I was really nerd happy when I woke up this morning thinking I would spend the day poking around the newly-launched data.gov.ph. In fairness, I appreciate the effort to put all government data in one place (which should be the mandate of NSCB, but anyway), and to have it available in CSV, XML, or ASCII – basically anything that’s not PDF – is an upgrade.
However, upon further perusal I remain frustrated and disappointed. It’s clear that beyond the two points I mentioned above there has been no attempt to make the data usable. In order to avoid being called one of those “cottage industries” who makes a living off of criticizing the administration no matter how good its intentions are, I offer some obvious problems and some simple solutions:
1) The search function is not functioning.
Say that I want to know the maternal mortality over time. I go to data, search for “maternal mortality.” No results. I search for “maternal,” no results. “Mortality,” “death” and “deaths,” still no dice. (At this point I’m wondering “Shouldn’t there be regular data on how many people die every month, the cause of death, and disaggregated to at least the provincial level?” But I don’t want to get side-tracked). Finally I search for women, and lo-and-behold, there is one lonely result: Health and Nutrition: % of Women who Died Due to Pregnancy-Related Causes. Jackpot
However, this discovery begs other questions. Why is there only one result when you search for women? Does this mean there is no data on women in the workforce, women married and at what age, average income of women-headed households, women who hold public office, or the percentage of LGUs that use their Gender and Development Budgets? (A search, by the way of “female” or “gender” doesn’t yield these results either.)
I know government data on these topics exists. I’ve seen it. But it is buried in spreadsheets that cover multiple topics and are named something so general so as to obscure like “Updates 2012.” This problem of “the data is publicly available but no one knows where” is the the exact problem this whole Open Data initiative is trying to address.
Solution: Tagging. The whole point of having a search function is so that you type an intuitive, not technical, description of what you’re looking for and then you find it. However, this only works if the name of the file or its tags have intuitive descriptors. Let’s look at the tags for Health and Nutrition: % of Women who Died Due to Pregnancy-Related Causes.
“Health” and “Nutrition” are already part of the title, so they are unnecessary but don’t hurt. “Philippines,” well yes, we are in the Philippines, all of this is assumed to be data about the Philippines, so again, totally unnecessary but doesn’t take away anything. “NAMRIA” stands for the National Mapping and Resource Information Authority. I had to Google that because I had no idea what it was and I have no idea why anyone who wanted to know about maternal mortality would search for NAMRIA.
Common sense should be used when making tags. What is this dataset? Luckily, its title is clear and specific. What are some other terms people would use to describe this and related topics? Well, “maternal mortality,” “mothers,” “death,” “female,” and “reproductive health” for starters. Tags and search terms must be assigned from the point of view of potential users.
2) Non-uniformity in data. Data is a wonderful thing. Having lots of data means that you can not only know about a single phenomenon, but you can know how it relates to other phenomena and come up with an idea about what causes it.
Say that I want to know the relationship between maternal mortality and public expenditures in health. This seems like a reasonable thing to want to know and something that we probably should know if we are going to make good policy. Maternal mortality data looks like this:
The data is obviously incomplete, which is a whole other issue. I will say, though, that having data that is accurate to the local level, even if you only have it in a few areas, is more useful for designing government interventions than having macro data that reflects the average in the country overall but does not tell you about the situation of any particular community. But I digress.
With this data it should be possible to look at the relationship between public health expenditure at the local level and maternal mortality. However, all the budget tables are just that, tables. This map must have been generated from a table, so that table already exists, but where is that table?? I need it to make simple comparison with health expenditure, not to mention more complicated statistical techniques that could reveal the effect of health expenditure on maternal mortality holding constant things like average income, average children per household, rural or urban character, etc.
Solution: The numbers exist. Show us the numbers.
(By the way, searches for “health expenditures” only resulted in national budgets. Searches for “local government units” and “IRA” turned up null.)
3) Labeling. In order to demonstrate that math has practical implications in life, 3rd grade math was full of word problems. My 3rd grade teacher insisted that we label all our answers. “The answer is not 5, it’s 5 apples. Why can’t you just say 5? Because I’m a mathematician, not a psychic, and you can’t assume I know what you’re talking about.”
Again, let’s go back to Health and Nutrition: % of Women who Died Due to Pregnancy-Related Causes. It seems pretty straight-forward: the percentage of women who died due to pregnancy-related causes in the colored local government units. However, Died when? In the past month? In 2013? Ever?? You can’t accurately find out important relationships (like those mentioned in #2) if you don’t specify a time period. Also, are these all women or women of child-bearing age? Also another potential source of skewed data.
Another example, here is a section of a file called “31 October 2012 NEDA Updates” available at http://data.gov.ph/catalogue/dataset/31-october-2012-updates
This data sheet begets more questions than answers. What does “% g.r.” mean? There is no key that explains this in the file. If the purpose of making data open is so that any citizen, whether they are a technical expert or not, can access it and understand it, then things that are not immediately understandable to the layman should have an explanatory note in the key.
Beyond that however, there are items that are simply impossible to understand even if you do have prior knowledge. For example: row 25, the number of building permits for quarter 2 of 2012 = 4.4. It would seem that this represents the absolute number of building permits. (I can’t see how building permits could be a percentage of anything, unless you meant the value of building permits, but that’s not what it says.) So the question is 4.4 what? 4.4 thousand? 4.4 million? Similarly, the value of construction for quarter 2 of 2012 = 15.4. Again, 15.4 million pesos? 15.4 billion pesos? Or 15.4% of total GDP? Of total GNI? Or of whatever % g.r. is?
Now please look at rows 18-24, electrical consumption. Residential consumption in August 2012 (row 20) is -3.3. I find it hard to believe that residential households generated 3.3% more electricity than they consumed (thus yielding a negative value), so this makes me think -3.3 represents a change of some sort. But a change of what? Is that the change in October 2012 compared to August 2012? Of residential consumers’ share of all electricity consumed?
Again, maybe these are industry conventions that are simply going above my head, but there is no point to providing open access to data if the meaning of the data is not transparent. Accessible necessarily means understandable. Any provider of information has the duty to clearly explain what that information means, not berate the population for not understanding bureaucratic conventions.
Solution: All of these datasets already have a codebook that clearly explains the meaning of each data point and how it is derived. I am confident these codebooks must already exist because at some point NEDA, DBM, DOF, etc. have to train new staff. Just upload the codebooks and link them to the datasets.
In conclusion, Open Data could be the start of something immensely useful, but let’s not pat ourselves on the back just yet. For the Open Data project to cause real changes in its intended areas of transparency and governance, for accessibility to be real and not just a technical concept, it is not enough to just upload all the data to a central location and hope someone will have the time, energy, and expertise to do something with it. You must think of the user, the possible practical implications, and format your data accordingly. (As well as in the future, collect your data accordingly.) There are lots of other problems that make this data unusable or prohibitively difficult to use in its current form: in many files the data is laid out like text tables instead of rectangular data sets; instead of having all the data on a particular topic in one sheet (say, amount spent on public education), every year has its own file thus requiring a whole lot of manipulation before you can do any time-series analysis; the timing of the way data is displayed is inconsistent, making cross-source comparisons very difficult (e.g. some data is displayed by year, some by quarter, some by month, some by calendar year and some by fiscal year – all in the same table).
While a lot more has to be done to format the raw data, I tried to identify three quick and easy ways in which the usability of this site and its data can be greatly improved: better tagging, revealing the numbers, and labeling data (basically uploading codebooks). These are all steps that don’t need technical experts or statisticians, they just need a little common sense.
P.S. It’s possible that everything I just said is in the Action Plan for Open Data Philippines, but I have no idea because the link is broken.