Our First Principles When We Work With Data in India
It is safe to say that a day in the life of any DDL team member involves at least 5 hours of staring at a black screen, perhaps pacing around trying to figure out why a beautifully designed code build is failing, and going to bed chewing on the same problem, only to have a team member discover a mundane bug. We work with a lot of data. Mostly secondary data collected by a host of data originators-- the government, satellites, researchers, and at times the private sector.
We’re slowly figuring out a system of first principles as our data adventures increase in ambition or wishful thinking, depending on your perspective. Here are some broad ideas that drive how our team approaches data.
Stay humble by learning what is already known.
Research questions should never be made in a vacuum. If you’re asking an important question, then most likely, many people have already put in hundreds of hours writing related research.
Of course, understanding what’s been covered in quantitative papers is important, But we also have so much to learn from rich, detailed, qualitative work in other disciplines like anthropology, sociology, psychology, urban studies, and so many more. When reading in other disciplines, you can ask yourself questions like: “Can I use this case study to make a hypothesis?” “What can I test with my data that they couldn’t?” “Wow, this ethnography is so cool and I’ve been trying to fix this data merge for 2 days now, is it too late to switch to anthropology?” (No it’s not, but now you can do a little bit of both!)
Same with data. A lot of recent quantitative research focuses on designing the perfect RCT, or perfect survey, to exactly answer a pre-set question with no errors. That’s a very expensive way to make new data, but there’s a huge amount of data that has already been collected and open to the public. The whole principle behind the Socioeconomic High-resolution Rural-Urban Geographic Platform for India (SHRUG) is that it is a collaborative platform. A lot of people use the SHRUG. When researchers collaborate and offer to add their data to the SHRUG, their work gets cited as well given people already know about the SHRUG. Their data on the other hand adds to the multidimensionality of the SHRUG and our data deficit issues are slightly better. The alternative could have been spending a lot of money to collect every bit of information we were interested in. The SHRUG is proof that to build off each other’s work and give credit where due path is so much better and easier than the lone warrior approach.
Shiny, buzzword-y, the high tech-iest solutions may not always be the best fit to the problem.
We’ve spent some time training neural network models at DDL to do all sorts of things with names. While it’s great to milk every aspect of the data for it’s worth -- the name variable in the data that we hastily dispose of to drop PII (Personally Identifiable Information) may reveal a lot of information (variables) if treated with care. But machine learning may not be the perfect solution to every problem either. Hasty AI solutions are the worst possible approach.
AI is not a smart robot, and models are often opaque and can be a pain to debug. Buzzword-y methods like “machine learning” and “neural network” aren’t magic; you can’t just chuck a bunch of unclean data in and expect the AI to provide an easy explanation for the whole complex problem. Instead, you’ve got to come up with a very specific task, informed by what’s already been published, prep your data exactly, and then the AI will happily complete the task. Also, there should have been a good reason why you went with a particular approach over another.
A lot of the work will be painful and not glamorous.
Most of the time, our work is slow and unglamorous, doing things like:
* Reading through old, grainy government documents trying to figure out what on Earth variable vague_var_mplt means, and why, for Pete’s sake, no one wrote it down
* According to the government, some of India’s oldest people are 217 years-old and have 124 children. Now we have to figure out which ones are errors and which are living
Investing time upfront to link datasets will unlock huge potential.
The pandemic has driven home the importance of readily available data linked across space and time. An issue faced by a number of entities trying to work with contemporary Indian data is that there is no clear mapping between current administrative units and those in the most recent censuses. This made it difficult to link hospital capacity data from administrative surveys to crowd-sourced real-time case counts. Half of the time our team’s biggest value-add is making it possible to link different datasets at the village, town, block, and district levels. At the risk of making that value-add obsolete around here, we wouldn’t have to jump through any of those code hoops if administrative datasets had common geographic identifiers to begin with.
Instead, data is continually collected in an ad-hoc manner to solve one problem at a time. Data can and does have many uses. For example, we had never thought that linking assembly constituency level data to the population censuses could be useful for election reporting in India -- but someone thought of using the linked data for high resolution reporting on the demographic composition of constituencies going into election across the many phases of the bitterly contested state elections in West Bengal this year. Not hoarding data has many many positive externalities, not the least of which is people like you more in the field. It’s really nice to not be hated given we’re chasing explosive and not politically easy research questions half of the time!