29 Aug 2019

Data are not oil, they are cupcakes

By this point, it’s a bit cliche to say that data is “the new oil.” It’s the sort of phrase you might have heard if you didn’t have headphones on when you went to that one hip cafe the other day. And while some parts of the metaphor are relevant (“Data spill” has kind of a ring to it), other parts are potentially misleading, especially for an aspiring data scientist.

Instead, I propose that we think of data as cupcakes, or really, desserts more generally. There are some reasons,

  1. Data are the result of human intervention. Data don’t just exist in the ground because dinosaurs died a long time ago (though, sticklers might disagree). Someone has to deliberately place a sensor somewhere, or track a user’s browsing behavior, or run a sample through some device, or launch a camera into space…. and then write down everything that the sensor saw, because they think that information might be useful for some reason someday.

    Of course, this is just like cupcakes. Cupcakes don’t appear out of thin air (we can dream), they are deliberately constructed for human consumption.

  2. Data are wildly varied across domains. This is a corollary of point 1. Since data are generated by people, and since people have extraordinarily diverse backgrounds and intentions, the data you will encounter as a professional data scientist will vary dramatically from project to project. Museum directors, city planners, ecologists, and healthcare providers are all collecting data these days – I think it’s a small miracle that a few of our mathematical abstractions (randomness, latent structure, inference) are relevant across such broad domains.

    Oil is oil no matter where you go, desserts come in many different flavors.

  3. We can strengthen point 2, if we note that, even within a single application domain, the data are not uniform. It’s now standard for a genomics data analysis to incorporate measurements from many different modalities. Formal government surveys are often complemented by automatically recorded transactions. On the internet, images, text, temporal, and relational information are all baked together into a delicious, fruit-filled, and icing-topped whole, as you will discover in your course project.

Okay, okay, I know what you’re going to say, “But Kris, cupcakes have never jeopardized civil society” and “all these additional dessert regulations are going to stifle innovation.” Yes, this metaphor has it’s own limits, but I hope I’ve gotten my main point across, which is that data are human artifacts, generated by people with diverse visions of the future.

