What is the purpose of data analysis?
In these notes, we will review “Tukey, Design Thinking, and Better Questions,” a blog post by Roger Peng. In the post, he discusses the importance of formulating good questions in data analysis. While the process of refining questions is central to any successful, real-world analysis, it is rarely discussed in formal statistics textbooks, which usually restrict themselves to developments in theory, methodology, or computation.
The post itself is based on a line of thinking from the statistician John Tukey’s 1962 paper, “The Future of Data Analysis.” Perhaps the most famous quote from this paper is,
Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.
and throughout the article, Tukey argues that data analysts should not (1) be called upon to provide stamps of authority and (2) be distracted by the allure of proving optimal algorithms in settings that are likely not that realistic.
This of course begs the question – what should data analysts be doing? Tukey never directly discusses this, but Peng’s thesis is that the purpose of data analysis is to spark the formulation of better questions. This idea is captured in the diagram below. The idea is that algorithms (visual, predictive, and inferential) can help us strengthen evidence that certain patterns exist. However, if those patterns answer irrelevant questions, then there is little point in the analysis. With so many algorithms available, it’s easy to spend substantial effort gathering evidence to answer low quality questions. Instead, a good data scientist should focus on improving the quality of the questions asked, even if that means that the answers will be less definitive.
What does it mean to ask better questions? Peng argues that better questions are “sharper,” providing more discriminating information and guiding research to more promising directions. In my view, a sharper question is one that helps inform decisions that have to be made (all data aside), either by adding to our body of beliefs or bringing new uncertainties into focus.
The implication is that data analysis is not about finding “the answer.” Instead, it should be about clarifying what can and cannot be answered based on the data. But more, it can help clarify what questions we really want answered – in Peng’s words, “we can learn more about ourselves by looking at the data.”
The final example in the reading is that the residuals in an analysis are themselves data worth examining. They inform whether a model (and the associated way of conceptualizing the problem) is really appropriate, or whether a reformulation would be beneficial.
At the end, the main idea is that data analysis equips us to ask better questions.