The Art of Understanding Distributions in Data Science
Written on
Chapter 1: The Mindset of a Data Scientist
Understanding how data scientists perceive the world is complex. Good data scientists and statisticians possess foundational skills that enable them to analyze data with greater clarity than the average person. It's not enough to merely learn the tools; true statistical thinking must go deeper than just memorizing tests and formulas.
Reflecting on my own experiences, I recall my childhood and how my thinking has evolved over time. I remember dining out with my aunt and uncle. As soon as our meals arrived, my uncle instinctively reached for the salt shaker. My aunt playfully remarked that he was foolish to add salt before tasting his food. At that moment, I understood the logic: tasting is essential before making adjustments. Adding salt without prior tasting seemed irrational to me.
However, as I matured, I recognized that I had fallen into the trap of rigid, rule-based thinking.
Section 1.1: Statistical Literacy vs. Rigid Thinking
Individuals who possess statistical literacy perceive the world in terms of distributions. They recognize that various influences and interactions exist, some strong and others weak, and that variability is a natural part of understanding data. A genuine grasp of reality acknowledges that nothing is guaranteed; probabilities govern outcomes.
In contrast, many people lean towards a rules-based mindset, viewing everything in black-and-white terms—something is either true or false. This dichotomy often becomes evident when interpreting scientific studies. For instance, when presented with sensational headlines like, “Eldest siblings are more intelligent,” I often cringe at the predictable responses. Comments like, “I’m the youngest and definitely smarter!” or “That’s not always true!” reflect a fundamental misunderstanding.
Rule-based thinkers interpret such headlines as universal truths: “Older siblings are always more intelligent.” A single counter-example seemingly disproves the claim. However, a statistically literate person would approach it differently, stating, “The intelligence scores of older siblings are generally higher than those of their younger siblings.” This nuanced understanding aligns with what is typically tested in scientific studies, acknowledging that exceptions exist.
Subsection 1.1.1: The Importance of Effect Sizes
It’s not just that rule-based thinkers might dismiss anecdotes as disproving trends; they often overestimate the significance of scientific findings. I remember a conversation with my PhD advisor about a news article claiming that research papers mentioned on social media receive more citations. An undergraduate expressed disbelief, assuming that serious researchers wouldn’t discover work on platforms like Twitter. My advisor simply asked, “What is the effect size?”
This question encapsulates the distinction in thought processes. A rigid thinker might assume that every mention on Twitter guarantees a significant boost in citations. In contrast, a statistically savvy individual understands that while social media might have some effect, it could be minimal. With a large enough sample size, tiny effects can be detected—if being tweeted about results in an average increase of just 0.1 citations, it’s still noteworthy if it’s non-zero.
In reality, nearly every effect observed is non-zero. If a plausible connection exists between two variables, there’s a good chance of a measurable effect. The key question is whether this effect is substantial enough to be of importance.
Statements like “Boys develop speech later than girls” lack meaning without context about the effect size (how much later?). If the delay is a few months, it’s significant; if it’s only a few minutes, it’s less relevant. A statistically minded person approaches such facts with caution, awaiting clarity on the effect size.
Section 1.2: The Significance of Statistical Understanding
Communicating the essential knowledge that data scientists need regarding data and statistics can be challenging. The ability to view the world through the lens of distributions—with a recognition of the nuances between significant statistical relationships and those that are negligible—is vital. The deeper this understanding seeps into your life and thought processes, the more instinctive it becomes to apply it in a professional context.
Chapter 2: Exploring Key Concepts in Data Science
This video titled "Bayesian Statistics - Thinking about Probability Distributions" delves into how Bayesian methods allow data scientists to interpret data distributions effectively.
In this video, "The 5 Must-Know Distributions for Data Scientists (not what you think)," viewers will explore essential distributions that every data scientist should understand.