Statistical Significance and the Limits of Large Samples

Note: This post is also shared on LinkedIn.

Just finished “Apple in China” by Patrick McGee, an exquisite read!

Among the many valuable insights the book offers, one sentence, about Doug Guthrie’s research, sparked a reflection on research methodology:

👉 “The idea was to capture qualitative, on-the-ground research, but with a big enough sample that it would achieve statistical significance” (Ch. 28, 6:46).

The phrase “achieve statistical significance” in the context of qualitative research immediately caught my attention (since it is generally for quantitative analysis). It is likely an imprecise way of expressing the intent to collect a rich, diverse dataset to identify reliable patterns. Still, the wording is worth examining, especially in light of ongoing debates around research rigor and the reproducibility crisis.

Framing “achieving statistical significance” (i.e., p < α) as a primary goal is problematic. Large samples boost statistical power, meaning even negligible differences can appear “significant”… statistically, but not practically. This leads to a false sense of importance if effect size is ignored.

As Jacob Cohen reminded us: “The primary product of a research inquiry is one or more measures of effect size, not p-values” (1990, Things I have learned (so far)).

To counter this issue, some journals took bold steps. For example, Basic and Applied Social Psychology banned p-values altogether, encouraging instead larger sample sizes for more stable descriptive statistics (Trafimow & Marks, 2015, Editorial).

But even sample size is not everything, representativeness matters. Douglas Hubbard tells a story where the famous statistician John Tukey is quoted saying: “A random selection of three people would have been better than a group of 300 chosen by Mr. Kinsey” (2010, How to Measure Anything).

🔍 I explored these concepts further in a recent presentation on data science: lnkd.in/efJdbpYg