Wednesday, July 14, 2010

The China Study: With a large enough sample, anything is significant

There have been many references recently on diet and lifestyle blogs to the China Study. Except that they are not really references to the China Study, but to a blog post by Denise Minger. This post is indeed excellent, and brilliant, and likely to keep Denise from “having a life” for a while. That it caused so much interest is a testament to the effect that a single brilliant post can have on the Internet. Many thought that the Internet would lead to a depersonalization and de-individualization of communication. Yet, most people are referring to Denise’s post, rather than to “a great post written by someone on a blog.”

Anyway, I will not repeat what Denise said on her post here. My goal with this post is bit more general, and applies to the interpretation of quantitative research results in general. This post is a warning regarding “large” studies. These are studies whose main claim to credibility is that they are based on a very large sample. The China Study is a good example. It prominently claims to have covered 2,400 counties and 880 million people.

There are many different statistical analysis techniques that are used in quantitative analyses of associations between variables, where the variables can be things like dietary intakes of certain nutrients and incidence of disease. Generally speaking, statistical analyses yield two main types of results: (a) coefficients of association (e.g., correlations); and (b) P values (which are measures of statistical significance). Of course there is much more to statistical analyses than these two types of numbers, but these two are usually the most important ones when it comes to creating or testing a hypothesis. The P values, in particular, are often used as a basis for claims of significant associations. P values lower than 0.05 are normally considered low enough to support those claims.

In analyses of pairs of variables (known as "univariate", or "bivariate" analyses), the coefficients of association give an idea of how strongly the variables are associated. The higher these coefficients are, the more strongly the variables are associated. The P values tell us whether an apparent association is likely to be due to chance, given a particular sample. For example, if a P value is 0.05, or 5 percent, the likelihood that the related association is due to chance is 5 percent. Some people like to say that, in a case like this, one has a 95 percent confidence that the association is real.

One thing that many people do not realize is that P values are very sensitive to sample size. For example, with a sample of 50 individuals, a correlation of 0.6 may be statistically significant at the 0.01 level (i.e., its P value is lower than 0.01). With a sample of 50,000 individuals, a much smaller correlation of 0.06 may be statistically significant at the same level. Both correlations may be used by a researcher to claim that there is a significant association between two variables, even though the first association (correlation = 0.6) is 10 times stronger than the second (correlation = 0.06).

So, with very large samples, cherry-picking results is very easy. It has been argued sometimes that this is not technically lying, since one is reporting associations that are indeed statistically significant. But, by doing this, one may be omitting other associations, which may be much stronger. This type of practice is sometimes referred to as “lying with statistics”.

With a large enough sample one can easily “show” that drinking water causes cancer.

This is why I often like to see the coefficients of association together with the P values. For simple variable-pair correlations, I generally consider a correlation around 0.3 to be indicative of a reasonable association, and a correlation at or above 0.6 to be indicative of a strong association. These conclusions are regardless of P value. Whether these would indicate causation is another story; one has to use common sense and good theory.

If you take my weight from 1 to 20 years of age, and the price of gasoline in the US during that period, you will find that they are highly correlated. But common sense tells me that there is no causation whatsoever between these two variables.

There are a number of other issues to consider which I am not going to cover here. For example, relationships may be nonlinear, and standard correlation-based analyses are “blind” to nonlinearity. This is true even for advanced correlation-based statistical techniques such as multiple regression analysis, which control for competing effects of several variables on one main dependent variable. Ignoring nonlinearity may lead to misleading interpretations of associations, such as the association between total cholesterol and cardiovascular disease.

Note that this post is not an indictment of quantitative analyses in general. I am not saying “ignore numbers”. Denise’s blog post in fact uses careful quantitative analyses, with good ol’ common sense, to debunk several claims based on, well, quantitative analyses. If you are interested in this and other more advanced statistical analysis issues, I invite you to take a look at my other blog. It focuses on WarpPLS-based robust nonlinear data analysis.

12 comments:

  1. Damn you to hell Ned for gaining weight. It has made filling my car so much more expensive! :)

    Good analysis & explanation - appreciated by statistical numpties such as myself! This sort of analysis is far more beneficial and educational than some of the refutations I have read of Denise's effort, most of which has been from vegetarian's worried that their 'don't eat meat cos it will kill you' stance might be on shakey ground!

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Good stuff Ned. You do good work here and I appreciate it even if I don't usually comment. I will tweet this to keep adding fuel to the fire that is blazing all over social media concerning Denise's blog post.

    Surprised my raw vegan followers on twitter haven't unfollowed me as of late. Wait until they see my next post on eating a sheep's head!

    ReplyDelete
  4. Hi Jamie.

    A "researcher" starting with very strong preconceptions could argue that as gasoline prices went up, I tended to walk more, which made me hungrier, which made me fatter.

    I have seen some of the refutations on grounds of "univariate" (a.k.a. bivariate, or between pairs of variables) statistics. Denise's response is correct. When one does multivariate analyses, often strong bivariate correlations become stronger, and weak ones become weaker.

    ReplyDelete
  5. Btw, Michael, nice YouTube clip with Weston Price in it!

    ReplyDelete
  6. Hey Ned,

    It just goes to show one has to examine the correlation matrix in great detail.

    Of course, if you are lucky you should be able to construct a linear transformation and diagonalize it. I often wonder why it is only in the psychological and economic "sciences" that "factor analysis" is conducted as a method for extracting information from data. I guess nutritionists and epidemiologists just a'int that sophisticated.

    Alternatively they believe "black swans" are induced by Monsanto via a GMO.

    Personally I only accept Relative Risks in excess of 10 times, NTTs less than 10, correlation coefficients greater than 90% and operating characteristics such that false positives and false negatives are both less that 5%. It means I can reject so much "advice" as being hogwash.

    But then I inherited much of my father's genes, who only developed sarcopenia at age 88 and died at 92.

    Say la vie! I do all the time.

    PS Is this a coincidence or a correlation: Campbell comes from Cam Beul (Gaelic for "Crooked mouth").

    ReplyDelete
  7. Hi Leon.

    Yes, I have never seen FA used in nutrition or epidemiology articles. Actually, I haven't seen it used in medical articles in general.

    I do see multiple regression analyses being used here and there, but often wrongly. Other than that, the preferred method seems to be ANOVA (or a variation, like ANCOVA). More often than not, I see no evidence of normality tests, which ANOVA requires.

    It terms of genetic makeup, it sounds like you and Bob Delmonteque have something in common:

    http://www.bobdelmonteque.com

    ReplyDelete
  8. Everyone who read Denise's post should read this first because the understanding you share in this post is so critical. I was inspired by Denise's post but also quite disappointed as she attacks his arguments with the same type of correlations that make Campbell's arguments weak in the first place. Epidemiological data is so misleading at times . . . darn it!

    ReplyDelete
  9. Thanks Avi. Yes, it is easy to lie with statistics, and this is why researchers should be very careful when drawing conclusions based on their data.

    ReplyDelete
  10. Dear Ned, it would be great if someone launched a data mining competiton on Kaggle (http://www.kaggle.com ) with the China Project data. We need really tallented researchers/mathematicians to seriously explore this database.

    ReplyDelete
  11. Hello,
    This is an amazing blog post found here.... Very interesting information, Thanks!

    ReplyDelete