In the comments section of Denise Minger’s post on July 16, 2010, which discusses some of the data from the China Study (as a follow up to a previous post on the same topic), Denise herself posted the data she used in her analysis. This data is from the China Study. So I decided to take a look at that data and do a couple of multivariate analyzes with it using WarpPLS (warppls.com).
First I built a model that explores relationships with the goal of testing the assumption that the consumption of animal protein causes colorectal cancer, via an intermediate effect on total cholesterol. I built the model with various hypothesized associations to explore several relationships simultaneously, including some commonsense ones. Including commonsense relationships is usually a good idea in exploratory multivariate analyses.
The model is shown on the graph below, with the results. (Click on it to enlarge. Use the "CRTL" and "+" keys to zoom in, and CRTL" and "-" to zoom out.) The arrows explore causative associations between variables. The variables are shown within ovals. The meaning of each variable is the following: aprotein = animal protein consumption; pprotein = plant protein consumption; cholest = total cholesterol; crcancer = colorectal cancer.
The path coefficients (indicated as beta coefficients) reflect the strength of the relationships; they are a bit like standard univariate (or Pearson) correlation coefficients, except that they take into consideration multivariate relationships (they control for competing effects on each variable). A negative beta means that the relationship is negative; i.e., an increase in a variable is associated with a decrease in the variable that it points to.
The P values indicate the statistical significance of the relationship; a P lower than 0.05 means a significant relationship (95 percent or higher likelihood that the relationship is real). The R-squared values reflect the percentage of explained variance for certain variables; the higher they are, the better the model fit with the data. Ignore the “(R)1i” below the variable names; it simply means that each of the variables is measured through a single indicator (or a single measure; that is, the variables are not latent variables).
I should note that the P values have been calculated using a nonparametric technique, a form of resampling called jackknifing, which does not require the assumption that the data is normally distributed to be met. This is good, because I checked the data, and it does not look like it is normally distributed. So what does the model above tell us? It tells us that:
- As animal protein consumption increases, colorectal cancer decreases, but not in a statistically significant way (beta=-0.13; P=0.11).
- As animal protein consumption increases, plant protein consumption decreases significantly (beta=-0.19; P<0.01). This is to be expected.
- As plant protein consumption increases, colorectal cancer increases significantly (beta=0.30; P=0.03). This is statistically significant because the P is lower than 0.05.
- As animal protein consumption increases, total cholesterol increases significantly (beta=0.20; P<0.01). No surprise here. And, by the way, the total cholesterol levels in this study are quite low; an overall increase in them would probably be healthy.
- As plant protein consumption increases, total cholesterol decreases significantly (beta=-0.23; P=0.02). No surprise here either, because plant protein consumption is negatively associated with animal protein consumption; and the latter tends to increase total cholesterol.
- As total cholesterol increases, colorectal cancer increases significantly (beta=0.45; P<0.01). Big surprise here!
Why the big surprise with the apparently strong relationship between total cholesterol and colorectal cancer? The reason is that it does not make sense, because animal protein consumption seems to increase total cholesterol (which we know it usually does), and yet animal protein consumption seems to decrease colorectal cancer.
When something like this happens in a multivariate analysis, it usually is due to the model not incorporating a variable that has important relationships with the other variables. In other words, the model is incomplete, hence the nonsensical results. As I said before in a previous post, relationships among variables that are implied by coefficients of association must also make sense.
Now, Denise pointed out that the missing variable here possibly is schistosomiasis infection. The dataset that she provided included that variable, even though there were some missing values (about 28 percent of the data for that variable was missing), so I added it to the model in a way that seems to make sense. The new model is shown on the graph below. In the model, schisto = schistosomiasis infection.
So what does this new, and more complete, model tell us? It tells us some of the things that the previous model told us, but a few new things, which make a lot more sense. Note that this model fits the data much better than the previous one, particularly regarding the overall effect on colorectal cancer, which is indicated by the high R-squared value for that variable (R-squared=0.73). Most notably, this new model tells us that:
- As schistosomiasis infection increases, colorectal cancer increases significantly (beta=0.83; P<0.01). This is a MUCH STRONGER relationship than the previous one between total cholesterol and colorectal cancer; even though some data on schistosomiasis infection for a few counties is missing (the relationship might have been even stronger with a complete dataset). And this strong relationship makes sense, because schistosomiasis infection is indeed associated with increased cancer rates. More information on schistosomiasis infections can be found here.
- Schistosomiasis infection has no significant relationship with these variables: animal protein consumption, plant protein consumption, or total cholesterol. This makes sense, as the infection is caused by a worm that is not normally present in plant or animal food, and the infection itself is not specifically associated with abnormalities that would lead one to expect major increases in total cholesterol.
- Animal protein consumption has no significant relationship with colorectal cancer. The beta here is very low, and negative (beta=-0.03).
- Plant protein consumption has no significant relationship with colorectal cancer. The beta for this association is positive and nontrivial (beta=0.15), but the P value is too high (P=0.20) for us to discard chance within the context of this dataset. A more targeted dataset, with data on specific plant foods (e.g., wheat-based foods), could yield different results – maybe more significant associations, maybe less significant.
Below is the plot showing the relationship between schistosomiasis infection and colorectal cancer. The values are standardized, which means that the zero on the horizontal axis is the mean of the schistosomiasis infection numbers in the dataset. The shape of the plot is the same as the one with the unstandardized data. As you can see, the data points are very close to a line, which suggests a very strong linear association.
So, in summary, this multivariate analysis vindicates pretty much everything that Denise said in her July 16, 2010 post. It even supports Denise’s warning about jumping to conclusions too early regarding the possible relationship between wheat consumption and colorectal cancer (previously highlighted by a univariate analysis). Not that those conclusions are wrong; they may well be correct.
This multivariate analysis also supports Dr. Campbell’s assertion about the quality of the China Study data. The data that I analyzed was already grouped by county, so the sample size (65 cases) was not so high as to cast doubt on P values. (Having said that, small samples create problems of their own, such as low statistical power and an increase in the likelihood of error-induced bias.) The results summarized in this post also make sense in light of past empirical research.
It is very good data; data that needs to be properly analyzed!
Wednesday, May 29, 2024
Subscribe to:
Post Comments (Atom)
74 comments:
Excellent analysis. Let me pick a few bones, however. I don't like the term "statistically significant", because it implies an essentially arbitrary threshold. The test for p-value < 0.05 is a common one in the biological sciences, but in other fields it is often taken much lower. For instance, when I was in gamma-ray astronomy, the test was more stringent ("5 sigma") for denoting whether or not a source was "detected".
My point here is that the decision about "significance" is arbitrary. What we really care about is the relative belief in different hypotheses. You've gone a long ways towards this in your analysis, elucidating several possible hypotheses of causality. I urge you to take the extra steps and compute some posterior odds ratios, so we can directly compare the support for these hypotheses to each other.
Great work!
Professionally BRILLIANT, Ned.
I particularly like the fact that results, significances, etc do not depend on normality of data and is non-parametric.
Did you notice that Campbell's latest offering noted that his analyses rested on the "biological plausibility" of his casein cancer link, yet there is little to support this except his own epidemiological study and his hypothesis of casein acting as an on/off switch. Where is the biochemical/enzymatic/knockout gene mechanism or pathway.
Campbell means crooked mouth (Cam Beul) in Scots Gaelic.
Ned, could you update the graphs to higher resolution images? It's difficult to read the signs on some of the numbers.
Why, in the second model, does the arrow go from colorectal cancer to total cholesterol, and not the other way around? Seems like a mistake or a typo.
Hi Dave, thanks and good point. The P < 0.05 level was the one adopted in all discussions so far, so I decided to stick with it.
Hi Leon, thanks. On one hand, I think we have to seek the truth, which requires debate. But, on the other hand, I think that Dr. Campbell and his collaborators have done a great job in collecting and compiling the data. The data seems to be pretty good, and that is very important.
Hi Uncle H.
The relationship TC -> cancer is less plausible than cancer -> TC.
Often a sign of cancer are "messed up" lipids, especially a high LDL, which contributes to a high TC.
In any case, if I had included the link TC -> cancer instead, the results would have been essentially the same.
Hi Anon.
When you open the graph window, press the "CRTL" and "+" keys at the same time to zoom in. You can make it as large as your screen, and even larger.
Ned,
In the first model, you propose a link between protein consumption and total cholesterol, and the link between all three and cancer. In the second model, you lose the link between protein consumption and cholesterol. I'm confused why you changed the *structure* of Model 2 in addition to adding the effects of schistosomiasis infection. Theoretically, I can see the rationale for protein -> cholesterol -> cancer; it's harder for me to see the link between protein -> cancer -> cholesterol.
Wasn't the original point to determine whether cholesterol moderates the relationship between diet and cancer? The second model tests whether cancer mediates the relationship between diet and cholesterol.
Also, what are your model fit statistics, such as Chi-square, CFI, SRMR, and RMSEA? Without these, it's hard to judge the relevance of the findings.
Hey Ned, you may also want to try stratifying the data and running separate models on the counties where schistosomiasis = 0 and where schistosomiasis is > 0. Try doing this with the relationship between cholesterol and schistosomiasis as well. You should see that cholesterol has little effect on colorectal cancer occurrence in schistosomiasis-free populations but has a stronger relationship with both schistosomiasis and colorectal cancer in infected populations.
Denise
@Ned,
How is causality included in this analysis? Is there something in there that actually distinguishes the direction of causality, or are we looking at associations? Interesting in either case, just want to be clear. Thanks.
Hi Uncle H.
The reason I built the first model that way was to incorporate the generic, "widely accepted" (but wrong), hypothesis that:
X -> cholesterol -> Y
Where Y is something "bad", in terms of health, and cholesterol is usually a measure of LDL or total cholesterol (TC). Therefore, the generic hypothesis goes, X should be removed from the diet. I have a quite a few posts on this blog on why this generic hypothesis is wrong. Here are three:
http://healthcorrelator.blogspot.com/2010/05/postprandial-glucose-levels-hba1c-and.html
http://healthcorrelator.blogspot.com/2010/04/long-term-adherence-to-dr-kwasniewskis.html
http://healthcorrelator.blogspot.com/2010/02/want-to-improve-your-cholesterol.html
The second model better reflects what we know about the relationship between TC (and even LDL-C) and cancer. That is, often TC and LDL-C are markers of disease, and thus the onset of cancer may lead to an elevation of TC and LDL-C. This suggests cancer as a causative factor.
But, still, changing the direction of the arrows as you suggested does not affect the overall outcome of the multivariate analysis, because of the very strong overall effect of schistosomiasis infection. The beta from schisto to crcancer goes down a bit (to 0.79), because of the competition with cholest for explained variance on crcancer. But not enough to change the conclusions regarding any path.
Don't take my word for it. You can do several analyzes with various model structures yourself and see what happens. WarpPLS is free for trial, and most other statistical analysis software (e.g., SPSS) will do this type of analysis (although something like SPSS may take a lot more effort).
I'll address the fit indices question on a separate comment because this is getting too big.
Dave,
The single arrows imply causality. Had Ned chosen to, he could have used double arrows (pointing in both directions) to remove causality (i.e., to avoid making a directional hypothesis).
Hi Uncle H., again.
Regarding the fit indices question, the ones you mention are more commonly used in covariance-based structural equation modeling.
Since the models in this post don't have latent variables, they are actually path models, not true structural equation models.
Anyway, WarpPLS generates three fit indices. More on these here:
http://warppls.blogspot.com/2010/01/how-are-model-fit-indices-calculated-by.html
Not surprisingly, the fit indices are significantly better for the second model than for the first. This is not surprising because of the difference in R-squared values, which are themselves a good indication of model-data fit.
Hi Denise.
Thanks for stopping by and for the excellent suggestion.
Like you, I have not found a way of slowing down the passage of time, or making time management a non-zero-sum game :)
@Uncle Herniation,
What I'm asking is how directionality is modeled in the analysis. What changes, and how might this be related to P("A causes B") vs. P("B causes A")? Thanks.
Dave, Uncle H.:
WarpPLS always takes into consideration inter-correlations between predictor variables that compete for the explained variance in a criterion variable. There is no need to explicitly model them through double-arrows. That is what multiple regression does as well.
Changing the direction of an arrow can affect the results when two or more variables point at one variable. WarpPLS always controls for the effects of competing predictor variables on any criterion variable.
Finally, I should not that WarpPLS does something a bit different from what covariance-based structural equation modeling software (SEM) tools do (e.g., Lisrel and Amos). WarpPLS does variance-based SEM, also referred to as PLS-based SEM, which has some advantages over covariance-based SEM:
http://faculty.chass.ncsu.edu/garson/PA765/pls.htm
Ned,
SPSS won't do this without AMOS; I have MPlus and a few other SEM software packages that I can use, but unfortunately I don't believe the sample size is large enough to actually conclude anything. Do you have 95% confidence intervals for your parameter estimates so we can judge how precise they are?
Whether X -> cholesterol -> Y is right or wrong is an empirical argument that can vary based on what X and Y are. It cannot be stated as unequivocally wrong all the time. In your second model, cholesterol is depicted as the final outcome, when in reality the final outcome that is most interesting is cancer, because we are interested in finding the variables that are risk and protective factors against cancer. We are not concerned with cholesterol as our final outcome measure, and therefore is is more useful as a moderator variable, not the final variable in a causal path.
To put this another way: Let's say I want to understand my risk of getting colon cancer based on my plant protein intake, my animal protein intake, my total cholesterol, and whether or not I've had a schistosomiasis infection. This makes sense.
Now let's say I want to estimate my total cholesterol based on my plant protein intake, my animal protein intake, whether I have colon cancer, and whether or not I've had a schistosomiasis infection.
This is less meaningful for two reasons. First, we want to understand our risk of cancer in the future based on our diet, how our diet affects our cholesterol, and whether or not our cholesterol even means anything when it comes to our risk of colon cancer. Second, if we want to know our cholesterol, we can just get a blood test. We don't need to use a regression equation based on Model 2 to figure out our cholesterol levels. But your second model treats cholesterol as an outcome variable without even asking whether or not it has an influence on any other variables. In other words, the model treats cancer as a predictor variable instead of an outcome variable. But obviously, cancer is the outcome variable, not cholesterol.
Did you get a chance to look at the model fit statistics I mentioned in my earlier post?
Maybe we cross-commented; my answers are above.
Thanks, for the great read.
Nice analysis and DAGs. However, I'd like to know what kind of regression model you used - linear regression? If so, it would be important to check that your outcome variable is normally distributed (this is separate from using a nonparametric method to determine the variance and p-values for the beta coefficients).
Are there other factors you think could be included in the DAG?
@Ned,
I took a quick look at the link you posted on the method (emphasis on "quick") and still don't see how it can distinguish between causal hypotheses. It seems more oriented toward predicting one set of variables from another, which sounds more like correlation than causation. Anyway, I should read more closely. Just trying to get an idea of precisely what question your analysis answers. Very nice work regardless.
I'm going to put in my obligatory plug for Jaynes' book on Probability Theory. First three chapters can be downloaded here:
http://bayes.wustl.edu/etj/prob/book.pdf
At the end that the plot showing the relationship between schistosomiasis infection and colorectal cancer shows a regression result almost entirely driven by a single datapoint. A proper regression analysis will remove that point as well as any outliers and high leverage data points. Need to go back through and consider how this could affect your other results.
Sorry folks, I will have to be quick with this comment. Life and work keep getting in the way of blogging. Let me see if I can address at least a few of the issues raised:
- I did linear and nonlinear analyzes. Usually nonparametric stats are the way to deal with deviation from normality. But if there is anything else that can (or needs to) be done, I would like to know.
- That outlier on the plot is actually well aligned with the linear trend, and does not influence it much. Besides, one usually removes outliers only if it is suspected that they reflect error. If it is good data, we need to keep it.
- As for new variables, let's see what comes up. It would be interesting if we could differentiate consumption of fruits and veggies from grains and seeds. Perhaps we don't even need new data, but it would be good to have it. A larger sample would also be great.
- I am planning a new post for the weekend, with a new analysis of data from counties without any schistosomiasis infections. I think you will find it interesting.
As always comments and critique are very welcome.
I'd like to encourage communication among commenters, not only with me.
The readers of this blog tend to be very knowledgeable, and can add a lot to the discussion by commenting on others' comments.
Great analysis. I found similar results when I did the analysis in JMP.
I redid the analysis without including the counties who don't have schisomatosis data. I also redid the analysis on just the counties where schisomatosis was 0.
Another thing I did was removed counties who may have a disproportionate affect on the data. The 27th county (I used the numbers not the county names) has a very high rate of schisomatosis infection.
There is also a vegetarian county and a near carnivorous county. I think when I tested for Cook's D influence another two counties came up as disproprionately influential. I redid all the analysis excluding those counties (one or some at a time then all together).
This was for my personal interest and I didn't save the results. I removed certain counties because I thought they may be skewing the data set and I was curious what I'd find and felt like being as thorough as possible.
If you want I could redo it all again.
I'll repost this over at Denise's blog.
- Bushrat
On the issue of causality vs. association, raised by Dave, which I guess I did not fully address yet:
Most statistical tests generate coefficients of association (e.g., path coefficients, Pearson or Spearman correlations, F coefficients) and P values. Sometimes confidence intervals are reported instead of P values; their interpretation is similar.
There are a few things that one can do to go beyond that, but in my view the best way to infer causality from a given association is to use logic, past research, and theory.
This is particularly true from cross-sectional data; that is, data that reflect what is happening at a particular point in time. Longitudinal data is generally better for inferring causation, because you can segment the data to show that a increase in variable X at time 1 is followed by an increase in variable Y at time 2.
But even with longitudinal data, the reality is that one can NEVER prove a hypothesis based on empirical data. One can only provide support for a hypothesis based on empirical data; using logic, past research, and theory.
What one can do much more easily is to DISPROVE a hypothesis, and for that a single case is enough. In fact, statistical tests are usually aimed at disproving hypotheses, not proving them.
Getting back to causality vs. association, I can make total cholesterol point at colorectal cancer in a model, and get a positive and significant association coefficient.
But to believe that total cholesterol causes colorectal cancer is nonsensical because total cholesterol is generally increased by consumption of animal products, of which animal protein consumption is a proxy. And animal protein consumption seems to be protective again colorectal cancer (negative association on graph).
Hi Bushrat.
I am glad to know we got similar results, since mistakes can happen.
My approach is generally to keep outliers and do a nonlinear analysis to account for their influence on the distribution pattern. WarpPLS does nonlinear analyses yielding U- and S-curve patterns.
But, having said that, I did remove outliers in some intermediate analyzes just to see what I would get. I didn't find anything really unexpected.
The reason I like to keep outliers is that they often tell us something unique about a pattern, unless they are outliers due to measurement error.
Hi Ned. Great work! Are you aware of on-line data from the China Study? Or, could you provide some already Excel formated data, so that other people could try similar statistical analysis? Regards.
In the comments section of the post below, which discusses some of the data from the China Study, Denise Minger posted a link to the data she used in her analysis:
http://rawfoodsos.com/2010/07/16/the-china-study-my-response-to-campbell/
That is the data I used.
hi ned,
two suggestions:
1) i think you should consider linearly transforming the outcome variable and redoing multiple linear regression.
2) by stratifying, you'll be left with very little power in each group (small sample size). and if there are differences in the effect of animal protein (your primary exposure of interest), you'd have to perform a multivariate test to see if the beta coefficients differ in the two strata.
3) based on #2, it might be better to include an interaction term in your model and then do a likelihood ratio test to compare the main effects model to the main effects + interaction term model. what you're expecting is a difference in animal protein effect depending on "levels" of schistomiasis, right? so really what you're trying to evaluate isn't whether schistosomiasis is a *confounder* but rather, an *effect modifier*.
hc
"The second model better reflects what we know about the relationship between TC (and even LDL-C) and cancer. That is, often TC and LDL-C are markers of disease, and thus the onset of cancer may lead to an elevation of TC and LDL-C. This suggests cancer as a causative factor."
i think i'm missing something here. my understanding is that just because something is a biomarker for disease, it doesn't mean that the disease --> biomarker. otherwise, the biomarker is crap. isn't the point of a biomarker to detect disease in its earlier, non-symptomatic state?
however, you're bringing up a good point about the cross-sectional data. we really can't know the direction. furthermore, disease rates were for the late 1970s whereas all of the survey and biospecimen collections were in 1983.
Hi hc, thanks.
I have written a new post that addresses, even if indirectly, some of the issues of raised:
http://healthcorrelator.blogspot.com/2010/07/china-study-one-more-time-are-raw-plant.html
This new post discusses data from counties without any schistosomiasis infection only.
hc, regarding interaction effects:
I added a few in my intermediate analyses, but they added too much colinearity to the model. WarpPLS gives VIFs (a common colinearity measure), and they were too high for comfort.
This is often the case with small samples. When interaction effects are added, colinearity often goes up through the roof. With high colinearity, path coefficients and P values tend to get distorted.
wow. kudos, hat up to you.
Hi Ned:
Similar results in a British study, which I discussed here:
http://entropyproduction.blogspot.com/2009/07/british-observational-study-on-cancer.html
I liked the above study because the vegetarians who ate fish (flexitarians) had lower rates of cancers than the vegetarians.
Hi Dr. Gee, thanks. Btw, Dark City is also one of my favs.
Thanks Robert, very interesting post. I am adding your blog to my list so that I can visit often. Looks like my kind of blog!
"This is good, because I checked the data, and it does not look like it is normally distributed."
How did you check this?
Sincerely, NosyOne
There are a number of tests for normality that one can use:
http://en.wikipedia.org/wiki/Normality_test
The Kolmogorov–Smirnov test is a widely used one:
http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
With some experience, you can simply take a look at a histogram of the data, or even at the data itself, and "see" that it is not normal if that is the case.
So, you did that to the cancer data? I mean the data set which shows the cancer numbers in different counties.
Sincerely, NosyOne
P.S thank you for the responses!
Hi NosyOne.
Colorectal cancer was the main dependent variable in the analysis in this post.
I didn't have to even worry about normality, because the technique used for P value calculation was nonparametric.
awesome! thanks
Various spam comments above deleted. This post apparently has become a magnet for spam!
Sometimes we are very concerned about our health, do not understand what to do. It is very easy to make better health. the use of natural vitamin supplements. Vitamins function in many metabolic reactions that occur in foods consumed in the body, control of vitamins and energy metabolism of our body.
Nice to share my love is wonderful to tell you that a healthy green gives you the best Organic vitamins, herbal remedies and organic supplements. They use all natural ingredients to create organic products.
It is my pleasure that I have the unique opportunity to comment on this awesome post.
on site laptop screen repair melbourne
like it
link building expert
like it man
smo services
you are great boy
seo friendly content writing services
Interesting Article. Hoping that you will continue posting an article having a useful information.
Great post Ned, it was very interesting! I have been doing a lot of research about biobanking software because a co worker mentioned it to me and how it can really be a huge help when it comes to managing your bio-specimens. That's how I came across you blog and I'm happy I did because this was a very informative post on the Biorepositories event. Thank you for sharing this with us, I'm going to show my co worker this!
Thank you for your sharing! I really like to read it,So good to find somebody with some original thoughts on this subject .
silicone bh
This post is a revised version of a previous post. The original comments are preserved here. More comments welcome, but no spam please!
Post a Comment