Sunday, August 15, 2021

The China Study one more time: Are raw plant foods giving people cancer?

In this previous post I analyzed some data from the China Study that included counties where there were cases of schistosomiasis infection. Following one of Denise Minger’s suggestions, I removed all those counties from the data. I was left with 29 counties, a much smaller sample size. I then ran a multivariate analysis using WarpPLS (, like in the previous post, but this time I used an algorithm that identifies nonlinear relationships between variables.

Below is the model with the results. (Click on it to enlarge. Use the "CRTL" and "+" keys to zoom in, and CRTL" and "-" to zoom out.) As in the previous post, the arrows explore associations between variables. The variables are shown within ovals. The meaning of each variable is the following: aprotein = animal protein consumption; pprotein = plant protein consumption; cholest = total cholesterol; crcancer = colorectal cancer.

What is total cholesterol doing at the right part of the graph? It is there because I am analyzing the associations between animal protein and plant protein consumption with colorectal cancer, controlling for the possible confounding effect of total cholesterol.

I am not hypothesizing anything regarding total cholesterol, even though this variable is shown as pointing at colorectal cancer. I am just controlling for it. This is the type of thing one can do in multivariate analyzes. This is how you “control for the effect of a variable” in an analysis like this.

ins Since the sample is fairly small, we end up with nonsignificant beta coefficients that would normally be statistically significant with a larger sample. But it helps that we are using nonparametric statistics, because they are still robust in the presence of small samples, and deviations from normality. Also the nonlinear algorithm is more sensitive to relationships that do not fit a classic linear pattern. We can summarize the findings as follows:

- As animal protein consumption increases, plant protein consumption decreases significantly (beta=-0.36; P<0.01). This is to be expected and helpful in the analysis, as it differentiates somewhat animal from plant protein consumers. Those folks who got more of their protein from animal foods tended to get significantly less protein from plant foods.

- As animal protein consumption increases, colorectal cancer decreases, but not in a statistically significant way (beta=-0.31; P=0.10). The beta here is certainly high, and the likelihood that the relationship is real is 90 percent, even with such a small sample.

- As plant protein consumption increases, colorectal cancer increases significantly (beta=0.47; P<0.01). The small sample size was not enough to make this association nonsignificant. The reason is that the distribution pattern of the data here is very indicative of a real association, which is reflected in the low P value.

Remember, these results are not confounded by schistosomiasis infection, because we are only looking at counties where there were no cases of schistosomiasis infection. These results are not confounded by total cholesterol either, because we controlled for that possible confounding effect. Now, control variable or not, you would be correct to point out that the association between total cholesterol and colorectal cancer is high (beta=0.58; P=0.01). So let us take a look at the shape of that association:

Does this graph remind you of the one on this post; the one with several U curves? Yes. And why is that? Maybe it reflects a tendency among the folks who had low cholesterol to have more cancer because the body needs cholesterol to fight disease, and cancer is a disease. And maybe it reflects a tendency among the folks who have high total cholesterol to do so because total cholesterol (and particularly its main component, LDL cholesterol) is in part a marker of disease, and cancer is often a culmination of various metabolic disorders (e.g., the metabolic syndrome) that are nothing but one disease after another.

To believe that total cholesterol causes colorectal cancer is nonsensical because total cholesterol is generally increased by consumption of animal products, of which animal protein consumption is a proxy. (In this reduced dataset, the linear univariate correlation between animal protein consumption and total cholesterol is a significant and positive 0.36.) And animal protein consumption seems to be protective again colorectal cancer in this dataset (negative association on the model graph).

Now comes the part that I find the most ironic about this whole discussion in the blogosphere that has been going on recently about the China Study; and the answer to the question posed in the title of this post: Are raw plant foods giving people cancer? If you think that the answer is “yes”, think again. The variable that is strongly associated with colorectal cancer is plant protein consumption.

Do fruits, veggies, and other plant foods that can be consumed raw have a lot of protein?

With a few exceptions, like nuts, they do not. Most raw plant foods have trace amounts of protein, especially when compared with foods made from refined grains and seeds (e.g., wheat grains, soybean seeds). So the contribution of raw fruits and veggies in general could not have influenced much the variable plant protein consumption. To put this in perspective, the average plant protein consumption per day in this dataset was 63 g; even if they were eating 30 bananas a day, the study participants would not get half that much protein from bananas.

Refined foods made from grains and seeds are made from those plant parts that the plants absolutely do not “want” animals to eat. They are the plants’ “children” or “children’s nutritional reserves”, so to speak. This is why they are packed with nutrients, including protein and carbohydrates, but also often toxic and/or unpalatable to animals (including humans) when eaten raw.

But humans are so smart; they learned how to industrially refine grains and seeds for consumption. The resulting human-engineered products (usually engineered to sell as many units as possible, not to make you healthy) normally taste delicious, so you tend to eat a lot of them. They also tend to raise blood sugar to abnormally high levels, because industrial refining makes their high carbohydrate content easily digestible. Refined foods made from grains and seeds also tend to cause leaky gut problems, and autoimmune disorders like celiac disease. Yep, we humans are really smart.

Thanks again to Dr. Campbell and his colleagues for collecting and compiling the China Study data, and to Ms. Minger for making the data available in easily downloadable format and for doing some superb analyses herself.


Anonymous said...

Looking at your U-curve, I'd say that it's not a particularly good shape for the data. The points you show look like a horizontal line up to about 0.5 on the horizontal axis, then a rising line. Can you use a piecewise-linear regression line with your software?

This kind of fit would seem to make more sense - no effect until you get sufficient intake, then a dose-dependent effect starts to show up.

Tom Passin said...

Sorry, didn't mean to be anonymous for my comment above.

Ned Kock said...

Hi Tom.

That is the best-fitting U curve for the data. Or so says the computer!

Aaron Blaisdell said...

Very nice analysis! The Chinese rarely eat vegetables raw, only fruit, so it can't be raw plants linked to cancer. Since going primal, I've been wondering why most people find eating grains and legumes so appealing. Given their toxicity, it is quite a conundrum. My guess is that they contain some addictive qualities. It has been shown that some grains contain proteins that stimulate endogenous opiate receptors, so these would clearly be addictive. Perhaps the rapid blood sugar rise and fall also makes them addictive because they encourage frequent food seeking to replenish blood glucose after the crash--thus perpetuating a vicious cycle best filled by simple carbohydrates that rapidly raise blood glucose. Thus, there's an element of Pavlovian and instrumental conditioning involved. I became aware of this after going primal and extinguishing the carb cravings. This is probably a more significant difficulty in switching western diets than opposition by big Agra and big Pharma, though they have a big influence, too.

Ned Kock said...

Very, very good points Aaron. Thanks.

Tom Passin said...

"That is the best-fitting U curve for the data. Or so says the computer! "

I don't doubt it ... let's see, you are using a quadratic, right? I'm just pointing out that even just going by eye, the variance would be reduced by a piecewise-linear fit where the left-hand segment is horizontal. I don't think that would even reduce the number of degrees of freedom, or if so, only by one.

Ned Kock said...

Hi Tom.

Maybe, but we don't want to fall into the trap of overfitting. This (overfitting) can be a common problem in nonlinear analyses.

Anne said...

Hi Ned,

I couldn't understand this:

"Maybe most (not all) of the folks who had low cholesterol ended up having more cancer because the body needs cholesterol to fight disease, and cancer is a disease. And maybe most (not all) of the folks who have high total cholesterol do so because total cholesterol (and particularly its main component, LDL cholesterol) is in part a marker of disease,"

If cholesterol helps the body fight disease and people get cancer because they don't have enough cholesterol then how can high cholesterol be a marker for the same disease which happens when people have low cholesterol ? I'm sure I have misunderstood this but I'm trying to figure it out.....mainly 'cos I have high cholesterol (7.1) but I won't take a statin and I've not got metabolic syndrome either, my HDL is high (2.7) and triglycerides low (0.5) so my doc isn't worried either. Please can you clarify about low and high cholesterol.


Ed said...


I just stumbled across Peters old post, about how wheat germ agglutinen triggers cell division in the intestine, and higher rates of intestinal cancer.

Ned Kock said...

Hi Anne.

High TC and LDL are markers for disease probably for the same reason; they are used to fight disease.

Lipids in general are also diet markers. People on a diet rich in animal fat will often have high TC and LDL, but also high HDL. See this post:

One can have a high LDL-C, but if the LDL particles are mostly large and buoyant, their contribution to atheromas is significantly decrease. So one can have a fairly high LDL and be very healthy. See this post:

Unfortunately health issues are not as simple as some MDs and researchers make them appear.

Ned Kock said...

Hi Ed, thanks.

Anne said...

Thanks Ned !

Denise said...

Hey Ned,

Awesome job! I was working with linear regressions for this data set (as was Campbell, as far as I know, for the cholesterol/colorectal cancer correlation -- he did not employ U-curves, as his hypothesis was that as cholesterol increases, so does disease, even starting at the lowest levels). If this is a more accurate way to model it I will gladly revise my post.

What were your results using this method for A) the full data set and B) the schistosomiasis-infected group?

Ned Kock said...

Hi Denise, thanks.

I haven't done an analysis on the schisto cases only yet. I would be very surprised if the main results were different from the ones for the whole dataset, because of the schisto's disproportionately strong effect. That is on this post, the one you linked earlier:

Also, I personally don't consider very good practice to delete data from datasets that are already small in order to explore further effects. The small sample size may lead to interpretation problems. I did this here only because the schisto effect was way too strong in the whole dataset.

I hope more people will do their own analyses on the original data, like we have been doing. Then the discussion will move away from X or Y are saying this, to something more like "the data" is saying this.

Thanks again.

Ned Kock said...

As for multiple linear analyses, it is good to have them. They are not that bad. The problem with them is that most relationships in natural and behavioral phenomena are nonlinear, with U curves (straight or inverted) being particularly common.

When I say U curves I mean the whole curves or sections of them. Sections of U curves would model other types of functions such as hyperbolic decay and logarithmic functions.

Denise said...

"I hope more people will do their own analyses on the original data, like we have been doing. Then the discussion will move away from X or Y are saying this, to something more like "the data" is saying this."

I completely agree -- I hope this is what will happen!

I'm going to try posting some more data today. I was going to wait until I made my post with my own regression analysis results, but I'm not sure when I'll be getting to that (I'm working on a reply to Campbell first) so might as well start getting more data out there now. :)

Also: what relationship did you find between animal protein consumption and cholesterol? If it was weak, this is more evidence that Campbell's use of cholesterol as a biomarker for animal food consumption is unreliable. That would also be potential support for the hypothesis that (at least in this case) rising cholesterol could be a result, and not a cause, of disease.

Ned Kock said...

Hi Denise.

The linear and nonlinear relationships between TC and animal protein consumption were strong and significant in all analyses I have done so far on the whole dataset.

I haven't checked it using nonlinear in this smaller dataset, but I can tell you that the linear univariate correlation between animal protein consumption and total cholesterol is a significant and positive 0.36 here.

Doing an analysis that takes nonlinearity into consideration usually (but not always) amplifies the strength of an already strong linear association, even if it is a univariate one.

I personally believe that TC and LDL-C are markers of disease AND of a diet rich in animal fat, of which animal protein may be seen as a proxy. They may be markers of other things as well - e.g., a relatively rare condition known as familial hypercholesteromia, which usually leads to disease.

From a "whole lipids" perspective, the difference is usually in HDL-C. If one's HDL-C is high, usually a high LDL-C is associated with non-atherogenic large-buoyant LDL particles:

If one's HDL-C is low, usually a high LDL-C is associated with more potentially atherogenic small-dense LDL particles:

The combination of a low HDL-C and a high LDL-C is particularly common among older folks whose diet is primarily of foods rich in refined carbohydrates and sugars (white bread, pasta, doughnuts, regular sodas etc.).

Denise said...

Great info on cholesterol, Ned!

I just added a new post where I'll be posting more data. So far, I've added a number of variables for heart disease.

Data for HDL cholesterol is up there but I'll also be adding non-HDL fractions, especially apo-B, later on, in case you're interested in looking at that.

Chris Masterjohn said...

Hi Ned,

This post is pretty interesting. However, I'm a bit confused why you are controlling for cholesterol. One of the most important criteria of a potential confounder is that it does not lie as an intermediate in the causal pathway between the independent and dependent variable. It's pretty clear that the type of protein one eats can affect total cholesterol and it's pretty unclear whether total cholesterol might have a causal impact on the disease. Even if it doesn't itself have an effect on the disease, it seems removing variation due to total cholesterol would remove a portion of the variation truly due to protein intakes.

I like what you said about LDL-C and HDL-C. I wrote a blog a while back that might offer an explanation for why low HDL-C would be associated with pattern B LDL. Briefly, increased residence time of LDL particles, which is largely due to insufficient activity of the LDL receptor, will lead to increased transfer of cholesterol from HDL to LDL particles and increased a) oxidation of LDL and b) metabolism of LDL by lipoprotein lipase, both of which will reduce LDL size.

Here's the full blog:


Ned Kock said...

Hi Denise, thanks!

Ned Kock said...

Hi Chris.

The U curve of TC vs. cancer has two sides, left and right. The left side reflects a relationship that in my view is causal. That is, as TC goes down, so does the body's ability to fight disease, hence an increase in cancer rates.

As you correctly point out, the right side is not causal. There, TC reflects cancer as TC increases in response to cancer (and the complications that precede it). So, on the right side TC is a marker of disease, not a causative factor.

One may argue that because of the above, the inclusion of TC robs too much explained variance in the variable cancer, thereby reducing the strengths of the paths between animal and plant protein and cancer.

That is a fair point, except that TC reflects other confounding factors that I would like to have controlled for, if data on them was available. This applies primarily to the right side of the U curve. One of those confounding factors is age, which here would have to be the mean age in each county (not very good). The other is smoking, which I suspect was high, with variations across counties. Both of those factors increase cholesterol, and also arguably increase cancer rates.

Ned Kock said...

Chris, I will write another comment addressing the HDL issue, which I think is fascinating.

Ned Kock said...

Hi Chris, again.

Let me first tell you that you have an excellent blog, which I have been reading for quite some time. I also appreciate your sharing your experience, which I read a while ago.

My knowledge of lipid metabolism is mostly from my reading of pubs that have numbers in them, to which I try to apply common sense.

Through that reading, I couldn't help but notice that HDL-C is often strongly and negatively correlated with VLDL-C. I also noticed that VLDL particle size tends to vary a lot, so VLDL is a good candidate for variability in response to dietary factors. (See el. micro. photo on post below.)

Putting things together, my thinking moved in a particular direction. I think that, in folks who don't have familial hyperchol. (FH), the problem starts with an alteration of VLDL secretion patterns. For example, I believe that in diets rich in refined carbs and sugars, VLDL particles are secreted in higher quantities, and are smaller in size. Those particles end up as s-d LDL particles (pattern B). Then I think the mechanism is pretty much what you described, since s-d LDL particles have a longer life, and are more prone to glycation and oxidation.

In other words, I think that in non-FH cases, the problems starts with the VLDL particles. Whereas in FH, I think that the problem is more with receptors.

Chris Masterjohn said...
This comment has been removed by the author.
Chris Masterjohn said...

Hi Ned,

Thanks for the props on the blog! :)

I'll follow you by responding in a separate comment for each topic for clarity.

I did not point out that the right side isn't causal. What I said is that it is not clear from the data alone nor from any outside evidence that has been referenced in this discussion so far whether total cholesterol has any cause-and-effect relationship to colorectal cancer. It is not obvious to me that the left side is causal and the right side is not, and although you might believe that, I think arguing such requires a great deal more evidence than has been presented here and probably a great deal more than has been compiled on the topic elsewhere.

I think your concept of using total cholesterol as a marker for age and smoking is deeply problematic. First, these data are available in the monograph and there is no justificiation for substituting cholesterol. Second, while the age structure of the community should affect its cancer mortality rate, the mortality statistics are taken from a different decade than the survey. the argument was that the diet did not change much in those communities during that time period, but certainly the age structure was likely to have changed. Third, I do not think there is any legitimacy to using surrogate covariates in this manner and I have never seen it done. Sure these things might be correlated but if the r-squared is 15-30%, which is most likely the case, the efficacy of your surrogate will be extremely poor. Moreover, if some are correlated in linear relationships and your surrogate has a U-shaped trend that's a further problem. The bottom line is no one ever does this and I think for very good reason.

But the most important point is that protein intakes do causally impact cholesterol levels. Therefore, if you remove some of the variance attributable to cholesterol levels, you are removing some of the variance due to protein intake without any real justification. And in fact, if the total cholesterol is a marker of something causally related to cancer -- for example, and inflammatory process, you could be removing variation representing a causal pathway between protein intakes and colorectal cancer.

I think your analysis is interesting in a very preliminary sense but I think it would be more useful if you were not adjusting for cholesterol or at least if you presented the data in both forms and opened up a discussion about the relative merits of adjusting for cholesterol.


Chris Masterjohn said...
This comment has been removed by the author.
Chris Masterjohn said...

Hi again Ned,

Second post on lipoproteins. :)

You make a good point that lipoprotein size is determined in part by dietary factors that affect their size at secretion. I neglected to include that among factors in my last post.

Diets high in refined carbs actually lead to larger VLDL particles, not smaller ones. However I think you're on the right track because large VLDL particles are associated with small LDL particles.

Here's a study showing carbohydrate restriction leads to reduced VLDL size and increased LDL size:

They argue that larger VLDL particles are better substrates for lipoprotein lipase and hepatic lipase and thus are metabolized to small LDL.

I do not believe that LDL receptor issues are limited to people with familial hypercholesterolemia. The hypothesis that sub-optimal levels of LDL receptor activity mediate atherosclerosis risk in the general population is very strongly supported by the finding that people with a genetic defect in the enzyme that degrades the LDL receptor have an 88% reduced risk of heart disease:

Studies have shown that small LDL is more prone to oxidation in vitro, but no one has definitively shown why. As I pointed out in the main heart disease article on my site, at least part of it is due to the fact that small LDL particles have already been partially oxidized once they are taken out of plasma:

In fact when they first found that prolonged (24 hr) contact with endothelial cells converted LDL into a different form that began killing the cells in the dish, before they knew that the LDL was being oxidized during this incubation, they noted that "endothelial cell-modified" LDL particles had a dramatic increase in density.

Thus, oxidation of the LDL particle makes it small and dense. I think you are right that diets rich in refined carbs lead to VLDL secretion patterns that wind up causing smaller LDL particles, but it is also true that these diets promote oxidative stress, which itself will reduce LDL particle size by oxidizing the LDL particles, and that suboptimal LDL receptor function is widespread, and this too reduces LDL particle size.

So it's a symphony of mechanisms. Thanks for the good points you've made!


Ned Kock said...

Hi Chris.

A model with 3 causative factors for something as complex as cancer is not only preliminary, it is certainly very incomplete. This model explains 34 percent of the variance in colorectal cancer. That is, 66 percent of the variance is unexplained. Plus the sample is small - 29 cases.

There are other factors that need to be controlled for, not only age and smoking. One would be other infections; another would be levels of obesity. TC is certainly not used here as a surrogate for age or smoking, or any of these other variables. It doesn't fit well the definition of a surrogate.

An example of surrogate, in a different context, would be perceived cognitive effort, used to measure actual cognitive effort. I guess I am being clear right? I am not sure I even used the words "surrogate" or "proxy".

TC, as I see it, makes the model more complete because there are a number of control variables missing. And also because of the reasons that I gave you. But no, not because TC is a proxy for age. Again, another example: a proxy for age would be estimated age based on recall of past events.

Now, leaving TC out could do different things to the competing path coefficients, but the end result could be rendered meaningless because the explained variance in the cancer variable would be too low. Most likely that would happen, since the path coefficient from TC to cancer is so high.

I am not sure what the validity of writing a post with an even more incomplete model would be. Perhaps it would lead to some interesting discussions, as you suggested, but I am not so sure.

Maybe what we should do is join forces (you, Denise, I, and anybody else who would like to be involved) and put together the best dataset we can based on the China Study. Then we could agree on some key analyses that could be run on it to provide a much better picture of what is going on. I would be happy to do the analyses, or help do them.

In the meantime, there are also some other smaller pieces of the dataset being made available. I like the idea of running analyses on them to see if the results make sense in light of these posts, and existing research.

I just wish I had more time.

Ned Kock said...

Hi Chris, again.

The discussion on lipoproteins is fascinating. I will get to the links later, but just wanted to say that it is really interesting.

Sometimes my impression is that evolution creates mechanisms that are, in some cases, too redundant. But I guess I am looking at this from a rather limited perspective; a human perspective.

Chris Masterjohn said...
This comment has been removed by the author.
Chris Masterjohn said...

Hi Ned,

I apologize for my last comment which didn't really make much sense. I deleted it and am going to try again.

I agree that it is desirable to have a more complete model. That would be a result of course of much deliberation and painstaking data analysis, literature review, etc.

However, in evaluating the effects of plant and animal protein, the most reasonable goal is not to try to maximize the explanation of the variance but instead to try to achieve the truest estimation of the relationship with protein intakes.

So if we include a factor in a multiple regression model, we can basically achieve two things. We might find an independent additive effect where the relationships of each stay the same but the explanatory power is, say, doubled because we have identified two more or less unrelated factors that contribute. Or, we might enhance or diminish the other relationship(s) by introducing an X variable related to the others. This might be great if the factor is a true confounder but could be questionable if it is not.

For example, if you found a relationship between obesity and heart disease you would not "adjust" for the effect of blood pressure because high blood pressure can be a consequence of obesity.

In the case of cholesterol, you do not have the first scenario because you have statistical relationships with the other independent variables and with the dependent variable.

You might have a confounder, in which case it is desirable to add it to the model and adjust, but whether it meets the criteria of a confounder needs to be critically examined. It should be rejected from the model, I believe, if there is significant evidence that it can be an effect of any of the primary independent variables of interest and especially if there is any biological plausibility that it could also be a cause of the dependent variable of interest (or a marker for something that is caused by the independent variable of interest and contributes causally to the Y variable of interest).

Certainly the first of those conditions is met (that plant and animal protein intakes can cause changes in cholesterol). The second condition could be a matter of extensive debate. I think this raises the possibility that including cholesterol in the model is actually introducing a confounder rather than removing one.

If the explanatory power of the model is low without including cholesterol, that might simply mean that animal and plant protein intakes have very little explanatory power. I don't see anything wrong with identifying a lack of a relationship where there is none.

Usually when adjusted analyses are presented it is customary to present several models of adjustment after presenting unadjusted data.

What I was suggesting was that a contrast of the model with and without adjustment for cholesterol and a discussion of the pros and cons of adjusting for cholesterol would be useful.

I hope that makes better sense.

In any case, thanks for the work you put into this!


Ned Kock said...
This comment has been removed by the author.
Ned Kock said...

Hi Chris.

A clarification first. This analysis employs robust nonlinear path analysis. The "robust" here essentially means nonparametric resampling. The nonlinear treatment is quite simple. The software finds a function that fits the data, but that does not overfit it, and then transforms the predictor values prior to calculating the path coefficients. More here, in case you or other readers are interested:

So this is not exactly like adding a control variable to a multiple regression analysis, but it is closer to that than to adding a covariate to an ANCOVA analysis. In ANCOVA the covariate will have the effect of reducing variance; the predictor is a categorical variable, and ANCOVA compares variance within groups with variance between groups.

Anyway, ultimately what we have here is something like this (an analogy). A widely held but illogical hypothesis: no glass cup ever breaks if you drop them on a concrete floor from a height of 6 ft. What this post does is akin to showing evidence that glass cups break, quite a lot of them, and you don't have to drop them from 6 ft; it is enough to drop them from 5 ft. (Or 4 ft, depending on how TC influences the various relationships.)

I think very few will disagree that including TC makes the analysis more conservative, even though TC is not the best control variable for this model. And, even being conservative, we find that: animal protein does not seem to cause colo. cancer; and plant protein does not seem to protect against colo. cancer. What we find is some preliminary evidence of the opposite, at least in this sample.

Even being conservative.

Chris Masterjohn said...

Hi Ned,

First, I'm really glad I stumbled upon your blog and I look forward to reading more of your posts as I believe I will learn a LOT about statistics from reading your work, which is an important goal in itself for me right now.

Second, thank you for the explanation. I will study the link on warppls when I have some time, perhaps at the end of the week, and look forward to it.

Third, quick tangential question for you -- does the terminology of 'ancova' differentiate between whether the predictor or covariate are categorical? I did an analysis in an experimental context where I calculated within-subject correlations and achieved this by dummy-coding subject number and subtracting SS due to subject out of the final model, generating 'r' by dividing SS due to my predictor by the total SS left over, and the methods paper I relied on referred to this as a form of ancova, but my predictor of interest was continuous and not categorical. Also, is there any analogous method where one controls for a continuous variable by substracting out the variation due to that variable in a similar way?

On the topic at hand, I'm wondering whether the trend toward a positive association with plant protein might be an artifact of including cholesterol. In the data subset you are using, simple linear regression shows a substantial negative correlation between plant protein and cholesterol (r^2 ~25%, this is using the diet survey; not sure which one you used). If this reflects a causal relationship, can't adding cholesterol to the model potentially distort the relationship? If diets high in plant protein tend to cause lower cholesterol levels, wouldn't "controlling" for cholesterol distort the true relationship to plant protein?

Finally, thank you for putting this all together. I think one of the most interesting things here is the degree to which one can analyze data in different ways and get different results without there always being a straightforwardly obvious "right" way to analyze the data. I look forward to more posts!


Ned Kock said...

Hi Chris.

The addition of a control variable like TC usually “steals” variance explained from other predictors, normally decreasing their respective paths. But it could also influence the paths in an upward manner, through a phenomenon called “suppression”:

There are a few people who have gotten deeper into this topic than I have. I tend to agree with some of the conclusions they reached. One of them is a computer scientist by the name of Judea Pearl, who argues that suppression is a statistical illusion, similar to “Simpson’s paradox”:'s_paradox

So what is one to do? Inclusion of new predictors (as controls or not) can sometimes distort results. (So can nonlinear corrections.) But typically this happens if they significantly increase colinearity in the model. WarpPLS reports a measure of colinearity, the variance inflation factor (VIF), which I always check when I analyze any dataset.

Given this, I’d be inclined to keep a control variable if: it makes at least some theoretical sense to keep it (even if to disprove a competing theory), the variable improves the fit of the model with the data (e.g., R-squared values increase, especially the one for your main dependent variable), and the resulting colinearity coefficients are low (VIFs < 5).

One more comment on ANCOVA coming …

Ned Kock said...
This comment has been removed by the author.
Ned Kock said...
This comment has been removed by the author.
Ned Kock said...

Hi Chris, again.

Now to the subject of ANCOVA. I always recommend that people use robust instead of parametric tests like ANCOVA. Parametric tests have a number of restrictions to their use, notably normal distribution of the criterion variable and large sample sizes. This means that the results of parametric tests are less trustworthy if those restrictions are not met.

With WarpPLS, the way to do a robust ANCOVA-like test is to create one dummy variable and replace your categorical predictor with it. For example, if your predictor is country (a categorical variable), and you have two countries (X and Y), then you can assign the value 1 to X to and 0 to Y. Your predictor will then be something like “degree of X-ness”. Sounds a bit odd, but works. Then you can add your control variable as I did with TC. That would be your covariate in the ANCOVA test. They all have to be numeric, of course.

With only 2 categories, one single dummy variable will do. If you have more than 2 categories, you will need C dummy variables, where C is the number of categories. Each dummy should be added to the model separately, because if you add them together you will most likely add a lot of colinearity to the model. The dummies would have 1 for the country's cases, and 0 for the other cases. (These numbers are arbitrary; you could use 103 and 4 instead, for example.)

No need to use a nonlinear algorithm, if you don’t want to. WarpPLS implements 4 algorithms, and two of them are linear. WarpPLS is a structural equation modeling (SEM) software, and many of the standard tests used in the social and natural sciences can be conceptually seen as special cases of SEM. This includes path analysis, multiple regression, MANOVA, MANCOVA, ANCOVA, and ANOVA.

Another comment coming ...

Ned Kock said...

Okay, 3rd part Chris. (Blogger limits comment size.)

You are right. There are many approaches to analyze data. Good researchers normally use several for any given dataset, and triangulate results. I did several intermediate analyses on this post’s dataset, and reported what I am comfortable reporting, being somewhat conservative. I could write ten posts but: (a) non-technical readers would be turned off; and (b) I have other things to do.

By the way, I particularly like triangulating multiple qualitative with multiple quantitative data analysis tests.

Good researchers do a lot of analyses, looking at the data from different angles. Even doing that, it is often difficult to come up with solid generalizations that apply to everybody.

A similar situation occurs in the stock market – trying to predict the future based on the past. But I have been using stats there too, WarpPLS analyses in particular, and it has been working reasonably well for me.

One could argue that I’m lucky. But guess what? The harder I work, the luckier I seem to get!

Chris Masterjohn said...

Hi Ned,

Thanks for your responses.

My minor analysis of the dataset yielded basically no relationship between plant protein and colorectal cancer when analyzed singly and a significant negative correlation between plant protein and total cholesterol (r-squared ~25%), which is a meaningful correlation but nothing approaching collinearity, so I agree that we are witnessing a 'suppressor' effect but without any problems arising from collinearity.

My concern about the potentially confounding effect of including cholesterol in the model arises from theoretical concerns about cause and effect and not from mathematical concerns about collinearity.

In a mulitple regression model, which you said your method is similar to, the results show the effect of each variable keeping the others constant.

So, in your model, we are addressing the question, "what is the relationship between plant protein and colorectal cancer when total cholesterol remains constant?" (Please correct me if I am making assumptions about your methods based on multiple regression if your method deviates from multiple regression in this respect).

This is a very useful and important question *if* there is no cause-and-effect relationship between plant protein and cholesterol.

If, however, dietary patterns rich in plant protein *cause* lower total cholesterol levels, then the question "what is the relationship with plant protein when cholesterol levels remain constant" departs from reality because in the real world cholesterol does *not* remain constant when you vary plant protein.

If there were no cause-and-effect relationship, the question your model is asking would cut out the noise and help us see the true relationship to plant protein. If there is a cause-and-effect relationship, the question your model is asking loses biological relevance.

So my concern here is that I find it very plausible that diets rich in plant protein *cause* decreased total cholesterol. I find this plausible primarily because 1) plant protein is less cholesterogenic than animal protein and is negatively correlated with it, 2) plant protein is associated with fiber and PUFA, which both decrease cholesterol levels through different means.

If this cause-and-effect reasoning is correct, then I think including cholesterol in the model becomes a confounder.


Chris Masterjohn said...

Hi Ned,

Thank you for the explanation. I do have some experience dummy coding. Did you mean C-1 rather than C dummy variables?

Your statistical knowledge is much more sophisticated than mine is, so please correct me if my understanding is wrong, but it seems that ANCOVA and partial correlations treat covariates the same way and thus they can be continuous or categorical. When I calculated within-subject correlations with ANCOVA, I dummy coded my subjects as a categorical variable and removed its variation so I treated the categorical variable as the covariate. You describe the opposite situation treating the continuous variable as covariate.

I did my calculations from the software-generated ANOVA table not realizing the software I was using could have done that itself. Today I ran some analyses of partial correlation between continuous variables and used the same calculations in Excel I used for ANCOVA and it yielded the same "r" with some rounding error.

So my impression is that ANCOVA and partial correlation are mathematically and conceptually equivalent -- is that correct?

Congratulations on your success in the stock market! :)


Ned Kock said...

Hi Chris.

A deep understanding of multiple regression and path analysis is, in my view, a mathematical understanding. There is only so much one can translate into words. The underlying math is primarily basic and intermediate matrix algebra.

Without it, one has to resort to metaphors, which usually create confusion for those who want a deeper understanding of what is going on. One metaphor is the “keeping the independent variables constant” one, which is widely used, but makes little sense.

If a variable is kept constant in a regression equation it doesn’t affect the path coefficients at all, because the variance of a constant is zero. Without variation there is no coefficient of association; there is no correlation. A correlation is a measure of the degree to which two variables vary in concert. If one is constant, the notion of correlation loses its meaning.

This situation is a bit like trying to understand evolution in depth without math, particularly the math used in population genetics. You can understand some simple evolutionary phenomena in terms of metaphors, but when it comes to phenomena like the evolution of costly traits (see link to PDF below; particularly Appendix A), the metaphors become a mess.

From a math perspective, I can “see” that a control variable’s association with the dependent variable is what really matters when it comes to deciding to include it or not in a model like the one on this post. That is, the control variable’s association with other independent variables is irrelevant for that decision.

Anyway, other issues:

- Bivariate correlations are not measures of colinearity. Not even close. A reasonably good measure of colinearity is the variance inflation factor (VIF).

- With only 2 categories, one single dummy variable will do. If you have more than 2 categories, you will need C dummy variables, where C is the number of categories (not C – 1).

- I don’t see a problem with using a dummy as covariate in ANCOVA. Having said that, I don’t see the point of using a test like ANCOVA at all; there are other better tests available employing robust stats.

- What you may want to do is to create multiple models, and then try to interpret what they are telling you. Depart from different theoretical assumptions, even wrong assumptions. If you include enough links in the models, you should get some results that will help you discard the models that don’t make sense. See my earlier post on the China Study for an example:

. said...
This comment has been removed by a blog administrator.
Ned Kock said...

Hi O Primitivo.

I have just sent you an email.

A lot of the data (perhaps not all) is also available from here:

This data seems to have more variables than the data that Denise posted. For example, it segments most variables based on sex.

Chris Masterjohn said...

Hi Ned,

The mathematical basis for the metaphor that a multiple regression model shows the effect of one variable while "keeping the others constant" is the fact that if the others are held constant, their regression coefficients collapse into the Y intercept.

In other words if, to simplify this greatly,

cancer = 10 + -2(animal protein) + 3(plant protein) + 7(cholesterol)

Then for all subjects with 4.1 mmol/L cholesterol the equation would simplify to

cancer = 29 + -2(animal protein) + 3(plant protein)

So the regression coefficients are showing the relationship of the variable "while the others are held constant." When the others are not held constant (which, at the level of the combined sample or the population being estimated, is always), the relationship is much more complex as indicated by the inclusion of the other variables in the model.

That's the reasoning put forward in an article I have by Garrett Fitzmaurice from Harvard's Department of Biostatistics.

Your understanding of statistics is much more sophisticated and advanced than mine is, but if you believe that the only consideration for inclusion of a variable into the model is the mathematics of the equation then I think this is a philosophical difference on which we will probably just have to agree to disagree. I think the effect of this approach would be to generate mathematically convincing by biologically irrelevant regression models.

Thanks for the correction about the VIF. I read the wiki article on it and it seems VIF diagnoses multicollinearity but the source of multicollinearity is still correlation between the independent variables, no?

I learned from both reading and in stats class that using C rather than C-1 variables when dummy coding distorts the equation. I don't have the statistical expertise to decipher which is correct.

The data from the original monograph, which Denise has been using and Campbell used in the book, divides between male and female. The data from the link you posted is, I believe, actually data from "The China Study II" so, while useful, the results will be different from analyses using the original data set.


Ned Kock said...

Hi Chris.

Before you calculate standardized partial regression coefficients (path coefficients), normally you standardize the variables in a model. (This also helps reduce other problems, because it makes the variables dimensionless.) Once you do that, the constant (intercept) disappears.

Path analysis deals with standardized variables. Classic multiple regression deals with non-standardized variables. The basis for structural eq. modeling, and for the math used in software like WarpPLS, is actually path analysis. This is one of the reasons why (among other things) I told you earlier that what WarpPLS does is not exactly multiple regression.

Of course you can always look at a plot with standardized values like the one on this post, and revert back to the non-standardized values, by multiplying them by the standard deviation and adding the mean of the variable they refer to.

I am not saying that only mathematical considerations matter. What I am saying is that a mathematical understanding is key, and can actually lead to new discoveries.

One example, which I don't have a reference for now, is that of Einstein's finding a result through math that contradicted his understanding of physics (which was way more sophisticated than the average person's).

He then modified the math (introducing a term; a constant, if I recall it properly) to fit his understanding. Einstein later deeply regretted it. The math was telling him something new, and apparently true, even though it could not easily be understood in non-math language (e.g.,in terms of metaphors).

But of course we can always agree to disagree. You may well be right.

Do you have a link for the China Study I data?

Chris Masterjohn said...

Hi Ned,

I understand that path coefficients are standardized regression coefficients and I understand the concept of standardization, and I understand that the Y interecept of a multiple regression model disappears during standardization. However, I don't think that changes the metaphor used.

First, because I was addressing your comment that the metaphor was unjustified in general, and in general, multiple regression is quite frequently used without standardizing the coefficients.

Second, conceptually, the Y intercept appears again when one performs the mental experiment of "holding the variables constant."

My analogy could be easily modified:

cancer = -0.2(animal protein) + 0.3(plant protein) + 0.5(cholesterol)

For a subset of the population with a given intake of animal X and a given cholesterol level Z, the regression equation simplifies to:

cancer = (-0.2X +0.5Z) + 0.3(plant protein)

where the first term is a constant (Y intercept) and the second represents the relationship between plant protein and cancer "holding the other variables constant."

However if this still doesn't represent warppls accurately, please correct me.

I agree that mathematical observations can be used to generate hypotheses just like any other type of observation, and I think that playing with data in multiple ways can be useful for brainstorming hypotheses. I think where nutrition differs from Einstein's work is that 1)experimentation is readily performable and 2) an enormous mass of experimental evidence is readily available. Thus, mathematical and other observations must always be integrated into the theretical framework that has as its basis experimental research, which provides definitive evidence of cause-and-effect relationships.

I'm not saying that one should NOT include cholesterol in the model to see if it yields mathematical results that are useful for brainstorming hypotheses. I'm simply saying that one must acknowledge what can be discerned from the currently accumulated body of experimental research and make a well-developed argument from a cause-and-effect theory based on that body of research to justify the inclusion or exclusion of variables in the model.

In many cases, there might be a compelling reason to argue both sides and it might be best to present several models, and that might provide a stimulus for further experimentation to clarify the ambiguities demonstrated by the alternative models.

The China Study I data is only available in print form as a giant monograph, and you would have to interlibrary loan it and spend hundreds of hours (at least doezens, I would imagine) typing it into Excel. I do have a copy of the monograph but have not performed this task yet.


Ned Kock said...

Hi Chris.

I should note that the path coefficients generated by WarpPLS should be identical to the standardized betas generated by a MR analysis, as long as you are using of WarpPLS's two linear algorithms. The unstandardized coefficients can be easily obtained from the standardized ones.

Actually, the equation you mentioned would become something like this:

cancer' = (-0.2X +0.5Z) + 0.3(P)

Where cancer' would represent cancer within that subset of the population, and P would represent plant protein consumption also within that subset.

I wonder if O Primitivo was referring to the China Study I data when he said he had the data in Excel. What do you think?

Given the impact of the book, it is reasonable to expect that a book with a new set of analyses, showing different things, would have a lot of potential. Particularly if what it says goes against with the original book says.

Anonymous said...

Just for the heck of it, I'd like to see a model that incorporates the rest of the macronutrients and dietary fiber as well. Would that be possible?

Ned Kock said...

Hi Anon.

O Primitivo from Canibais e Reis (link below) is putting together several files with the China Study data.

With those files anyone interested will be able to play with models ad nauseum.

I have been looking at the bivariate (aka univariate) correlations, and they are very interesting. A lot of confounding variables.

Btw, I delete O Primitivo's comment above at his request. He had included his email and was concerned about spamming.

healthycritique said...

hi ned, i continue to find your posts to be interesting and professionally presented. a few thoughts...

1. one thing i find challenging in all of this is that we're talking about colorectal cancer *mortality* rather than incidence. and the two are not necessarily directly related; there are other factors related to mortality that do not play a role in incidence (eg. treatment). this of course, applies to anyone undertaking an analysis of the china survey data. therefore, i find it difficult to draw conclusions approaching causality because we're not talking about risk of developing the disease.

2. the mortality rates are for 1975-1977 whereas the exposures were collected in 1983, so there is an even greater issue of temporality than we might normally have in a cross-sectional survey.

3. it is possible - especially since we're using mortality as the outcome - that the u-shaped curve actually reflects some other variable (eg. BMI) that might be subject to some reverse causality.

4. have you considered redoing your analyses using the online data? there are some differences between what denise has posted and what is online (denise was forthright about this). it would also be interesting to see the results for 1989 data, as the chinese population may have started to acquire more 'western' dietary patterns by then.

anyway, nice reading your post!

healthycritique said...

why not just use the online data? it is accessible to all, whereas the data denise is working from is in the 1990 printed monograph that few would have access to. it'd save LOTS of data entry time. and things seem to be "cleaner" in the online data (eg. certain dietary variables for tuoli are missing - likely because they were deemed unrepresentative of the typical dietary habits in that county). just a thought for anyone embarking on a huge data entry endeavor.

Ned Kock said...

Yes, analyzing the online data is a good option. I assume you are referring to this dataset:

With the dataset above, one could use data from both genders, and then control for the gender effect. This would double the size of the sample.

Chris Masterjohn said...

Hi Ned,

Sorry for taking so long -- got delayed with comprehensive exams.

So do you agree, then, that the coefficient from a multiple regression model is giving the effect of one variable "holding the others constant" in a sense?

And do you see the potential problem of including an intermediate in the model as if it is a confounder?

I agree a book of that sort would have good potential -- it would also be a massive undertaking, since the China Study data itself is only one chapter of Campbell's book.


Ned Kock said...
This comment has been removed by the author.
Ned Kock said...

Hi Chris.

Since you are such a nice guy I am almost agreeing with the "holding the others constant" metaphor to make you happy :)

I see the problem of adding a variable like TC to a model like the one discussed in this post, and also see the problems associated with leaving the variable out. I think the latter outweigh the former.

I am planning on writing a couple of posts on stats. One post showing that one should control for an effect indirectly (like with TC) if direct control is not possible, and the other post showing that a strong association may exist even with a zero correlation.

crystaljewellery said...
This comment has been removed by a blog administrator.
Ned Kock said...

Spam from crystaljewellery above deleted.

caverta online said...

Thanks for sharing this blog.. Really very informative information posted here..

android developer said...

I think you have given useful points on Health correlation.It is very much useful to me.So i really look forward to see next updates.

joomla development said...

Appreciate your formulating an exceptionally decent article, It happened to see your website page as well as several written piece. Is exceedingly good type publishing.

domain and hosting services india said...

yes you say's true that this is a nice blog and nice working i think a lot of hard work is neded for making this type of blog good work keep it up .

web design bangalore said...

This is such a Great resource that you are providing and you give it away for free. It gives in depth information. Thanks for this valuable information.

kaney said...

The perception I had was without chemotherapy, colorectal cancer patients would die -- if not all of them, at least a great majority of them. But research data does not support that perception.


GIrin Jackson said...

I have gone through your post and you have excellently written the content.

online drugstore said...

A raw vegan diet consists of unprocessed, raw plant foods that have not been heated above 40 °C (104 °F). ... and berries, including the traditional diet of the Nenet tribe of Siberia and the Inuit people. ... cruciferous vegetables have the most powerful anti-cancer effects of all foods.

easterndrugs said...

Testimonies of cancer healings through raw foods diet, freshly extracted raw ... me to simply change my diet [to a vegetarian regime composed largely of raw fruits and ... when we give our bodies nutrition in the form of raw foods as God intended. ... people are having excellent results healing themselves through raw foods.

Unknown said...

Hi this one is great and is really a good post. I think it will help me a lot in the related stuff and is very much useful for me. Very well written I appreciate & must say good job..
nutrasea omega 3