Monday, November 22, 2010

Human traits are distributed along bell curves: You need to know yourself, and HCE can help

Most human traits (e.g., body fat percentage, blood pressure, propensity toward depression) are influenced by our genes; some more than others. The vast majority of traits are also influenced by environmental factors, the “nurture” part of the “nature-nurture” equation. Very few traits are “innate”, such as blood type.

This means that manipulating environmental factors, such as diet and lifestyle, can strongly influence how the traits are finally expressed in humans. But each individual tends to respond differently to diet and lifestyle changes, because each individual is unique in terms of his or her combination of “nature” and “nurture”. Even identical twins are different in that respect.

When plotted, traits that are influenced by our genes are distributed along a bell-shaped curve. For example, a trait like body fat percentage, when measured in a population of 1000 individuals, will yield a distribution of values that will look like a bell-shaped distribution. This type of distribution is also known in statistics as a “normal” distribution.

Why is that?

The additive effect of genes and the bell curve

The reason is purely mathematical. A measurable trait, like body fat percentage, is usually influenced by several genes. (Sometimes individual genes have a very marked effect, as in genes that “switch on or off” other genes.) Those genes appear at random in a population, and their various combinations spread in response to selection pressures. Selection pressures usually cause a narrowing of the bell-shaped curve distributions of traits in populations.

The genes interact with environmental influences, which also have a certain degree of randomness. The result is a massive combined randomness. It is this massive randomness that leads to the bell-curve distribution. The bell curve itself is not random at all, which is a fascinating aspect of this phenomenon. From “chaos” comes “order”. A bell curve is a well-defined curve that is associated with a function, the probability density function.

The underlying mathematical reason for the bell shape is the central limit theorem. The genes are combined in different individuals as combinations of alleles, where each allele is a variation (or mutation) of a gene. An allele set, for genes in different locations of the human DNA, forms a particular allele combination, called a genotype. The alleles combine their effects, usually in an additive fashion, to influence a trait.

Here is a simple illustration. Let us say one generates 1000 random variables, each storing 10 random values going from 0 to 1. Then the values stored in each of the 1000 random variables are added. This mimics the additive effect of 10 genes with random allele combinations. The result are numbers ranging from 1 to 10, in a population of 1000 individuals; each number is analogous to an allele combination. The resulting histogram, which plots the frequency of each allele combination (or genotype) in the population, is shown on the figure bellow. Each allele configuration will “push for” a particular trait range, making the trait distribution also have the same bell-shaped form.

The bell curve, research studies, and what they mean for you

Studies of the effects of diet and exercise on health variables usually report their results in terms of average responses in a group of participants. Frequently two groups are used, one control and one treatment. For example, in a diet-related study the control group may follow the Standard American Diet, and the treatment group may follow a low carbohydrate diet.

However, you are not the average person; the average person is an abstraction. Research on bell curve distributions tells us that there is about a 68 percentage chance that you will fall within a 1 standard deviation from the average, to the left or the right of the “middle” of the bell curve. Still, even a 0.5 standard deviation above the average is not the average. And, there is approximately a 32 percent chance that you will not be within the larger -1 to 1 standard deviation range. If this is the case, the average results reported may be close to irrelevant for you.

Average results reported in studies are a good starting point for people who are similar to the studies’ participants. But you need to generate your own data, with the goal of “knowing yourself through numbers” by progressively analyzing it. This is akin to building a “numeric diary”. It is not exactly an “N=1” experiment, as some like to say, because you can generate multiple data points (e.g., N=200) on how your body alone responds to diet and lifestyle changes over time.

HealthCorrelator for Excel (HCE)

I think I have finally been able to develop a software tool that can help people do that. I have been using it myself for years, initially as a prototype. You can see the results of my transformation on this post. The challenge for me was to generate a tool that was simple enough to use, and yet powerful enough to give people good insights on what is going on with their body.

The software tool is called HealthCorrelator for Excel (HCE). It runs on Excel, and generates coefficients of association (correlations, which range from -1 to 1) among variables and graphs at the click of a button.

This 5-minute YouTube video shows how the software works in general, and this 10-minute video goes into more detail on how the software can be used to manage a specific health variable. These two videos build on a very small sample dataset, and their focus is on HDL cholesterol management. Nevertheless, the software can be used in the management of just about any health-related variable – e.g., blood glucose, triglycerides, muscle strength, muscle mass, depression episodes etc.

You have to enter data about yourself, and then the software will generate coefficients of association and graphs at the click of a button. As you can see from the videos above, it is very simple. The interpretation of the results is straightforward in most cases, and a bit more complicated in a smaller number of cases. Some results will probably surprise users, and their doctors.

For example, a user who is a patient may be able to show to a doctor that, in the user’s specific case, a diet change influences a particular variable (e.g., triglycerides) much more strongly than a prescription drug or a supplement. More posts will be coming in the future on this blog about these and other related issues.


Michael Barker said...

Ed, you've been reading my mail. I was just sitting upstairs looking at a database I had created to track the variables that effected my diabetes. I was thinking about adapting it to my experiment that I'm doing on the blog.

One of the reasons I pulled this database up is that I just came from seeing my endo. He threw his hands up on the whole thing but said that since I'm getting the numbers he'll support me as best he can.

I need to get this data in some sort of order. Heck, I need to get it off paper. I'm thinking that for better or for worse, I'll want to publish this data on my blog.

Scott W said...

I'm all over it! I'm an auditor so I work with Excel all the time...but not even close to your level...certainly no programming.

I'm starting a new workout program shortly (one that I think you would very much be able to appreciate) so it would be interesting to track my gains using this software. Also, I am suspecting a food allergy that this could be useful for.

I'm signing up.

Scott W

David Isaak said...

Oh, this sounds like great fun. I've been tracking all manner of my health and activity variables over the last 18+ months on Excel, and I'm always up for a new angle of attack on the problem.

(I'm not really a pro at correlation analysis, though I've been working my way though Schumacker and Lomax's Beginner's Guide to SEM. I'm more of an LP sort of guy.)

Ned Kock said...

Hi Michael.

Good to know that HCE will be useful to you. I certainly hope so; in a few days we'll know for sure, as the beta version is released.

Ned Kock said...

Hi Scott.

Good. HCE is very easy to use. And you still have all the features of Excel.

Ned Kock said...

Hi David.

HCE is certainly much, much simpler than SEM. By the way, if you want my recommendation of an SEM tool that is easy to use, and also powerful, here is my recommendation:

WarpPLS does variance-based SEM (a.k.a. PLS-based SEM), which is a robust approach for SEM. Among other things, it handles small samples well and does not require that the indicator variables (that make up latent variables) be normally distributed.

I guess you know that I use WarpPLS a lot right? I have several posts here that use it. I developed it using MATLAB, C++, and Java.

js290 said...


why C++ and Java?

Kato said...

I'd been thinking about trying to put together something of this sort and would be happy to help in Beta testing.

Ned Kock said...
This comment has been removed by the author.
Ned Kock said...

Hi Kato, thanks.

Just so you know, the call for beta testers will be issued in a post on this blog. The post will have instructions.

Ned Kock said...

Hi js290.

Both C++ and Java are used for minor customization of MATLAB features. Unfortunately some customization options are available via C++ and others via Java code.

David Isaak said...

Can you recommend what I ought to read to be able to use WarpPLS with a reasonable degree of competence.

I've done a fair amount of regression work,but SEM is new to me.

Ned Kock said...

Hi David.

Path analysis can be seen as a general case of multiple regression, and SEM is path analysis with latent variables. So SEM is a very general case of multiple regression.

I think the book you are reading is good as an intro. to SEM.

To use WarpPLS, I'd recommend going through the documentation on (including the YouTube videos and User Manual), and the posts on the blog (linked there).

Many people who are familiar with regression try to do an SEM analysis with WarpPLS after viewing the YouTube videos only, and do reasonably well.