Wednesday, January 8, 2014

This Post is Not About Autism

Today we are going to talk about Correlation vs Causation!

Pssst! I've got a secret! We actually already talked about this last week. 

After talking about understanding a main cause vs truly understanding the mechanism of why last week, it seemed like a great time to talk about correlation vs causation which is pretty much the same thing. Also, a friend posted this article on Facebook and I suddenly knew what I would post about this week. Here's the graph from the article:


DISCLAIMER TIME!
As the title of this post indicates, this post is NOT about Autism. Autism is a very real and very challenging disorder and I in no way intend this post to be making light of Autism. In fact, part of my motivation here is showing how Autism has become a media hype poster child to the point where other people are making fun of the correlation hype. I am using Autism for this post mainly because all these silly graphs already exist that show the point so well. Anyway, instead of focusing on all the media hype, we SHOULD be talking about ways to help those with the disorder or, teaching people not to discriminate against those with Autism. 

OK, back to the post!

The graph above shows a correlation between Organic Food Sales and Autism. I think just about everyone who sees it will think "yeah, but thats just a coincidence." I mean, organic food can't possibly CAUSE Autism right? Right?! 

Well the TDE sure doesn't think it does. It just happens to correlate. Organic food popularity just happened to increase at the same time as Autism.*

*Note: Pay very careful attention to the x axis in that graph. 1998 to 2007 (for the final data point.) Thats only 9 years. What does this graph look like if we went back to 1970? What about 1950? What about forward to 2013? Careful choice of which subset of data to put in a visual can drastically alter the conclusions drawn from it. Be skeptical people. Always skeptical. I think perhaps we'll talk about this more next week! 

However, if you saw that graph and thought "OMG! Organic food causes Autism!" here are a couple other things that cause Autism:

                            


So, how exactly is increasing college tuition causing Autism? Its not. While microwave RF or organic food could more easily be tied to Autism (logically), the point is that these graphs show only correlation. To show that organic food causes Autism, we would need to understand the mechanism for HOW this happens. 

And really, these graphs don't actually show correlation. They show two parallel trends. If they were truly showing correlation, they would look more like the graphs below.*

Autism vs Organic Food Sales (R Squared of 0.991)


Autism vs Microwave RF Exposure (R Squared of 0.984)


Autism vs College Tuition (R Squared of 0.989)


*Note: The TDE did not have access to the raw data used to create the initial graphs. The graphs above were created by the TDE using data obtained from the original graphs which included a decent amount of error/noise due to interpolation/estimation variation. The point was not absolute accuracy, but to show what a correlation graph SHOULD look like.

A true correlation graph shows the dependent variable (in this case, Autism) on the Y axis and the suspected independent variable (whatever we are trying to show is a cause of Autism) on the X axis. The R Squared shows how good the correlation is (1.0 being perfect correlation.) These examples all happen to be linear correlations which makes everything a bit nicer. And, all three of these sill hold up as strong correlations, based on R Squared values.

Another really important point is to look at the scales used for Autism. All three are different:

1) Organic Food graph shows sales in millions of dollars per year vs total individuals diagnosed with autism per year. There are multiple problems with this. First of all, this doesn't take into account population variation. This also doesn't limit cases by age of patients. Were many adults suddenly diagnosed in the years just after Autism became better understood/recognized/diagnosed? A more accurate dependent variable would be to use the percentage of a specified age range diagnosed with Autism each year. 

Of course, the whole point of making this graph was to correlate Autism to something that obviously isn't causing Autism. So, they probably used the most compelling graph instead of the most scientifically correct graph to prove their point.

2) The Microwave RF graph shows strength of signal vs Autism incidence per 10,000 children. This is a fairly accurate measurement of Autism to use. It is a percentage of a specific population (children) which will take into account population variation. "Children" could be more accurate however. Is that infant to age 18? School age children? You get the point.

Note that this graph uses two Y-scales. One for RF Exposure and one for Autism. Its not wrong, and its often the only option, but it makes it a little harder to understand quickly. 

Also note that "incidence" is different than "individuals diagnosed." Incidence includes everyone who currently has the disorder where individuals diagnosed would only be NEW cases/diagnoses that year. The TDE isn't entirely sure which is a better metric to use to show the increase in Autism. Number of diagnoses per year is probably an easier number to track and obtain and therefore more accurate, but it doesn't capture the full story in terms of what percent of the population has the disorder. Also, note that both incidence and number of diagnoses cannot take into account the fact that we have gotten significantly better at recognizing the symptoms and diagnosing the disorder over the years. This alone could account for a large portion of the increase (though it does seem that there is more to the story than just recognizing and diagnosing it more often.)

3) The College Tuition vs Autism graph shows cumulative % of college tuition cost vs the cumulative % of Autism cases. 

First of all, what does that even mean? Scroll up and look at the graph again and look at the red line that represents Autism cases. The final point (2007) shows 100% of cases. This sets the number of cases in 2007 to be 100% and then the rest of the data is given in comparison to the number of cases in 2007. In 1999 the data point is at about 5%. This means that the number of cases in 1999 was about 5% of the number of cases in 2007. The same method is used for the college tuition numbers. 

Why would they do this? Probably to put the two variables onto the same scale (percentage from 0 to 100.) Otherwise, it would be really difficult to show them on the same graph and the correlation wouldn't as clear. By converting them both to % of the highest value, we can see how the two trends are similar. 

There isn't anything terribly wrong (scientifically) with this method. However, the problem with this is that it skews the raw data and "hides" it from the readers. I didn't have the raw numbers and had to make my correlation graph with cumulative percents instead of actual number of cases and actual cost of college tuition. A much better graph would have been to graph the raw data as a true correlation graph or at least use the 2 axis method like in the RF exposure graph.

I think these three graphs were intentionally made in a way that keeps the chronological year on the X-axis. We like seeing how things increase over time. It makes sense to us. By eliminating the time scale we loose the connection to "this is happening now, in my life." We also loose the exponential curve which is scary! (media hype) Just understand that the original graphs are not truly correlating the two variables to each other. 

Anyway, we got off on a Correlation tangent there. Back to the point.

Causation:

So, how do we show causation? Well you have to full understand the mechanism. 

Here's a simple example:

For a while, I was lactose intolerant.* Many people self-diagnose themselves with lactose intolerance via correlation. They notice that when they eat foods that contain a large amount of dairy they have digestive distress. You could also eliminate dairy from your diet entirely and then correlate the elimination of dairy to the elimination of "attacks." You could probably even correlate different amounts of milk consumed to some other variable like subjective rating of intestinal pain or maybe total length of time of the lactose "attack." You might even get a nice graph and a strong correlation. Of course, it would be awful of you to force someone that is lactose intolerant to drink milk so don't do that. You can do it to yourself if you value the scientific knowledge over your own comfort I suppose. However, this would show correlation only.

*Yes, past tense. I can eat dairy again now! Yay!

In my case, however, I had a test done. They gave me a class full of dissolved lactose and made me drink it. It wasn't very nice of them, but it was for diagnostic purposes and didn't do any lasting harm. Then, over the course of two hours they had me exhale into a bag at regular intervals. My breath was analyzed for different sugars via Gas Chromatography.* We know how lactose is broken down in the body (the enzyme Lactase breaks Lactose into glucose and galactose which we can digest more easily than Lactose.) By analyzing the sugars on my breath they were able to tell that my Lactase wasn't doing its job well enough. Because we understand the mechanism, we know what to test in order to confirm the presence or absence of Lactase, and we can therefore know for sure when someone is lactose intolerant. I'm sure there are other, more invasive, ways to directly measure Lactase also, but drinking a glass of lactose mixture was probably less painful than whatever they would have needed to do to get a sample of my gut enzymes directly.

*Of COURSE I asked what they were doing with it and how they were analyzing it!

Sure, its probably easier to just eliminate Lactose and see if you feel better, but then you only have correlation, not causation. Some people are actually allergic to milk protein as opposed to being unable to digest milk sugar. Or, maybe some people are super sensitive to the additives in milk and would be fine if they drank only organic milk. I know some people have issues drinking pasteurized milk but can handle raw milk without problems. So, just because you feel better if you eliminate all dairy doesn't mean that you Lactose Intolerant. Its only correlation, not causation. 

So, what does actually CAUSE Autism? We don't know. We have lots of correlation and lots of hype. We have lots of scientists and doctors working on it. But right now we still don't know. Sorry.










2 comments:

  1. You've got it all backwards... Increased cases of autism leads to more organic food being purchased, college tuition going up and added exposure to microwaves. Duh!

    ReplyDelete
  2. As for the cumulative % scales, I think in terms of R squared, what you have used in your correlation graph is just as accurate. Using real raw values would only affect the slope of your line (I think...)

    ReplyDelete