A couple months ago I stumbled across ancestry.com’s “Genetic Census of America”. Since I was researching the health outcomes question already I remembered that this data existed and I decided to bite the bullet and actually analyze this data systematically. Lo and behold, I quickly discovered some very strong correlations between these genetic proportions (crudely without any particular techniques) and the life expectancy of non-hispanic whites in each state. I refined this a bit and produced a toy model that can explain about 85% of the variance in life expectancy between states (not to mention other things)!
Before I get started, let me get some caveats out of the way:
- correlation does not necessarily imply causation
- these particular genetic groups may just be proxies in this country for particular ethnic or other genetic groups (at least in part)
- this could “cultural” (people with particular frequencies of SNPs are also more likely to have had particular cultural mores, values, and the like passed onto them through their ancestors/parents).
- most “whites” have some fraction of other continental groups, but it’s usually pretty small on average
- the DNA testers may not necessarily be representative of the larger “white” population, but I think it’s good enough to represent the white population (probably less so other groups).
- binning these together by states and other high levels of aggregation likely improves the “accuracy” of these methods since random accidents, stochastic variances in gene expression, or what have you get averaged out to large degree. Likewise, to the extent these groups are just a crude proxy for actual groups, this level of aggregation likely further helps.
- Ancestry.com does not provide details by race/ethnic group and my procedures cannot perfectly remove any potential signature introduced by others. Blacks and latinos, in particular, surely introduce some european genetic groups into this data, although they are obviously much under-represented in ancestry’s DNA analysis and I do not think it would skew the results that much.
This is a simple model that I produced to calculate non-hispanic white (NHW) life expectancy by state using a simple genetic calculation and smoking rates (weighted equally on standard deviations from the national non-hispanic mean amongst states) .
As for how I got here….






