The probable genetic explanation for interstate differences in mortality amongst non-hispanic whites

A couple months ago I stumbled across ancestry.com’s  “Genetic Census of America”.  Since I was researching the health outcomes question already I remembered that this data existed and I decided to bite the bullet and actually analyze this data systematically.  Lo and behold, I quickly discovered some very strong correlations between these genetic proportions (crudely without any particular techniques) and the life expectancy of non-hispanic whites in each state.  I refined this a bit and produced a toy model that can explain about 85% of the variance in life expectancy between states (not to mention other things)!

Before I get started, let me get some caveats out of the way:

  • correlation does not necessarily imply causation
  • these particular genetic groups may just be proxies in this country for particular ethnic or other genetic groups (at least in part)
  • this could “cultural” (people with particular frequencies of SNPs are also more likely to have had particular cultural mores, values, and the like passed onto them through their ancestors/parents).
  • most “whites” have some fraction of other continental groups, but it’s usually pretty small on average
  • the DNA testers may not necessarily be representative of the larger “white” population, but I think it’s good enough to represent the white population (probably less so other groups).
  • binning these together by states and other high levels of aggregation likely improves the “accuracy” of these methods since random accidents, stochastic variances in gene expression, or what have you get averaged out to large degree.  Likewise, to the extent these groups are just a crude proxy for actual groups, this level of aggregation likely further helps.
  • Ancestry.com does not provide details by race/ethnic group and my procedures cannot perfectly remove any potential signature introduced by others.   Blacks and latinos, in particular, surely introduce some european genetic groups into this data, although they are obviously much under-represented in ancestry’s DNA analysis and I do not think it would skew the results that much.

This is a simple model that I produced to calculate non-hispanic white (NHW) life expectancy by state using a simple genetic calculation and smoking rates (weighted equally on standard deviations from the national non-hispanic mean amongst states) .

actual_vs_pred_le_smoking_and_genes

As for how I got here….

First, you can start to see these patterns with the three groups, i.e., Great Britain, Ireland, and W. Europe,  I keyed in on without any adjustments at all.

No-Adjustment: NHW LE by % GB

unadj_gb_le

No-Adjustment: NHW LE by % Irish

noadj_le_irish

No-Adjustment: NHW LE by % Western European

noadj_le_weuro

“Great British” is pretty apparent, but the latter two are obviously noisier.  This is without any any adjustment for other major racial/ethnic groups in ancestry’s estimates (e.g., blacks, latinos, asians, american indians, etc) that took these DNA tests in those states and, obviously, some states have considerably more (broad-strokes) racial/ethnic diversity than others.  Put differently, states with more non-european groups and thus test takers (albeit seriously under-represented) will tend to skew the european fractions down without any adjustment.

However, if we merely aggregate these three groups together they tighten up a bit:

unadj_pct_euro

We can do quite a bit better though if we crudely divide these and other components all by the total european ethnic groups (as defined by ancestry.com, which is generally pretty similar to major continental categories broadly) to prevent these other genetic groups from obscuring this signal (Yes, I know that’s imperfect, but it seems to work!)

For each group I used:

Adj. British

adj_le_gb

Adj. Irish

adj_le_irish

Adj. Western European

adj_le_weuro

Adjusted Combined (Great British + Irish + W. European)

adj_neuro_le

I did similar analysis for every other european group (and other non-european groups), most of which produced less strong signals and/or positive relationships (e.g., E. Europe, Scandinavia, etc), but I found this to be the cleanest (probably, in part, because these populations are larger and thus their contribution to mean NHW life expectancy can be mostly clearly observed).

Moreover, there is a good reason to think that these particular groups (GB, Irish, and W. European) are closer to each other genetically than other groups (geographical, historical, and genetic analysis).

See this PCA analysis (amongst others):

uk_and_other_eu_pca

source: Population structure and genome-wide patterns of variation in Ireland and Britain.

Or this:

european_population_structure

As for ancestry.com’s methodology definitions of these three ethnic groups, see here:

ancestry_gb

[Note: As you can see I am about as “Great British” as the average English person, presumably. That is probably significantly more than I should be if I interpret this as actual strictly nationally-aligned ethnic groups rather one of probabilities of these particular ethnic/genetic groups being found in particular regions in higher concentrations.  Although, if you look carefully, you can see a fair amount of GB in northern parts of FR, DE, BE, NL, etc]

ancestry_weuro

[Much less than I would expect based on my knowledge of my family’s recent history, genealogy, and this map!]

ancestry_ireland

A list of ancestry.com’s European ethnic/genetic groups by the average proportions of people in those regions

ancestry_euro_structure

[Note: Both the “Great Britain” and “Europe West” regions have a lot of other genetic groups.  Ireland, by contrast, is much more “pure” or, more technically, less recently admixed or to a much lesser degree]

ancestry_all_groups

Their method is far from perfect and their ethnic labels are a bit confusing, if interpreted in a particular literal fashion, but there is still real information content there and it seems to be good enough to find evidence of population structure within the non-hispanic white population of the United States.

I ran some comparisons against my crude genetic model (GB+IR+WEU fraction alone) to see if there were any apparent systematic patterns whereby other groups seemed to cause notable under- or over-prediction.  I did not find much evidence for this (note: positive Y values=over-prediction)

I also ran a series of cross-checks to see if perhaps other populations might be skewing my data somehow:

My calculated fraction by proportion of census takers reporting to be NHW

pctneuro_as_prop_of_euro_by_nhw_census

[There does not seem to be any systematic relationship between my calculation and the proportion of NHW on the US census]

euro_fraction_by_latino_census

[States with higher latino proportions on the US census do not report seem to skew significantly based on census proportions of latinos]

My calculated fraction by proportion of census takers reporting to be black

pcteuro_by_census_percent_black

[States with larger black populations report somewhat larger adj. GB/IR/WEU proportions, which is consistent with known migrations and much stronger patterns in this data.  If black test takers (national avg ~20% european) were skewing this a lot I think we’d see a stronger shift here.]

NHW LE by proportion of NHW on census

nhwle_by_census_nhw

[NHW life expectancy (LE) is not correlated with the census NHW percentage]

NHW LE by the proportion of census latino

nhwle_by_census_latino

[States with larger latino populations have very slightly higher NHW life expectancies, but that’s probably not significant]

NHW LE by proportion census black

nhwle_by_census_black

[States with proportionally larger black populations have lower NHW life expectancies.  I interpret this as mostly being a reflection of southern states having both more blacks and lower NHW LE for the reasons I am describing here (genetic and possibly cultural).  Although I suppose it is possible that the small amount of african admixture may explain a small fraction of this too]

Delta from my simple prediction (genes-only) by percentage American Indian

eu_le_delta_by_american_indian

[A genetic check for the latino angle, mainly.  Apparently not many latinos amongst the DNA testers (that outlier is NM) and they don’t seem to be skewing the results terribly]

eu_le_delta_by_black_ethnitiy

[States with larger fractions of sub-saharan african genetic/ethnic groups in the genetic data may have their life expectancy over-estimated slightly, based on my crude method (excluding smoking), but not by much all that much.  ]

NHW LE by total proportion of European ethnic groups in ancestry.com’s DNA results

nhwle_by_euro_census_fraction

[Note: Most states results are very european–obviously not many non-whites took these tests. Those states with higher european genetic fractions amongst ancestry’s results have higher NHW life expectancy, but that’s consistent with the south and the like]

I compared all of the other major continental groups versus my model (over-prediction) — See this PDF here (to save space/bandwidth)

Onward…

We can produce a simple model using this great british, irish, and w. european fraction like so. r**2=0.54, not bad.

actual_vs_pred_genes_alone

However, we can do much better if we add in the proportions current NHW smokers.

actual_vs_pred_le_smoking_and_genes

[This is a very simple model that just uses smoking rates and the european proportion, equally weighted, for r**2=0.85. I can do a bit better with more variables and stronger weighting, but to prove the point I thought I’d keep this as simple as possible (for now at least) lest I be accused of over-fitting this!]

It can be interesting to compare this analysis with life expectancy for other groups to see how they correlate (or not).  Yes, I am well aware that “Asian” and “Hispanic” are not homogeneous populations, i.e., some states have systematically different national/ethnic and admixture characteristics here than others, however lack of correspondence ought to tell us something.  As in, if we find no relationship amongst these other groups than state-wide explanations like in healthcare systems, policy, and the like are unlikely to explain much (this position ought to be especially difficult for genetic and/or culture deniers to argue with a straight face!)

State life expectancy by estimated GB/IR/WEU proportion amongst whites

ethnic_le_by_euro

Whites and blacks move together, although the fit is considerably tighter for whites, whereas asians and latinos do not appear to correlate at all!   I suppose one could make the argument that cultural/lifestyle differences explain this particular pattern, especially amongst whites and blacks, but one could also argue, if this is genetic, that that particular european admixture in the south has something to do with this too!  It is certainly hard to blame state policy without making any account for genes, culture, or other behavioral elements given the difference outcomes observed in different states

I, for one, find the similarity of the patterns that I observed earlier in the state of California very interesting.

calif_male_le_by_race_and_income

calif_female_le_by_race_and_income

[source for the data]

[Obviously this is at a different level of granularity, i.e., state vs SES, and it’s possible that the latinos and asians are significantly different than these “same” groups in other states (different national origin, culture, and/or admixture proportions), but I really do not believe that is mere coincidence!  Although these other groups are clearly not homogeneous generally, my theory is that they are actually less stratified genetically and culturally with respect to measures of SES because they have not been here nearly as long, on average, and because their historical regions have not had the same structure with respect to long established market systems and the like.]

ethnic_chd_by_euro

[Similar patterns… although, unfortunately, the only other ethnicity they had was “other” — presumably mostly asians and latinos]

ethnic_smoking_by_euro

ethnic_imr_by_euro

ethnic_obesity_by_euro

ethnic_naep_by_euro

[Other groups appear to be virtually uncorrelated with this measure]

ethnic_asthma_by_euro

[Curiously asthma rates seem to notably lower for blacks in these states]

ethnic_low_birth_weight_by_euro

ethnic_pre-term-birth_by_euro

It can also be useful to compare this against other measures.

le_by_gdp_per_capita

state_gdp_per_capita_by_euro

smoking_rates_by_gdp

ethnic_smoking_by_euro
[repeat of the same above for easy comparison]

state_le_by_gini

[Little to no correlation here and, if anything, it does not move in the expected direction]

inequality_by_euro

[Not much of are correlation here either!]

Comparison between ethnic group life expectancy and each group’s “own” (corresponding) 8th grade NAEP scores

state_life_expectancy_by_ethnicity_and_naep_scores

[Note: This fit is much better than just about anything else, save for smoking rates by group probably, and there is some data to support the notion that IQ predicts health outcomes quite well (or here).  Obviously NAEP scores are not a perfect proxy for IQ and likely more subject to variances in effort and conscientiousness, but they are certainly pretty well correlated…]

ethnic_naep_by_euro

[repeat of earlier graph for comparison]

Life expectancy by current smoking rates for each group (data is spotty for latinos and not available for asians)

ethnic_le_by_ethnic_smoking

Obviously smoking adds a lot to this, especially for non-hispanic whites:

gene_only_smoking_overest

However, it’s important to note that these genetic proportions predict much higher rates of smoking, lower NAEP scores, higher obesity, and more, so I wouldn’t assume that smoking per se explains as much itself (surely it’s terrible for health, but it’s also a signal of intelligence these days and preferences/lifestyle).  Why do some states have much higher smoking rates than others?  Genes and (arguably) culture explain this better than GDP or inequality (see above).

gene_only_gini_overest

[inequality adds virtually nothing and it’s probably not even statistically significant]

gene_only_gdp_overest

[likewise for GDP per capita]

gene_only_naep_overest

[NAEP scores seem to add something though]

Even without smoking we can improve out model by adding NAEP scores

nhwle_by_gene_and_naep

gene_and_naep_overpred_by_smoking

[smoking per se still adds something, but less than before]

Or obesity rates

gene_only_obesity_overest

gene_obese_smoking_overpred

[likewise for smoking with respect to obesity and genes]

Or we can combine genes (GB/IR/WEU fraction), NAEP, and obesity rates for an even better alternative

gene_naep_obese_le_est

gene_obese_naep_overpred_by_smoking

[smaller still]

 I still thinking smoking adds real incremental power and that it is certainly absolutely terrible for health, but my point here is that smoking per se does not explain everything (i.e., it’s as much a signal of other behaviors and other health issues as much as it is a cause in and of itself) and that still leaves open the question of why some groups smoke so much more than others (the variances today are not apt to be well explained by policy).

genes_naep_obesity_smoking_rate_pred

[We can account for most of the variance in smoking rates amongst NHW with a simple model using equal weights for genes, obesity, and NAEP scores!]

smoking_rates_by_gdp

[and the usual suspects in like GDP per capita don’t tell us nearly as much

ethnic_smoking_by_gini

[inequality tell us virtually nothing for non-hispanic whites]

ethnic_smoking_by_naep

[NAEP scores in and of themselves tell us something]

nhw_smoking_rate_by_euro_and_naep_equal_weight

[States that are 1 SD above average, i.e., average of lower gene fraction  and higher NAEP scores in SD from mean, smoke, on average, ~2.5 percentage points less per capita and vice versa]

gdppc_by_euro_and_naep_equal_weight

[States that are 1 SD above average, i.e., average of lower gene fraction and higher NAEP scores in SD from mean, have on average ~5.3 K more GDP per capita (remember this is without factoring other groups into the mix that surely confound this!)]

multi_pred_by_euro

[A giant mess of different measures/outcomes against the gene percentage]

*****************

actual_le_by_weighted_average_sd_units_above_mean

[States that are 1 SD units above the mean in GB/IR/W. Euro gene fraction and NHW smoking rates are, on average, ~1 SD units below the mean in life expectancy]

Below I tossed a lot of this data into a heat map (US states) to try to show the patterns in standard deviation terms.  Blue=better, white=average, red=worse.

I picked this color scheme to show the differences in the sharpest relief possible.  When it’s off by a bit, especially around the mean (+/- average) it looks worse than it really is.  Nevertheless, you can probably pick up some pretty clear patterns here!

Google Chrome 2

Google Chrome 6

Google Chrome

Google Chrome 5

Google Chrome 10

[Note: I accidentally flipped the sign here!.   positive = under-prediction = blue; negative = over-prediction = red.  Nevada is a clear over-prediction here!]



Google Chrome 4

Google Chrome 7


Google Chrome 8

[Texas NHW do surprisingly well on the NAEP.  Accurate or skewed by varying state standards?]

Google Chrome 9

And, of course, these state wide averages obscure tremendous country-level variation (which is all the more reason why “policy” is unlikely to explain much here!)

Male life-expectancy

male_life_expectancy_by_county

Female life-expectancy

female_life_expectancy_by_county

[Note: these maps include all racial/ethnic groups so it tends to exaggerate a bit]

Nevertheless, you can see these by white-only groups for specific major causes of death, i.e., CHD and stroke, both hospitalizations and death rates here:

Google Chrome 27 Google Chrome 28 Google Chrome 25 Google Chrome 26 Google Chrome 23 Google Chrome 24

You can sort of observe the non-random distribution of ethnicity/ancestry by looking at specific (reported) ancestry in the US census:

“Scotch-Irish”

moa_scotch_irish

“Scotish”

moa_scotish

“English”

moa_english

“German”

moa_german

“Swedish”

moa_swedish

“Irish”
moa_irish

[Note: The responses are fairly fickle (trends, politics, etc) and not weighted by actual proportion amongst “whites”, so it’s not that useful as an absolute gauge… which is why I prefer the genetic approach here, even if it’s not quite as granular!]

major_reported_euro_ancestry_by_region

1990 US census data

In any event, I am not necessarily arguing that the three ethnic/genetic groups I picked out are homogenous (within or between said groups) or have a uniform distribution, so that we should necessarily expect the same (implied) outcomes in, say, England (although there’s probably some general directional effect here if we compare these groups to, say, southern european life expectancy) or with any individual that has said proportions (i.e., if you could somehow clone said individual or their immediate family 1000 times and observe their mortality rates).  However, it seems very likely to me that this is at least a good proxy for particular ethnic groups that settled or emigrated to this country so we can make some reasonable guesses about the sort of people in these state (on average) based on the average genetic proportions amongst (apparent) europeans broadly.

In other words, it’s not necessarily that those that the “Great British”, “Irish”, or “Western European” genetic groups are as unhealthy or underperforming otherwise as we (generally) find here, but rather that those states with high proportions of “Great British”, “Irish”, or “West European” (ancestry.com’s methods) ethnic groups amongst NHWs claim a higher share of particular finer-grain ethnic populations and that many of those groups have a long history of problems (although they could be sorted to some degree based on these sort of admixture levels).   Likewise, there are averages amongst states: different cities and regions have different distributions of ethnic groups and the like too.  The Scottish (including our so-called “Scotch-Irish”), Irish (especially Northern/Protestants), and people in Northern parts of England have long had worse outcomes (and still do) and I rather suspect many of our early immigrants came from those groups and brought their troubles with them even as they prospered.  Many of them settled in the south, greater Appalachia, etc and a good number of them have since interbred with other groups and moved to other parts of the US (which is where genetical/admixture analysis can possibly reveal more than, say, US census reported ethnicity which changes depending on current trends/popularity and the like!).

These sorts of patterns are not entirely unique to the United States.

There is a very well established “North-South” gradient in the UK in health outcomes.

Google Chrome 15

Scottish life expectancy by council area

Google Chrome 14

CHD mortality ratios for Wales, N. Ireland, and Scotland vs England

Google Chrome 11

CHD mortality in 1961 vs 2009

Google Chrome 13

Google Chrome 12

[Note: Very little change in relative risk!]

uk_chd_diff

uk_stroke_diff

uk_bp_diff

[Notice the relative lack of spread in nothern ireland, by health authority at least.  I would not be surprised if this was the result of relatively (traditionally) greater genetic homogeneity in NI — see ancestry.com’s regional distribution, amongst others]

Google Chrome 17

Google Chrome 16

Nor have they abolished differences by major ethnic/racial groups

Google Chrome 18

uk_ethnicity_stroke_rates

UK_life_expectancy_spatial

uk_ethnicity_diabetes_rates

uk_self_reported_diabetes_rates

Google Chrome 19

Study of Y-haplogroups and CHD risks in the UK

uk_y_halpogroup_chd_study

Patterns in health disparities are hardly unique to the UK or the US either

Health inequality study in various european countries (healthcare amenable vs non-amenable mortality)

amenable_mortality_inequality_in_europe

[There are significant long-standing differences in mortality rates according to class in all of Europe: both “amenable” and otherwise]

Finland CHD mortality rates by income group

finland_mortality_rates_chd

Life expectancy in Finland, amongst adults age 35, by class and sex

finland_life_expectancy_by_sex_and_class

********

The notion that all groups have the same risk factors, the same life expectancy, and so on, ergo all differences in outcomes can only be explained by differences in health care treatment are really not tenable any longer (it’s not even likely).  I suspect that these differences are mostly genetic, even many of the behavioral components are apt to be strongly influenced by genes (e.g., propensity for addiction to tobacco, over-eating, etc), but, if nothing else, we really do not have good environmental explanations, save for smoking, particular types of drug use, STDs, etc.  We barely even have good policy solutions to even address those limited parts that we do understand (never mind that which we do even understand!).   Many medical/technological improvements will help all groups, but, more often that not, these will tend to amplify pre-existing disparities in outcomes and rarely close them appreciably (unless a particular treatment is highly efficacious in treating diseases that disproportionately effect particular groups so as to have a pronounced effect on overall mortality/odds ratios) !

We clearly have more genetic diversity than do most European/Anglo countries that we are typically compared to, even amongst “whites”.

uk_ca_usa_pca_map

us_ca_pca_clusters

Many of our “white” immigrants came from different countries, regions, religious groups, and more.   If these different broad groups have (had) different genotypical life expectancies (given similar environmental conditions), risk factors, and so on this alone will create more apparent health “inequality” (especially if we look at the nation as a whole rather than more narrowly within a particular broad ethnic group and region) even if most people do not think of these groups as being (visibly) different.  (Not to mention some potential admixing with other groups and/or other groups identifying as NHW)

Furthermore, even many of these individual countries, regions, and the like are clearly not homogeneous either.

We can see this quite clearly with the averages produced by ancestry.com’s analysis for Europe (in these regions, some individuals clearly have more, some have less since this admixture is relatively recent):

ancestry_euro_structure

It is quite likely that they were and still are stratified genetically by class, by region, by historical religious groups, etc (whether they know it or not) and that this has implications for life expectancy, disease rates, and other important outcomes.  Emigration from these places to the United States was hardly random: some classes, religious groups, regions, and so on clearly settled here at very different rates for very different reasons at different times.  If we drew a disproportionate share of lower class or otherwise marginalized groups, it is quite likely that many of our groups could have somewhat worse outcomes even when we try to compare apples-to-apples (not that easy!).  There is some evidence for long-standing historical differences in different immigrant populations (and these differences still exist in many of these countries and, particularly, regions that they immigrated from).

See this table on Irish immigrants to the United States and historical health disparities (starting in 1850):

irish_heart_disease

[This is obviously long before modern medicine, fast food, etc and these patterns in CHD differences existed even then!]

Life expectancy at birth by region (both sexes) — zoom in Northern Europe

Google Chrome 21

Life expectancy at birth, both sexes, zoom out

Google Chrome 22

source: WHO

There is clearly not just one German, English, Scottish, French, Dutch, etc  life expectancy, not even at regional level (i.e., there is certainly more within regions, ethnic groups, classes, etc)!  [I would love to get similar level of detail for UK and various european countries to see how this sort of analysis plays out, crude though it may be].  Some regions have different distributions of risk factors and there is some evidence that this correlates with population structure (see this for CHD risks in the UK).  We also know that these groups are not that homogeneous either.

I suspect (but cannot prove) that interstate differences amongst non-hispanic whites, like differences within and between other first world countries, are largely genetic and probably somewhat cultural/behavioral (to the extent we can disentangle culture/behavior from genetics).  Of course lifestyle (e.g., smoking, eating habits, etc) and healthcare systems can have an impact, and they can change over time, but these variances are  usually not that pronounced in practice and they tend not to explain much within broadly similar environments (as in, amongst well established citizens that have adopted broadly western diets, that have decent healthcare systems, not excessively high rates of accidents/murder/etc).

One thought on “The probable genetic explanation for interstate differences in mortality amongst non-hispanic whites

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s