Eugene’s Computer Science Journey - Analyzing Infant Mortality Rates by County, Race, and Income in North Carolina, 2011-2016

Introduction

In November 2022, my friend gave me the datasets included in the “data” folder for this post as “Yr1116Birth.csv” and “Yr1116Death.csv” and told me to develop a model that predicts a county-by-county infant mortality rate based on various demographic features that were recorded in the data. The data lists births and deaths of infants in the years 2011-2016 in North Carolina by county of residence, race, and ethnicity of the mother. The Excel file Dataset Descriptions.xls lists all the information available in this dataset, and I presume that the information comes from the NC State Center for Vital Statistics. I’ve also supplemented the data with county-by-county income data which I found at datausa, though this data is also readily available from various public agencies like the Census Bureau. I use the data to compute infant mortality rates by county, race, and year. I used the 2011-2015 data to train my models and the 2016 data for validation.

This post is a write-up of my analysis of this data. This was an interesting task because it was a crash course in a few important data science skills: data wrangling, rare event modeling, and maps & data visualization.

Data wrangling

My data came from a number of different sources. There were separate birth and death data tables, the income data came from its own source, and the purely geographic data used to generate the map at the end of this post came from the mapdata package.

Combining these sources presented some difficulty. The birth and death data had different numerical codings for the race of the baby/mother, so I had to make a custom function collapse which collapsed the many extra race codings in the birth data into the simpler “other” category in the death data. In a similar vein, different sources encoded county information in different ways: North Carolina orders the counties from 1 to 100, but there is also a FIPS code which is more useful for national data, and of course the name of a county is a fine label for it. One defect in my data was that one of my sources mis-spelled “Tyrrell”, and it took me a while to detect this error.

Rare Event Modeling

When my friend presented me with this data, he and I discussed the interesting fact that some counties recorded no infant deaths for certain races in certain years. I don’t think that this was due to incomplete records or reporting anomalies: when I investigated these cases, I found that there were fewer than 100 births in the previous year in the same race and county. The overall infant mortality rate was about .7% in these years, so the expected number of infant deaths when there are fewer than 100 births is less than 1.

My friend raised the possibility that I could model this problem as a classification problem: given a partiular infant birth, predict the probability that it would die in the first year of its life. I considered this possibility, but decided not to do the analysis in this way, since the birth data contained more information, like the infant’s birth weight, that was not reflected in the death data, so I thought it might be hard to measure the effect of these additional variables on a given infant’s likelihood of death. So, instead I modeled the problem as a regression problem: predict a county’s infant mortality rate in a given county by year and race, given the county’s average values for the other predictors in the birth data (e.g., birth weight in grams, median income in the county, number of cigarettes smoked by the mother).

Nevertheless, the data still presented the challenges associated with classification problems in which the classes are very unbalanced in number. To get a feel for why there is an issue, let’s consider one of those counties where there were no infant deaths in a given year. Because infant mortality rates are on the order of 1/1000, to detect significant changes in rates between counties, it makes sense to measure them on a log scale. The counties where there were no infant deaths, we would record an infant mortality rate of 0, which would be (infinitely) many orders of magnitude smaller than the typical rate. To solve this, I added .001 to the infant mortality rates before taking the logarithm. This is sort of like label smoothing: I don’t want the model to make too much of those points where there happened to be no infant deaths.

One other thing I tried to do was to aggregate the counties into clusters and compute only in-cluster infant mortality rates for the training data. This was an attempt to reduce year-to-year variance due to small sample sizes. However, I found that my implementation of this idea didn’t really improve the validation-set error. So, in the end, I didn’t put this into practice. But, if you look at the code for this notebook, you’ll find relics of that approach.

Visualization and Maps

This project was a good way for me to practice what I had learned about making data visualizations from Introduction to Statistical Learning with R, including feature importance charts. But it was also an opportunity for me to learn how to make a county-by-county heat map; this map appears in the last section of this

Loading Packages and Cleaning Data

In this section, I clean the data and wrangle it into a form amenable to analysis. I omit most of this process from the presentation version of the notebook, but the interested readers can examine the code, which is available on my Github page.

The result of all these manipulations is a data frame brdrcounts which looks like this:

head(brdrcounts)

# A tibble: 6 × 15
   YEAR CORES  RACE CIGPN CIGFN CIGSN CIGLN  BWTG  GEST  PLUR  MAGE PARITY
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
1  2011     1     1 3.25  1.91  1.43  1.33  3318.  38.7  1.03  27.1   2.33
2  2011     1     2 2.58  1.58  1.06  0.942 3089.  38.2  1.05  25.5   2.77
3  2011     1     3 6.5   5     5     5     2758.  38.5  1     25     2.75
4  2011     1     4 0.347 0.180 0.178 0.159 3287.  38.7  1.03  27.2   3.05
5  2011     2     1 2.93  1.87  1.63  1.49  3252.  38.2  1.05  26.6   2.71
6  2011     2     2 0     0     0     0     3433.  39.5  1     23.4   2.15
# … with 3 more variables: INCOME <dbl>, IMR <dbl>, CLUSTER <dbl>

Most of the columns are explained in the file Dataset Descriptions.xls; “IMR” is the log of infant mortality rate, and “Cluster” is just a duplicate of “CORES” (it’s an artifact from when I tried to apply clustering to the data).

Trying Different Models

Now, we proceed to try different models on the test data. I think a bit of a warning is in order about concluding too much about variable importance, since we expect there to be significant collinearity between some of the predictors.

The first method we try is just a linear model; we perform subset selection by validation-set MSE.

which.min(val.errors)

[1] 10

coef(regfit,which.min(val.errors))

  (Intercept)         RACE2         RACE3         CIGPN         CIGFN 
-1.983988e+00  4.961281e-01  1.442747e-01 -7.878452e-03 -5.804568e-03 
        CIGSN          BWTG          GEST          PLUR        PARITY 
 1.456896e-02 -1.282548e-04 -4.569572e-02 -4.074168e-01  7.307301e-03 
       INCOME 
-9.240510e-06

subset.error<-val.errors[which.min(val.errors)]

The best model seems to associate a decline in infant mortality rate if the mother is American Indian or “Other” (not White, Black, or American Indian). It’s hard to understand the sign of the coefficients for “CIGPN” and “CIGFN”. My guess is that this has to do with the fact that I imputed a slightly lower-than-average infant mortality rate when the death count for a given county, race, and year is zero.

Let’s now try lasso regression.

x<-model.matrix(IMR~.-CORES,train.data[,1:14])[,-1]
y<-na.omit(train.data$IMR)
set.seed(1)
cv.lasso<-cv.glmnet(x,y,alpha=1,family="gaussian")
plot(cv.lasso)

coef(cv.lasso,cv.lasso$lambda.min)

15 x 1 sparse Matrix of class "dgCMatrix"
                       s1
(Intercept) -7.914631e+00
YEAR         2.907554e-03
RACE2        4.879737e-01
RACE3        1.309074e-01
RACE4       -7.877888e-03
CIGPN       -6.122286e-03
CIGFN        .           
CIGSN        6.751858e-03
CIGLN        .           
BWTG        -1.183393e-04
GEST        -4.529146e-02
PLUR        -3.807254e-01
MAGE        -1.634793e-04
PARITY       6.375718e-03
INCOME      -8.955468e-06

coef(cv.lasso,cv.lasso$lambda.1se)

15 x 1 sparse Matrix of class "dgCMatrix"
                    s1
(Intercept) -4.8453328
YEAR         .        
RACE2        0.1535657
RACE3        .        
RACE4        .        
CIGPN        .        
CIGFN        .        
CIGSN        .        
CIGLN        .        
BWTG         .        
GEST         .        
PLUR         .        
MAGE         .        
PARITY       .        
INCOME       .

The 1se lambda value gives a model in which the only predictor is RACE2 (African American).

Now we try ridge regression:

cv.ridge<-cv.glmnet(x,y,alpha=0,family="gaussian")
plot(cv.ridge)

coef(cv.ridge,cv.ridge$lambda.min)

15 x 1 sparse Matrix of class "dgCMatrix"
                       s1
(Intercept) -1.002797e+01
YEAR         3.996398e-03
RACE2        4.597315e-01
RACE3        1.149176e-01
RACE4       -2.115184e-02
CIGPN       -5.598731e-03
CIGFN       -1.539379e-03
CIGSN        5.012017e-03
CIGLN        2.696535e-03
BWTG        -1.336627e-04
GEST        -4.438239e-02
PLUR        -3.947143e-01
MAGE        -2.115157e-03
PARITY       7.580164e-03
INCOME      -8.632730e-06

coef(cv.ridge,cv.ridge$lambda.1se)

15 x 1 sparse Matrix of class "dgCMatrix"
                       s1
(Intercept) -5.599637e+00
YEAR         8.684702e-04
RACE2        8.397161e-02
RACE3       -3.927352e-03
RACE4       -3.221903e-02
CIGPN       -6.046998e-05
CIGFN       -5.124061e-05
CIGSN       -9.542885e-06
CIGLN       -3.908353e-05
BWTG        -5.967245e-05
GEST        -1.357928e-02
PLUR        -1.939902e-02
MAGE        -6.071366e-03
PARITY       7.083761e-04
INCOME      -1.630609e-06

Let’s try to train a single tree.

library(tree)
tree.model<-tree(IMR~.-CORES,train.data)
summary(tree.model)


Regression tree:
tree(formula = IMR ~ . - CORES, data = train.data)
Variables actually used in tree construction:
[1] "RACE"   "INCOME" "PLUR"   "GEST"   "CIGLN" 
Number of terminal nodes:  6 
Residual mean deviance:  0.3125 = 582.5 / 1864 
Distribution of residuals:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.33800 -0.15540 -0.08285  0.00000  0.14070  4.18600

plot(tree.model)
text(tree.model,pretty=0)

tree.preds<-predict(tree.model,na.omit(test.data))
tree.error<-mean((na.omit(tree.preds)-na.omit(test.data$IMR))^2)

The most notable differences in IMR come from GEST and RACE2.

Let’s do some tree pruning. The following graph shows that the minimum deviance is obtained via a tree with 6 nodes; however, there doesn’t seem to be much difference between a tree with 3 nodes and a tree with 6 nodes.

set.seed(2)
tree.cv<-cv.tree(tree.model)
plot(tree.cv$size,tree.cv$dev,type="b")

prune.tree.model<-prune.tree(tree.model,best=3)
plot(prune.tree.model)
text(prune.tree.model,pretty=0)

prune.tree.preds<-predict(prune.tree.model,na.omit(test.data))
prune.error<-mean((na.omit(prune.tree.preds)-na.omit(test.data$IMR))^2)

As we noted above, GEST and RACE2 are the major factors in this tree.

Let’s try random forests.

library(randomForest)
set.seed(12)
rf.model<-randomForest(IMR~.-CORES,na.omit(train.data[,1:14]), importance=TRUE)
rf.preds<-predict(rf.model,newdata=na.omit(test.data))
rf.error<-mean((rf.preds-na.omit(test.data$IMR))^2)
varImpPlot(rf.model)

Again, “RACE”, “BWTG”, and the various “CIG” predictors appear near the top of the %IncMSE chart.

Last model is boosting:

library(gbm)
set.seed(15)
boost.model<-gbm(IMR~.-CORES,data=na.omit(train.data),distribution="gaussian",n.trees=5000,interaction.depth=4)
summary(boost.model)

            var   rel.inf
GEST       GEST 14.410847
BWTG       BWTG 12.606412
MAGE       MAGE 11.118088
PARITY   PARITY 10.495755
CLUSTER CLUSTER  8.163164
INCOME   INCOME  7.517338
CIGPN     CIGPN  6.803060
PLUR       PLUR  6.779247
CIGFN     CIGFN  5.553177
CIGLN     CIGLN  4.856280
CIGSN     CIGSN  4.735801
RACE       RACE  4.725559
YEAR       YEAR  2.235272

boost.preds<-predict(boost.model,newdata=na.omit(test.data))
boost.error<-mean((boost.preds-na.omit(test.data$IMR))^2)

Summary of Results

Now, we summarize our results in a table:

mean.error<-mean((na.omit(test.data$IMR)-log(avgIMR+.007))^2)
test.error.data<-data.frame(Method=c("No Dependence on Predictors","Best Subset Linear Model","Simple Lasso","Lowest MSE Lasso", "Simple Ridge", "Lowest MSE Ridge", "Tree","Pruned Tree","Random Forests","Boosting"),`Test Error`=c(mean.error,subset.error,simple.lasso.error,best.lasso.error,simple.ridge.error, best.ridge.error, tree.error,prune.error,rf.error,boost.error))
test.error.data

                        Method Test.Error
1  No Dependence on Predictors  0.6817740
2     Best Subset Linear Model  0.3456520
3                 Simple Lasso  0.3640914
4             Lowest MSE Lasso  0.3462914
5                 Simple Ridge  0.3660809
6             Lowest MSE Ridge  0.3466169
7                         Tree  0.3436803
8                  Pruned Tree  0.3643193
9               Random Forests  0.3240173
10                    Boosting  0.4121946

Random forests seems to have done the best. This is consistent with its reputation as the best out-of-the-box method. Boosting is finnicky, and probably required some more hyperparameter tuning.

Graphs and Visualizations

This is a graph of infant mortality rates by race and year.

This is a heat map of North Carolina by infant mortality rate:

Conlcusion

It was interesting to go back to this project a few months after I first did it, because I noticed a lot of places where what I have learned in the interim could have come in handy. For example, I now have more robust EDA and feature engineering frameworks. Keep an eye out for a future blog post in which I discuss these issues in more depth in the context of my participation in recent Kaggle competitions.