We evaluate some of the ethnicolr models on the NC Voter Registration Data (access limited to researchers with university affiliation). There are some challenges in evaluation given how race and ethnicity are coded varies across the two states.
North Carolina distinguishes between race and ethnicity and has two columns. Here's the codebook:
/ ***************************************************************************
Race codes
race description
*******************************************************************************
A ASIAN
B BLACK or AFRICAN AMERICAN
I AMERICAN INDIAN or ALASKA NATIVE
M TWO or MORE RACES
O OTHER
P NATIVE HAWAIIAN or PACIFIC ISLANDER
U UNDESIGNATED
W WHITE
*************************************************************************** /
/ ***************************************************************************
Ethnic codes
ethnicity description
*******************************************************************************
HL HISPANIC or LATINO
NL NOT HISPANIC or NOT LATINO
UN UNDESIGNATED
*************************************************************************** /
FL codebook is as follows:
We start by presenting a full cross-tabulation of prediction from the FL full name model and NC concatenation of race and ethnicity, e.g., Asian--HL, Asian--NL, etc.
Next we present three comparisons:
(race_code == 'B') & (ethnic_code == 'NL') ==> nh_black (race_code == 'W') & (ethnic_code == 'NL') ==> nh_white
The overall accuracy is 82%, with accuracy for NH Black at 33% and NH White at 96%.
- (race_code == 'B') & (ethnic_code == 'NL') ==> nh_black
- (race_code == 'W') & (ethnic_code == 'NL') ==> nh_white
- ((race_code == 'W') & (ethnic_code == 'HL')) | ((race_code == 'B') & (ethnic_code == 'HL')) ==> hispanic
- (race_code == 'A') & (ethnic_code == 'NL') ==> asian
The overall accuracy is 81%, with accuracy for NH Black at 33%, NH White at 96%, Asians at 60%, and Hispanics at 59%.
- (race_code == 'B') & (ethnic_code == 'NL') ==> nh_black
- (race_code == 'W') & (ethnic_code == 'NL') ==> nh_white
- ethnic_code == 'HL' ==> hispanic
- (race_code == 'A') & (ethnic_code == 'NL') ==> asian
The overall accuracy is 81%, with accuracy for NH Black at 33%, NH White at 96%, Asians at 60%, and Hispanics at 71%.
We build new LSTM models based on NC data. We start by assuming y = concatenation of ethnic code and race code. We remove U and also UN --- assuming they are 'missing at random.' This gives us 12 categories.
We build a separate model that only predicts the race_code and takes out 'U', again assuming it to be 'missing at random.' We also build a model that only predicts ethnic_code and take out the UN.
- Download NC Data
- FL Model Evaluation on NC Data
- 12 category Model
- Race code model
- Latino model
- NC Model Evaluation on FL Data
Suriyan Laohaprapanon and Gaurav Sood