In this package, I aim to provide a function that could predict ethnicity (race) from names.
I created this package hoping to help applied researchers on their studies regarding ethnic bias and discrimination, and potentially eliminate the racial and ethnic disparities. By using this package, you agree to the following:
Again, you should use the package responsibly.
Sure. I have trained a model to predict and classify race based on last names. Simply use it as:
predict_ethnicity(lastnames = "Jackson", method = "lastname") #> lastname prob_asian prob_black prob_hispanic prob_white race #> 1 Jackson 0.02337379 0.898527 0.007418942 0.07068024 black
Of course. There is a separate model just to do that. By having both first names and last names, we can achieve higher accuracy than only having last names. The syntax is similar to what we have seen from above.
predict_ethnicity(firstnames = "Samuel", lastnames = "Jackson", method = "fullname") #> firstname lastname prob_asian prob_black prob_hispanic prob_white race #> 1 Samuel Jackson 0.01741119 0.8898849 0.006667824 0.08603613 black
Cool. I got you covered. Just use vectors as input.
<- c("Samuel", "Will") firstnames <- c("Jackson", "Smith") lastnames predict_ethnicity(lastnames = lastnames, method = "lastname") #> lastname prob_asian prob_black prob_hispanic prob_white race #> 1 Jackson 0.02337379 0.8985270 0.007418942 0.07068024 black #> 2 Smith 0.08850801 0.5421596 0.033163065 0.33616934 black predict_ethnicity(firstnames = firstnames, lastnames = lastnames, method = "fullname") #> firstname lastname prob_asian prob_black prob_hispanic prob_white race #> 1 Samuel Jackson 0.01741119 0.8898849 0.006667824 0.08603613 black #> 2 Will Smith 0.04450590 0.5568278 0.007727875 0.39093847 black
Just remember to have the same length for the
lastnames vectors and the first name and last name for
the same person should have same index in each of the vectors.
Alright. The package also supports extremely fast execution by
multi-threading via the wonderful
RcppThread package. To
use this, just pass a number to the
threads argument and
the number need to be greater than 0.
<- rep("Samuel", 10000) firstnames <- rep("Jackson", 10000) lastnames # measure the elapsed time <- Sys.time() start_time <- predict_ethnicity(firstnames = firstnames, lastnames = lastnames, threads = parallel::detectCores()-2) p <- Sys.time() end_time - start_time end_time #> Time difference of 1.431824 secs
Processing ten thousands names only spent around 0.64 seconds for us. I would call this pretty fast.
For most use cases that I can imagine, the default setting
threads = 0) should be fast enough since we are leveraging
C++ routines for the processing. If you have very large dataset, or if
you have a powerful machine, or if you just want to run the code faster,
you can set the
threads argument to be bigger than 0 and
you should observe performance boost.
You may need to wisely choose the appropriate number of threads for the job. In general, the more threads you have, the faster it should run. But the relationship is not linear, as there will be more overhead when increasing the number of threads. In the example, I was choosing the number of threads by the maximum number minus 2 (24 - 2 = 22).
I first trained the models in Keras with Python, using the Florida Voter Registration dataset. After training a big model for the prediction, I also trained a smaller model than will mimic the prediction of large model (this is called “distillation of knowledge”). By doing so, we could significantly reduce the size of the model while keeping the accuracy. This is very important if we want the package to be lightweight and fast in processing data.
After the training and testing process, I save the distilled model
and export it into C++ by the
frugally-deep project. This
will allow us to get rid of the dependency on Keras and Python and we
can directly making inferences from C++. From here, it is very obvious
that we can wrap the inference procedure by
Rcpp and call
it from R.
Note that one could potentially use the
keras package in
R to load Keras models trained in Python. But I believe this would have
defeated the purpose of having a R package, as the
package still depends on Python and the installation of Keras. You can
argue that we are actually still using Python:)
I have trained two different model for predicting ethnicity from
names, one only leverages last names, while another incorporate both
first names and last names. In some applications, researchers may only
have access to last names, then they should consider using the
lastname method. In other cases, we could also have first
names available, then we could incorporate this information and use the
fullname method. This will yield better accuracy for the
The processing speed is super fast, as the heavy-lifting has been
delegated to the underlying C++. What is more, to make it even faster, I
RcppThread to achieve multi-threading. This would be
extremely useful if you have a very large dataset at hand. As shown in
the example above, we have achieved to process a million names within 10
seconds. In other words, we could predict the race from a name by 10 μs