On using machine learning to predict recidivism
Parole board members use predictions of an offender's risk of recidivating to inform early release decisions. As of yet, we have been only moderately accurate in our attempts to predict recidivism. Machine learning has the potential to increase our predictive ability. Despite this, machine learning methods are relatively unexplored by criminal justice researchers. The goal of this dissertation was to provide recommendations about which combination of modern machine learning method (random forests or Extreme Gradient Boosting) and set of predictors is optimal for predicting recidivism at parole hearings. I operationalized recidivism with five outcome variables: whether they were rearrested for any crime, whether they were rearrested for a violent crime, whether they were reconvicted for a violent crime, whether they were reincarcerated for any crime, and whether they were reconvicted/reincarcerated for a violent crime. Data came from the Texas Department of Criminal Justice and includes an exhaustive list of predictor variables on offenders released in fiscal years 2010 through 2013 (n = 258,248). The models correctly predicted recidivists with areas under the receiver operator curve between .74 to .85 depending on the how recidivism was defined, with general reincarceration being the easiest to predict and violent rearrest being the hardest. Random forests slightly outperformed Extreme Gradient Boosting on all outcomes. The most important predictors were number of tattoos, age, number of previous convictions for various crimes, age at first arrest, and rate of disciplinary infractions during the current incarceration. Race, sex, and correctional education program participation were relatively unimportant.