[Updated: Mon, Sep 23, 2024 - 14:42:38 ]
There are two datasets we will use throughout this course. The first dataset has a continuous outcome and the second dataset has a binary outcome. We will apply several methods and algorithms to these two datasets during the course. We will have an opportunity to compare and contrast the prediction outcomes from several models and methods on the same datasets.
This section provides some background information and context for these two datasets.
The readability dataset comes from a recent Kaggle Competition (CommonLit Readability Prize). You can directly download the training dataset from the competition website, or you can import it from the course website.
'data.frame': 2834 obs. of 6 variables:
$ id : chr "c12129c31" "85aa80a4c" "b69ac6792" "dd1000b26" ...
$ url_legal : chr "" "" "" "" ...
$ license : chr "" "" "" "" ...
$ excerpt : chr "When the young people returned to the ballroom, it presented a decidedly changed appearance. Instead of an inte"| __truncated__ "All through dinner time, Mrs. Fayre was somewhat silent, her eyes resting on Dolly with a wistful, uncertain ex"| __truncated__ "As Roger had predicted, the snow departed as quickly as it came, and two days after their sleigh ride there was"| __truncated__ "And outside before the palace a great garden was walled round, filled full of stately fruit-trees, gray olives "| __truncated__ ...
$ target : num -0.34 -0.315 -0.58 -1.054 0.247 ...
$ standard_error: num 0.464 0.481 0.477 0.45 0.511 ...
There is a total of 2834 observations. Each observation represents a reading passage. The most significant variables are the excerpt
and target
columns. The excerpt column includes plain text data, and the target column includes a corresponding measure of readability for each excerpt.
readability[1,]$excerpt
[1] "When the young people returned to the ballroom, it presented a decidedly changed appearance. Instead of an interior scene, it was a winter landscape.\nThe floor was covered with snow-white canvas, not laid on smoothly, but rumpled over bumps and hillocks, like a real snow field. The numerous palms and evergreens that had decorated the room, were powdered with flour and strewn with tufts of cotton, like snow. Also diamond dust had been lightly sprinkled on them, and glittering crystal icicles hung from the branches.\nAt each end of the room, on the wall, hung a beautiful bear-skin rug.\nThese rugs were for prizes, one for the girls and one for the boys. And this was the game.\nThe girls were gathered at one end of the room and the boys at the other, and one end was called the North Pole, and the other the South Pole. Each player was given a small flag which they were to plant on reaching the Pole.\nThis would have been an easy matter, but each traveller was obliged to wear snowshoes."
readability[1,]$target
[1] -0.3402591
According to the data owner, ‘the target value is the result of a Bradley-Terry analysis of more than 111,000 pairwise comparisons between excerpts. Teachers spanning grades 3-12 served as the raters for these comparisons.’ A lower target value indicates a more challenging text to read. The highest target score is equivalent to the 3rd-grade level, while the lowest target score is equivalent to the 12th-grade level. The purpose is to develop a model that predicts a readability score for a given text to identify an appropriate reading level.
In the following weeks, we will talk a little bit about the pre-trained language models (e.g., RoBerta). Our coverage of this material will be at the surface level. We will primarily cover how we obtain numerical vector representations (sentence embeddings) for given text input from a pre-trained language model using Python through R. Then, we will use the sentence embeddings as features to predict the target score in this dataset using various modeling frameworks.
The Recidivism dataset comes from The National Institute of Justice’s (NIJ) Recidivism Forecasting Challenge. The challenge aims to increase public safety and improve the fair administration of justice across the United States. This challenge had three stages of prediction, and all three stages require modeling a binary outcome (recidivated vs. not recidivated in Year 1, Year 2, and Year 3). In this class, we will only work on the second stage and develop a model for predicting the probability of an individual’s recidivism in the second year after initial release.
You can download the training dataset directly from the competition website, or from the course website. Either way, please read the Terms of Use at this link before working with this dataset.
'data.frame': 25835 obs. of 54 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Gender : chr "M" "M" "M" "M" ...
$ Race : chr "BLACK" "BLACK" "BLACK" "WHITE" ...
$ Age_at_Release : chr "43-47" "33-37" "48 or older" "38-42" ...
$ Residence_PUMA : int 16 16 24 16 16 17 18 16 5 16 ...
$ Gang_Affiliated : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Supervision_Risk_Score_First : int 3 6 7 7 4 5 2 5 7 5 ...
$ Supervision_Level_First : chr "Standard" "Specialized" "High" "High" ...
$ Education_Level : chr "At least some college" "Less than HS diploma" "At least some college" "Less than HS diploma" ...
$ Dependents : chr "3 or more" "1" "3 or more" "1" ...
$ Prison_Offense : chr "Drug" "Violent/Non-Sex" "Drug" "Property" ...
$ Prison_Years : chr "More than 3 years" "More than 3 years" "1-2 years" "1-2 years" ...
$ Prior_Arrest_Episodes_Felony : chr "6" "7" "6" "8" ...
$ Prior_Arrest_Episodes_Misd : chr "6 or more" "6 or more" "6 or more" "6 or more" ...
$ Prior_Arrest_Episodes_Violent : chr "1" "3 or more" "3 or more" "0" ...
$ Prior_Arrest_Episodes_Property : chr "3" "0" "2" "3" ...
$ Prior_Arrest_Episodes_Drug : chr "3" "3" "2" "3" ...
$ Prior_Arrest_Episodes_PPViolationCharges : chr "4" "5 or more" "5 or more" "3" ...
$ Prior_Arrest_Episodes_DVCharges : logi FALSE TRUE TRUE FALSE TRUE FALSE ...
$ Prior_Arrest_Episodes_GunCharges : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Prior_Conviction_Episodes_Felony : chr "3 or more" "3 or more" "3 or more" "3 or more" ...
$ Prior_Conviction_Episodes_Misd : chr "3" "4 or more" "2" "4 or more" ...
$ Prior_Conviction_Episodes_Viol : logi FALSE TRUE TRUE FALSE TRUE FALSE ...
$ Prior_Conviction_Episodes_Prop : chr "2" "0" "1" "3 or more" ...
$ Prior_Conviction_Episodes_Drug : chr "2 or more" "2 or more" "2 or more" "2 or more" ...
$ Prior_Conviction_Episodes_PPViolationCharges : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
$ Prior_Conviction_Episodes_DomesticViolenceCharges: logi FALSE TRUE TRUE FALSE FALSE FALSE ...
$ Prior_Conviction_Episodes_GunCharges : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
$ Prior_Revocations_Parole : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Prior_Revocations_Probation : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
$ Condition_MH_SA : logi TRUE FALSE TRUE TRUE TRUE FALSE ...
$ Condition_Cog_Ed : logi TRUE FALSE TRUE TRUE TRUE FALSE ...
$ Condition_Other : logi FALSE FALSE FALSE FALSE TRUE TRUE ...
$ Violations_ElectronicMonitoring : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Violations_Instruction : logi FALSE TRUE TRUE FALSE FALSE FALSE ...
$ Violations_FailToReport : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Violations_MoveWithoutPermission : logi FALSE FALSE TRUE FALSE FALSE TRUE ...
$ Delinquency_Reports : chr "0" "4 or more" "4 or more" "0" ...
$ Program_Attendances : chr "6" "0" "6" "6" ...
$ Program_UnexcusedAbsences : chr "0" "0" "0" "0" ...
$ Residence_Changes : chr "2" "2" "0" "3 or more" ...
$ Avg_Days_per_DrugTest : num 612 35.7 93.7 25.4 23.1 ...
$ DrugTests_THC_Positive : num 0 0 0.333 0 0 ...
$ DrugTests_Cocaine_Positive : num 0 0 0 0 0 0 0 0 NA 0 ...
$ DrugTests_Meth_Positive : num 0 0 0.1667 0 0.0588 ...
$ DrugTests_Other_Positive : num 0 0 0 0 0 0 0 0 NA 0 ...
$ Percent_Days_Employed : num 0.489 0.425 0 1 0.204 ...
$ Jobs_Per_Year : num 0.448 2 0 0.719 0.929 ...
$ Employment_Exempt : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Recidivism_Within_3years : logi FALSE TRUE TRUE FALSE TRUE FALSE ...
$ Recidivism_Arrest_Year1 : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
$ Recidivism_Arrest_Year2 : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
$ Recidivism_Arrest_Year3 : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
$ Training_Sample : int 1 1 1 1 1 0 1 0 1 1 ...
There are 25,835 observations in the training set and 54 variables, including a unique ID variable, four outcome variables (Recidivism in Year 1, Recidivism in Year 2, and Recidivism in Year 3, Recidivism within three years), and a filter variable to indicate whether an observation was included in the training dataset or test dataset. The remaining 48 variables are potential predictive features. A complete list of these variables can be found at this link.
We will work on developing a model to predict the outcome variable Recidivism_Arrest_Year2
using the 48 potential predictive variables. Before moving forward, we must remove the individuals who had already recidivated in Year 1. As you can see below, about 29.9% of the individuals recidivated in Year 1. I am removing these individuals from the dataset.
table(recidivism$Recidivism_Arrest_Year1)
FALSE TRUE
18111 7724
recidivism2 <- recidivism[recidivism$Recidivism_Arrest_Year1 == FALSE,]
I will also recode some variables before saving the new dataset for later use in class.
require(dplyr)
# Dependents
recidivism2$Dependents <- recode(recidivism2$Dependents,
'0'=0,
'1'=1,
'2'=2,
'3 or more'=3)
# Prior Arrest Episodes Felony
recidivism2$Prior_Arrest_Episodes_Felony <- recode(recidivism2$Prior_Arrest_Episodes_Felony,
'0'=0,
'1'=1,
'2'=2,
'3'=3,
'4'=4,
'5'=5,
'6'=6,
'7'=7,
'8'=8,
'9'=9,
'10 or more'=10)
# Prior Arrest Episods Misd
recidivism2$Prior_Arrest_Episodes_Misd <- recode(recidivism2$Prior_Arrest_Episodes_Misd,
'0'=0,
'1'=1,
'2'=2,
'3'=3,
'4'=4,
'5'=5,
'6 or more'=6)
# Prior Arrest Episodes Violent
recidivism2$Prior_Arrest_Episodes_Violent <- recode(recidivism2$Prior_Arrest_Episodes_Violent,
'0'=0,
'1'=1,
'2'=2,
'3 or more'=3)
# Prior Arrest Episods Property
recidivism2$Prior_Arrest_Episodes_Property <- recode(recidivism2$Prior_Arrest_Episodes_Property,
'0'=0,
'1'=1,
'2'=2,
'3'=3,
'4'=4,
'5 or more'=5)
# Prior Arrest Episods Drug
recidivism2$Prior_Arrest_Episodes_Drug <- recode(recidivism2$Prior_Arrest_Episodes_Drug,
'0'=0,
'1'=1,
'2'=2,
'3'=3,
'4'=4,
'5 or more'=5)
# Prior Arrest Episods PPViolationCharges
recidivism2$Prior_Arrest_Episodes_PPViolationCharges <- recode(recidivism2$Prior_Arrest_Episodes_PPViolationCharges,
'0'=0,
'1'=1,
'2'=2,
'3'=3,
'4'=4,
'5 or more'=5)
# Prior Conviction Episodes Felony
recidivism2$Prior_Conviction_Episodes_Felony <- recode(recidivism2$Prior_Conviction_Episodes_Felony,
'0'=0,
'1'=1,
'2'=2,
'3 or more'=3)
# Prior Conviction Episodes Misd
recidivism2$Prior_Conviction_Episodes_Misd <- recode(recidivism2$Prior_Conviction_Episodes_Misd,
'0'=0,
'1'=1,
'2'=2,
'3'=3,
'4 or more'=4)
# Prior Conviction Episodes Prop
recidivism2$Prior_Conviction_Episodes_Prop <- recode(recidivism2$Prior_Conviction_Episodes_Prop,
'0'=0,
'1'=1,
'2'=2,
'3 or more'=3)
# Prior Conviction Episodes Drug
recidivism2$Prior_Conviction_Episodes_Drug <- recode(recidivism2$Prior_Conviction_Episodes_Drug,
'0'=0,
'1'=1,
'2 or more'=2)
# Delinquency Reports
recidivism2$Delinquency_Reports <- recode(recidivism2$Delinquency_Reports,
'0'=0,
'1'=1,
'2'=2,
'3'=3,
'4 or more'=4)
# Program Attendances
recidivism2$Program_Attendances <- recode(recidivism2$Program_Attendances,
'0'=0,
'1'=1,
'2'=2,
'3'=3,
'4'=4,
'5'=5,
'6'=6,
'7'=7,
'8'=8,
'9'=9,
'10 or more'=10)
# Program Unexcused Absences
recidivism2$Program_UnexcusedAbsences <- recode(recidivism2$Program_UnexcusedAbsences,
'0'=0,
'1'=1,
'2'=2,
'3 or more'=3)
# Residence Changes
recidivism2$Residence_Changes <- recode(recidivism2$Residence_Changes,
'0'=0,
'1'=1,
'2'=2,
'3 or more'=3)
#############################################################
str(recidivism2)
'data.frame': 18111 obs. of 54 variables:
$ ID : int 1 2 3 4 6 7 8 11 13 15 ...
$ Gender : chr "M" "M" "M" "M" ...
$ Race : chr "BLACK" "BLACK" "BLACK" "WHITE" ...
$ Age_at_Release : chr "43-47" "33-37" "48 or older" "38-42" ...
$ Residence_PUMA : int 16 16 24 16 17 18 16 5 18 5 ...
$ Gang_Affiliated : num 0 0 0 0 0 0 0 0 0 0 ...
$ Supervision_Risk_Score_First : int 3 6 7 7 5 2 5 3 3 7 ...
$ Supervision_Level_First : chr "Standard" "Specialized" "High" "High" ...
$ Education_Level : chr "At least some college" "Less than HS diploma" "At least some college" "Less than HS diploma" ...
$ Dependents : num 3 1 3 1 0 2 3 1 1 1 ...
$ Prison_Offense : chr "Drug" "Violent/Non-Sex" "Drug" "Property" ...
$ Prison_Years : chr "More than 3 years" "More than 3 years" "1-2 years" "1-2 years" ...
$ Prior_Arrest_Episodes_Felony : num 6 7 6 8 4 10 6 3 8 9 ...
$ Prior_Arrest_Episodes_Misd : num 6 6 6 6 0 6 6 6 4 3 ...
$ Prior_Arrest_Episodes_Violent : num 1 3 3 0 1 1 3 2 0 2 ...
$ Prior_Arrest_Episodes_Property : num 3 0 2 3 3 5 1 1 5 2 ...
$ Prior_Arrest_Episodes_Drug : num 3 3 2 3 0 1 2 1 2 4 ...
$ Prior_Arrest_Episodes_PPViolationCharges : num 4 5 5 3 0 5 5 3 1 4 ...
$ Prior_Arrest_Episodes_DVCharges : num 0 1 1 0 0 0 0 1 0 0 ...
$ Prior_Arrest_Episodes_GunCharges : num 0 0 0 0 0 1 0 0 0 1 ...
$ Prior_Conviction_Episodes_Felony : num 3 3 3 3 1 3 1 0 1 3 ...
$ Prior_Conviction_Episodes_Misd : num 3 4 2 4 0 1 4 3 0 2 ...
$ Prior_Conviction_Episodes_Viol : num 0 1 1 0 0 0 1 0 0 1 ...
$ Prior_Conviction_Episodes_Prop : num 2 0 1 3 2 3 0 0 2 1 ...
$ Prior_Conviction_Episodes_Drug : num 2 2 2 2 0 0 2 0 1 1 ...
$ Prior_Conviction_Episodes_PPViolationCharges : num 0 1 0 0 0 1 1 1 0 1 ...
$ Prior_Conviction_Episodes_DomesticViolenceCharges: num 0 1 1 0 0 0 0 0 0 0 ...
$ Prior_Conviction_Episodes_GunCharges : num 0 1 0 0 0 1 0 0 0 0 ...
$ Prior_Revocations_Parole : num 0 0 0 0 0 0 0 1 0 0 ...
$ Prior_Revocations_Probation : num 0 0 0 1 0 0 0 0 0 0 ...
$ Condition_MH_SA : num 1 0 1 1 0 0 0 1 0 1 ...
$ Condition_Cog_Ed : num 1 0 1 1 0 0 1 1 0 1 ...
$ Condition_Other : num 0 0 0 0 1 0 0 0 0 1 ...
$ Violations_ElectronicMonitoring : num 0 0 0 0 0 0 0 1 0 0 ...
$ Violations_Instruction : num 0 1 1 0 0 0 0 1 0 0 ...
$ Violations_FailToReport : num 0 0 0 0 0 0 0 0 0 0 ...
$ Violations_MoveWithoutPermission : num 0 0 1 0 1 0 0 0 0 0 ...
$ Delinquency_Reports : num 0 4 4 0 0 0 0 0 0 0 ...
$ Program_Attendances : num 6 0 6 6 0 0 0 9 0 6 ...
$ Program_UnexcusedAbsences : num 0 0 0 0 0 0 0 2 0 0 ...
$ Residence_Changes : num 2 2 0 3 3 1 0 2 1 1 ...
$ Avg_Days_per_DrugTest : num 612 35.7 93.7 25.4 474.6 ...
$ DrugTests_THC_Positive : num 0 0 0.333 0 0 ...
$ DrugTests_Cocaine_Positive : num 0 0 0 0 0 0 0 0 0 0 ...
$ DrugTests_Meth_Positive : num 0 0 0.167 0 0 ...
$ DrugTests_Other_Positive : num 0 0 0 0 0 ...
$ Percent_Days_Employed : num 0.489 0.425 0 1 0.674 ...
$ Jobs_Per_Year : num 0.448 2 0 0.719 0.308 ...
$ Employment_Exempt : num 0 0 0 0 0 0 0 1 0 1 ...
$ Recidivism_Within_3years : num 0 1 1 0 0 1 0 1 0 0 ...
$ Recidivism_Arrest_Year1 : num 0 0 0 0 0 0 0 0 0 0 ...
$ Recidivism_Arrest_Year2 : num 0 0 1 0 0 0 0 1 0 0 ...
$ Recidivism_Arrest_Year3 : num 0 1 0 0 0 1 0 0 0 0 ...
$ Training_Sample : int 1 1 1 1 0 1 0 1 1 0 ...
Now, we export the final version of the dataset.
In future weeks, we will work with this version of the dataset.
You will need to install the reticulate
package and sentence_transformers
module for the following weeks. You can run the following code in your computer to get prepared for the following weeks. Note that you only have to run the following code once to install the necessary packages.
If you are having troubles about installing these packages in your computer, I highly recommend using a Kaggle R notebook which these packages are already installed (I will give more information about this in class).
# Install the reticulate package
install.packages(pkgs = 'reticulate',
dependencies = TRUE)
# Install Miniconda
install_miniconda()
Once you install the reticulate package, run the following code to get python configurations and make sure everything is properly installed.
# Load the reticulate package
require(reticulate)
conda_list()
name
1 Anaconda3
2 st
3 Anacond3
4 openai
5 r-reticulate
python
1 C:\\Users\\cengiz\\Anaconda3\\python.exe
2 C:\\Users\\cengiz\\Anaconda3\\envs\\st\\python.exe
3 C:\\Users\\cengiz\\AppData\\Local\\r-miniconda\\envs\\Anacond3\\python.exe
4 C:\\Users\\cengiz\\AppData\\Local\\r-miniconda\\envs\\openai\\python.exe
5 C:\\Users\\cengiz\\AppData\\Local\\r-miniconda\\envs\\r-reticulate\\python.exe
You should see r-reticulate
under the name column as one of your virtual Python environment. Finally, you will also need to install the sentence transformers module. The following code will install the sentence transformers module to the virtual Python environment r-reticulate
.
# Install the sentence transformer module
use_condaenv('r-reticulate')
conda_install(envname = 'r-reticulate',
packages = 'sentence_transformers',
pip = TRUE)
[1] "sentence_transformers"
# try pip=FALSE, if it gives an error message
Once you install the Python packages using the code above, you can run the following code. If you are seeing the same output as below, you should be all set to explore some very exciting NLP tools using the Readability dataset.
require(reticulate)
# Import the sentence transformer module
reticulate::import('sentence_transformers')
Module(sentence_transformers)
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".