Applying unsupervised machine learning methods to publicly available data to gain insights into schools and districts across a state. Programmed in R.
In this guide, you will be able to apply two unsupervised machine learning methods–specifically, Principal Component Analysis (PCA) and K-Means Clustering–to publicly available state report card data to identify schools that that are similar to each other in terms of student enrollment, faculty and staff characteristics, programs, spending and funding, and other school indicators.
School leaders often use peer groups for comparative analyses and benchmarking. Identifying appropriate peer schools with similar settings and student bodies can be a challenging and time consuming process. The purpose of this guide is to present a data-driven approach to identifying school peer groups. We use PCA and K-Means Clustering algorithms together to identify groups of like-schools and provide an example of how this information can be used to benchmark school proficiency scores.
This guide utilizes publicly available data from the Kentucky school report card for the 2018/19 school year. Schools and districts in Kentucky can use this code to analyze data from earlier report cards from the Kentucky Department of Education. Additionally, code from this guide can be adapted to analyze data found on other state or district report cards.
Once you have identified analyses that you want to try to replicate or modify, click the “Download” buttons to download R code and sample data. You can make changes to the charts using the code and sample data, or modify the code to work with your own data. If you are familiar with GitHub, you can click “Go to Repository” and clone the entire repository to your own computer.
Go to the Participate page to read about more ways to engage with the OpenSDP community or reach out for assistance in adapting this code for your specific context.
To complete this tutorial, you will need R, R Studio, and the following R packages installed on your machine:
tidyverse
: For convenient data and output manipulationbroom
: To extract model details in a tidy data framefactoextra
: To visualize PCA results easilysimputation
: To impute missing dataTo install packages, such as broom
, run the following command in the R console:
install.packages("broom")
In addition, this guide will draw from OpenSDP-written functions defined in the functions.R
document, which is located in the R
folder of this guide’s GitHub repository. Please make sure to have downloaded the entire GitHub repository to run this code.
After you installed your R packages and downloaded this guide’s GitHub repository, run the chunk of code below to load them onto your computer.
# Load packages
library(tidyverse)
library(broom)
library(factoextra)
library(simputation)
# Read in some R functions that are convenience wrappers
source("../R/functions.R")
This guide utilizes publicly available data from the Kentucky School Report Card for the 2018/19 school year. All data is at the school-level with the exception of financial information (dist_seek_funding
, dist_building_funding
), which is at the district level. Assessment and accountability data (prof_dist_rd
, prof_dist_ma
) are only used in the supplemental analysis at the end of the guide. We chose to exclude assessment and accountability data from the PCA and K-Means Clustering analyses because we wanted to focus the analyses on school characteristics and resources. Individuals can add assessment and accountability data to their model as long as the features are numeric–i.e., school performance levels must be re-coded as a numeric variable. The code we used to process the report card data files is stored in the R
folder as clean-src-data.R
.
Below is a list of variables and descriptions used in the analyses:
Variable Name | Variable Description |
---|---|
state_sch_id |
School ID number |
sch_name |
School name |
level |
Indicator of school level (“ES”, “MS”, “HS”) |
stn_membership |
Number of students enrolled |
stn_white_pct |
Percent of white students enrolled |
stn_male_pct |
Percent of male students enrolled |
stn_attendance_pct |
Student attendance rate |
stn_chronic_absence_pct |
Percent of chronic absent students enrolled |
stn_safety_pct |
Percent of students enrolled with behavior event |
stn_ell_pct |
Percent of English Language Learners enrolled |
stn_frpl_pct |
Percent of economic disadvantage students enrolled |
stn_gifted_pct |
Percent of gifted students enrolled |
stn_homeless_pct |
Percent of homeless students enrolled |
stn_iep_pct |
Percent of special education students enrolled |
stn_migrant_pct |
Percent of migrant students enrolled |
tchr_experience_avg |
Average number of years experience by teachers |
tchr_new_pct |
Percent of new teachers |
tchr_ma_plus_pct |
Percent of teachers with advance degree (masters +) |
tchr_national_board_pct |
Percent of National Board Certified teachers |
tchr_turnover_pct |
Teacher turnover rate |
tchr_waivers_pct |
Percent of teachers with certification waivers |
tell_students |
Average rating on Tell Survey (Student Conduct) |
tell_community |
Average rating on Tell Survey (Community Engagement) |
tell_leadership |
Average rating on Tell Survey (School Leadership) |
dist_seek_funding |
Per-pupil state funding amount for district |
dist_building_funding |
Total building funds for district |
title1_status |
Indicator of Title 1 status |
stuent_teacher_ratio |
Student-to-teacher ratio |
prof_dist_rd |
Percent of students scoring Proficient in reading |
prof_dist_ma |
Percent of students scoring Proficient in math |
Observations: 725
Variables: 30
$ state_sch_id <chr> "001001016", "001001020", "002005060", "002...
$ sch_name <chr> "Adair County Elementary School", "Adair Co...
$ level <chr> "ES", "ES", "ES", "ES", "ES", "ES", "ES", "...
$ stn_membership <dbl> 624, 538, 707, 909, 374, 417, 530, 418, 237...
$ stn_white_pct <dbl> 0.8910256, 0.8791822, 0.9108911, 0.9130913,...
$ stn_male_pct <dbl> 0.5064103, 0.5501859, 0.4809052, 0.5258526,...
$ stn_attendance_pct <dbl> 0.950, 0.940, 0.952, 0.943, 0.968, 0.951, 0...
$ stn_chronic_absence_pct <dbl> 0.13782051, 0.20446097, 0.13437058, 0.16941...
$ stn_safety_pct <dbl> 0.060897436, 0.022304833, 0.154172560, 0.09...
$ stn_ell_pct <dbl> 0.027243590, 0.037174721, 0.009900990, 0.02...
$ stn_frpl_pct <dbl> 0.75160256, 0.76951673, 0.66336634, 0.70187...
$ stn_gifted_pct <dbl> 0.10416667, 0.10594796, 0.02404526, 0.05500...
$ stn_homeless_pct <dbl> 0.001602564, 0.020446097, 0.029702970, 0.01...
$ stn_iep_pct <dbl> 0.12820513, 0.23977695, 0.15417256, 0.23322...
$ stn_migrant_pct <dbl> 0.019230769, 0.037174721, 0.002828854, 0.00...
$ tchr_experience_avg <dbl> 13.3, 15.9, 11.9, 12.0, 15.9, 14.2, 11.4, 1...
$ tchr_new_pct <dbl> 9.1, 2.7, 5.9, 1.9, 5.3, 0.0, 4.0, 0.0, 0.0...
$ tchr_ma_plus_pct <dbl> 0.727, 0.946, 0.647, 0.808, 0.921, 0.904, 0...
$ tchr_national_board_pct <dbl> 0.09090909, 0.16216216, 0.20588235, 0.05769...
$ tchr_turnover_pct <dbl> 0.24242424, 0.16216216, 0.02941176, 0.07692...
$ tchr_waivers_pct <dbl> 0.00000000, 0.00000000, 0.02941176, 0.03846...
$ tell_students <dbl> 0.893, 0.803, 0.883, 0.938, 0.933, 0.886, 0...
$ tell_community <dbl> 0.892, 0.887, 0.877, 0.819, 0.997, 0.946, 0...
$ tell_leadership <dbl> 0.949, 0.784, 0.836, 0.884, 0.946, 0.912, 0...
$ dist_seek_funding <dbl> 5249, 5249, 5071, 5071, 1857, 3901, 3901, 3...
$ dist_building_funding <dbl> 993715, 993715, 2258788, 2258788, 242847, 3...
$ title1_status <dbl> 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
$ student_teacher_ratio <dbl> 0.06666667, 0.06250000, 0.05882353, 0.07142...
$ prof_rd <dbl> 52.1, NA, 58.4, 44.3, 89.7, 53.1, 56.5, 43....
$ prof_ma <dbl> 51.4, NA, 52.0, 44.3, 87.7, 45.0, 55.1, 38....
This guide is an open-source document hosted on GitHub and generated using R Markdown. We welcome feedback, corrections, additions, and updates. Please visit the OpenSDP participate repository to read our contributor guidelines.
Purpose: This analysis examines the structure of the dataset and illustrates how PCA can be used to reduce the number of variables in a dataset, while losing as little information as possible.
Required Analysis File Variables:
state_sch_id
stn_membership
stn_white_pct
stn_male_pct
stn_attendance_pct
stn_chronic_absence_pct
stn_safety_pct
stn_ell_pct
stn_frpl_pct
stn_gifted_pct
stn_homeless_pct
stn_iep_pct
stn_migrant_pct
tchr_experience_avg
tchr_new_pct
tchr_ma_plus_pct
tchr_national_board_pct
tchr_turnover_pct
tchr_waivers_pct
tell_students
tell_community
tell_leadership
dist_seek_funding
dist_building_funding
title1_status
stuent_teacher_ratio
A Note on Missing Data
Determining how to address missing data is an important decision for many data analysts. Depending on the amount of missing data and whether data is missing completely at random, missing at random, and missing not at random, different strategies can be applied to a PCA on an incomplete data set. Additionally, PCA algorithms in R often differ in how they treat missing data in their default settings. For example, the prcomp
function excludes data with missing data by default (see ?prcomp
for additional information). In this analysis, we apply a simple imputation method using the R package simputation
as an example of how to estimate missing data using a model-based approach. We encourage you to read more on the topics of missing data with PCA and different approaching to imputing missing data.
We listed a few helpful resources on PCA and missing data at the end of this guide.
Ask Yourself
# // Step 1: Identify which variables are missing
src_data %>%
select(-starts_with("prof")) %>%
gather(variable, value) %>%
group_by(variable) %>%
summarise(missing = calc_pct_missing(value)) %>%
arrange(-missing)
# A tibble: 28 x 2
variable missing
<chr> <dbl>
1 tell_community 0.0207
2 tell_leadership 0.0207
3 tell_students 0.0207
4 tchr_turnover_pct 0.00690
5 tchr_national_board_pct 0.00138
6 dist_building_funding 0
7 dist_seek_funding 0
8 level 0
9 sch_name 0
10 state_sch_id 0
# ... with 18 more rows
# // Step 2: Impute missing values
# We selected three school features, as an example, to estimate the missing data.
# We likely would have taken a different approach if more schools had missing data.
src_data_complete <- src_data %>%
impute_lm(tell_community ~ stn_membership + stn_white_pct + stn_frpl_pct) %>%
impute_lm(tell_leadership ~ stn_membership + stn_white_pct + stn_frpl_pct) %>%
impute_lm(tell_students ~ stn_membership + stn_white_pct + stn_frpl_pct) %>%
impute_lm(tchr_turnover_pct ~ stn_membership + stn_white_pct + stn_frpl_pct) %>%
impute_lm(tchr_national_board_pct ~ stn_membership + stn_white_pct + stn_frpl_pct)
# Check -> Good!
src_data_complete %>%
select(-starts_with("prof")) %>%
gather(variable, value) %>%
group_by(variable) %>%
summarise(missing = calc_pct_missing(value)) %>%
arrange(-missing)
# A tibble: 28 x 2
variable missing
<chr> <dbl>
1 dist_building_funding 0
2 dist_seek_funding 0
3 level 0
4 sch_name 0
5 state_sch_id 0
6 stn_attendance_pct 0
7 stn_chronic_absence_pct 0
8 stn_ell_pct 0
9 stn_frpl_pct 0
10 stn_gifted_pct 0
# ... with 18 more rows
# // Step 3: Select numeric columns and cast to matrix
src_matrix <- src_data_complete %>%
select(-starts_with("prof")) %>%
select_if(is.numeric) %>%
as.matrix()
# // Step 4: Execuite the PCA algorithm
src_pca <- prcomp(src_matrix,
# It's important that you center and scale your data.
# The centering defualt is TRUE. I'm being explicit for demonstration.
center = TRUE,
# Switch to TRUE to scale data
scale = TRUE)
It can be informative to see which factors are contributing the most in the model. If you see variables that contribute very little to both principal components 1 or 2, try repeating the PCA (Steps 3 and 4) without those feature. We observed three variables that were contributing little to model (stn_migrant_pct
, stn_gifted_pct
, student_teacher_ratio
) and decided to drop them from the model as an example of what to do in this situation.
Note: we recommend that you provide a rational for any decision you make to exclude variables from your analysis. Failure to do so may come across to the public that you are manipulating the model. See resources on PCA at the end of this guide for additional information on this topic.
# // Step 5: Plot
# Plot variables contributing to principal component 1
fviz_pca_contrib(src_pca, choice = "var", axes = 1)
# Plot variables contributing to principal component 1
fviz_pca_contrib(src_pca, choice = "var", axes = 2)
# // Step 6: Drop variables contributing little to model and repeate PCA
# List variables we want to drop
drop_vars <- c("stn_migrant_pct", "stn_gifted_pct", "student_teacher_ratio")
# Repeate PCA (condensed code)
src_pca_revised <- src_data_complete %>%
select(-starts_with("prof")) %>%
select_if(is.numeric) %>%
# drop vars here
select(-drop_vars) %>%
as.matrix() %>%
prcomp(center = TRUE, scale = TRUE)
Purpose: Using information from the PCA, this analysis shows how K-Means Clustering algorithms can be used to form groups form groups of comparable schools.
Required Analysis File Variables:
Ask Yourself
# // Step 1: Extract principal components 1 and 2
pca_coords <- augment(src_pca_revised) %>%
select(.rownames, .fittedPC1, .fittedPC2) %>%
mutate(.fittedPC1 = .fittedPC1 * -1)
# // Step 2: Set seed for reproducability
set.seed(2468)
# // Step 3: Apply k-means algorthm
src_km <- kmeans(pca_coords %>% select(-.rownames),
# Set to the number of groups you want.
centers = 30)
# // Step 4: Examine the number of schools in each group
src_km$size
[1] 50 24 30 16 26 29 38 17 17 31 27 31 14 14 25 22 32 36 32 21 24 17 19 14 32
[26] 12 33 12 9 21
# // Step 5: Write a CSV with schools groups
# Add clusters to original dataset
sch_clusters <- src_data %>%
# Add clusters
mutate(cluster = factor(src_km$cluster)) %>%
# Add PCAs
mutate(pc1 = pca_coords$.fittedPC1,
pc2 = pca_coords$.fittedPC2)
# Write file
# Uncomment line below to overwrite file provided in this guide.
# write_csv(sch_clusters, "../data/sch_clusters.csv")
Lastly, examine your group using descriptive statistics and different visualizations (see below as an example). Ask yourself if the groupings make sense. If they do, try rerunning your K-Means algorithm to form new school groupsings. Don’t be surprised if it takes you a few tries to get school groups that are appropriate in size and makeup. It’s an iterative process.
Other advice. Conduct a follow-up analysis using prior year data. If you see that your school groups differ in meaningful ways from those in your original analysis, consider aggregating your data across years and use this information in your analysis. Using multiple years of data has the advantage of evening out periodic swings in the data.
# // Step 6: Plot
ggplot(sch_clusters,
aes(pc1, pc2, color=cluster)) +
geom_point(show.legend = FALSE) +
theme_bw() +
theme(panel.grid = element_blank(),
axis.ticks = element_blank(),
plot.title.position = "plot") +
labs(x = "PC1", y = "PC2",
title = "School groups by first and second principal components",
subtitle = "Number of groups: 30")
Purpose: This analyses examines proficiency rates in reading among schools within a comparison groups. It is an example application of a dataset containing a list of comparison schools.
Required Analysis File Variables:
Ask Yourself
# \\ Step 1: Read csv file of clustered dataset, naming it "src_clusters"
# Note: You can continue to use the dataset from Step 5 of the K-Mean analysis. It's
# the same dataset. We load a new dataset for illustrative purposes.
sch_clusters <- read_csv("../data/sch_clusters.csv")
# \\ Step 2: Plot
# Set school of interest
my_school <- "034165052"
# Plot
sch_clusters %>%
# Flag school of note
mutate(flag = ifelse(state_sch_id == my_school, 1, 0)) %>%
# Restrict to cluster containing school of note
filter(cluster == 3) %>%
# Shorten school names
mutate(sch_name = str_replace(sch_name, "Elementary School", "ES"),
sch_name = str_replace(sch_name, "Elementary", "ES"),
sch_name = str_replace(sch_name, "School", "ES"),
sch_name = str_replace(sch_name, "Traditional", "ES")) %>%
ggplot(aes(reorder(sch_name, prof_rd), prof_rd)) +
geom_col(aes(fill = factor(flag))) +
geom_text(aes(label = paste0(prof_rd, "%")), nudge_y = 5) +
scale_fill_brewer() +
coord_flip() +
theme_minimal() +
theme(panel.grid = element_blank(),
axis.text.x = element_blank(),
legend.position = "none",
plot.title.position = "plot") +
labs(x = "", y = "",
title = "Rosa Parks Elementary School is performing at the top of its peer group",
subtitle = "Percent of students scoring proficient in reading, 2018/19",
caption = "Data: KY School Report Card")
Below are a few helpful resources on PCA, K-Means clustering, and missing data imputation.
FactoMineR
: For dimension reduction methods such as PCAfactoextra
: To visualize PCA results easilyFactoShiny
: Shiny app for PCA