A step-by-step guide to implementing a beating-the-odds (BTO) analysis using a multilevel framework. Programmed in R.
In this guide, you will use statistical models to predict school performance based on the demographic makeup of schools’ student populations and compare these predictions with actual school performance.
School leaders often want to identify promising practices that distinguish high-performing schools from their counterparts and facilitate the transfer of some of these practices to struggling schools. A BTO analysis is one approach school leaders can take to identify schools that perform better or worse than expected, given the unique student populations they serve. In general, BTO analyses predict school performance based on the demographic make up of schools’ student populations and then compare these predictions with actual school performance. Schools with observed performance that is statistically significantly greater than their predicted performance are typically considered to be performing better than expected, or beating the odds. Schools with observed performance that is statistically significantly less than their predicted performance meet the worse than expected criteria.
There are a number of examples of BTO analyses being used to identify schools exceeding expectations in achievement gap closure, reading, English language arts, math, graduation rate, and state-determined performance measures. Many focused on a variety of educational contexts such as rural districts, high poverty high schools, and charter schools. The approach can be used to guide decision-making by providing objective information to leadership about schools that may warrant a closer look either positively or out of concern.
State | Performance Metric | Population |
---|---|---|
Colorado | Achievement gap closure | Rural districts |
Georgia | College and career readiness | All K-12 schools |
Florida | Grade 3 reading | Public elementary schools |
Michigan | State-defined performance measures | All K-12 schools |
Mississippi | English language arts; math | Grade 3-8 public schools |
Nebraska | Achievement gap closure | Rural districts |
Puerto Rico | Graduation rate; reading; math | High poverty high schools |
South Carolina | English language arts; math | Charter schools |
The purpose of this guide is to present a data-driven approach to identify BTO schools. We use multilevel models to predict student performance and school-level effects on that performance. Next, we compare each school’s predicted performance to its actual performance. The school is identified as beating the odds if its actual performance is higher or lower than predicted by a statistically significant margin. The procedures presented in this guide are based on those used in a collaborative study by the Kentucky Department of Education and REL Appalachia.
This guide draws on SDP’s “Faketucky”, a synthetic dataset based on real student data, in the analysis. While these data are synthetic, the code is not. Schools and districts wanting to conduct their own BTO analysis can easily adapt the code provided in this guide to their own data.
To replicate or modify the analysis described in this guide, click the “Download” buttons to download R code and sample data. You can make changes to the charts using the code and sample data, or modify the code to work with your own data. If you are familiar with GitHub, you can click “Go to Repository” and clone the entire repository to your own computer.
We encourage you to go to our Participate page to read about more ways to engage with the OpenSDP community or reach out for assistance in adapting this code for your specific context.
To complete this tutorial, you will need R, R Studio, and the following R packages installed on your machine:
tidyverse
: For convenient data and output manipulationlme4
: To fit multilevel modelsmerTools
: To extract expected ranks from fitted modelsglue
: To format and interpolate stringsTo install packages, such as lme4
, run the following command in the R console:
install.packages("lme4")
In addition, we use a custom ggplot theme – sdp_theme()
– to make text size, font and color, axis lines, axis text, and other standard chart components into an OpenSDP style. Custom themes can help you establish a professional brand for your data visualizations. Organizations such as The Urban Institute and BBC News make use of custom ggplot themes to create publication-ready charts that are consistent with their organization’s branding. While you may not have a dedicated marketing team, making an effort to match chart style and formatting to your district’s brand can help your results stand out to stakeholders.
After installing your R packages and downloading this guide’s GitHub repository, run the chunk of code below to load your packages and custom theme onto your computer.
# Load packages
# Note: We do not load the `merTools` package because it masks
# the `select` function from `dplyr` (found in `tidyverse`).
library(tidyverse)
library(lme4)
library(glue)
# Set custom ggplot2 theme for BTO guide
sdp_theme <- function() {
theme_minimal() +
theme(
panel.grid = element_blank(),
plot.title = element_text(size = 16, face = "bold"),
plot.title.position = "plot",
plot.subtitle = element_text(size = 14),
axis.title = element_text(size = 12),
axis.text = element_text(size = 12),
legend.position = "top",
legend.text = element_text(size = 12),
strip.text = element_text(size = 12),
strip.background = element_rect(fill = "gray80", color = "gray80")
)
}
This guide uses SDP’s “Faketucky” dataset. The Faketucky synthetic dataset contains high school and college outcome data for two graduating cohorts of approximately 40,000 students. There are no real students in the dataset, but it mirrors the relationships between variables present in real data. The dataset was developed as an offshoot of SDP’s College-Going Diagnostic for Kentucky, using the R synthpop package. In addition, we created school-level aggregates for each variable. The first step of the analysis “Prepare data for analysis” describes the steps to create the school-level variables.
Below is a list of variables and descriptions used in the analyses:
Variable Name | Variable Description |
---|---|
first_dist_code |
Code of first district attended in high school |
first_dist_name |
Name of first district attended in high school |
first_hs_code |
Code of first high school attended |
first_hs_name |
Name of first high school attended |
chrt_ninth |
Student 9th grade cohort |
male |
Student male indicator |
race_ethnicity |
Student race/ethnicity |
frpl_ever_in_hs |
Student ever received free or reduced price lunch in high school |
sped_ever_in_hs |
Student ever classified as special ed in high school |
lep_ever_in_hs |
Student ever classified as limited English proficiency in high school |
gifted_ever_in_hs |
Student ever classified as gifted in high school |
scale_score_8_math |
Scaled score of 8th grade math test |
scale_score_8_read |
Scaled score of 8th grade reading test |
scale_score_11_math |
Scaled score of highest math ACT |
scale_score_11_read |
Scaled score of highest reading ACT |
# Load "Faketucky" data file
load("../data/faketucky.rda")
# Select variables of interest
my_vars <- c("first_dist_code", "first_dist_name",
"first_hs_code", "first_hs_name",
"chrt_ninth", "male", "race_ethnicity",
"frpl_ever_in_hs", "sped_ever_in_hs",
"lep_ever_in_hs", "gifted_ever_in_hs",
"scale_score_8_math", "scale_score_8_read",
"scale_score_11_math", "scale_score_11_read")
faketucky <- faketucky_20160923[my_vars]
This guide is an open-source document hosted on GitHub and generated using R Markdown. We welcome feedback, corrections, additions, and updates. Please visit the OpenSDP participate repository to read our contributor guidelines.
Purpose: This analysis illustrates how to identify schools that perform better or worse than expected, given the unique student populations they serve, using a BTO approach.
Required Analysis File Variables:
first_hs_code
chrt_ninth
male
race_ethnicity
frpl_ever_in_hs
sped_ever_in_hs
lep_ever_in_hs
gifted_ever_in_hs
scale_score_8_math
scale_score_8_read
scale_score_11_math
scale_score_11_read
Analytic Technique: We use multilevel models to predict student performance and school-level effects on that performance. There are a number of benefits to using a multilevel approach over a more traditional approach like ordinary least squares. In particular, a multilevel approach allows us to account for the hierarchical or nested structure of the data. In this case, student observations and school observations from different years are nested within schools. Additionally, recent BTO studies have used a multilevel approach with success (e.g., Bowers, 2015; Partridg, Rudo, & Herrera, 2017).
We encourage you check out Gelman and Hill’s book Data Analysis Using Regression and Multilevel/Hierarchical Models if you’re interested in learning more about multilevel models and their applications.
Ask Yourself:
A Note on Missing Data: It is important to determine how you want to address missing data before you begin your analysis. For the purpose of guide, we chose to exclude students with data missing from the analyses for simplicity. We recommend that you conduct a missing data analysis to determine whether your data is missing completely at random, missing at random, or missing not at random and apply the appropriate strategy to address your missing data. Andrew Gelman’s chapter on Missing-data Imputation in R is a great resource to help you think about your options.
This BTO analysis uses a multilevel framework that incorporates school-level information into the modeling process. To prepare the analytical dataset, we calculated school averages within each school year. These variables were then centered using the grand mean of each variable. We centered variables to aid in the interpretation of the school intercepts.
Note: We recognize that grand-mean centering may not be an appropriate option for your analysis. We encourage you to explore other centering techniques for multilevel modeling. Gelman and Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models and Raudenbush and Bryk’s Hierarchical Linear Models provide further details on the different centering techniques as well as the advantages and disadvantages of their use in multilevel modeling.
# Calculate school averages by cohort year
sch_avg_by_cohort <- faketucky %>%
# Create indicator for students identifying as white
mutate(race_white = ifelse(race_ethnicity == "White", 1, 0)) %>%
# Group data by high school and cohort year
group_by(first_hs_code, chrt_ninth) %>%
# Calculate school averages
mutate(sch_male = mean(male, na.rm = TRUE),
sch_white = mean(race_white, na.rm = TRUE),
sch_frpl = mean(frpl_ever_in_hs, na.rm = TRUE),
sch_sped = mean(sped_ever_in_hs, na.rm = TRUE),
sch_lep = mean(lep_ever_in_hs, na.rm = TRUE),
sch_gifted = mean(gifted_ever_in_hs, na.rm = TRUE),
sch_8_math = mean(scale_score_8_math, na.rm = TRUE),
sch_8_read = mean(scale_score_8_read, na.rm = TRUE)) %>%
# Remember to ungroup
ungroup()
# Calculate across-year cohort averages
cohort_avg <- sch_avg_by_cohort %>%
# Flag 2010 cohort (for modeling)
mutate(flag_2010_cohort = ifelse(chrt_ninth == 2010, 1, 0)) %>%
# Group by cohort year
group_by(chrt_ninth) %>%
# Center prior achievement test scores
mutate(math_8_center = scale_score_8_math - mean(scale_score_8_math, na.rm = TRUE),
read_8_center = scale_score_8_read - mean(scale_score_8_read, na.rm = TRUE)) %>%
# Calculate averages
mutate(sch_male_mean_year = mean(sch_male, na.rm = TRUE),
sch_white_mean_year = mean(sch_white, na.rm = TRUE),
sch_frpl_mean_year = mean(sch_frpl, na.rm = TRUE),
sch_sped_mean_year = mean(sch_sped, na.rm = TRUE),
sch_lep_mean_year = mean(sch_lep, na.rm = TRUE),
sch_gifted_mean_year = mean(sch_gifted, na.rm = TRUE),
sch_8_math_mean_year = mean(sch_8_math, na.rm = TRUE),
sch_8_read_mean_year = mean(sch_8_read, na.rm = TRUE)) %>%
# Ungroup data frame
ungroup()
# Center school-level variables
sch_avg_center <- cohort_avg %>%
# Subtract across-year averages from school averages
mutate(sch_male_center = sch_male - sch_male_mean_year,
sch_white_center = sch_white - sch_white_mean_year,
sch_frpl_center = sch_frpl - sch_frpl_mean_year,
sch_sped_center = sch_sped - sch_sped_mean_year,
sch_lep_center = sch_lep - sch_lep_mean_year,
sch_gifted_center = sch_gifted - sch_gifted_mean_year,
sch_8_math_center = sch_8_math - sch_8_math_mean_year,
sch_8_read_center = sch_8_read - sch_8_read_mean_year)
Multilevel models are a powerful and flexible extension to conventional regression frameworks. This is one of the many reasons why they are so attractive to education researchers. However, this added flexibility can make fitting and interpreting such models a challenge. Here, we present a relatively simple multilevel model that takes into account school-level variation. We encourage you to read more on the topic of multilevel modeling and its application in BTO analyses.
We fit a two-level multilevel model for each subject area outcome of interest – ACT math and reading – using the lmer
function in the lme4
package. Specifically, these models use a random intercept framework, which allows the school-level intercept to vary randomly around a cross-school mean. We recommend you read the lme4 Reference Manual and vignette Fitting Linear Mixed-Effects Models Using lme4 before you fit your models. These resources contain a wealth of information.
# Fit multilevel models for each subject area
m_math <- lmer(
# Define model formula
formula = scale_score_11_math ~
male + race_white + frpl_ever_in_hs + sped_ever_in_hs +
lep_ever_in_hs + gifted_ever_in_hs + math_8_center +
sch_male_center + sch_white_center + sch_frpl_center +
sch_sped_center + sch_lep_center + sch_gifted_center +
sch_8_math_center + flag_2010_cohort + (1|first_hs_code),
# Call dataframe containing the variables named in formula
data = sch_avg_center
)
m_read <- lmer(
formula = scale_score_11_read ~
male + race_white + frpl_ever_in_hs + sped_ever_in_hs +
lep_ever_in_hs + gifted_ever_in_hs + read_8_center +
sch_male_center + sch_white_center + sch_frpl_center +
sch_sped_center + sch_lep_center + sch_gifted_center +
sch_8_read_center + flag_2010_cohort + (1|first_hs_code),
data = sch_avg_center
)
Examine the coefficients and standard errors for each variable. Ask yourself if the estimates are in the range of reasonable possibility. If not, go back and inspect your dataset and make sure there are no errors in processing the data. Also inspect your dataset to make sure that the assumptions of multilevel modeling hold.
For information on other model checking and sensitivity analysis for multilevel models, see Snijders and Berkhof’s chapter on Diagnostic Checks for Multilevel Models.
Below we print a summary of the math model for illustrative purposes.
# Call summary statistics for the math model
summary(m_math)
Linear mixed model fit by REML ['lmerMod']
Formula:
scale_score_11_math ~ male + race_white + frpl_ever_in_hs + sped_ever_in_hs +
lep_ever_in_hs + gifted_ever_in_hs + math_8_center + sch_male_center +
sch_white_center + sch_frpl_center + sch_sped_center + sch_lep_center +
sch_gifted_center + sch_8_math_center + flag_2010_cohort +
(1 | first_hs_code)
Data: sch_avg_center
REML criterion at convergence: 404785.3
Scaled residuals:
Min 1Q Median 3Q Max
-5.8065 -0.6742 -0.1224 0.5558 6.5091
Random effects:
Groups Name Variance Std.Dev.
first_hs_code (Intercept) 0.3799 0.6164
Residual 11.6758 3.4170
Number of obs: 76329, groups: first_hs_code, 378
Fixed effects:
Estimate Std. Error t value
(Intercept) 18.9194811 0.0605949 312.229
male -0.0612752 0.0250516 -2.446
race_white 0.0210941 0.0386175 0.546
frpl_ever_in_hs -0.7038140 0.0286906 -24.531
sped_ever_in_hs -0.7261136 0.0445451 -16.301
lep_ever_in_hs -0.0587592 0.1395022 -0.421
gifted_ever_in_hs 1.7477650 0.0342171 51.079
math_8_center 0.1200613 0.0007308 164.289
sch_male_center -0.4505802 0.3362711 -1.340
sch_white_center -0.6539464 0.2425249 -2.696
sch_frpl_center -1.4640371 0.1938528 -7.552
sch_sped_center -0.6070127 0.4397803 -1.380
sch_lep_center 1.5766275 1.1772262 1.339
sch_gifted_center -0.2540672 0.3349658 -0.758
sch_8_math_center -0.0225188 0.0051242 -4.395
flag_2010_cohort 0.3110568 0.0250484 12.418
# Remove the comment (#) below to print a summary of the reading model
# summary(m_read)
Visual diagnostic plots are another way to inspect the quality of your model. Using a helpful plotting function found in the merTools package, we plot the results of a simulation of random effects for each model. Look for variation and confidence bands (dark bars) that do not overlap the red line for zero. Here, we established that a number of school effects are meaningfully different from zero.
Note: We do not load the merTools
package because it masks the select
function from dplyr
(found in tidyverse
). Instead, we chose to call individual functions directly from merTools
using the ::
operator (e.g., merTools::plotREsim
). This is a handy workaround when dealing with packages with conflicting functions.
# Plot random effects for the math model
merTools::plotREsim(merTools::REsim(m_math, n.sims = 100), stat = "median", sd = TRUE)
# Remove comment (#) below to plot random effects for the reading model
# merTools::plotREsim(merTools::REsim(m_read, n.sims = 100), stat = "median", sd = TRUE)
We used a measure called “expected rank” to identify BTO schools. Expected rank provides the percentile ranks for the observed groups (i.e., schools) in the random effect distribution taking into account both the magnitude and uncertainty of the estimated effect for each group. Incorporating magnitude and uncertainty in the BTO process is a key advantage of using this technique when assessing the performance of schools with small student populations. Estimates for small schools are more uncertain due to having few student observations. A BTO analysis that relies only on confidence intervals and point estimates biases the results towards small schools with uncertain, but very large positive values. Using expected ranks mitigates these biases.
We extracted the expected rank and more reliable confidence intervals using the merTools
package’s REsim
and expectedRank
functions. Next, we identified schools as beating the odds if they were above the 70th percentile of all schools. Conversely, schools were identified as performing worse than expected if they were below the 30th percentile.
ranks_math <- merTools::expectedRank(m_math, groupFctr = "first_hs_code")
ranks_read <- merTools::expectedRank(m_read, groupFctr = "first_hs_code")
calc_bto <- function(.ranks, .var) {
# Input model expected ranks
.ranks %>%
# Flag schools that perform above/below benchmark (i.e., BTOs)
# assume math/read have same cut offs
mutate(bto = ifelse(pctER >= 70 | pctER < 30, "yes", "no")) %>%
# Select and name variables of interest
select(first_hs_code = groupLevel, "estimate_{{ .var }}" := estimate,
"pctER_{{ .var }}" := pctER, "bto_{{ .var }}" := bto)
}
bto_math <- calc_bto(ranks_math, math)
bto_read <- calc_bto(ranks_read, read)
# Merge BTO datasets
bto_read_math <- left_join(bto_read, bto_math, by = "first_hs_code")
# Pull school and district info from original dataset
sch_names <- faketucky %>%
select(first_dist_code, first_hs_code, first_dist_name, first_hs_name) %>%
distinct() %>%
# Convert high school work to factor for merge
mutate(first_hs_code = factor(first_hs_code))
# Merge BTO and school info datasets
sch_bto_data <- left_join(sch_names, bto_read_math, by = "first_hs_code")
It’s always helpful to plot the results of your analysis. We’ve found a nice scatter plot can be an effective visualization for communicating the overall results of the analysis.
# Create scatter plot math/read residuals
sch_bto_data %>%
mutate(bto_type = case_when(
bto_math == "yes" & bto_read == "yes" ~ "Math AND Reading",
bto_math == "yes" | bto_read == "yes" ~ "Math OR Reading",
TRUE ~ "Neither/Not BTO"
)) %>%
ggplot(aes(estimate_math, estimate_read, color = bto_type)) +
geom_point(size = 2, alpha = .6) +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_hline(yintercept = 0, linetype = "dashed") +
scale_color_manual(values = c("#E69F00", "#0049E6", "#999999")) +
sdp_theme() +
theme(panel.grid = element_line(color = "grey92")) +
labs(x = "Math Residual", y = "Reading Residual", color = "",
title = "Schools that Performed Better/Worse that Expected in Math and Reading")
Purpose: This analysis examines the distribution of BTO schools in math and reading and establishes performance levels (e.g., small, medium, or large changes in between the predicted and actual school performance). Results provide an overall summary of a BTO analysis.
Required Analysis File Variables:
first_hs_code
pctER_math
(calculated field from BTO analysis)pctER_read
(calculated field from BTO analysis)Ask Yourself
Analytic Technique: Determining reasonable cut points for performance categories (such as small, medium, or large). Count the number of schools by performance level and plot a matrix to compare school performance across subject areas.
There are a number of ways to establish performance levels. Examples include referencing academic literature, technical documents, pilot studies, statistical analyses, etc. For illustrative purposes, we aligned performance levels to percentile ranks where:
Possible Next Steps or Action Plans: Identify which schools are performing at different levels. Develop academic plan for schools at each performance level.
# Create function to add performance levels
add_performance_levels <- function(.bto_data, .pctER.subject, .var) {
# Input BTO data
.bto_data %>%
# Define performance levels
mutate(perform_lvl = case_when(
# Positive
between({{ .pctER.subject }}, 70, 80) ~ 1,
between({{ .pctER.subject }}, 80, 90) ~ 2,
{{ .pctER.subject }} >= 90 ~ 3,
# Negative
between({{ .pctER.subject }}, 20, 30) ~ -1,
between({{ .pctER.subject }}, 10, 20) ~ -2,
{{ .pctER.subject }} < 10 ~ -3,
# Other
TRUE ~ 0
)) %>%
# Label performance levels
mutate(perform_text = case_when(
perform_lvl == -3 ~ "Large\ndecrease",
perform_lvl == -2 ~ "Medium\ndecrease",
perform_lvl == -1 ~ "Small\ndecrease",
perform_lvl == 0 ~ "Not\nBTO",
perform_lvl == 1 ~ "Small\nincrease",
perform_lvl == 2 ~ "Medium\nincrease",
perform_lvl == 3 ~ "Large\nincrease"
)) %>%
mutate(perform_text = fct_relevel(
perform_text,
"Large\ndecrease", "Medium\ndecrease", "Small\ndecrease",
"Not\nBTO",
"Small\nincrease", "Medium\nincrease", "Large\nincrease")) %>%
# Rename variables of interest
rename("perform_lvl_{{ .var }}" := perform_lvl,
"perform_text_{{ .var }}" := perform_text)
}
perf_lvl_math <- add_performance_levels(sch_bto_data, pctER_math, math) %>%
# Drop reading variables for following merge
select(-contains("read"))
perf_lvl_read <- add_performance_levels(sch_bto_data, pctER_read, read) %>%
select(-contains("math"))
bto_perform_lvl <- left_join(perf_lvl_math, perf_lvl_read)
bto_perform_lvl %>%
select(first_hs_code, starts_with("perform_text")) %>%
pivot_longer(-first_hs_code,
names_to = "subject", values_to = "perform_lvl") %>%
count(subject, perform_lvl) %>%
mutate(subject = ifelse(str_detect(subject, "math"), "Math", "Reading")) %>%
ggplot(aes(perform_lvl, n, fill = subject)) +
geom_col(position = position_dodge(.85), width = .8) +
geom_text(aes(label = n),
position = position_dodge(.85), vjust = -.45) +
scale_fill_manual(values = c("#E69F00", "#999999")) +
sdp_theme() +
theme(panel.grid.major.y = element_line(color = "grey92")) +
labs(x = "Performance Level", y = "Number of Schools", fill = "",
title = "Distribution of School Performance Categories for Math and Reading")
Creating tables or plots that compare school performance levels in math and reading allows us to see whether schools perform better or worse in one or two subject areas. In this example, we see that school performance in math and reading are correlated. This information can be used to inform the way we think about best practices.
bto_perform_lvl %>%
select(first_hs_code, starts_with("perform_text")) %>%
count(perform_text_math, perform_text_read) %>%
ggplot(aes(perform_text_math, perform_text_read)) +
geom_tile(aes(width = .95, height = .95),
fill = "white", color = "black") +
geom_text(aes(label = n)) +
sdp_theme() +
labs(x = "Math Performance", y = "Reading Performance",
title = "School Counts by Performance Categories for Math and Reading")
Purpose: This analysis compares prior achievement by BTO performance levels to better understand the progress made by schools in the different performance groups.
Required Analysis File Variables:
first_hs_code
chrt_ninth
scale_score_11_math
scale_score_11_read
bto_math
(calculated field from BTO analysis)bto_read
(calculated field from BTO analysis)Ask Yourself
Analytic Technique: Calculate the average prior achievement for each school in the BTO analysis then rank schools by their score.
prior_achieve <- faketucky %>%
filter(chrt_ninth == "2009") %>%
select(first_hs_code, scale_score_11_math, scale_score_11_read) %>%
mutate(first_hs_code = as.character(first_hs_code)) %>%
# Calculate average scores by subject
group_by(first_hs_code) %>%
summarise(avg_11_read = mean(scale_score_11_read, na.rm = TRUE),
avg_11_math = mean(scale_score_11_math, na.rm = TRUE)) %>%
# Rank schools by subject
mutate(rank_11_read = rank(avg_11_read) / length(avg_11_read) * 100,
rank_11_math = rank(avg_11_math) / length(avg_11_math) * 100)
prior_achieve_bto <- bto_perform_lvl %>%
select(first_hs_code, starts_with("bto")) %>%
left_join(prior_achieve)
prior_achieve_bto %>%
mutate(bto_type = case_when(
bto_math == "yes" & bto_read == "yes" ~ "Math AND Reading",
bto_math == "yes" | bto_read == "yes" ~ "Math OR Reading",
TRUE ~ "Neither/Not BTO"
)) %>%
ggplot(aes(rank_11_math, rank_11_read)) +
geom_point(aes(color = bto_type),
size = 2, alpha = .6) +
geom_vline(xintercept = 50, linetype = "dashed") +
geom_hline(yintercept = 50, linetype = "dashed") +
scale_color_manual(values = c("#E69F00", "#0049E6", "#999999")) +
sdp_theme() +
labs(x = "Prior Performance - Math (Percentile Rank)",
y = "Prior Performance - Reading (Percentile Rank)",
color = "",
title = "Prior Performance by BTO Status")
Purpose: This analysis explores the student demographics of schools at different performance levels.
Required Analysis File Variables:
first_hs_code
chrt_ninth
sch_white
sch_frpl
sch_sped
sch_lep
perform_lvl_math
(calculated field from BTO analysis)perform_lvl_read
(calculated field from BTO analysis)Ask Yourself
Analytic Technique: Calculate the percentage of white (or minority) students, low income students, and special education students for schools in each performance level.
# Create function to calculate proportions by subject area
calc_props <- function(.perform_lvl_subject, .subject_lbl) {
# Call BTO performance levels
bto_perform_lvl %>%
# Label BTO schools
mutate(perform_high_low = case_when(
{{ .perform_lvl_subject }} > 0 ~ glue("{.subject_lbl}_High Performing"),
{{ .perform_lvl_subject }} < 0 ~ glue("{.subject_lbl}_Low Performing"),
{{ .perform_lvl_subject }} == 0 ~ glue("{.subject_lbl}_Neither High/Low"),
TRUE ~ NA_character_
)) %>%
# Merge with school demographics
left_join(sch_avg_by_cohort %>%
filter(chrt_ninth == 2010) %>%
mutate(first_hs_code = factor(first_hs_code))) %>%
drop_na(perform_high_low) %>%
# Calculate proportions by performance level
group_by(perform_high_low) %>%
summarise(prop_white = mean(sch_white, na.rm = TRUE),
prop_frpl = mean(sch_frpl, na.rm = TRUE),
prop_sped = mean(sch_sped, na.rm = TRUE),
prop_lep = mean(sch_lep, na.rm = TRUE)) %>%
# Convert to long format for plotting
pivot_longer(cols = starts_with("prop")) %>%
rename(group = perform_high_low)
}
prop_math <- calc_props(perform_lvl_math, "Math")
prop_read <- calc_props(perform_lvl_read, "Reading")
prop_stn_perform_lvl <- bind_rows(prop_math, prop_read)
prop_stn_perform_lvl %>%
separate(group, into = c("subject", "group"), sep = "_") %>%
mutate(group = fct_relevel(group, "Low Performing", "High Performing", "Neither High/Low")) %>%
mutate(value = round(value * 100, 1)) %>%
mutate(name = case_when(
name == "prop_white" ~ "White",
name == "prop_frpl" ~ "Low Income",
name == "prop_sped" ~ "Special Education",
name == "prop_lep" ~ "Limited English"
)) %>%
ggplot(aes(name, value, fill = group)) +
geom_col(position = position_dodge(.85), width = .8) +
geom_text(aes(label = round(value, 1)),
position = position_dodge(.85), vjust = -.45) +
expand_limits(y = c(0, 100)) +
facet_wrap(~ subject, nrow = 2) +
scale_fill_manual(values = c("#E69F00", "#0049E6", "#999999")) +
sdp_theme() +
theme(panel.grid.major.y = element_line(color = "grey92")) +
labs(x = "Student Group", y = "Percent of Students", fill = "",
title = "School Demographics by Performance Status in Math and Reading")