Overview of R2L Methodology

R2L combines numerous data sources and uses new methods for ongoing data collection using web scraping, machine learning, natural language processing and researchers to create a robust panel dataset for each district based on published school district website content. Current week and historical district instructional status data are presented for the nation as a whole and by district demographic and sociopolitical variables — such as district poverty, voting history, and rates of broadband access — on the R2L dashboard for each week.

Research Design

This dashboard presents both current status in a given week and changes since early in the school year, broken down by various demographic and sociopolitical characteristics. In addition, the dashboard displays data by a number of demographic and sociopolitical characteristics to display associations with district statuses.

Constructs. Our primary construct is mode of instruction, broadly defined. Our coding is based on district-wide policies collected from district webpages, social media announcements, or direct contact with a district representative. Idiosyncratic closures due to school-specific outbreaks are not reflected in our data unless they closed a grade range for the entire district for at least one week. To ensure uniformity in collector responses, the district’s operating plans are coded into three mutually exclusive categories: fully in-person, hybrid, or fully remote. These broad definitions allowed us to categorically separate district operating plans:

  1. In-person. All grade levels can attend school in buildings five days per week, though families can opt for fully remote instruction or a hybrid model.
  2. Hybrid. Either students in some grades can return to buildings in person while other grades can only return in a hybrid or remote model or all students can return to buildings for four days or less each week (or five partial days) while learning remotely from home the remaining time.
  3. Remote. All grade levels above first grade participate in virtual instruction five days per week, with no option for in-person or hybrid learning. Districts that only allowed in-person or hybrid instruction for prekindergarten, kindergarten, first grade, or select subgroups of students are included in this category.

Information on districts’ instructional status was gathered from school district websites (and pages linked to them) on the assumption that these sites are the centralized communication hubs for all schools in those districts.

R2L data can be reported at the district or school level, but the results should be explicitly understood for what is represented at each level. At the district level, percentages are the proportion of districts offering in-person, hybrid, or remote instruction at all schools. At the school level, percentages are of schools that belong to districts offering in-person, hybrid, or remote instruction at all schools. As such, in-person percentages cover only a portion of schools that have an in-person option available to all schools, such as elementary schools open five days a week in a district in which high schools are remote or hybrid. Similarly, remote percentages capture a subset of schools that are fully remote and would not include the high schools referred to earlier. We refer to school percentages to communicate the extent of these categories, since district percentages equate three school districts with 500 school districts. Approximations of the percentage of students in each model are available based on a sample gathered by Burbio.

Data Collection

R2L’s approach to data capture rests on school district websites’ published content. In September 2020, R2L began scraping data from websites of regular, non-charter, public school districts that had three or more schools (as defined by the National Center for Education Statistics 2018–19 Common Core of Data Local Education Agency data file). We track changes by identifying new content posted on district websites in a given week (by subtracting content from a previous week from a given week’s scraped data), and then use a machine learning approach to analyze whether the new content indicates a change in operational status. AEI and C2i researchers update our weekly time series data by reading web content and calling school districts when necessary to review the predictions of the machine learning models, affirm new changes, and review past weeks’ changes. Districts whose websites yielded no predicted changes in a given week are assumed to retain their previous instructional status.

During the fall and early winter, R2L worked to improve the effectiveness of our scraping and machine learning and surveyed a representative sample of more than 2,200 of our about 8,400 targeted school districts. R2L expanded our sample of 2,200 surveyed districts with multiple external data sources in January 2021 to expand our time series to the population of regular school districts with three or more schools.

The largest source of external data came from MCH Strategic Data’s “COVID-19 IMPACT: School District Status,” which included baseline data for nearly all the about 8,400 regular public school districts in the United States with at least three schools. MCH’s data included up to 18 data points per district through January 2020 and served as the baseline for R2L’s time series.  We supplemented MCH data with data drawn from 29 states’ departments of education* — and a few additional supplemental sources — for corrections and fidelity checks to our collective baseline time series. Data collected by AEI and C2i researchers for all our initial sample of 2,200 districts; from districts flagged by the R2L web scraping and machine learning algorithm since November 2, 2020; and for districts with sparse or insufficient time series data have been applied over the baseline data from MCH and states. A subset of states provide ongoing data that are used to flag changes and do quality assurance on R2L data.

Existing Data Sources. School district–level demographic data and crosswalks between districts and counties come from the Common Core Dataset from the National Center for Education Statistics. Historical achievement data come from Stanford Education Data Archive 4.0. The US Census Bureau, US Bureau of Labor Statistics, and St. Louis Federal Reserve’s FRED platform provide county economic and educational attainment data. Vote totals from 2020 come from USElectionAtlas.org. County-level mask usage data gathered in July 2020 come from The New York Times. Finally, aggregate county COVID-19 case data come from USAFacts.org, and age-specific case data reported monthly come from the Centers for Disease Control and Prevention.

Districts’ Instructional Offerings by Week

These bar graphs display the percentage of districts in each category of instruction. Additionally, categories of instruction are displayed for categories such as poverty, in which high-poverty districts are above the national district average for the measure of poverty and low-poverty districts are above the average.

*Alabama, Alaska, Arizona, Arkansas, Colorado, Connecticut, Georgia, Hawaii, Illinois, Kansas, Louisiana, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Nebraska, New Mexico, North Carolina, Ohio, Oregon, Rhode Island, South Carolina, and Washington.