Rosie Hughes, Adanya Lustig and Erin Rhoda, BDN staff members, created a survey to better understand how Maine schools help students who might be struggling now or in the future with drugs or alcohol. They heard back from 229 schools.
But before we could start to think about what schools do that works (or doesn’t) to protect kids from the influence of substances, we had to run our survey data through some paces. This is the inglorious, even maddening, part of our work. But we had to be sure that our conclusions were built on a solid foundation of statistical reasoning.
This section describes how we validated our data, describes our data exploration tools, and then shows how we tested the significance of our stronger conclusions. You can find the gory details of the analysis on GitHub.
Now that we have some confidence in our numbers and our process, it’s time to get into the fun part: exploration.
Statistics as a flashlight
When trying to find your way on a hiking trail at night, you learn to get used to tripping on the odd rock or taking a branch to the face. A flashlight is handy to keep yourself on the path, and if you have a map that’s handy, too.
There are a lot of parallels between hiking at night and data exploration. When faced with a collection of shiny new data, as we were once the survey responses from principals started coming in, there is a tendency to want to jump in to find the next great insight.
But branches and rocks can be waiting in there in the form of bias and over-eagerness to find significant results. In this world of data, there is no map, but statistics can be our flashlight.
Here we will lay out how we checked for non-response bias and how we tested for statistical significance.
It is common practice in surveys to ask questions of a small group of people to infer how the whole population would answer. That small group is called a sample. In our survey we were interested in what all of Maine’s principals think and how all districts deal with drug issues.
Our goal was not to take a small sample and make assessments based on that. Rather we wanted to hear from everyone. However, we only heard back from about 68 percent of high schools, and 31 percent of elementary schools. So, these respondents became our sample.
Sampling to infer trends from a population is not inherently bad. It doesn’t necessarily introduce errors from bias, but this is possible. Bias in this case means that some part of the population may be underrepresented, and our results might not accurately tell the population’s story.
When a survey’s sampling introduces inaccuracies, we call this non-response bias. But how can we know if the respondents differ in a meaningful way from those who didn’t respond? This calls for a flashlight.
We tested for non-response bias by looking at data that were available for all schools. We checked to see if the responding schools were essentially the same as the non-responding schools in three areas: student-teacher ratio, spending per pupil, and percentage of students who receive free or reduced-price lunches.
If there was a difference in any of these areas, then we would have to correct for that bias.
The Maine Department of Education’s data warehouse gave us access to enrollment numbers at each grade level for each school, as well as the number of teachers. With this we could calculate the student-to-teacher ratio both for high schools and elementary schools.
Looking at the high schools first, we separated the schools into respondents and nonrespondents. Then we had to figure out if these two groups were truly different. If they were different, then that would point to a potential area of bias. We measured the amount of difference with two statistical measures: effect size and p-value.
In the figure below, you will see some green bars and some blue bars. The blue bars show the number of responding schools that had a student-teacher ratio of six, seven, eight, nine, 10, 11, 12, 13, 14 and 15 students per teacher. You will notice that there is a wide range (from six students per teacher, to more than twice that number), but on average we see about 11 students per teacher. This is the “distribution” of student-teacher ratios for responding high schools.
Effect size measures how far apart the responding schools’ distribution is from that of the non-responding schools. The figure shows visually that the two distributions have a lot of overlap. We measured an effect size of 0.19, which is considered to be a small difference.
But what is the probability of collecting data that has a bigger difference? That is the definition of p-value. If the probability of seeing more extreme data is small, then that’s what we call a statistically significant difference. However, in this case we calculated a p-value of 0.39. There is a 39-percent chance that we could have seen a more extreme difference just due to chance.
In other words, there is no support from these data that there is a difference between the respondents and the non-respondents.
For the elementary schools, we found an effect size of 0.14, and a p-value of 0.35. Again, there is no evidence in these data of a significant difference.
Expenditure per pupil
Next we looked at how much money each school district spends per pupil. This was also based on data from the Maine Department of Education data warehouse. Since the shape of these distributions was different from the normally distributed student-teacher ratio data, we used a different statistical test. For these data we found a tiny effect size (0.02), and a large p-value (0.87). There is no evidence of bias here.
Percent of students receiving free or reduced-price school lunch
For data on students receiving free or reduced-price lunch, we found a tiny effect size (0.02), and a large p-value (0.83). Again, here is no evidence of bias here.
Exploring the results
With confidence in our data, we could move on to the exploration. There was a lot of data, though. We wanted to figure out if it could tell us what sorts of school policies and approaches were helpful, and which were not. To do this we systematically looked at associations between questions.
For the most part, the information we gathered from schools include answers to yes/no questions or multiple choices. We can’t use tools like linear regression or correlation here. Instead we turned to the equivalent of correlation for nominal, i.e., labeled, data: association.
Associations between survey responses
“Association” refers to measures of the strength of relationship in which at least one of the variables is nominal. It’s like correlation but for variables that are not numbers. A handy way to see associations is with a contingency table.
Here is an example of a contingency table. This one compares the number of drug-related incidents in a year (rows) to the student-teacher ratio (columns) for Maine high schools.
From this we can see that there are generally more incidents in schools with higher student-teacher ratios, and fewer incidents in schools with few students per teacher. The strength of this relationship can be measured with a statistic called Cramer’s V.
V is a measure of the association between two variables as a percentage of their maximum possible variation. For the data above we found the strength of association to be 0.36. That means that 36 percent of the variation in number of incidents may be explained by student-teacher ratio.
This is by no means a complete explanation, but it does give us a good place to start looking. We might also ask: What are all of the potential effects that could occur when a school system adds more teachers? It may not be the case that teachers specifically prevent the use of drugs, but there may well be some other variable that follows (or is followed by) the ratio which does reduce the tendency toward drug use.
Contingency tables were a handy tool for looking at all of the possible associations, and Cramer’s V helped us find the ones that were the most promising. We know that correlation (or association) is not the same as causation, but it shined a light on some interesting paths that were worthy of more exploration.
Testing for significance
The most satisfying result of data experiments is finding statistical significance. If we want to know if there is an actual difference or a genuine effect in an experiment, then this is how we make that assessment. We use the same techniques here that we did when searching for non-response bias. We’ll also talk about the statistical power of this experiment.
Our exploration of the associations between question responses turned up several associations that we investigated further. One that stood out was the connection between student-teacher ratio and the question: “In the past year, how many students have been caught with alcohol, drugs or drug paraphernalia on your campus?”
We separated schools into two groups, schools with student-teacher ratio above 11 (“high s-t”) and those with 11 and below (“low s-t”). When comparing the distributions of these two groups, it seemed like there was a real difference between them. The distributions are shown in the graphic below.
We found an effect size of 0.88 and a p-value of 0.0005. This means that there is a large difference between the distributions and there is only a 0.5% chance of finding a more extreme difference just due to chance.
We also needed to make sure that we had taken enough samples to be at least 80-percent certain that we could detect a difference of this size. This is called the statistical power of the experiment. For this we calculated the number of samples required in an experiment with two independent samples and a continuous outcome. This gave a required sample size of at least 20 samples from each group. We had 54 samples from the “high s-t” group and 31 from the “low s-t” group.
This meets the criteria for a statistically significant difference, meaning that schools with fewer students per teacher tend to have fewer students getting caught with drugs or alcohol.
We found our way to a satisfying destination in our exploration.