Educational institutions like the University of Michigan are, on the surface at least, deeply committed to fairness. This commitment is often tested: any time we select students for admission, evaluate their work for grades, or decide who to honor. Although the commitment is simple, living up to it is not. Creating an educational system that is fair is a constant challenge. To meet it, we must be vigilant, using all the information we have to find out where we might be falling short.
I’ve been thinking about how to use data to test fairness and equity in higher education for a while now. Nothing I have to say here is entirely new, but it’s clear more people need to hear it, because there are lots of problems.
What makes a course fair?
Let’s consider an example. Since joining the Michigan faculty in 1995, I have regularly taught large introductory courses. In these courses, every student is provided with the same opportunities to learn, asked to complete the same assignments, and tested in the same way. These days, each student’s test and homework scores are often determined by automatic graders unaware of their identities. Point totals are used to rank students, then converted to letter grades intended to say something about global competence. On the surface, a course like this seems like a model of fairness. But is it?
Start with the core idea: is treating everyone in exactly the same way fair? Well, it might be, if everyone arrived at the start in exactly the same condition, with precisely the same background and preparation. Then their performance might depend only on what they do during the class. Of course that’s never true. In real-world college courses, final rankings say as much about what students were like when the class began as they do about what happened as students worked through the semester.
What’s the point of reiterating where people were when they started? Doesn’t that just ensure that those who get ahead early, whether through wealth and privilege, or even hard work, just stay on top forever? Even if the purpose of the final grade is solely to report achievement levels at the end of the class, regardless of where everyone started, we’d still want to know more, if only to understand whether and how our class is working.
Better-or-worse than expected – a different kind of fairness
Imagine that we take a step forward. Instead of just comparing outcomes, why don’t we measure how every student in the class improves during a course, from start to finish. Gather up everything we know about them at the start, and see how they do at the end. Then we could make comparisons only among people who start out looking just the same. Clearly a comparison like this is ‘fair’ in a different way.
Even among students who start out the same, we find some variety of outcomes. Most do about the same, and this ‘typical’ performance becomes our rational expectation for students with this kind of background. Some do better-than-expected – they outperform most of the students who enter with a background like theirs. Some do worse-than-expected – they underperform most of the students who enter with a background like theirs.
There could be lots of reasons for better-or-worse than expected performance like this. The differences might be purely random: a bit of good luck for some students and bad for others. They could be due to effort: just rewards for especially hard work and penalties for slacking off. Intrinsic traits invisible in our measures of background might act too – experience we haven’t seen, or maybe talent.
Testing for fairness by comparing outcomes
If a class is fair, student performance should depend only on things which are salient. There are things which shouldn’t determine student outcomes. No one’s grade should depend on their shirt color, height, or what they had for lunch on Tuesday. This assertion gives us a way to test whether a course is fair: check to see whether better-or-worse than expected performance depends on something it shouldn’t.
Testing for fairness by comparing outcomes is a time-honored approach. A typical application happens in a large course taught in many sections. Instructors concerned about fairness often check to see whether grades differ strongly from one section to another. In doing this, they typically assume that the mix of salient traits (background and preparation, effort, talent) is the same in every section, so that they expect the same average performance for students in each group.
When grades differ significantly from section-to-section, concerns about fairness arise. Perhaps instruction was better in some sections than others. Perhaps grading was more lenient in some sections than others. In either case, some would consider the outcomes unfair. Of course it’s also possible that students in some sections were better prepared, or worked harder, than students in others. But it’s not unusual for instructors to seek fairness by forcing the average grades in every section to be very nearly the same.
Now we can do more. With rich knowledge of student background and preparation, and perhaps even information about effort, we can test whether students in one group do better-or-worse than expected.
Do outcomes differ for unacceptable reasons?
We’ve done this test for my own physics classes: gathering information about every student – combining admissions data like standardized test scores and high school GPA with placement test scores and information about every class they’ve taken and grade they’ve received. Then we take students for whom all these things are the same, and compare outcomes.
What we found was a surprise. In my introductory physics classes (and those of all my colleagues), male and female students who seem academically identical receive, on average, quite different grades – female students typically receive grades about a third of a letter grade lower than male students. Remember, this is after carefully accounting for everything we know about the academic background of these students. This is not a small effect either – the difference between an A- and a B+ hurts. It is significant both as performance feedback to the student and for its role in ranking and the award of honors. Such an inequity can’t be ignored.
Why do I say inequity? Gender is one of the descriptors of a person which, like height, should not influence success in my physics class. But it does. To me, this is clear evidence that I have created in my class an inequitable environment; one in which female students encounter barriers to success which equivalently prepared male students do not. This is my problem, and I’m glad to say that I’m finally starting to take steps to do something about it.
It’s not just my problem though. Gendered performance differences like this are ubiquitous in large, introductory science and math lecture classes. Remarkably, they are absent in the lab courses associated with the same topics. I’ll be writing more on this in future posts.
The key lesson is simple. If we want our higher education system to be fair, we must pay attention. We must constantly examine what’s happening, in careful, quantitative ways. When we find problems, we need to work to address them.