Misuse of Models: IB predicting test scores

Posted by Q McCallum on 2020-07-27

The group behind the International Baccalaureate (IB) Diploma skipped this year’s exams due to Covid-19 concerns. That’s understandable, even if far from ideal. IB is certainly not the only organization forced to make quick moves to adapt to the pandemic.

What’s less understandable was IB’s follow-up move: they used a model to predict exam performance. Some universities have since withdrawn students’ acceptances based on lower-than-expected IB exam scores, even though those “results” didn’t come from students’ exam-day performance.

While there are many lessons one can learn from this, I’ll explore two related to ML/AI:

1. Be mindful of developing and deploying predictive models. The higher the stakes, the more care you must employ.

IB’s actions, in my view, constitute a misuse of a predictive model. I am not surprised that someone at IB had the idea to develop a predictive model; I’m just confused that this idea made it all the way to a deployed, in-the-wild model which held weight in schools’ decisions (and, ergo, students’ futures).

These kinds of exams are held as a proxy for a person’s long-term aptitude based on a single day’s effort, so there’s already a lot of pressure on students to perform well. What’s worse is when we take that single abstraction (the exam concept) and layer another abstraction on top of it (predicting the exam’s score). It’s akin to looking through two lenses you’ve stacked togther: it’s possible that you were able to place them such that you get the true, clear picture. It’s also possible that any flaws in those lenses, or in your arrangement thereof, will compound one another and distort the view.

This is the same argument that is (rightfully!) levied against predictive policing, predictive hiring, and social credit scores: these systems don’t give a person a fair chance to prove themselves in the moment. They instead reduce a person to a score that pops out of a black box model. That score is often based only in part on the person’s past, and it mixes in statistics from a wider group. And it is further skewed by any mistakes made by the model’s creators.

Worst of all, the group that employs the predictive model reaps the benefits when it is correct but rarely suffers consequences when it is wrong. That means there is little incentive for the group to correct the model (or, better yet, to remove it from service) when things go awry. It’s a clear case of “Heads: I win; tails: you lose.”

2. Mind your upstream data provider. Beware of changes to the numbers they provide.

If IB’s lesson is about when to (not) employ a predictive model, the universities’ lesson is to be mindful of using data you didn’t collect yourself.

When you source data from an external vendor, it pays to understand how it came to be. (This is known as the data generating process (DGP), which is a link in what I call the data supply chain.) This goes double when the vendor furnishes a distilled data product, such as scores or classifications, as opposed to raw data that you can inspect for yourself. Changes in how that product is calculated can lead to a change in results, which will in turn have material impact on your results as you use it to build your models. And even if the new values are “more correct” because they better reflect what you are trying to represent, they are still “less correct” because downstream data consumers are tuned for the old values.

Consider a credit score as an example. Banks pull that number from upstream credit bureaus and factor that into their decision of whether to offer you a loan. If the credit bureaus radically change how they calculate the score, the old and new scores may vary. This holds even if the average score across their test data set – which they may use as a quick measure that everything is in order – remains the same.

The bank therefore needs to be able to compare the old and new scores – not on an aggregate basis, but for individual consumers – to backtest across historical data. This check would detect cases cases in which lending decisions would change based on the new scores. Someone could otherwise be denied credit, not because of their actual financial history, but because they applied for a loan the day after the upstream credit bureaus changed their calculations.

Lessons learned (?)

My hope is that the IB story serves as a cautionary tale for groups that choose to deploy predictive models as well as those who rely on those models’ outputs. There are many ways an ML/AI model can prove incorrect or unreliable, so it’s unwise to treat their outputs as gospel.