UK A-level algorithm fiasco a global example of what not to do - what went wrong and why

Profile picture for user catheverett By Cath Everett August 24, 2020
The UK government ignored its own guidance and refused external help, resulting in an A level algorithm disaster that disproportionately impacted students from disadvantaged backgrounds. But what exactly went wrong and what are the implications?

An image of an A-level student behind a pile of bookts
(Image by Wokandapix from Pixabay )

The reputation of UK government technology projects - and by extension that of the government itself - has been hard hit lately, not least due to last week's A-level exam results disaster.

Firstly, there have been the seemingly endless development U-turns and delays to the launch date of the Covid-19 contact tracing app.

Then there was the Home Office's decision earlier this month to scrap a controversial artificial intelligence (AI) system that had been in use since 2015 to process visa applications, ahead of a judicial review requested by charity the Joint Council for the Welfare of Immigrants. Campaigners claimed the software created a "hostile environment" for migrants and was biased in favour of white applicants.

But the third and most embarrassing debacle of all is undoubtedly the A level exam results fiasco. Even though exam regulator Ofqual had been warned by external advisors, including the United Learning schools trust, that the mathematical (rather than AI) algorithm being used to assess results was "volatile" and risked providing erratic outcomes, it pressed ahead anyway under ministerial pressure to prevent grade inflation.

This situation resulted in students from disadvantaged backgrounds having their results disproportionately downgraded, while those attending public schools benefited from being awarded higher marks. The subsequent outcry forced the government to ditch the algorithm and go with teacher-assessed grades, which saw affected students scramble belatedly to find a place at oversubscribed universities.

So what exactly went wrong here and what are the likely implications? According to Professor Jo-Anne Baird, director of the Department of Education at the University of Oxford, who is a member of Ofqual's Standing Advisory Group, because the regulator was specifically directed to deliver exam results that were not subject to grade inflation, the algorithm model developed was as good as it was possible to get. She explains:

Mathematical models never predict perfectly, especially in the field of human learning. But the Secretary of State's remit to Ofqual was to produce a system that brought about broadly comparable results with those of the past. So it wasn't possible just to use teacher's grades as they're not comparable. This meant some statistical moderation was needed to produce a model that worked best within the parameters set.

The problem was that the data used in the algorithm was not based on standard coursework's concurrent attainment grades due to the current predilection for relying on exams to assess students' abilities. As a result, forecasts were based on what data was available, that is how individuals had performed at GCSE and the historical performance of their school.

As good as it gets

In other words, the two-thirds accuracy levels attained by the algorithm were simply "as good as it got". But this situation was known about in advance, and "the government would also have been aware of the extent to which it worked", Baird says. She continues:

There's no way you can improve on the data when that's all you have available. These systems only reproduce what's gone before and entrench inequalities already in the data, which is a real problem. So you have to be careful of the ethics of how this is done and move very slowly. Educational technology sales pitches make big claims that don't pan out, so technological development is needed, but the real issue is conceptual. It's about understanding what the models should look and what the data can and can't do - and this will only get more important as we move to AI.

Another big problem though, Baird says, was the lack of "joined-up" or holistic "systems thinking" around what to do about any injustices thrown up by the system, the knock-on effects and how to mitigate them:

The algorithm might have carried public confidence if it had been accompanied by an appeals system for those suffering a clear injustice. It's a huge systems issue to get millions of results out the door so it would have given credibility to the results for the majority.

Katja Bego, principal researcher in innovation foundation Nesta's technology futures and explorations team, agrees. But she points out that a key problem in many government tech projects is the desire to do things too quickly, which leads to governance issues being overlooked:

There's never going to be a perfect algorithm or a completely fair system, so you have to get the governance right. But the lack of governance, robust systems and human decision-making here created a toxic combination and turned it into a fiasco. Also nothing was done about the warnings from The Royal Statistical Society (RSS) a few months ago, which said this could happen. So there was no real public scrutiny, although things might have been quite a bit better if there'd been more oversight and transparency.

The RSS, according to an article by the Ada Lovelace Institute entitled ‘Can algorithms make the grade?', had identified concerns over the composition of the government's technical advisory group, which consisted mainly of "government employees or current or former employees of the qualification regulators". As a result, the RSS claimed it asked some of its "independent Fellows" to join the group, but was "met with the condition of a strict, five-year non-disclosure agreement", which it declined.

But Ofqual's chair Roger Taylor has now disputed the Society's allegation and attested that the non-disclosure agreement only pertained to confidential data. The regulator has also published the non-disclosure agreement to back up its case. 

In addition to the apparent absence of independent, external scrutiny though was the fact that the government apparently chose to ignore its own ethics guidelines. Examples include the Department for Digital, Culture, Media & Sport's (DCMS) ‘Data ethics guidance', which were published in June 2018.

The importance of ethics

They were also followed in June last year by guidelines on ‘Understanding artificial intelligence ethics and safety' from the Office for AI, a joint Department for Business, Energy & Industrial Strategy and DCMS unit, and would have been just as applicable to the A-level results system, which was based on a mathematical algorithm as to any AI system. As Elena Sinel, CEO of Acorn Aspirations and Teens in AI, which provide young people with the opportunity to learn tech skills, points out:

When you design a system, you need checks and balances and a checklist to go through. There are various ethical frameworks designed to do just that, including the government's own, so it's just a shame it decided not to follow them. But there are also lots of advisory bodies it could have consulted, such as the Alan Turing and Ada Lovelace Institutes. Whenever the government deploys technology, there has to be public scrutiny as that's what democracy's all about. You can't just deploy things in secret and then try to justify your actions later. And you absolutely have to have risk assessment and mitigation measures in place.

But in the government's defence, Bego says, there is currently no global "gold standard" in ethical framework terms, most of which are "not usually very practical". There is very little legislation to guide organisations either. This means that many in both the public and private sector either go for a "free and open interpretation" of the rules or develop their own.

In a bid to do something about the situation, Bego says, Nesta is working with the cities of Amsterdam and Helsinki to build an AI Registry based on rules laid down by the European Commission's European Group on Ethics (EGE) in Science and New Technologies.

The aim of the Registry is to make it transparent to the public how, where and in what context algorithms and data are being used across the public sector. The overarching aim of the initiative, however, is to implement the EGE's ethics rules and develop a practical set of recommendations for procurement, development and implementation on that basis, the goal being to set a gold standard that can be used by around the world.

As for the likely impact of the A level debacle though, she believes the repercussions will be long and deep, not least in terms of the damage done to the international reputation of UK government IT:

Globally, this is probably the highest profile case of this kind of thing going horrendously wrong, and of a system being so biased and discriminatory. It'll become a case study of how not to do things, but it'll also have quite a big impact on public trust and, as a result, we might see people pushing for more accountability and transparency in other areas of life too, like car insurance. In fact, I wonder if this mightn't be the story that changes things - it's so big and it's impact is so visible. Algorithms weren't easy to explain in the past and it was necessary to give semi-hypothetical examples to show the risks, but now you can cite the A level situation as it's so extreme.

But Bego also believes that this scenario could have implications for uptake of the UK Covid-19 contact-tracing app when it appears due to question marks over the government's internal development processes and governance procedures. She explains:

It could be a good moment for the government to become more deliberate in terms of trust-building as it's clear that those countries with effective processes generally have higher adoption rates.

My take

There seem to be a number of lessons to be learned from this debacle, so here are a few:

Firstly, don't rush tech initiatives through for the sake of expediency and forget about vital governance and risk mitigation activities.

Secondly, ensure you use an appropriate data set to achieve your aims as any results will only be as good as the information your algorithm has to go on.

Thirdly, ensure your aims fit with those of your stakeholders - preventing grade inflation was patently not top of mind for the students affected by this scenario. Fairness was.

[This article was updated after publication to add Ofqual's rebuttal of the RSS's statements relating to its non-disclosure agreement].