The UK Government is doing nothing about AI companies scraping copyrighted content, preferring to let the courts decide if their actions are illegal.
That’s the suggestion from a new report from the UK Parliament's House of Lords’ Communications and Digital (select) Committee, published this month in the wake of its 2023 Inquiry into Large Language Models (LLMs).
The 93-page report on AI policy and regulation, produced after hearing from 41 expert witnesses in November and examining over 900 pages of written evidence, urges an immediate response from No 10:
The government has a duty to act. It cannot sit on its hands for the next decade until sufficient case law has emerged.
In response to this report, the government should publish its view on whether copyright law provides sufficient protections to rightsholders, given recent advances in LLMs. If this identifies major uncertainty, the government should set out options for updating legislation to ensure copyright principles remain future proof and technologically neutral.
Bold words. And the Committee’s view on AI providers’ actions is equally clear:
LLMs may offer immense value to society. But that does not warrant the violation of copyright law or its underpinning principles. We do not believe it is fair for tech firms to use rightsholder data for commercial purposes without permission or compensation, and to gain vast financial rewards in the process.
So, there we have it.
The Inquiry reached its conclusion after hearing evidence from all sides of the debate. Expert witnesses included: Dan Conway, CEO of the Publishers Association; Richard Mollet, Head of European Government Affairs at RELX (parent of LexisNexis, Elsevier, and others); Arnav Joshi, Senior Associate at law firm Clifford Chance; Dr Hayleigh Bosher, Reader in Intellectual Property Law at Brunel University London; plus, senior representatives from Microsoft, Meta, Google DeepMind, Stability AI, Aleph Alpha, Holistic AI, Hugging Face, Mozilla AI, and others. (ChatGPT maker OpenAI was scheduled to appear, but its session coincided with the company’s temporary implosion.)
The report adds:
There is compelling evidence that the UK benefits economically, politically, and societally from upholding a globally respected copyright regime. [Nearly six percent of total economic activity in the creative sectors alone.]
The application of the law to LLM processes is complex, but the principles remain clear. The point of copyright is to reward creators for their efforts, prevent others from using works without permission, and incentivize innovation. The current legal framework is failing to ensure these outcomes occur and the government has a duty to act.
A clear and unambiguous view. The Lords were apparently persuaded by publishers’ belief that LLM companies’ scraping of unlicensed data to train their systems – including databases of known pirated material – was illegal. So, it is now incumbent on the government to do something about it.
Google told the Inquiry last year that it “always seeks to be compliant with IP laws” when training AI models, a statement echoed by Meta and others. But as we said in our report at the time, what did they mean by compliance, and with whose laws? And wasn’t “seeks” a vague ambition rather than a verifiable act?
Speaking on behalf of the UK’s publishers in November, Conway did not pull his punches:
AI is not being developed in a safe and responsible, or reliable and ethical way. And that is because LLMs are infringing copyrighted content on an absolutely massive scale.
Naysayers to these arguments have long claimed that reading a book, learning from it, and using that knowledge in daily life is not only legal, but desirable. So why shouldn’t that principle apply to machine intelligence?
Copyright expert Dr Bosher told the Inquiry:
Is the purpose of you reading a book to benefit commercially from the story within? No, you are just enjoying and consuming the story. [Strictly speaking, some books are read for commercial gain.]
But, if the purpose of an AI ‘reading’ a huge dataset of information – of value, that's owned by copyright – is to create a new business and be remunerated themselves from something created by someone else, then that is typically a licence model.
Standing up to the big money
Other naysayers have claimed that, because academic inquiry is protected by fair-use conventions, training an LLM is simply a new form of research. To which Dr Bosher responded last year:
The principle of when you need a licence and when you don't is clear: to make a reproduction of a copyright-protected work without permission would require a licence or would otherwise be an infringement. And that is what AI does, at different steps of the process – the ingestion, the running of the programme, and potentially even the output.
Some AI developers are arguing a different interpretation of the law. I don't represent either of those sides, I'm simply a copyright expert.
Well put. So, copyright holders have won the day – at least with Britain’s second chamber of Parliament. But it is now up to government to seize the initiative and stop trying to play both sides in the hope that the market will somehow deliver a win that everyone is happy with.
But aren’t these questions being dealt with on a voluntary, goodwill basis by the UK’s Intellectual Property Office [IPO]? Not anymore. Sadly – or perhaps fortuitously for copyright holders – publication of the Lords report coincided with the collapse of attempts by the IPO to draft a non-binding voluntary code.
It has been suggested that AI companies were unwilling to engage with any agreement that might involve massive licensing costs and thus “stifle innovation”.
However, given that LLMs are being developed by five of the world’s top 10 richest companies – with OpenAI backed by the most valuable, Microsoft, whose $3 trillion market cap is equal to UK GDP – those claims should be treated with the contempt they deserve. After all, this would not just be a cost to vendors, but also a step towards billions of dollars in uncontested revenue!
While well intentioned, work towards a voluntary code was also staggeringly naïve, given the bad-faith content-scraping that has clearly taken place. For example, the Books3 data set, which was used to train some LLMs, contained nearly 200,000 pirated texts, while some image-generation tools have produced works containing the watermarks of commercial image banks. In this light, any claims that content was scraped accidentally are absurd.
Even so, the Committee had hailed the IPO‐led voluntary process, describing it as “welcome and valuable” in the report, though adding that “debate cannot continue indefinitely”.
On that now-abandoned scheme, the report says:
If the process remains unresolved by Spring 2024 the government must set out options and prepare to resolve the dispute definitively, including legislative changes if necessary.
So, with the voluntary scheme now scrapped, the government has no choice but to act. But by doing what exactly? Some clues may lie in the ashes of the IPO’s scheme, as presented in the report:
The [since abandoned] IPO code must ensure creators are fully empowered to exercise their rights, whether on an opt‐in or opt‐out basis.
Developers should make it clear whether their Web crawlers are being used to acquire data for Generative AI training or for other purposes. This would help rightsholders make informed decisions, and reduce risks of large firms exploiting adjacent market dominance.
The government should encourage good practice by working with licensing agencies and data repository owners to create expanded, high-quality data sources at the scales needed for LLM training. The government should also use its procurement market to encourage good practice.
The IPO code should include a mechanism for rightsholders to check training data. This would provide assurance about the level of compliance with copyright law.
Excellent suggestions. However, the House of Lords report is clearest on one point: AI companies were wrong to scrape copyrighted content, and this government must say so, and do something about it in law.
But that means No 10 doing the thing that seems most difficult for Rishi Sunak, above all the recent Prime Ministers: stand up to big money, and back the UK’s valuable copyright holders rather than the Big Tech he would prefer to appease.
The UK should seize the day in other areas too, urges the report:
The Government must continue to forge its own path on AI regulation, balancing rather than copying the EU, US or Chinese approaches. In doing so, the UK can strengthen its position in technology diplomacy and set an example to other countries facing similar decisions and challenges.
International regulatory co‐ordination will be key, but difficult and probably slow. Divergence appears more likely in the immediate future. We support the government’s efforts to boost international co‐operation, but it must not delay domestic action in the meantime.
Quite. Are you listening, Prime Minister?