While Parliament’s elected chamber, the House of Commons, has been concerned with AI safety this month – plus war, pestilence, and the King’s Speech – the upper second chamber, the House of Lords, has been looking at Large Language Models (LLMs). Indeed, it has been exposing issues that the Bletchley Park Summit largely ignored, given the latter’s focus on frontier technologies.
On 7 November the Lords’ Communications and Digital (Select) Committee explored the vexed issue of copyright and LLMs, in a world where many creators believe that AI companies have torn up the concept of a proprietary work.
Expert witnesses at a session chaired by Baroness Stowell were Dan Conway, CEO of the Publishers Association; Arnav Joshi, Senior Associate at law firm Clifford Chance; Richard Mollet, Head of European Government Affairs at RELX (multinational parent of LexisNexis, Elsevier, and others); and Dr Hayleigh Bosher, Reader in Intellectual Property Law at Brunel University London.
In an earlier session, Google claimed it “always seeks to be compliant with IP laws” when training AI models, a statement echoed by Amazon and Meta. But what did they mean by compliance, and with whose laws? And isn’t “seeks” just a vague ambition rather than a verifiable act?
Speaking on behalf of the UK’s publishers, Conway acknowledged that while LLMs can be a force for good – one used by the creative industries themselves – the elephant in the room needs urgent recognition:
AI is not being developed in a safe and responsible, or reliable and ethical way. And that is because LLMs are infringing copyrighted content on an absolutely massive scale.
He cited examples such as some LLM makers’ use of the Books3 data set (nearly 200,000 pirated titles), then added:
They are not currently compliant with IP law. We've had conversations with technical experts around the processes undergone via these LLMs, and it is our contention to the Committee that these LLMs do infringe copyright, at multiple parts of that process.
This is not a new claim: class actions in the US are based on it, while Caroline Cummins, the Association’s Head of Policy and Public Affairs, used similar words in September at a Westminster Forum on AI regulation.
Naysayers to these arguments claim that reading a book, learning from it, and using that knowledge in daily life is not only legal, but desirable. So why shouldn’t that principle apply to machine intelligence? Dr Bosher explained:
Is the purpose of you reading a book to benefit commercially from the story within? No, you are just enjoying and consuming the story. But if the purpose of an AI ‘reading’ a huge dataset of information – of value, that's owned by copyright – is to create a new business and be remunerated themselves from something created by someone else, then that's typically a licence model.
In fairness, we should allow that some people might read a book for commercial gain – a ‘how to’ text, for example – but generally they will have bought that text, so the author is paid.
Meanwhile, other naysayers claim that academic inquiry is protected by fair-use conventions, and training an LLM is simply a new form of research. Some believe that all information should flow like water to benefit humanity, free of proprietary dams (perhaps not an argument that resonates in a UK currently surrounded by sewage).
Despite these perspectives, Dr Bosher explained that the Publishers Association’s position – wholesale copyright infringement by AI developers – is legally correct:
Copyright is very much a technological and cultural tool that needs to be applied in different circumstances. And we try to write copyright law in a way that is technologically neutral – to the degree that it lasts a long time, even when technologies evolve.
The principle of when you need a licence and when you don't is clear: to make a reproduction of a copyright-protected work without permission would require a licence or would otherwise be an infringement. And that is what AI does, at different steps of the process – the ingestion, the running of the programme, and potentially even the output.
Some AI developers are arguing a different interpretation of the law. I don't represent either of those sides, I'm simply a copyright expert. And from my position, understanding what copyright is supposed to achieve and how it achieves it, you would require a licence for that activity.
However, she acknowledged that not every AI company has scraped the Web and/or used unlicensed data to train their models. (NB - diginomica is aware of generative image tools that only use licensed content). However, due to the lack of transparency among many developers, it is often unclear whether permissions have been sought and/or licences obtained.
Indeed, some Big Techs’ size and internal complexity makes evidence of infringement much harder to find, even if one might infer it from an AI’s output.
Perhaps the claim made by Google, Amazon, Meta, and others, that they “seek to be compliant” refers to a different legal system, ie the US? RELX’s Mollet said:
Of course, they are operating in a different jurisdiction […] and some people do maintain – I would say, and our counsel says, erroneously – that US law allows this [fair use of unlicensed content for commercial purposes]. But UK law – and indeed, EU law – is pretty clear. If you are reproducing works for the purposes of text and data mining, and you are a commercial entity, then you have to obtain the permission of the rights holder. And if you haven't got that permission in the UK, then that is an infringement.”
My own research on US copyright law reveals that part of the challenge is that jurisdiction’s own lack of clarity. Take the following statement from the US Copyright Office (which is itself engaged in an industry-wide consultation). At the time of writing, this is the status quo:
Courts […] are more likely to find that non-profit educational and non-commercial uses are fair. This does not mean, however, that all non-profit education and non-commercial uses are fair, and all commercial uses are not fair; instead, courts will balance the purpose and character of the use against the other factors below.
Additionally, ‘transformative’ uses are more likely to be considered fair. Transformative uses are those that add something new, with a further purpose or different character, and do not substitute for the original use of the work.
In addition, it says:
Using a more creative or imaginative work (such as a novel, movie, or song) is less likely to support a claim of a fair use than using a factual work (such as a technical article or news item).
Then factor in this additional statement from the US Copyright Office:
Courts look at both the quantity and quality of the copyrighted material that was used. If the use includes a large portion of the copyrighted work, fair use is less likely to be found; if the use employs only a small amount of copyrighted material, fair use is more likely.
But it adds:
That said, some courts have found use of an entire work to be fair under certain circumstances. And in other contexts, using even a small amount of a copyrighted work was determined not to be fair because the selection was an important part – or the ‘heart’ – of the work.
Got all that? But despite this frustrating lack of precision – which seems designed for expensive lawyers to argue each case in court (ker-ching!) – it appears that any hypothetical claim of ‘fair use’ when ingesting entire libraries of unlicensed novels or songs would be spurious.
That said, the word ‘transformative’ seems likely to be the key legal battleground for generative AI - isn’t transform what those systems do? Sometimes. But increasingly, enterprises use them to explain information to them.
Even so, the real heart of the matter is money. For example, the huge financial risk involved in any author, artist, or musician taking a trillion-dollar corporation to court. A cynic might observe that this constitutes a licence of a different kind: for unethical behaviour on an epic scale, which the Publishers Association believes there is clear evidence for.
All of which brought the Lords to a more interesting question: Are traditional sectors such as publishing obstructing innovation? Are their claims of copyright infringement just a form of institutional Luddism – King Canute trying to hold back a wave of change?
Dr Bosher was having none of it:
There is no evidence to suggest that copyright is a barrier to innovation in that context. We can see that, for example, with the UK IPO [Intellectual Property Office] consultation on the text and data-mining exception, lots of tech firms did not opt for [i.e. did not support] the broadest option.
This refers to the withdrawal (in March 2023) of a proposal to allow broad IP law exceptions for data mining when training AI and ML systems, including for commercial purposes.
This demonstrates that [copyright] is not a barrier to them. And that they understand that the tech industry and creative industries are not separate entities. Creative Industries are also developing AI, and AI companies are also creative. And so, everyone can benefit from the copyright framework.
So, we don't believe it is a barrier. The purpose of copyright is to encourage creativity and innovation, and also the dissemination of that creativity and innovation for culture and knowledge. And it does that by balancing the protection of the creator’s output with limitations, such as exceptions or the length of copyright.
All of which seems clear and fair - if copyright is such a barrier, then why did AI companies not seek to demolish it when they had the chance? The answer is that the tech sector itself is hugely reliant on IP. They don’t want to create precedents in which a holder’s IP can just be cast aside in the interests of commercial research. Fancy that!
RELX is both a traditional publisher and a deep user of AI and data analytics, explained Mollet – a model that is fast becoming the norm. So, what conclusion could he draw from having both perspectives at the heart of the business? Cognitive dissonance? He said:
In all of those areas, we are taking data and content – some of it proprietary, so we're a rights holder – then aggregating it with other data. Then applying analytic tools, including AI and generative AI, to help our business customers and improve their decision-making.
But from both perspectives, we would agree […] with what's being said from a copyright point of view. We think it's vital that there's transparency about what's going into the models, not only so creators can be rewarded and credited, and give their consent and get compensation, but also to incentivize the creation of high-quality data.
Unless we can trade off intellectual property rights, there's no incentive for companies in the long run to ensure that data is of the highest possible quality, that it's peer-reviewed and authenticated. And in the two areas where we work, scientific and legal research, it's absolutely vital to have that quality. So, copyright is important for that reason.
Then he added:
I think for any AI developer, the cliché of ‘garbage in, garbage out’ is never more apposite than in this world of generative AI. Unless we can see what's going in, including protected works, then we can't have trust in the outputs. For both those reasons – and from both sides of our business – we think copyright should be upheld. And I certainly agree that there's strong evidence that it hasn't always been.
A respectable position, with which I broadly concur. But as I say above, the underlying issue is money – on both sides of the argument.
For evidence, another part of the AI industry said the quiet part out loud this month. In a written note, VC investors Andreessen Horowitz (A16z) went as far as claiming that the billions of dollars plowed into AI companies have been “premised on an understanding that, under copyright law, any copying necessary to extract statistical facts is permitted.”
Fascinating, as the clear implication is that developers have told investors that copyright won’t be a problem. However, the focus on “statistical” is interesting, and perhaps designed to create leeway for other forms of copyrighted material. Though couched in the vaguest terms, US copyright law is clearly designed to protect the latter at scale.
But then the company added a statement that may take your breath away:
Under any licensing framework that provided for more than negligible payment to individual rights holders, AI developers would be liable for tens or hundreds of billions of dollars in year in royalty payments.
Yup, there it is, folks: in black and white. The context for those comments was the US Copyright Office’s own call for evidence, during which OpenAI, Meta, and others, have admitted they would have to figure out how to pay copyright holders. This would impose an intolerable burden on them, they claimed.
Pity the poor billionaires having to pay those awful artists, academics, writers, and creators. You know, the ‘paying creative humans’ problem that, one might logically infer, AI is here to solve – rather than, say, famine, climate change, war, and disease.
During the US proceedings, Google referred to Web scraping as “knowledge harvesting”. That's a description that ignores an obvious fact: an author who has spent a year, a decade – or a lifetime, perhaps – researching and writing a book, has probably not done so for a trillion-dollar corporation to just grab it and put it in an AI system, one that means that nobody needs to buy that book anymore.
All of which brings us to the real question - in ‘knowledge harvesting’ who is driving the harvester? Not the little guy who planted the crops, it seems.