Subscribe to get latest news delivered straight to your inbox


    Using Copyrighted Content to Train AI: Can Licensing Bridge the Gap?

    • 01.03.2025
    • By Hugh Stephens
    Hugh Stephens Blog

     

    The struggle between authors (writers, artists, musicians) and AI developers over the unauthorized and uncompensated use of copyrighted works to train AI applications continues, both in the courts (here is a summary of the current state of play in the US where most of the litigation is taking place) and in the political arena, such as the UK government’s latest initiative to put its thumb on the scale in favour of the AI industry, now slowed down by opposition within Parliament. The creative industries in Britain are still nervous, however, as demonstrated by the coordinated “Make it Fair” campaign organized by leading UK newspapers on February 25. While the courts may provide some guidance, it is unlikely to be dispositive and is almost certainly to be somewhat contradictory and lengthy, given the appeal process that will play out. With new applications being rolled out every day, AI appears to be unstoppable. Let’s accept that this is the case. If so, what then will be the rules governing the use of AI training content, particularly content that is protected by copyright, such as books, journalistic output, paintings, musical compositions etc.?

    It is already apparent that at least some of the AI output trained on these materials will compete in the marketplace with the original works. If that is the case, then surely some of that additional value should be shared with those who helped create the content initially. The way this will most likely be done is through licensing in the form of payment and permission for use of the copyrighted creative output that enabled the training to take place. Licensing would also help resolve another potential issue, the possibility that the final product produced by the AI algorithm infringes on the copyright of the works on which it was trained. This is unlikely to happen in the case of written works but is certainly potentially possible with graphic or musical works.

    While many have called for licensing as a solution, there are many challenges to be overcome to make it work effectively. Yet some licensing is already taking place between AI developers and owners of well delineated data sets. As an example, various newspaper and magazine publishers have already reached licensing agreements with AI providers. OpenAI has signed licensing deals with the Wall Street Journal, Times of London, the Financial Times, Time, Le Monde, Axel Springer and others. This is in marked contrast to OpenAI’s relationship with the New York Times, which has led to one of the most prominent lawsuits in the field, with the Times suing OpenAI for copyright infringement. The reason for this lawsuit, of course, is because licensing negotiations between the two entities broke down. Some photo and image licensing companies have concluded AI deals (Shutterstock is the most prominent example) while others, such as Getty Images have not. (Getty is suing StabilityAI in the UK). Eventually most of the institutional or corporate holders of valuable content in one form or another will likely reach, or attempt to reach, licensing deals with the major AI developers. But that still leaves out an awful lot of copyright-protected content.

    The conundrum is how to deal with the millions of individual creators who produce content in different formats, and tie them into a workable licensing regime. The first challenge is how to even figure out who is producing content that is likely to be used by AI developers. The second is to calculate how much that use is worth. Then there is the challenge of how to administer a collective licensing scheme in a way that is both practical and affordable and where the small amount of royalties for individual works are not swamped by the administrative costs of collection and disbursement. Finally, there is the question of how to resolve the issue of competing licensing organizations in order to provide more or less one-stop-shopping for the AI industry.

    It is worth noting that one-stop-shopping currently does not exist in any area of collective licensing. Different collectives represent creators in different fields so music, publishing, art, broadcasting and visual arts licensing are all represented by different organizations, in some cases with more than one collective in a given field. The Copyright Board of Canada lists 36 copyright collectives on its website. I haven’t seen a definitive list for the US but this university website lists about the same number.

    Whereas users of music only have to deal with a handful of CMOs (collective management organizations), and users of text based content (online or offline) need only to acquire a reprographic license from the major licensing collectives for published works, such as the Copyright Clearance Center in the US or Access Copyright in Canada, AI developers access the full gamut of content. It will be challenging to make access easy for the AI development industry, a point developed by Dr. Pamela Samuelson of the University of California, Berkeley, well-known copyright scholar (and skeptic, let it be added). In her recent paper in the UCLA Law Review (“Fair Use Defenses in Disruptive Technology Cases”), Samuelson focuses primarily on the question of fair use—as suggested by the title—but also examines the issue of a collective licensing regime for generative AI development. She manages to raise just about every objection conceivable (see pp.80-86 of the document for more details);

    generative AI uses all forms of content therefore the licence would have to be very broad
    -an issue would arise as to whether content for training was used just once, or on repeat occasions
    -it would be very difficult and costly to administer given that there could be literally billions of creators involved

    -creators would get very little revenue; the bulk would go to the administering agencies, the CMOs
    -it would be difficult to determine value and to set a price on each transaction
    -what about orphan works?
    -differing national regimes might create confusion; alternatively some countries might not require a licence payment, giving them an unfair advantage
    -it would be unfair to startups since the incumbents have already scooped volumes of content without payment.

    She notes that creators may lose out, but since AI will affect the livelihoods of so many others, this is not exclusively a copyright problem. Tough luck creators.

    Clearly Dr. Samuelson is not in favour of a collective licensing regime for content appropriated by AI developers, yet despite her firehose of cold water, there are a number of promising developments in this area. For example, the Copyright Clearance Center (CCC) in the US recently announced it would provide AI re-use rights within its Annual Copyright Licenses, making the CCC’s licence “the first-ever collective licensing solution for the internal use of copyrighted materials in AI systems.” Note the caveat. While covering re-use of content for AI applications, the CCC makes it clear that;

    The license enables participating rightsholders to fulfill the needs of companies that require an efficient way to legally acquire the rights to use copyrighted materials within AI systems for internal use.”

    Not training. The Copyright Agency in Australia has done something very similar.

    Starting from February 2025, Copyright Agency will extend its Annual Business Licence to cover staff of licensed businesses who include third party material in prompts for AI tools (and) copy and share outputs from AI tools with colleagues”.

    However, it does not apply to AI training and does not allow capture of the content outside the business, such as by an externally provided AI tool.

    Likewise, the Copyright Licensing Agency (CLA) in the UK issues a Text and Data Mining (TDM) Licence. The CLA’s website explains that TDM “is the process of transforming unstructured content into a structured format to analyse, extract and identify meaningful information and insights. By using TDM, organisations can harness the power of vast volumes of information and data, capturing and revealing key concepts, trends, and hidden relationships.” Sounds quite a bit like training generative AI, but it’s not.

    CLA’s TDM licence extension includes rights covering use of published content for TDM purposes. This does not cover the use of content in training or prompting Generative AI models.

    Canada’s equivalent CMO, Access Copyright, is actively examining the issue, as it notes in its new strategic plan for 2025-2028;

    Like collective rights management organizations around the world, we will actively explore how we might enhance our corporate licence offerings to include uses related to AI, providing Canadian rights holders who wish to participate in the emerging market for AI licensing to do so, either in Canada or by virtue of reciprocal agreements with sister organizations.”

    It is clear that these Reproduction Rights Organizations (CMOs by another name) are cautiously feeling their way forward to find the appropriate role for collective licensing. Meanwhile the private sector has not been sitting idly by. Forbes reports that so many content aggregation startups have been established that they have formed a Data Providers Alliance. Recently launched “Created by Humans” is another commercial entrant that is pitching itself to authors.

    Take control of your work’s AI Rights and get compensated for its use by AI companies.”

    As these new enterprises enter the market, it threatens to become quite crowded. Just as there are more and more AI companies, including new entrants like DeepSeek, a proliferation of new sector-specific content-aggregators will make licensing more challenging. If the CMOs wait too long, they will face entrenched competition. Not all these new aggregators will survive. In the end, AI developers will not subscribe to multiple content licensors; they will go with the ones that provide the broadest coverage. It will be a Darwinian selection process.

    While this is happening, other countries are experimenting with the concept of extended collective licensing for AI content. This allows CMOs to grant licenses on behalf of both their members and non-members alike. An extended collective licence is not a compulsory licence but it could lead to such a system being established. Spain was first out of the gate, but has since pulled back after the proposed Royal Decree attracted the criticism from many rights holders that it would proscribe their options. Yet it is one of many solutions being tested.

    The recently released US Copyright Office report “Identifying the Economic Implications of Artificial Intelligence for Copyright Policy” includes an extensive discussion of licensing possibilities, including examining the pros and cons of a new statutory blanket licence. This would need to include a provision excluding rightsholders (such as entities that have already reached licensing agreements with AI developers) who have the ability and wish to issue voluntary licences that generate greater remuneration than a statutory payout would earn. This raises thorny opt-in/opt-out issues. Compromises will be required, but the challenges are not insurmountable.

    The trick is to devise a system that will capture as much content as possible while allowing some flexibility to rightsholders, allocating payments in way that is fair and efficient (the USCO paper suggests that revenues associated with a work could serve as a rough proxy for its relative value), at the same time minimizing administrative costs so that expenses do not exceed potential revenues for rightsholders holding limited content inventory. Can it be done?

    Despite the many obstacles identified by Dr. Samuelson and others, I am convinced that in the end collective licensing for content used in AI development and applications will become as accepted as the collective licensing regimes for use of various forms of copyrighted content today. The way forward won’t be straightforward; there will be zigs and zags. The courts and legislatures will play a role, as will authors, publishers, and the AI developers themselves. But in the end we will get there. Licensing, including some form of collective licensing, is the inevitable bridge that will bring AI developers and copyright holders together.

    This article was first published on Hugh Stephens Blog