Skip to main content

Machine Learning Is Not Your Copilot: AI System Accused of Violating Open Source Copyright Licenses

A A A

Article

As previously reported in this space, the Court of Appeal for the Federal Circuit has ruled that an AI machine cannot be an inventor because it is not a “natural person.” You can read those posts here and here. Issues regarding AI and intellectual property have now crossed over into the realm of copyright and open-source licensing.

On November 3, 2022, a group of plaintiffs filed suit in the Northern District of California against several defendants, including GitHub, Inc., Microsoft Corporation, and OpenAI, Inc. and related companies to OpenAI. The issue stems from a product called Copilot and a product integrated into Copilot called Codex. To provide some context of the issue, some backstory may help. The case is 4:22-cv-06823 (note that the website cited below that follows this litigation has documents incorrectly marked by the court as 3:22-cv-06823).

GitHub, Inc. provides a platform, github.com, that helps developers track and manage changes to a software project’s code that uses Git. Git is an open-source version control system created by Linus Torvalds (of Linux fame). Users can sign up and use GitHub to host public code repositories for free, or can pay a fee to have GitHub to host private code repositories.

In June 2021, GitHub and OpenAI launched a product called Copilot. Microsoft purchased GitHub in 2018 and has become the largest investor in OpenAI, which is why Microsoft is a named defendant. In August 2021, OpenAI launched Codex, a program that converts natural language into code, and is integrated into Copilot. This combination of programs promises to assist software coders in developing their code by providing or filling in blocks of code using AI.

Machine learning systems, a subset of AI referred but referred to as AI here, must undergo “training.” Training involves a process in which the machine learning program derives its behavior from operating on (or “studying”) a set of material and/or data referred to as training data. It is the manner in which Copilot obtains and uses training data from the public repositories on GitHub that has caused the issue.

As mentioned above, GitHub and Git provide access to open-source software code. Open-source code, while typically free of charge, usually has a license that defines the terms and conditions under which other users may use the code. These licenses include requirements for attribution, typically comment lines in the code that include the name and copyright notice of the original author of the code. Some licenses also may include terms of use that define in what type of products, uses, etc., the code may be used.

In their complaint, the class-action plaintiffs allege that Copilot provides output text derived from the plaintiff’s and the class’s licensed materials without adhering to the license terms. They allege that Copilot “ignores, violates, and removes the Licenses offered by thousands—possibly millions—of software developers, thereby accomplishing software piracy on an unprecedented scale.” (Complaint, ND of CA, case 4:22-cv-06823, also found at githubcopilotlitigation.com, note that the documents were incorrectly marked by the court as 3:22-cv-06823.) The plaintiffs allege that Copilot will replace a huge amount of previously publicly available, open source code by taking it and keeping it inside the GitHub paywall, allowing GitHub, OpenAI, and Microsoft to monetize something that was supposed to be free of charge.

The complaint alleges violations of the Digital Millennium Copyright Act (DMCA), Title 17 USC 1201-1205. (Title 17 is the “Copyright Act.”) The complaint alleges that Copilot and Codex’s behavior of ingesting and distributing licensed open-source code without the attribution, copyright notices, and license terms violates the DMCA.

In addition to the above, the complaint includes further allegations of reverse passing off, unjust enrichment, and unfair competition under the Lanham Act, Title 15 USC 1125. It alleges that the defendants passed off the licensed materials as it or Copilot’s own creation, and received unjust enrichment because users paid fees for using Copilot. It also alleges violation of California’s laws regarding unfair competition, and privacy, the California Consumer Privacy Act (CCPA).

Conduct alleging contract violations is also included in the complaint. These include: violation of the licenses governing the use of the open-source code for training, and republishing that code without the required attribution, copyright notice and license terms; interference with contractual relations between the class and the public regarding the licensed materials by concealing the license terms; and fraud related to GitHub violating its own terms of service related to sale and distribution of the licensed code outside GitHub.

The plaintiffs filed a second class-action lawsuit naming two additional defendants (4:22-cv-07074). The Northern District of California has consolidated these cases into one case.

The defendants have not yet filed their answer, so at this point in the litigation the plaintiff’s side of the story is the only one available. The complaint alleges that in negotiations prior to the filing of the complaint, they had first offered a defense of “Fair Use.” The doctrine of Fair Use, Title 17 section 107, has four components. These are: the purpose and character of the use; the nature of the copyrighted work; the amount or substantiality of the portion used; and the effect of the use on the potential market or value of the work. While all of these come into play in this situation, one aspect of the effect on the market that may be persuasive is that reproduction of entire software works can cause the fair use defense to fail. Questions exist about whether the output of the Copilot and Codex does indeed copy most of the open-source code, and the effect that has on the market. In this case, the market comprises the developers who would otherwise freely share their code on GitHub.

The complaint further alleges that the defendants have stated that Copilot and Codex do not retain copies of materials upon which they are trained. This relates to the question above as to the nature of the output of the products. The complaint has countered that the output text of the products is nearly identical to the code from the training data, without the attribution, license, or terms. However, the judgment of “nearly-identical” will depend upon the court’s decision. 

We will keep you posted.

This article is provided for informational purposes only—it does not constitute legal advice and does not create an attorney-client relationship between the firm and the reader. Readers should consult legal counsel before taking action relating to the subject matter of this article.

  Edit this post