OpenAI accidentally deleted crucial data related to its copyright lawsuit with The New York Times during ongoing legal proceedings over copyright infringement claims. The incident involved data from dedicated virtual machines provided to the plaintiffs, which OpenAI acknowledged to the court in a recent filing. As a result, attorneys for the Times have stated they lost a week’s worth of work related to the case.
OpenAI faces data loss setback in lawsuit with The New York Times
According to a letter from the Times’ legal team, this data loss involved “an entire week’s worth of its experts’ and lawyers’ work” and was “irretrievably lost.” The plaintiffs were investigating claims that OpenAI’s models had been trained on unauthorized content. As part of this process, they accumulated data over 150 hours of intensive research on OpenAI’s training datasets, specifically looking for instances of copyright infringement. A report from TechCrunch indicated that the deletion occurred on November 14, when “programs and search result data stored on one of the dedicated virtual machines was erased by OpenAI engineers.”
The core of the lawsuit asserts that OpenAI, along with Microsoft—its partner using OpenAI’s technology for its Bing AI chatbot—has infringed The New York Times’ copyright by utilizing paywalled content without authorization. The Times claims OpenAI’s models produced “near-verbatim” replicas of its articles, forming its argument for damages. OpenAI has consistently refuted these allegations, claiming that its training was based on publicly available data, qualifying as fair use under copyright laws.
A spokesperson for OpenAI commented that the incident was a “glitch.” At the same time, they successfully recovered most of the deleted data, and critical elements, including “the folder structure and file names,” remain lost and consequently unusable. As a result, the Times’ attorneys now face the challenge of restarting their evidence collection from the ground up. Despite the circumstances, they reported having “no reason to believe [the erasure] was intentional,” stressing that OpenAI is best positioned to search its datasets. Yet, they also noted the company’s reluctance to disclose details about its training data.
Further complicating matters, similar copyright claims have emerged against OpenAI. A recent lawsuit against the company by Raw Story and AlterNet was dismissed because the plaintiffs could not provide sufficient evidence of harm related to their allegations. In contrast, The New York Times has reportedly invested over $1 million in legal fees to pursue its case against OpenAI. This financial commitment illustrates smaller publishers’ distinct challenge when vying against substantial-tech companies.
OpenAI, on the other hand, has recently entered licensing agreements with several major media companies, allowing the use of their content to train its AI models, thereby providing compensation and credit. Reports indicate OpenAI is paying publishing giant Dotdash Meredith at least $16 million annually for licensing rights, reflecting its strategy of seeking formal partnerships rather than ongoing litigation.
Image credit: Furkan Demirkaya/Ideogram