ChatGPT like AI models running out of text to train, claims UC Berkeley professor

AFP

Stuart Russell, an artificial intelligence expert and professor at University of California, has raised concerns about AI-powered language models, such as ChatGPT, potentially "running out of text in the universe" that is used to train them.

He explained that the technology behind AI bots, which rely on vast amounts of text data, is "starting to hit a brick wall".

Russell shared this insight during an interview with the International Telecommunication Union, a UN communications agency, last week. He emphasised that there is a finite amount of digital text available for these language models to consume.

The implications of this text scarcity may influence the future practices of generative AI developers as they collect data and train their technologies.

However, he maintained his belief that AI will increasingly replace humans in various language-dependent jobs. Russell referred to these jobs as "language in, language out" tasks during the interview. His comments contributed to the ongoing discussion surrounding data acquisition practices conducted by OpenAI and other developers of generative AI models.

Concerns have been raised by creators worried about their work being replicated without consent, as well as by social media executives dissatisfied with the unrestricted usage of their platforms' data. Russell's observations drew attention to another potential vulnerability: the scarcity of text available for training these datasets.

A study conducted by Epoch, a group of AI researchers, in November, revealed that machine learning datasets are likely to deplete all "high-quality language data" before 2026. The study defined "high-quality" language data as originating from sources like "books, news articles, scientific papers, Wikipedia, and filtered web content".

Today's most popular generative AI tools, powered by large language models (LLMs), were trained on massive amounts of published text extracted from public online sources, including digital news platforms and social media websites. The practice of "data scraping" from the latter was a contributing factor behind Elon Musk's decision to limit daily tweet views, as he previously stated.

Russell highlighted in the interview that OpenAI, in particular, had to supplement its public language data with "private archive sources" to develop GPT-4, the company's most robust and advanced AI model to date. However, he acknowledged in his email to Insider that OpenAI has yet to disclose the exact training datasets used for GPT-4. Recent lawsuits filed against OpenAI allege the use of datasets containing personal data and copyrighted materials in training ChatGPT. Notably, a prominent lawsuit was filed by 16 unnamed plaintiffs, asserting that OpenAI utilised sensitive data like private conversations and medical records.

Another lawsuit, involving comedian Sarah Silverman and two additional authors, accused OpenAI of copyright infringement due to ChatGPT's capability to generate accurate summaries of their work. Authors Mona Awad and Paul Tremblay also filed a similar lawsuit against OpenAI in late June.

More from Business News

  • Spinneys makes Dubai stock exchange debut

    Spinneys 1961 Holding PLC, an operator of premium grocery retail supermarkets under the Spinneys, Waitrose and Al Fair brands in the UAE and Oman, started trading on Thursday on the Dubai Financial Market (DFM).

  • ADNOC reports 18% Q1 growth

    ADNOC Distribution released strong Q1 2024 financial results, showing an 18 per cent year-on-year increase in EBITDA to $248 million.

  • Dubai Duty Free boss to retire after 41 years

    After 55 years in the travel retail industry and 41 at the helm of Dubai Duty Free (DDF), Colm McLoughlin, Executive Vice Chairman & CEO has announced that he is stepping down from his role on May 31, 2024.

  • Sharjah airport welcomes over 4 million passengers

    More than 4.2 million passengers travelled through Sharjah Airport in the first quarter of 2024, marking a 10 per cent year-on-year increase.

  • DXB on track to surpass 90 million passengers in 2024

    His Highness Sheikh Ahmed bin Saeed Al Maktoum, President of Dubai Civil Aviation Authority (DCAA), Chairman of Dubai Airports, and Chairman and Chief Executive of Emirates Airline and Group, says he expects passenger traffic at Dubai International Airport to exceed 90 million by the end of this year.

News