News Publishers Block Internet Archive Web Crawlers

Global news organizations are blocking the Internet Archive's web crawlers to prevent AI companies from using archived content to train models without permission. This move follows ongoing lawsuits against firms like OpenAI for copyright infringement. A separate court ruling denied copyright protection for fully AI-generated works, potentially preserving human roles in creative industries.

3 sources·May 1, 10:24 AM(51 days ago)·3m read

News Publishers Block Internet Archive Web Crawlers

Audio version

Tap play to generate a narrated version.

Developing·Limited corroboration so far. This page will refresh as more sources emerge.

Around 245 news organizations from nine countries have begun blocking the Internet Archive's crawlers, which capture and archive web pages for the Wayback Machine. These blocks aim to stop AI companies from accessing historical content to train large language models without compensation or consent.

The Internet Archive holds over one trillion web pages dating back to 1996, including past articles from outlets like CNN, The New York Times, The Guardian, and USA Today. More than 20 major news organizations already block the main crawler, ia_archiverbot, according to an analysis by Originality AI.

At least one of the Archive's four bots is blocked by 241 global news sites, many owned by USA Today Co, the largest U.S. newspaper publisher. This has effectively removed hundreds of local publications from historical records.

Content from the Internet Archive has appeared in key AI datasets, prompting lawsuits against companies like Perplexity and OpenAI for alleged copyright violations. News organizations argue this use competes directly with their original journalism.

“The issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us," — Graham James, a spokesperson from The New York Times, as cited by The Next Web. Some outlets, such as The Guardian, have opted to limit rather than fully block the Archive's access. The Internet Archive's director stated that the organization is collateral damage in the dispute, with AI companies being the real issue. The Archive has implemented measures like preventing large downloads and limiting automated extractions to curb misuse.”

News Publishers Block Internet Archive Web Crawlers

Transparency

Story details

Related Stories

AI-linked super PACs spend $37 million on 2026 congressional races

Trump Says Talks Continue Over Restoring Foreign Access to Anthropic’s Top AI Models

Amazon Web Services Discusses Possible Sales of Trainium Chips to Other Companies

Related Stories

AI-linked super PACs spend $37 million on 2026 congressional races

Trump Says Talks Continue Over Restoring Foreign Access to Anthropic’s Top AI Models

Amazon Web Services Discusses Possible Sales of Trainium Chips to Other Companies