Substrate
ai

News Publishers Block Internet Archive Web Crawlers

Global news organizations are blocking the Internet Archive's web crawlers to prevent AI companies from using archived content to train models without permission. This move follows ongoing lawsuits against firms like OpenAI for copyright infringement. A separate court ruling denied copyright protection for fully AI-generated works, potentially preserving human roles in creative industries.

Euronews
The Atlantic
thenextweb.com
3 sources·May 1, 2:24 PM(4 days ago)·3m read
|
News Publishers Block Internet Archive Web Crawlerstheweek.com
Audio version
Tap play to generate a narrated version.
Developing·Limited corroboration so far. This page will refresh as more sources emerge.

Around 245 news organizations from nine countries have begun blocking the Internet Archive's crawlers, which capture and archive web pages for the Wayback Machine. These blocks aim to stop AI companies from accessing historical content to train large language models without compensation or consent.

The Internet Archive holds over one trillion web pages dating back to 1996, including past articles from outlets like CNN, The New York Times, The Guardian, and USA Today. More than 20 major news organizations already block the main crawler, ia_archiverbot, according to an analysis by Originality AI.

At least one of the Archive's four bots is blocked by 241 global news sites, many owned by USA Today Co, the largest U.S. newspaper publisher. This has effectively removed hundreds of local publications from historical records.

Content from the Internet Archive has appeared in key AI datasets, prompting lawsuits against companies like Perplexity and OpenAI for alleged copyright violations. News organizations argue this use competes directly with their original journalism.

The issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us," — Graham James, a spokesperson from The New York Times, as cited by The Next Web. Some outlets, such as The Guardian, have opted to limit rather than fully block the Archive's access. The Internet Archive's director stated that the organization is collateral damage in the dispute, with AI companies being the real issue. The Archive has implemented measures like preventing large downloads and limiting automated extractions to curb misuse.

90 lawsuits have been filed by creators, including authors, musicians, artists, and news publishers, accusing AI firms like OpenAI, Meta, and Anthropic of using copyrighted works to train models without permission. The Atlantic is involved in one such lawsuit against Cohere.

These cases highlight concerns about the future of creative labor. A 2024 court decision in Thaler v. Perlmutter ruled that works generated autonomously by AI cannot receive copyright protection, as copyright requires a human author. The Supreme Court declined to review this in March.

This leaves open questions about how much AI involvement renders a work uncopyrightable. The ruling has economic implications for industries reliant on monetizing intellectual property through licensing. Entertainment companies like studios, record labels, and book publishers depend on copyright to generate revenue from films, music, and books.

Major players have avoided fully AI-generated content to maintain copyrightability. Netflix's production guidelines warn against using AI for main characters, key visuals, or central settings without approval. Hachette pulled the book Shy Girl after allegations of AI-written portions.

These decisions reflect business pragmatism, as uncopyrightable AI content cannot be licensed or protected from copying. The prohibition incentivizes keeping human creators involved to preserve profitable IP models. OpenAI's video tool Sora, announced with a licensing deal with Disney, was shut down months later.

Sources suggested high costs and lack of popularity contributed, but the inability to copyright AI-generated output may have factored in, making large investments unviable.

The Copyright Office has suggested that human prompting alone is insufficient for copyright on AI outputs, though courts have not yet ruled. There are calls for harsher penalties on misrepresenting AI involvement in registrations.

The Atlantic is involved in one such lawsuit, against the AI firm Cohere." — The Atlantic article. Advocacy group Fight for the Future launched a petition, signed by 100 journalists, protesting blocks on the Archive. They argue preservation is crucial for accountability, as the Wayback Machine tracks edits to articles. Some news organizations are seeking compromises with the Archive to limit access without full blocks.

Key Facts

245 organizations
blocking Internet Archive crawlers
Over 90 lawsuits
filed against AI firms for copyright infringement
Thaler v. Perlmutter
ruled AI-generated works uncopyrightable
241 news sites
block at least one Archive bot
Sora shutdown
OpenAI ended video tool post-Disney deal

Story Timeline

6 events
  1. Mar 2026

    OpenAI shut down its video-generation tool Sora after announcing a licensing deal with Disney.

    1 sourceThe Atlantic
  2. Mar 2026

    Hachette pulled the book Shy Girl following allegations of AI-written portions.

    1 sourceThe Atlantic
  3. Mar 2026

    Supreme Court declined to review the Thaler v. Perlmutter decision on AI copyright.

    1 sourceThe Atlantic
  4. 2024

    Court of Appeals ruled in Thaler v. Perlmutter that autonomous AI works cannot be copyrighted.

    1 sourceThe Atlantic
  5. Recent months

    News organizations began blocking Internet Archive crawlers to prevent AI training use.

    1 sourceEuronews
  6. 1996 onward

    Internet Archive started archiving web pages, now holding over one trillion.

    1 sourceEuronews

Potential Impact

  1. 01

    Courts will define thresholds for human input in AI-assisted creations for copyright eligibility.

  2. 02

    News organizations will pursue more lawsuits against AI companies for using archived content.

  3. 03

    Entertainment companies will restrict AI use in production to preserve licensing revenue.

  4. 04

    Creative industries will maintain human involvement to ensure copyright protection for works.

  5. 05

    Internet Archive will negotiate compromises with publishers to limit AI access without full blocks.

  6. 06

    Advocacy groups will push for policies protecting digital preservation amid AI disputes.

Transparency Panel

Sources cross-referenced3
Framing risk55/100 (moderate)
Confidence score74%
Synthesized bySubstrate AI
Word count620 words
PublishedMay 1, 2026, 2:24 PM
Bias signals removed5 across 3 outlets
Signal Breakdown
Amplifying 2Speculative 1Loaded 1Framing 1

Related Stories

Samsung Market Cap Tops $1 Trillion as Chip Stocks Rise Amid AI DemandSemafor
ai1 hr agoDeveloping

Samsung Market Cap Tops $1 Trillion as Chip Stocks Rise Amid AI Demand

South Korea’s Samsung saw its market capitalization surpass $1 trillion as semiconductor demand rose. SK Hynix hit a record high and Alphabet advanced on a $200 billion Anthropic deal. AI firms DeepSeek and Anthropic pursue large valuations while analysts note sector momentum.

Cnbc
SQ
Semafor
3 sources
Brockman Testifies About 2017 Dispute with Musk Over OpenAI For-Profit Shiftjapantimes.co.jp
ai3 hrs agoUpdated

Brockman Testifies About 2017 Dispute with Musk Over OpenAI For-Profit Shift

OpenAI President Greg Brockman detailed a heated 2017 confrontation with Elon Musk during testimony in the federal trial Musk v. Altman. He described Musk storming around a table and grabbing a painting after rejecting shared control proposals. The lawsuit seeks $150 billion in d…

The New York Times
Wired
New York Post
BBC News
Business Insider
+4
10 sources
Palantir Reports 85 Percent Revenue Growth in First QuarterYmblanter / Wikimedia (CC BY-SA 4.0)
ai1 hr ago

Palantir Reports 85 Percent Revenue Growth in First Quarter

Palantir exceeded analyst estimates with 85 percent revenue growth in the first quarter, driven by U.S. government and commercial sales. NVIDIA and Corning announced a long-term partnership to expand U.S. manufacturing for AI infrastructure. Several other technology companies als…

CNBC
DE
NE
3 sources