AI crawlers cause Wikimedia Commons bandwidth demands to surge 50%

The Surge in Bandwidth Consumption by Wikimedia Commons

The Wikimedia Foundation, the organization behind Wikipedia and various other collaborative knowledge initiatives, announced on Wednesday that the bandwidth used for multimedia downloads from Wikimedia Commons has increased by 50% since January 2024.

This surge, as the Foundation explained in a recent blog entry, is not driven by heightened interest from users seeking information. Instead, it stems from automated scrapers that are busily gathering data to train artificial intelligence models.

The Impact of Scraper Bots

Wikimedia disclosed that its infrastructure is designed to manage sudden increases in traffic from human visitors, particularly during times of heightened public interest. However, the level of traffic generated by these bots is exceptional and poses significant risks and costs to the organization.

Wikimedia Commons serves as a free resource for a variety of multimedia, including images, videos, and audio files that are licensed for public use. Upon examining the data, Wikimedia revealed that nearly two-thirds (65%) of the most resource-intensive traffic originates from bots, while these bots account for only 35% of overall pageviews. This disconnect occurs because frequently accessed content is typically cached closer to the user, reducing costs, while less popular material is stored further away, making it more expensive to retrieve.

The Behavior of Human Users vs. Bots

The difference in traffic patterns highlights a behavior distinction: human users generally concentrate on specific topics, while crawler bots read extensively across numerous pages, including those that are less frequently visited. Consequently, these requests are often funneled to the core data center, increasing resource consumption significantly.

To mitigate these challenges, the Wikimedia Foundation’s site reliability team has been dedicating significant time and resources to block crawlers and ensure that average users experience minimal disruption. This added strain on resources is compounded by the rising costs associated with cloud services.

The Broader Implications for the Open Internet

This issue is representative of a broader trend that threatens the fundamental principles of the open internet. Recently, open source advocate Drew DeVault expressed concern over how AI crawlers disregard “robots.txt” files, which are meant to limit automated traffic. Additionally, software engineer Gergely Orosz pointed out that scrapers from large tech companies have increased bandwidth demands for his projects.

Despite the challenges posed by these automated systems, developers are responding with ingenuity and determination. Some technology firms are taking steps to tackle the problem; for instance, Cloudflare has introduced AI Labyrinth, a solution designed to employ AI-generated content to slow down scrapers.

The Ongoing Battle

Ultimately, the ongoing struggle between content creators and scrapers resembles a classic cat-and-mouse game. If left unchecked, this relentless crawling may push many publishers to protect their content behind logins and paywalls, adversely affecting the accessibility of information for users across the web.