Skip to main content

How AI Bots Are Changing the Rules of Data Collection

How AI Bots Are Changing the Rules of Data Collection. Credits: AIPlusInfo

How OpenAI’s Bot Bricked a Small Business Website

It was any ordinary Saturday, or so it seemed. On that day, the website e-commerce platform of Triplegangers had suddenly started to malfunction. Initially, it looked like a typical DDoS attack. However, in actuality, a more unexpected culprit turned out to be: OpenAI’s bot.

The OpenAI bot had been relentless in its efforts to scrape the entire Triplegangers website, which hosts more than 65,000 products. Each product page contains descriptive text and at least three photos of high qualityTomchuk explained that the OpenAI bot was generating “tens of thousands” of server requests to download this large amount of content—hundreds of thousands of photos and other data—all at once.

OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week,” said Tomchuk. “Their crawlers were crushing our site. It was basically a DDoS attack.”

Triplegangers, a seven-person company, has spent over a decade creating what it claims to be the world’s largest database of “human digital doubles.” This includes 3D image files scanned from real human models, as well as high-quality photos of hands, hair, skin, and full-body images. The company caters to 3D artists, video game developers, and anyone needing realistic human recreations.

Despite having a terms-of-service page prohibiting bots from scraping its content, Triplegangers‘ website fell victim because it lacked a properly configured robots.txt file. This file, known as the Robots Exclusion Protocol, is essential for informing search engines and bots about which parts of a site to avoid. While OpenAI claims to honor such files, its bots reportedly take up to 24 hours to recognize updates.

Unexpected Consequences

In addition to website down time, Triplegangers were affected by overload of their servers from the new surge in activity due to the bot from OpenAI during business hours in the United States. The worst news: Tomchuk expects a steep increase in his AWS bill based on excessive CPU usage and downloaded data.

Robot.txt files are not a failsafe solution; they are on a voluntary basis regarding compliance with them. Even some of these AI companies, like Perplexity, have been accused in the past of ignoring robot.txts, which raises serious questions about the ethics behind AI-driven data scraping.

Each of these is a product, with a product page that includes multiple more photos. Used by permission.Image Credits:Triplegangers
Each of these is a product, with a product page that includes multiple more photos. Used by permission.
Image Credits: Triplegangers

By midweek, Triplegangers had set up a correctly configured robots.txt file and added Cloudflare protections to block bots like GPTBot. Tomchuk is still unsure how much data OpenAI was able to extract. He has found no way to contact OpenAI directly to request the removal of any stolen content.

“We’re in a business where rights are a serious issue because we scan actual people,” Tomchuk noted. Under legislation such as the GDPR in Europe, such action could have wide-ranging legal repercussions.

A Perfect Target for AI Crawlers

Triplegangers’ site is a goldmine for AI training data. Its photos are meticulously tagged with attributes like ethnicity, age, and body type, making them highly valuable for AI companies looking to enhance their models. Ironically, the sheer aggressiveness of OpenAI’s bot is what tipped Tomchuk off to the issue. Had the bot scraped less aggressively, it might have gone unnoticed.

Triplegangers’ server logs showed how ruthelessly an OpenAI bot was accessing the site, from hundreds of IP addresses. Credits: Triplegangers
Triplegangers’ server logs showed how ruthelessly an OpenAI bot was accessing the site, from hundreds of IP addresses. Credits: Triplegangers

Tomchuk is not alone with his experience, though. Similarly, other people running websites stated that Open AI bots crashed websites and inflated bills on hosting bills. According to a 2024 study made by Double Verify, there’s an 86% increase for “general invalid traffic” cases brought about by AI scrapers.

Small businesses such as Triplegangers suffer from the lack of transparency and accountability by AI companies. Tomchuk now checks the server logs every day to try and pinpoint any possible attacks.

According to Tomchuk, AI companies should seek permission before scraping data. “It’s scary because there seems to be a loophole that these companies are using. They put the onus on business owners to block them,” he said. “They should be asking permission, not just scraping data.”

Share

AD

You may also like

0
    0
    Your Cart
    Your cart is emptyReturn to Courses