How OpenAI’s Bot Bricked a Small Business Website
It was any ordinary Saturday, or so it seemed. On that day, the website e-commerce platform of Triplegangers had suddenly started to malfunction. Initially, it looked like a typical DDoS attack. However, in actuality, a more unexpected culprit turned out to be: OpenAI’s bot.
The OpenAI bot had been relentless in its efforts to scrape the entire Triplegangers website, which hosts more than 65,000 products. Each product page contains descriptive text and at least three photos of high quality. Tomchuk explained that the OpenAI bot was generating “tens of thousands” of server requests to download this large amount of content—hundreds of thousands of photos and other data—all at once.
“OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week,” said Tomchuk. “Their crawlers were crushing our site. It was basically a DDoS attack.”
Triplegangers, a seven-person company, has spent over a decade creating what it claims to be the world’s largest database of “human digital doubles.” This includes 3D image files scanned from real human models, as well as high-quality photos of hands, hair, skin, and full-body images. The company caters to 3D artists, video game developers, and anyone needing realistic human recreations.
Despite having a terms-of-service page prohibiting bots from scraping its content, Triplegangers‘ website fell victim because it lacked a properly configured robots.txt file. This file, known as the Robots Exclusion Protocol, is essential for informing search engines and bots about which parts of a site to avoid. While OpenAI claims to honor such files, its bots reportedly take up to 24 hours to recognize updates.
Unexpected Consequences
In addition to website down time, Triplegangers were affected by overload of their servers from the new surge in activity due to the bot from OpenAI during business hours in the United States. The worst news: Tomchuk expects a steep increase in his AWS bill based on excessive CPU usage and downloaded data.
Robot.txt files are not a failsafe solution; they are on a voluntary basis regarding compliance with them. Even some of these AI companies, like Perplexity, have been accused in the past of ignoring robot.txts, which raises serious questions about the ethics behind AI-driven data scraping.
By midweek, Triplegangers had set up a correctly configured robots.txt file and added Cloudflare protections to block bots like GPTBot. Tomchuk is still unsure how much data OpenAI was able to extract. He has found no way to contact OpenAI directly to request the removal of any stolen content.
“We’re in a business where rights are a serious issue because we scan actual people,” Tomchuk noted. Under legislation such as the GDPR in Europe, such action could have wide-ranging legal repercussions.
A Perfect Target for AI Crawlers
Triplegangers’ site is a goldmine for AI training data. Its photos are meticulously tagged with attributes like ethnicity, age, and body type, making them highly valuable for AI companies looking to enhance their models. Ironically, the sheer aggressiveness of OpenAI’s bot is what tipped Tomchuk off to the issue. Had the bot scraped less aggressively, it might have gone unnoticed.
Tomchuk is not alone with his experience, though. Similarly, other people running websites stated that Open AI bots crashed websites and inflated bills on hosting bills. According to a 2024 study made by Double Verify, there’s an 86% increase for “general invalid traffic” cases brought about by AI scrapers.
Small businesses such as Triplegangers suffer from the lack of transparency and accountability by AI companies. Tomchuk now checks the server logs every day to try and pinpoint any possible attacks.
According to Tomchuk, AI companies should seek permission before scraping data. “It’s scary because there seems to be a loophole that these companies are using. They put the onus on business owners to block them,” he said. “They should be asking permission, not just scraping data.”