The AI Data Scraping Crackdown: Why Regulators Say Tech’s Biggest Secret Violates U.S. Privacy Law
By Ramyar Daneshgar
Security Engineer | USC Viterbi School of Engineering
Disclaimer: This article is for educational purposes only and does not constitute legal advice.
A quiet battle has begun between U.S. regulators and the artificial intelligence industry, triggered by the discovery that many of the most influential AI systems were trained on personal data scraped from the open internet. Investigations by major outlets such as the New York Times revealed that image datasets containing millions of photos were assembled without consent, pulled from platforms like Flickr, LiveJournal, and personal blogs, often containing identifiable faces of adults and children. As the Washington Post later documented, voice-AI firms trained speech models on YouTube and TikTok audio without notifying creators, asserting that anything posted publicly was fair game.
Regulators now reject that assumption. The Federal Trade Commission has publicly stated that scraping facial geometry or voiceprints from public websites can constitute illegal biometric collection, and in several enforcement actions the agency ordered companies not only to delete improperly obtained data but also to destroy machine-learning models trained on it. This occurred in the FTC’s actions against Everalbum and Paravision, signaling that the government views scraped biometric datasets as incompatible with U.S. consumer-protection law.
How AI Firms Built Models on Scraped Content
Early generative AI systems depended heavily on scraping. Companies harvested images from personal blogs, long-forgotten forum posts, audio clips from livestreams, and social-media photos of individuals who never imagined they were training a commercial model. Investigations by NBC News and VICE/Motherboard showed that datasets like LAION-5B were assembled by crawling billions of URLs, capturing everything from school portraits to medical discussions on public message boards.
The scale was enormous. The LAION dataset, used to train image-generation systems such as Stable Diffusion, was built from raw web crawls. Its own academic paper acknowledges that identities, children’s photos, and personal material were ingested with minimal filtering. Similar investigations revealed that voice models were trained on scraped audio of YouTubers who never licensed their voices.
Regulators Reframe Scraping as Biometric Surveillance
The FTC’s 2023 Biometric Enforcement Policy reframed scraping as “surreptitious biometric surveillance.” The Commission emphasized that once a company extracts a faceprint or voiceprint, the individual cannot replace it the way they would a password. The immutability of biometric identifiers elevates the harm.
State regulators share this view. Illinois courts interpreting the Biometric Information Privacy Act have repeatedly held that scraping publicly available photos to extract facial geometry violates state law if done without consent. In Vance v. IBM, the court rejected IBM’s argument that Flickr images labeled with public licenses could be used to train facial-recognition models without notice or permission. In A.C. v. Clearview AI, an Illinois state court concluded that scraping billions of facial photos from social media sites violated biometric privacy even though the photos were visible on the open internet.
The Texas Attorney General reached the same conclusion in its landmark facial-recognition settlement with Meta, which focused on unauthorized extraction of facial data from user photos.
Children’s Data and COPPA
Scraping minors’ photos introduces additional legal exposure. Under the Children’s Online Privacy Protection Act, collecting personal data from children under 13 for commercial use requires parental consent. Regulators argue that training AI models on scraped images of children violates COPPA because biometric data is considered personal information.
The FTC’s action against Edmodo reinforced this view, warning AI firms that minors’ biometric identifiers cannot be used merely because they appear publicly.
State Attorneys General Begin Large-Scale Investigations
Across the country, attorneys general in New York, New Jersey, California, Vermont, and Texas have launched investigations into AI companies after journalists documented scraping practices. The New York Times investigation into Clearview AI exposed how billions of social-media images were harvested and sold to law enforcement agencies. This reporting prompted multi-state inquiries into the legality of operating face-matching algorithms built from scraped images.
California’s Privacy Protection Agency has also begun examining how scraped data is used in AI training pipelines, citing the state’s California Consumer Privacy Act protections.
The FTC has gone further by requiring deletion of trained models in certain cases, marking the first time a U.S. regulator has recognized, in practice, a “right to have an algorithm un-trained.”
Civil Rights Concerns Behind the Crackdown
Civil-rights organizations, including the ACLU and Georgetown University’s Center on Privacy & Technology, have documented how scraped biometric datasets enable surveillance of vulnerable groups.
Research shows that face-recognition tools trained on scraped datasets have been deployed to identify:
- Protest participants
- Religious minorities
- Immigrants
- Domestic-violence survivors
- LGBTQ+ individuals
In one case examined by Amnesty International, authorities abroad used Clearview AI’s scraped dataset to track activists as they moved across cities.
Congress took interest after learning that U.S. agencies had purchased commercial biometric datasets rather than obtain warrants, prompting hearings in the Senate Judiciary Committee on the legality of such practices.
Courts Begin Defining Boundaries
Courts across the U.S. are increasingly unwilling to accept that “public equals free to harvest.” In Sosa v. Onfido, the court found that scraping images to extract facial geometry—even when photos were publicly posted—triggered biometric-privacy requirements. Federal courts have similarly allowed suits against voice-AI firms that scraped YouTube videos to extract voiceprints without notification or consent.
In the Clearview multi-district litigation, federal judges rejected arguments that scraping public websites was categorically protected under the First Amendment, setting the stage for broader regulation of biometric scraping.
The International Context Pressures the U.S.
Other nations have already established strict rules. The European Union prohibits scraping biometric identifiers without explicit consent. Canada ruled that Clearview AI’s scraping practices violated federal privacy law in the Office of the Privacy Commissioner’s 2021 report. Brazil’s national data protection authority has initiated enforcement actions against AI firms importing scraped datasets.
These international developments are shaping American legislative proposals, which increasingly reference European-style restrictions on biometric data.
Conclusion: A New Phase in American Privacy Law
The AI scraping crackdown marks a decisive moment in U.S. privacy history. Regulators no longer accept the idea that public visibility implies permission. Courts no longer presume that technical capability justifies biometric extraction. And lawmakers increasingly treat scraping not as data collection, but as identity capture.
The future of artificial intelligence in the United States will depend on how this conflict is resolved. The age of unchecked scraping is ending, and the contours of the next chapter are being written in the courts, regulatory filings, and legislative chambers right now.