Meta AI Siphons Content from Lemmy Instances, Ignorance Blamed on 'Public' Data Wild West

Post date: August 12, 2025 · Discovered: April 23, 2026 · 3 posts, 95 comments

Meta/Facebook is accused of training its AI on user-generated content scraped directly from multiple, diverse Lemmy instances across the federated web.

The debate splits on legality. Some argue the data is inherently 'public' and usage is simply inevitable ('anarchiddy'). Others call the scraping an 'aggressive overreach,' pointing out that such actions are a 'dick move' against data offered via federation. Specific alarms were sounded regarding technical bypasses: 'mesamunefire' noted that scraping bots ignore standard protocols like robots.txt, claiming AI crawlers account for 95% of his small server's traffic.

The core conflict centers on consent and infrastructure strain. While some believe the data is already accessible through federation means ('danc4498'), others are flagging legal avenues, suggesting 'trespass to chattels' might apply if the scraping degrades service quality ('litchralee'). Operators are being pushed to solidify their legal defenses via clear Terms of Service ('fmstrat').

Key Points

#1Scraping bypasses established protocols.

Multiple users point out that Meta’s data collection ignores protocols, with 'mesamunefire' stating bots bypass robots.txt.

#2Legal challenge centers on service degradation.

The 'trespass to chattels' legal theory was raised, suggesting AI scraping that cripples a service could be actionable interference ('litchralee').

#3Data is framed as fundamentally public.

A segment of commenters argued usage is inevitable because content on federated platforms is 'fundamentally public' ('anarchiddy').

#4Scraping undermines site stability.

'nickwitha_k' shifted the focus beyond privacy, warning about the economic harm from excessive server load.

#5Inaction allows abuse.

'halcyoncmdr' suggested Meta's scraping indicates a failure of intended federation protocols, implying scraping is just a fallback measure.

Source Discussions (3)

This report was synthesized from the following Lemmy discussions, ranked by community score.

435
points
Leaked list shows Facebook training their AI on multiple Lemmy instances
[email protected]·163 comments·8/8/2025·by geneva_convenience
425
points
Leaked list shows Facebook training their AI on multiple Lemmy instances
[email protected]·68 comments·8/8/2025·by geneva_convenience
181
points
Leaked list shows Facebook training their AI on multiple Lemmy instances
[email protected]·48 comments·8/12/2025·by glowing_hans