LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI

Jerry on PieFed

My Mastodon instance is on the list. I try hard to block them.

The problem with the list is that it’s a target list, but not a list showing how much content, if any, they manage to process from any of the sites.

Pavidus

Stole.

@[email protected]

We ban the petty scrapers and celebrate the great ones as innovators and promote them as fortune 500 companies.

BlueÆther

% grep lemmy Meta.txt             
lemmy.ca
lemmynsfw.com
lemmy.sdf.org
lemmy.ml
lemmy.world
lemmygrad.ml

@[email protected]

Good catch. That’s worth a seperate post.

Hexbear is on the list too.

marcie (she/her)

Llms will start randomly shitting out hexbear emotes lol

irelephant [he/him]

where did you get the .txt file?

BlueÆther

Just extracted it from the PDF

@[email protected]

Always amused when leftist instances treat intellectual property like it’s real.

@[email protected]

IP debate aside, LLM scrapers absolutely annihilate system resources. I host a wordpress site and before setting up cloudflare labyrinth my whole server would get ddos’d at least twice a day.

irelephant [he/him]

its not, but scraping is annoyingly resource intensive.

NutWrench

One person’s “scraping” is another person’s plagiarism.

r00ty

I blocked the entire ASN for Meta, because they were downright dirty with their scraping. No gradual crawling, fakes UAs, random addresses across a large number of subnets.

They weren’t the only ones either. The AI scraping heist is the new goldrush.

LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AIplus-square

LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AIplus-square

Technology

LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI

LEAKED: A New List Reveals Top Websites Meta Is Scraping of Copyrighted Content to Train Its AI