ruby.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
If you are interested in the Ruby programming language, come join us! Tell us about yourself when signing up. If you just want to join Mastodon, another server will be a better place for you.

Administered by:

Server stats:

1.1K
active users

@olivierlacan Alternatively, reply to that user agent with a blank page and 402 Payment Required

@jstepien @olivierlacan @brainwane Oh god, that is the obviously correct response, and now I have to figure out how to do it.

@olivierlacan honestly, they were VERY generous and benign for them to show how to opt out if you want. That's not the level of ethics you usually see in the "scene", they set up good example for the other to follow. (who may, or may not, that's a different can of worms)

@smicur @olivierlacan i'm not sure how "generous" or "benign" it is to say you can opt-out of a thing that's already been performed essentially everywhere.

"oh the killings? You can opt out. Yeah, i know we shot essentially everyone you know, but it's fine. You can opt out now."

@masukomi @olivierlacan they had to start somewhere. The stuff they gathered was most likely already available for regular web crawlers, so the privacy argument is not holding there.

And you don't train a ML model this effective and even beneficial for all humanity by asking for consent like some cookie modal in every single turn.

OpenAI took the public data on the Internet for granted, and I don't see moral issues with that.

Midjourney and the likes are a whole different can of worms though.

@smicur @olivierlacan this would be true if they had done it before building and deploying their projects, certainly not now that they've scraped the whole deal already.

@olivierlacan Can you copy the content of the page? It requires an account.

@olivierlacan wild that you need to log in to see this link lol

@olivierlacan Just applied to all pages on my site. Thanks.

@olivierlacan I just do this:

echo "Blocking IP addresses..."
echo "[OpenAI egress ranges]"
iptables -A INPUT -s 23.102.140.112/28 -j DROP
iptables -A INPUT -s 23.98.142.176/28 -j DROP
@olivierlacan ironic that I needed to pass the cloudflare “are you a human” check to see that web page

@olivierlacan Even better to return a bunch of GPT generated garbage to poison the model instead of just disallow

@Beldantazar @olivierlacan now that is a great idea. Have a singe directory on your site that is a bunch of Lorem ipsum or nonsens language pages that you allow the GPTbot to access. Surely some nice soul will make a generator for this soon. Just make sure to tell the agent only that dodgy directory. Of course that assumes you trust them to do the right thing, which I think they’ve already demonstrated they shouldn’t be trusted.

@Danwwilson @olivierlacan i mean, you're already trusting them if you use the disallow anyway, but the ideal thing is rather than return a bunch of lorem ipsum that will be easily detected, instead return stuff that is chatgpt generated trash, it's harder for them to detect that and ai models get killed fast if they feed off their own outputs. ideally just have the same set of pages return either the normal data or the gpt data depending on user agent, so that way it's even harder to detect.

@olivierlacan Would be a shame if a bunch of websites responded to this user agent with megabytes of nonsensical English text. That might screw up its model

@olivierlacan @darren we need to extend the concept of robots.txt to machine-readable content licenses, specifically to disallow use for ML training.

@sminnee Do you think this is something that Creative Commons licensing could/should cover?

@olivierlacan
So:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

any other I'm missing? What about the browser.ai and axiom.ai crapola? Are they good netizens that can be kept at bay via robots?

@oblomov @olivierlacan

It would be nice to intercept them and send them to a script generating a page with some Mb of "you're full of shit".

@olivierlacan so if you now opt out, will they actually remove your data from their training model? Doubt it.

@olivierlacan and if people do get their data removed from AI, will it be like the end of 2001 as HAL has his modules removed? #DaisyDaisy

@olivierlacan I am happy that they are providing methods to opt out for web developers, which I think it really nice. I am hoping that SOURCES of training data (images and the like) implement something similar, it could become a factor in choosing a web site to post stories on or images on -- does that site make a good faith effort to prevent data from being used to train AI?

@olivierlacan All this garbage has made me switch to blocking * from / and giving wget and ia_archiver an exception

@wertercatt @olivierlacan mind sharing your robots.txt

```
User-agent: ChatGPT-User
User-agent: GPTBot
User-agent: Googlebot
User-agent: Googlebot-news
User-agent: Googlebot-Image
User-agent: AdsBot-Google
Disallow: /

User-agent: *
Disallow: /
```

I figure this should block some stuff but maybe i don't need to name each agent.

@olivierlacan unfortunate that this even has to happen 😞 it'd be nice if website just weren't scraped for AI training or it was on a very opt-in basis.
@olivierlacan This is cool but technically shouldn’t even be necessary. Anything you create is protected by copyright, automatically and implicitly. Even if the ChatGPT bot is free to lumber through your site, it is violating copyright law if it copies your stuff, uses it to train its LLMs, and then makes that derived work publicly available.

@JetForMe @olivierlacan I'm often pessimistic and sarcastic. Instead, I'm going to be honest here: Yeah, I believe they will. I try to be charitable with voluntary claims about robots.txt.

@olivierlacan There's at least 3 people here with a similar idea... a script to feed it an infinite number of generated, cross-linked nonsense pages, each linking to more of the same, wasting resources and poisoning the dataset.

They don't even need to be anything advanced. Any basic markov library spitting out vaguely (but unhelpfully) semi-coherent nonsense would do. It just needs to look just enough like language while being garbage in html tags.

Please! Create or boost!

@olivierlacan Imagine if someone made something like this and then published it as a #WordPress plugin! :mastomindblown:

@olivierlacan or just allow a directory filled with (generated) rubbish? :D

@olivierlacan it seems weird to post this with a link to OpenAI that requires login.

No thank you.

@olivierlacan this is actually deviously evil. They are pulling up the drawbridge now that they've scraped the internet without consent.

I'm sure OpenAI will also push for legislation to have new AI companies respect robots.txt. they are trying to cement their monopoly on data before credible competitors emerge.