Oddbean new post about | logout
 @Salastil 

> I'm getting my Nitter instanced scraped by a botnet that appears to be 100k IP large, they get fed in as fast as I ban them, but I don't believe they assign more than 100 IP to scraping at a time as to not DDOS the site,

WEIRD

> one IP never doing a scrape under 7 seconds so rate limiting wont nab them.

Out of curiosity, what UAs are they using?  Tried SSL fingerprinting?  You know why they'd be hitting your server, like did you check if DiscordBot or something is in your referrers, or someone linked to it from somewhere, or...?

> something I read about 10+ years ago, a sticky trap. I want to ensnare the bot into a perpetually open http request so that it never completes its loop,

Ah, okay, so you can do this pretty easily with nginx:  you can forward to different backends conditionally with one of the (really badly documented) `if` directives.  Set up a little script, listen/accept on one end, and then make a connection on the other to the actual upstream.  So if your Nitter instance is running on localhost:4444, you have this script listen on localhost:4445.  Have it relay all of the traffic upstream and then get the entire response (to avoid jamming up the real server), but trickle the response a few bytes at a time.  Some clients time out if you take too long to get the headers to them, so maybe send the headers back faster, but like delay a couple of seconds, then send the headers, then trickle the rest at a few bytes per second.

Another way to do this is to use iptables.  True story:  `-m statistic --mode random --probability 0.5 -j DROP` does more or less what you would expect.  This is what I did when Pawoo was flooding FSE with massive numbers of deletes, like as a kind of dopey rate-limiting ability.  (Unintentional on their part:  a few accounts with really long post history deleted themselves, and this causes Mastodon to send one delete per activity since the beginning of time to every server it has heard of...except the ones that it has blocked.)

> I figure that it the botnet notices when its banned and starts getting 403'd,

Basically zero of the scrapers that hit FSE do this.  Boardreader.com didn't even notice when I started actively poisoning their data until about a week after I started including the phone number of the guy that was ignoring my emails in the data. 
 >Out of curiosity, what UAs are they using?  Tried SSL fingerprinting?  You know why they'd be hitting your server, like did you check if DiscordBot or something is in your referrers, or someone linked to it from somewhere, or...?

Nitter is a Twitter proxy and there is only a few left after Elon's antics trying to make it a walled garden. The current design of Nitter requires us to make a large number of "guest accounts" that are created during an onboarding process using an old Android version of the Twitter account. These guest accounts give us access to most API features that used to exist because the walled garden. Each one has about 499 requests out of them before getting rate limited and only last 30.5 days before expiring. 

As to why? Nitter is effectively the only way to scrape content from Twitter, the guest_account stuff can only be created 1 per IP per day, so a lot have to be generated via proxy service. All of the basic stuff like obvious bot user agents have been handled, these botnets never have a single IP make a request more than once every 7-11 seconds and always with a legitimate User Agent. Sometimes it looks like desktop windows Chrome sessions, sometimes Iphones, its all over the place no real pattern, same with the stuff being searched for. 

I think I may come up with a way of 403ing anything that doesn't have a referrer to specific endpoints. In theory they should hit the root page, search from there and get referred to another page. 
 @Salastil @pistolero nginx has 444 error code which is better than 403 here as 444 hard closed the connection immediately. 
 @Salastil @pistolero you should have teach requests return static content for bots that is basically something that poisons their dataset. like a markov chain poster with a bunch of fake followers that are smurfed. that us a good tarpit idea 
 @Salastil @pistolero every post ending with "and the pee is stored in the balls too" 
 @Salastil @pistolero oh oh, idea, map every user to some random fedi user and gave it serve that as the tarpit. lots of ganee words in there tbh