Oddbean new post about | logout
 In my case they're not scraping a single account, they're scraping half of twitter via my instance, such simple regex options have never worked for me. One of the approaches brought up was to ban anything that wasn't containing a referrer from the site. 
 Welp, after autobanning anything that connected to the site for 8 hours the botnet is only increasing in speed. The access log moves so fast I cannot even begin to read it any longer.

https://media.salastil.com/media/4732e1dea6710a704ae91ea822f532d94dd199a03fcb2d3d8ec812644cf6e0ac.mp4 
 @Salastil @fzorb @anime graf mays 🛰️🪐 ...They've all got the same user-agent. 
 This batch right there does since its at the tail end of an 8 hour session of banning everything that connected to the site. The guy does indeed rotate user agents on his bots, I've seen him masquerade as Netscape Navigator 5 at one point, which was impressive since the browser was never released. 
 

        map $http_user_agent $baduseragents {
                default                                                     0;
                "~Trident/[1-7]\."                                          1;
                "~Chrome/(([1-9]{1})|([0-7]{1}[0-9]{1})|(7[0-9]{1}))\."     1;
                "~YaBrowser/(([1-9]{1})|([1-9]{1}[0-8]{1}))\."              1;
                "~Firefox/(([1-9]{1})|([0-7]{1}[0-9]{1})|(8[0-9]{1}))\."    1;
                "~EdgA?/(([1-9]{1})|([0-7]{1}[0-9]{1})|(8[0-6]{1}))\."      1;
                "~Version/(([1-9]{1})|([1-9]{1}[0-1]{1}))\."                1;
        } 
 @anime graf mays 🛰️🪐 @fzorb @pistolero Doesn't this cover every permutation of Chrome Firefox or Edge? 
 @Salastil @fzorb @pistolero @anime graf mays 🛰️🪐 now get a dark hoodie and an rgb keyboard and make it green. Also make your terminal colors green on black 
 @buy robux today :ROBUX: @fzorb @pistolero @anime graf mays 🛰️🪐 I already have a RGB keyboard and I use tiling window manger i3wm btw did you know I use Arch ? :archlinux: 
 @Salastil @buy robux today :ROBUX: @fzorb @pistolero @anime graf mays 🛰️🪐 Amber on black > green on black. 
 new vegas enjoyer spotted 
 Ah yes the  P i s s f o n t 
 my gf the other day "i miss when you used to play fallout" 
 I switched fnv to German and now it's impossible 
 I miss your Doritos 
 :graf_1::graf_2::graf_3::graf_4::graf_5:       
:graf_6::graf_7::graf_8:      :graf_9: 
 Me too 
 I'm building a half-timber cottage and I might use an old audio oscilloscope I have as a monitor for a small terminal. 
 @Salastil @buy robux today :ROBUX: @fzorb @pistolero @anime graf mays 🛰️🪐 The first computer i got to play with was a dual 5.25 floppy amber screened laptop type thing.  Played zork, and some kinda ascii dungeon crawler thing on it.  Shame modern web has a bit of a fit when you try to cram in low ascii characters. 
 @Salastil @fzorb @anime graf mays 🛰️🪐 

> they're scraping half of twitter via my instance,

Ha, it sounds a lot like what Boardreader was doing to FSE.  They actually recorded browser sessions and played them back, big army of residential US proxies.  I actually ended up writing a script that watched the logs and waited until some client had a suspiciously high proportion of requests hitting TWKN (watching behavior instead of source) and they would fire off a few hundred requests and then hop IPs.  If I killed an IP, another one would arrive really quickly.  Since they'd recorded browser sessions, it was hard to tell until they had already gotten some of the data already, but by the time they had hit several hundred requests for TWKN after the initial burst, it was too late to detect them.  Maybe check if you see `devtools.boardreader.com` in your logs anywhere, ha.

They weren't executing JavaScript (they couldn't) but I didn't wanna break all the clients by doing something like that.  Nitter, on the other hand, is basically *just* a web UI, so you could go that route.  Tack on some JS that adds a hash of the IP address plus a nonce to every link, this precludes a lot of proxy use and non-JS-executing scrapers because they'd have to know which place they're exiting from then do a hash.

> One of the approaches brought up was to ban anything that wasn't containing a referrer from the site.

That works sometimes, but they will pretty often spoof it or start spoofing it. 
 >That works sometimes, but they will pretty often spoof it or start spoofing it.

They already are spoofing to a degree, but they fuck up and will use a referrer from the wrong site sometimes and I'll see referrer from nitter.poast.org or one of the other instances, and this isn't how nitter operates. I just think banning isn't a viable strategy at this point, I've banned about 120k IP today and the botnet doesn't seem to have slowed a bit. I've been dealing with this guy since August and have managed to get him to fuck off multiple times, but this time he seems hellbent on scraping my instance until the instance no longer functions.