Oddbean new post about | logout
 The problem with garbage data is that it could ensnare normal users, it is difficult to identify the bot with 100% accuracy. Secondly I don't want to get into a legal gray zone, Nitter is a proxy of Twitter accounts, its a very simple front end and lightweight, but for the most part its 100% faithful to what is on the selected user's timeline is. If I start getting into the rat race of poisoning the feed I could in theory run into libel lawsuits from some demented Twitter user that thinks their reputation is being ruined because the feed is full of gore and gamer words. Third, if I start getting into the habit of "curating" the timelines I wouldn't be able to hide behind Section 203 as effectively when some cretin starts browsing child porn accounts and hashtags, most of the VPS hosting companies are aware of Nitter and when you get a nastygram from the gubmint they are more willing to play ball since you're just repacking public data, if Nitter gets a reputation for being all over the place that trust is eroded. 

Sorry I just woke up but I think my thoughts on the matter are clear enough. 
 @Salastil @pistolero 

You don't have to identify all of them with 100% certainty, just some of them.
You don't have to retain the original usernames or images on the scraped pages, those can also be replaced.

>I wouldn't be able to hide behind Section 203 as effectively
Is there a single similar case where this happened? 
 >Is there a single similar case where this happened?
Legal fees still cripple people even if you win the case, I have no intention of handing insane people ammunition to grind me down for no reason. Were this a Pleroma instance I was running and I was the HNIC I'd consider well poisoning because ultimately its _MY_ domain and the users would have to abide by some sort of EULA or be briefed that I was doing such things in advance to their data. With Nitter its just meant to be a replication of existing data. 

>You don't have to identify all of them with 100% certainty, just some of them. 

Therein lies the problem, I _can't_ identify between a random user that just has a bookmark of @realgronalddrumpf and just lands at his timeline and a bot that just lands at the timeline. This requires getting into invasive practice like fingerprinting or using CAPTCHA programs. I'm supposed to be offering a privacy frontend, subjecting the users to this stuff defeats the purpose. 
 @Salastil @laurel 

> if I start getting into the habit of "curating" the timelines I wouldn't be able to hide behind Section 203 as effectively

I don't think feeding garbage data to bots counts as curating:  it's not exercising editorial control over the posts.  It's like putting an alarm on the back door because legitimate customers are supposed to come in through the front.  I get what you mean about poisoned data, though, so there are a lot of options.

Have you tried associating cursors with IPs? 
 I managed to smack down 90% of the bots by 403ing anything that makes a request to a specific endpoint without a referrer from the site itself. In normal cases the site should operate with them going to the root page / -> search -> then either to the timeline of an account or to a reply. This is a bit draconian in that it prevents people with a bookmark from just showing up to the timeline with_replies but I set up a 403 explaining why. I doubt the guy with the botnet is really investigating why his bots are getting 302'd to an error page, its just not getting data. Now its back down to the baseline bots again.