and the remote server falls for it every time – Being undead is no excuse for skipping leg day

ejbarnes:

foone:

Speaking as someone who has written web crawlers for non-evil reasons…

I like how all the platforms (twitter, reddit) are destroying their APIs to kill “bots” but the fun thing about APIs is that only the good guys use them. They’re like door locks: they only keep out honest people. Someone wanting to steal your TV will just put a brick through your window.

Similarly, people wanting to flood a platform with spam and pornbots will often just not use the API, because it makes it too easy to track them down. They’ll instead write a program that pretends to be a browser, and clicks on links just like a human does.

Fun fact: that’s an “API” that exists for every website, and for a long while it was the only API that any sites ever had. So when you’re trying to automate using a site (for good or evil), the “api-zero” of just doing web scraping and user-agent-impersonation is always there. That’s what the bad guys will use, and that’s what the good guys are sometimes forced to use.

Anyway the end result of this sort of API monetization/destruction nonsense is that you’re only killing the bots that were written with good intentions. You’re killing the haikubots and that “THERE ARE FIVE LIGHTS” twitter bot. you’re killing the reddit bots that help moderate submissions by automatically applying flair or timing out replies after too long has passed.

But the bots that are just there to send you crytypocurrentsea scams and entice you in with stolen porn? They don’t use the APIs. They won’t be affected. They’ll keep on working. The people scraping your site for AI research? they won’t even register an account, they’ll just request the plain HTML contents of your pages.

So once you know that, locking out users from your APIs seems like a real bad idea, doesn’t it?

Can anyone contradict this? It seems way too sensible. It’s been a long time since I had to program anything using a third party’s API (MS Word for Lexis-Nexis’s CompareRite) and there’s no question that in 1996 a lot of human work was involved.

No, it’s very sensible, just google “headless browser.” Using the API makes things faster by 10x or so, but not using the API means you just have to go up a level and instead of extracting things via HTTP you extract it from the DOM of exactly the software a website is made for.

Recent Posts

Recent Comments

Archives

Categories

Tag: and the remote server falls for it every time

Speaking as someone who has written web crawlers for non-evil reasons…