Cloudflare: Perplexity uses stealth crawling techniques, like undeclared user agents and rotating IP addresses, to evade robots.txt rules and network blocks

Pro@mander.xyz · edit-2 1 month ago

Cloudflare: Perplexity uses stealth crawling techniques, like undeclared user agents and rotating IP addresses, to evade robots.txt rules and network blocks

some_guy@lemmy.sdf.org · 1 month ago

In other words, they’re assholes.

CarbonatedPastaSauce@lemmy.world · 1 month ago

The only surprising thing to me from this article is that OpenAI actually follows the rules for bot crawlers.

0_o7@lemmy.dbzer0.com · 1 month ago

Or they haven’t been caught yet.

The article explains PerplexityBot respects robots.txt, but then sends a different request with a different IP and different user-agent. They could very well be using a different method to walk around it.

CarbonatedPastaSauce@lemmy.world · 1 month ago

The article explains how they tested for that, and as far as they could tell OpenAI is respecting the rules.

TomasEkeli@programming.dev · 1 month ago

A sure sign that they are a nefarious company.

beeng@discuss.tchncs.de · 1 month ago

Perplexity fired back in their blog.

Pretty tasty.

Kay Ohtie@pawb.social · 1 month ago

Perplexity’s firing back assumes website owners distinguish between automated scraping and on-demand scraping.

I don’t think most people make that distinction.

And that falls in line perfectly with the typical “assumption of access” all of these “AI” companies make.

Cloudflare: Perplexity uses stealth crawling techniques, like undeclared user agents and rotating IP addresses, to evade robots.txt rules and network blocks

Cloudflare: Perplexity uses stealth crawling techniques, like undeclared user agents and rotating IP addresses, to evade robots.txt rules and network blocks

Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives