Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Many (most?) "big content" sites let Google and Bing spiders scrape the contents of articles so when people search for terms in the article they'll find a hit and then get referred to the pay wall.

Google doesn't want everyone to know what a Google indexing request looks like for fear the CEO mafia will institute shenanigans. And the content providers (NYT, WaPo, etc.) don't want people to know 'cause they don't want people evading their paywall.

Or maybe they're okay with letting the archive index their content...



Just FYI google and bing publish their user agent strings[1][2] for the crawlers. At least in my experience most of the typical ad-infested and paywalled news sites wont display the paywall if you change the user agent to a crawler they prefer.

[1] https://developers.google.com/search/docs/crawling-indexing/... [2] https://www.bing.com/webmasters/help/which-crawlers-does-bin...


Doesn't almost every site on the web know exactly what the Google bot looks like?


Google gives precise details about how to verify their bot is crawling your site and how to denote what content is paywalled and what isn’t.


Bingo. This is what I use to incentivize using a nonmonopolistic search engine to find the few sites I run.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: