To determine whether content changes Google has to spend budget as well, hasn't ...

throw0101a · on Aug 9, 2023

> So it has to fetch that 20-years old article.

It doesn't have to fetch every article (statical sampling can give confidence intervals), and it doesn't have to fetch the full article: doing a "HEAD /" instead of a "GET /" will save on bandwidth, and throwing in ETag / If-Modified-Since / whatever headers can get the status of an article (200 versus 304 response) without bother with the full fetch.

voramok · on Aug 9, 2023

There’s an obvious way this can be exploited. Bait and switch.

strken · on Aug 9, 2023

If the content is literally the same, the crawler should be able to use If-Modified-Since, right? It still has to make a HTTP request, but not parse or index anything.

codedokode · on Aug 9, 2023

If the content is dynamic (e.g. a list of popular articles in a sidebar has changed), then the page will be considered "updated".

wise_young_man · on Aug 10, 2023

This is not correct. It’s up to the server, controlled by the application to send that or other headers. Similar to sending a <title> tag. The headers take priority and similar to what another person said they will do a HEAD request first and not bother with a GET request for the content.