Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

To determine whether content changes Google has to spend budget as well, hasn't it? So it has to fetch that 20-years old article.


> So it has to fetch that 20-years old article.

It doesn't have to fetch every article (statical sampling can give confidence intervals), and it doesn't have to fetch the full article: doing a "HEAD /" instead of a "GET /" will save on bandwidth, and throwing in ETag / If-Modified-Since / whatever headers can get the status of an article (200 versus 304 response) without bother with the full fetch.


There’s an obvious way this can be exploited. Bait and switch.


If the content is literally the same, the crawler should be able to use If-Modified-Since, right? It still has to make a HTTP request, but not parse or index anything.


If the content is dynamic (e.g. a list of popular articles in a sidebar has changed), then the page will be considered "updated".


This is not correct. It’s up to the server, controlled by the application to send that or other headers. Similar to sending a <title> tag. The headers take priority and similar to what another person said they will do a HEAD request first and not bother with a GET request for the content.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: