It doesn't have to fetch every article (statical sampling can give confidence intervals), and it doesn't have to fetch the full article: doing a "HEAD /" instead of a "GET /" will save on bandwidth, and throwing in ETag / If-Modified-Since / whatever headers can get the status of an article (200 versus 304 response) without bother with the full fetch.
If the content is literally the same, the crawler should be able to use If-Modified-Since, right? It still has to make a HTTP request, but not parse or index anything.
This is not correct. It’s up to the server, controlled by the application to send that or other headers. Similar to sending a <title> tag. The headers take priority and similar to what another person said they will do a HEAD request first and not bother with a GET request for the content.