Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't understand the page, it shows a list of data sets (I think?) up to 91 TiB in size

The list of short links and their target URLs can't be 91 TiB in size can it? Does anyone know how this works?



The data is saved as a WARC file, which contains the entire HTTP request and response (compressed, of course). So it's much bigger than just a short -> long URL mapping.


did they follow the redirect and archive the page content? but why?


I did some ridiculous napkin math. A random URL I pulled from a Google search was 705 bytes. A googl link is 22 bytes but if you only store the ID, it'd be 6 bytes. Some URLs are going to be shorter, some longer, but just ballparking it all, that lands us in the neighborhood of hundreds of billions of URLs, up to trillions of URLs.


> A random URL I pulled from a Google search was 705 bytes.

705 bytes is an extremely long URL. Even if we assume that URLs that get shortened tend to be longer than URLs overall, that’s still an unrealistic average.


It is long, it represents the lower hundreds of billions bound in my awful napkin math.


The 91 TiB includes not just the URL mappings but the actual content of all destination pages, which ArchiveTeam captures to ensure the links remain functional even if original destinations disappear.


Ok but the destination pages are not at risk (or at least not any more than any random page on the web) so why spend any effort crawling them before all shortcuts have been saved?


3.75 billion URLs, according to this[1] the average URL is 76.97 characters would be ~268.8 GiB without the goo.gl id/metadata. So I also wonder whats up with that.

https://web.archive.org/web/20250125064617/http://www.superm...


They might be storing in WARC format, which records all the request and response headers and maybe even TLS certificates and things.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: