I don't understand the page, it shows a list of data sets (I think?) up to 91 Ti...

digitaldragon · 2025-08-18T06:01:47 1755496907

The data is saved as a WARC file, which contains the entire HTTP request and response (compressed, of course). So it's much bigger than just a short -> long URL mapping.

lyu07282 · 2025-08-18T07:03:38 1755500618

did they follow the redirect and archive the page content? but why?

jdiff · 2025-08-17T20:25:46 1755462346

I did some ridiculous napkin math. A random URL I pulled from a Google search was 705 bytes. A googl link is 22 bytes but if you only store the ID, it'd be 6 bytes. Some URLs are going to be shorter, some longer, but just ballparking it all, that lands us in the neighborhood of hundreds of billions of URLs, up to trillions of URLs.

rafram · 2025-08-18T00:34:14 1755477254

> A random URL I pulled from a Google search was 705 bytes.

705 bytes is an extremely long URL. Even if we assume that URLs that get shortened tend to be longer than URLs overall, that’s still an unrealistic average.

jdiff · 2025-08-18T10:20:49 1755512449

It is long, it represents the lower hundreds of billions bound in my awful napkin math.

ethan_smith · 2025-08-18T07:56:02 1755503762

The 91 TiB includes not just the URL mappings but the actual content of all destination pages, which ArchiveTeam captures to ensure the links remain functional even if original destinations disappear.

account42 · 2025-08-19T10:06:23 1755597983

Ok but the destination pages are not at risk (or at least not any more than any random page on the web) so why spend any effort crawling them before all shortcuts have been saved?

lyu07282 · 2025-08-18T07:01:40 1755500500

3.75 billion URLs, according to this[1] the average URL is 76.97 characters would be ~268.8 GiB without the goo.gl id/metadata. So I also wonder whats up with that.

https://web.archive.org/web/20250125064617/http://www.superm...

immibis · 2025-08-21T03:23:33 1755746613

They might be storing in WARC format, which records all the request and response headers and maybe even TLS certificates and things.