Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Every file in TAR ends with 1KiB of zeros as “end of file marker”
1 point by vitiral on Sept 17, 2023 | hide | past | favorite | 17 comments
https://docs.fileformat.com/compression/tar/

I don't understand how this format was ever thought to be a good idea

* filename size is exactly 100 bytes

* file size is octal at 8^12 (correction 8^11, 1 byte is /0 so 8 GiB). Why octal and not binary?

* there is extra "end of file" markers of 1024KiB of zeros

* contains NUMERIC owner/group IDs... on a serialized data format meant to be sent between computers (who's IDs other than root might not agree right?)

Why the extra EoF? Is this out of concern for data corruption? If so, why not a simple CRC checker for both detection and recovery?

The whole thing seems poorly designed yet is ubiquitous in Linux. I've used it for more than a decade without ever asking about it's protocol.

Some alternatives off the top of my head:

* remove EoF waste

* 2byte filename lengths, allowing file NAMES up to 64KiB and removing wasted bytes of the 100 bytes which are unused.

* add CRC checksum

* NAMED owner/group, permitting some kind of cross-platform usage. Or just remove this feature entirely (preferable IMO)

* Don't use octal for file size and get nearly infinite file sizes.



Most of the alternatives you suggest are implemented in PAX, which is an extension of the tar format. The EoF block is useful if the archive is written directly to physical media without a filesystem: it lets you determine where the archive ends.


Wouldn't such a feature be better put on the physical media controller/protocol itself?


yes; but then people wanted to be able to read the tapes they'd written long before the hardware and media acquired such features, so TAR kept the ability (it was more don't change things that aint broke, i think).

ARC and ZIP files are written as a fresh take on the idea of archive files, with much more capable hardware, after TAR had been around a couple decades. They have many features designed to use those new hardware capabilities, and were (and still are) very popular because of those.

They have bits that probably seem dated now, too. Breaking archives into floppy size chunks? but without any sort of forward error correction? No format support for unicode? (who cares it wasn't invented when the ZIP file spec was created?)


Because because.

TAR was written for very simple/tiny machines by today's standards, and was designed to read/write full valid blocks on physical tapes with constraints on spool-up and spool-down times/distances.

The description here seems reasonable: https://en.wikipedia.org/wiki/Tar_(computing)


I don't think any of the alternatives I listed can't be done on an embedded device or a tape drive though. I understand that the format is old and so perhaps many of the arbitrary constraints weren't seen as that bad at the time though.


This format was developed a long time ago before the luxury of (a) experience and (b) newer more capable storage hardware.

It's really strange to complain that a legacy format is full of bad features for modern tastes and hardware - how do you think it was worked out what bad and good features of formats and hardware might be?

The history in the Wikipedia page that I linked is instructive.


I'm arguing that these features make no sense _in the old context_ as well. Why waste 1KiB and namelen-100 bytes per file when space is so precious?


We used to have files consisting of many multiple tapes. We changed the OS to do dead reckoning on the end of each tape so we could stop well clear of the actual end mark. That way individual tapes could be copied and substituted if needed. Hard to see the reason if you don't just know.


I don't think I understand but maybe that's the point. Maybe it seems mysterious because there were other requirements at the time which were themselves already mysterious?


If you try to copy a tape to a slightly shorter tape you will be out of luck, and only know when it gets there.


Because some tape drives could only read and write whole (512B) blocks, and the way to be relatively sure that you didn't have a new file was to see two blocks of zeros.

The Wikipedia item explains this.


I think I'm confused. You make it sound like the "tape reader" hardware/driver isn't talking to the "file reader" part in software . Didn't the file reader tell the tape reader the size of the file, so it would already know where the end was (how many blocks it should read)?


Tar is an descendant from file formats that were written to tapes. Recovery was to the same system & space was not cheap. there is also an error in the maximal size: its 8 Gigabyte not 64. The last byte must be an \0. This is fixed in modern gnu and bsd tar.


Thanks for the correction.

If space was so important then why waste 1KiB/file? Why waste namelen-100 bytes per file? Why put space into owner/group id?


There was push to use newer and better designed CPIO format with limited success...seems tar is just too ingrained.

edit: looked into it a bit and it was not much better in fact.


I don't understand... Here is a better way...

No. Don't even think about it. tar is. It always will be that way so leave it.


I'm not suggesting changing it, I'm trying to understand how it could possibly have been this way in the first place




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: