Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The original Word format was a literal dump of a part of the data segment of the Word process. Basically like an mmapped file. Super fast. It is a pity that modern languages and their runtimes do not allow data structures to be saved like that.


You can absolutely save data like that, it's just that it's a terrible idea. There are obvious portability concerns issues: little-endian vs. big endian, 32-bit vs. 64-bit, struct padding, etc.

Essentially, this system works great if you know the exact hardware and compiler toolchain, and you never expect to upgrade it with things that might break memory layout. Obviously this does not hold for Word: it was written originally in a 32-bit world and now we live in a 64-bit one, MSVC has been upgraded many times, etc. There's also address space concern: if you embed your pointers, are you SURE that you're always going to be able to load them in the same place in the address space?

The overhead of deserialization is very small with a properly written file format, it's nowhere near worth the sacrifice in portability. This is not why Word is slow.


Andrew Kelley (author of zig) has a nice talk about programming without pointers allowing ultra fast serialize/deserialization. [0]

And then you have things like cap'n'proto if you want to control your memory layout. [1]

But for "productivity" files, you are essentially right. Portability and simplicity of the format is probably what matters.

[0]: https://www.hytradboi.com/2025/05c72e39-c07e-41bc-ac40-85e83...

[1]: https://capnproto.org/


That is true, cap’n proto and flatbuffers are excellent realizations of this basic concept. But that’s very different thing from what the commenter is talking about Word doing in the 90s, of just memory-mapping the internal data structures and be done with it.


Smalltalk is something like that.


It's only a terrible idea because our tools are terrible.

That's exactly the point!

(For example, if Rust would detect a version change, it could rewrite the data into a compatible format, etc.)


At which point you're not just memory mapping the file. And if the new version changes the size of the object, it doesn't pack in the same place in memory, so you have to repack before saving. Even serializing with versioning is very hard. Memory mapping is much worse. Several other comments indicate that I am not the only one with bad experiences here.


Your mileage may be different. I didn't work on Word (though I talked to those guys about their format) but I worked on two other apps that used the same strategy in the same era. One, on load you had to fix up runtime data that landed in that part of the data segment. Two, the in memory representation was actually somewhat sparse. This meant that a serializer actually read and wrote less to disk than mapping the file. So documents were smaller and there was actually less i/o and faster loads.

The reason I hated it though was because it was very hard to version. I know the Word team had that problem, especially when the mandate came down for older versions to be able to read newer versions. Hard enough to organize the disk format so old versions can ignore stuff, but now you're putting the same requirements on the in-memory representation. Maybe Word did it better.


There's all kinds of discussions of recovering text from corrupted files that just kind of went away when they moved over to the explicit serialization in docx.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: