True, but a couple years ago I ported most of the Tarsnap dedup algorithms to Python. It wasn't too hard, just time consuming. I was hoping someone else did that in a thorough way, but I guess the intersection of "I love tarsnap's design!" and "I have the time to port it from C!" might not be too large.
> Redistribution and use in source and binary forms, without modification,
is permitted for the sole purpose of using the "tarsnap" backup service
provided by Tarsnap Backup Inc.
The codebase is a jewel. I love the design, the way it's organized, the coding style, the algorithms, everything.
Then I started making a mental map of tarsnap: How does it build its deduplication index? How does it decide where block boundaries start within a file? Etc.
Eventually I started coding the algorithms in Python, mostly as a way of understanding the code. It's not actually as hard as it sounds, but you have to be rigorous. (It's a C -> Python conversion, after all, so there's not much room for error.)
My process was basically: Copy the C code into a Python file; comment out the code; for each line, write the corresponding Python; try to get something running as quickly as possible.
It worked pretty well, but I eventually lost interest.
Over the years, I've wanted a deduplication library, and 2021 is no exception. Someday I'll just roll up my sleeves and finish porting it.
Wow, that's much more technically advanced than I was as a teenager! Way to go!
To print progress with tarsnap 1.0.39, send it a SIGUSR1 or SIGINFO. On FreeBSD, you can do this by pressing ctrl-t. On Linux, you have to use the unfortunately-named `kill` or `killall` command, such as
(Note that Tarsnap is not responsible for naming the unix `kill` or `killall` commands.)
In the unreleased git version of tarsnap, there's a `--progress-bytes SIZE` command, which prints a progress message after every SIZE bytes are processed.
As a general note: the tarsnap-users mailing list is a great place to ask for tips. As you mentioned in your lessons learned, some of the options could have helped a lot (such as `--recover`)
https://www.tarsnap.com/lists.html
Note that that's gperciva not cperciva. I believe the person's name is Graham Percival, and I am fairly sure Colin Percival is still called Colin Percival. They may well be related, but I'm pretty sure that's a bona fide "employed by" :-).
The algorithm itself takes three cost&memory values N, r, p, and if you're calling the `crypto_scrypt()` function in C or C++ as a KDF you need to specify those.
The command-line binary generally takes "max time" and "max ram" (as percent and/or raw value) and estimates appropriate cost values. As of version 1.3.1, you can manually specify cost values for the binary if you want.
Tarsnap's deduplication code is not available under an open-source license.