gperciva's comments

gperciva · on Sept 15, 2021

> Has anyone made an interface to tarsnap's tarball dedup code?

Tarsnap's deduplication code is not available under an open-source license.

sillysaurusx · on Sept 15, 2021

True, but a couple years ago I ported most of the Tarsnap dedup algorithms to Python. It wasn't too hard, just time consuming. I was hoping someone else did that in a thorough way, but I guess the intersection of "I love tarsnap's design!" and "I have the time to port it from C!" might not be too large.

akerl_ · on Sept 16, 2021

This may be a foolish question, but if the dedup code isn’t open source, how did you port it to Python?

sillysaurusx · on Sept 16, 2021

Not foolish! The Tarsnap client code is open source, but the license file prohibits anyone from using the code: https://github.com/Tarsnap/tarsnap/blob/master/COPYING

> Redistribution and use in source and binary forms, without modification, is permitted for the sole purpose of using the "tarsnap" backup service provided by Tarsnap Backup Inc.

The codebase is a jewel. I love the design, the way it's organized, the coding style, the algorithms, everything.

My process was to skim Colin's thesis: http://www.daemonology.net/papers/thesis.pdf

Along with the rsync thesis: https://www.samba.org/~tridge/phd_thesis.pdf

Then I started making a mental map of tarsnap: How does it build its deduplication index? How does it decide where block boundaries start within a file? Etc.

Eventually I started coding the algorithms in Python, mostly as a way of understanding the code. It's not actually as hard as it sounds, but you have to be rigorous. (It's a C -> Python conversion, after all, so there's not much room for error.)

My process was basically: Copy the C code into a Python file; comment out the code; for each line, write the corresponding Python; try to get something running as quickly as possible.

It worked pretty well, but I eventually lost interest.

Over the years, I've wanted a deduplication library, and 2021 is no exception. Someday I'll just roll up my sleeves and finish porting it.

gperciva · on June 22, 2021

Do you normally hire high school students at your company?

ikiris · on June 22, 2021

They cost less.

gperciva · on June 22, 2021

Wow, that's much more technically advanced than I was as a teenager! Way to go!

To print progress with tarsnap 1.0.39, send it a SIGUSR1 or SIGINFO. On FreeBSD, you can do this by pressing ctrl-t. On Linux, you have to use the unfortunately-named `kill` or `killall` command, such as

killall -SIGUSR1 tarsnap

https://www.tarsnap.com/tips.html#check-current

(Note that Tarsnap is not responsible for naming the unix `kill` or `killall` commands.)

In the unreleased git version of tarsnap, there's a `--progress-bytes SIZE` command, which prints a progress message after every SIZE bytes are processed.

As a general note: the tarsnap-users mailing list is a great place to ask for tips. As you mentioned in your lessons learned, some of the options could have helped a lot (such as `--recover`) https://www.tarsnap.com/lists.html

(Disclaimer: I'm employed by Tarsnap Backup Inc.)

bigiain · on June 23, 2021

"employed by"

<grin>

gjm11 · on June 23, 2021

Note that that's gperciva not cperciva. I believe the person's name is Graham Percival, and I am fairly sure Colin Percival is still called Colin Percival. They may well be related, but I'm pretty sure that's a bona fide "employed by" :-).

gperciva · on June 23, 2021

Yes. For the curious,

https://github.com/Tarsnap/tarsnap/graphs/contributors

cperciva · on June 23, 2021

gperciva · on Aug 28, 2020

The algorithm itself takes three cost&memory values N, r, p, and if you're calling the `crypto_scrypt()` function in C or C++ as a KDF you need to specify those.

The command-line binary generally takes "max time" and "max ram" (as percent and/or raw value) and estimates appropriate cost values. As of version 1.3.1, you can manually specify cost values for the binary if you want.

gperciva · on Aug 28, 2020

BTW, we released scrypt 1.3.1 yesterday: http://mail.tarsnap.com/scrypt/msg00268.html

Main page, including the signed tarball: http://www.tarsnap.com/scrypt.html