As a benchmark, I tested computing exp(1) using the binary splitting algorithm (...

zhemao · on Jan 1, 2014

A bunch of people have been asking for benchmarks of LibTommath for comparison. I've ported this benchmark to tommath at https://gist.github.com/zhemao/8204404.

Here are the runs on my machine

BSDNT

    digits = 100: 2.43242e-05 s
    digits = 1000: 0.000143717  s
    digits = 10000: 0.0016644 s
    digits = 100000: 0.0378964 s
    digits = 1000000: 1.20317 s
    digits = 10000000: 40.1549 s

GMP

    digits = 100: 1.64086e-05 s
    digits = 1000: 0.000117272 s
    digits = 10000: 0.00113377 s
    digits = 100000: 0.0175584 s
    digits = 1000000: 0.284023 s
    digits = 10000000: 4.96335 s

TOMMATH

    digits = 100: 4.90092e-05 s
    digits = 1000: 0.000302586 s
    digits = 10000: 0.00548351 s
    digits = 100000: 0.368923 s
    digits = 1000000: 36.7576 s
    digits = 10000000: longer than I was willing to wait

I have no idea why libtommath gets so slow for larger numbers. Would anyone care to comment? I think something must be wrong in the way I ported the benchmark.

luckydude · on Jan 1, 2014

Any chance we can see results for 10 digits as well? The reason I ask is I'm guessing that a lot of work will actually be with smaller numbers of digits.

Is it possible to design a math lib that uses native ints/floats for stuff that fits and switches in the big num stuff on overflow? I believe ruby does this, do any of the libs do it?

I think the ruby approach is to cast the 32 bit items to 64 bit, do the operation into a 64 bit, see if any of the top 32 bits are set, if so, switch to bignum, if not, use the result. If tommath or any of the others do that, then I don't really care that much (for my purpose which is scripting langs that use tommath) about the big num perf. Faster is better but the high order bit for me, at least, is perf on the smaller numbers.

fdej · on Jan 1, 2014

Yes, for example the fmpz type in flint (http://flintlib.org/) implements this optimization, and also caches bignum objects for quick allocation/deallocation. Here is the benchmark with the mpir mpz type vs the flint fmpz type:

                       mpz           fmpz
    digits = 10:       4.24e-06 s    1.2e-06 s
    digits = 100:      1.28e-05 s    3.3e-06 s
    digits = 1000:     8e-05 s       2.7e-05 s
    digits = 10000:    0.001 s       0.0006 s
    digits = 100000:   0.017 s       0.013 s
    digits = 1000000:  0.25 s        0.23 s
    digits = 10000000: 4.32 s        4.08 s

In fact, the benchmark is a bit biased against the mpz type because it spends a lot of time allocating and deallocating temporary mpz's. It would be much faster to replace the recursion by a loop with in-place operations when b - a < 20, say. (On the other hand, unoptimized code of this kind is probably quite common in practice...)

fdej · on Jan 1, 2014

A possible explanation is that libtommath has a slow division algorithm. You could check what happens if you disable the final division.

zhemao · on Jan 1, 2014

I tried that. It's an improvement, but libtommath is still much slower than the other two.

    digits = 100: 4.31597e-05 s
    digits = 1000: 0.000249898 s
    digits = 10000: 0.00252304 s
    digits = 100000: 0.0618554 s
    digits = 1000000: 5.24447 s
    digits = 10000000: 703.01 s

wbhart · on Jan 2, 2014

Certainly he has only basecase division, which explains the original benchmark timings.

I'm not very familiar with libtommath, though I had a passing experience with it back in about 2004-2005. However, looking through the code, he does seem to have toom3 (I have both toom3 and toom32, which may or may not be relevant here -- the latter is for unbalanced multiplication).

But more obvious is that libtommath seems to check for errors after every operation, even internally. I think this is some kind of exception mechanism.

I also recall there being a libtomcrypt at some point. Maybe it still exists. That suggests that Tom is possibly focusing on the much more difficult area of crypto, where your code needs to much, much more defensive.

Also, libtommath claims to be 100% C, which bsdnt is not. We use some assembly language, which gives us 5-30% speedup (we could get about another factor of 2 if we unrolled the loops like the C compiler we are comparing against).

Those are a few of the things I can see.

But performance isn't everything. It has never been my intention to be compared with GMP performance-wise for example. You simply cannot beat GMP without being as technical as GMP. The focus here is simplicity and reliability (not to the extremes required for crypto though), and it always will be.

Roughly, my goals with this project were to eventually be much faster than say Python bignums, maybe within a factor of 2 or so of GMP on generic problems, but with code that could be maintained by language designers themselves, without being bignum experts.

zhemao · on Jan 2, 2014

Thanks for the clear explanation. And great job on writing a high-performing Bignum library. From these benchmarks, it looks like the go-to for a high-performing, permissively licensed Bignum library.

NAFV_P · on Dec 31, 2013

I noticed your gists used time.h. Coincidentally I have just been fooling around with sys/time.h because I find the former to be unreliable. From what I have read on the web many other hackers share the same sentiment. I'm surprised to see you using it in code like this, or have I missed something?

fdej · on Jan 1, 2014

I just copied the profile code from bsdnt. I guess the reason it uses it is that it simpler and/or more portable. In this case, it does not matter much. I use the functions from sys/time.h when I do more serious profiling.

NAFV_P · on Jan 1, 2014

> I guess the reason it uses it is that it simpler and/or more portable.

After more fooling around I see what you mean. The Open Group has one version of it, my Linux man pages serve me another. I skimmed over the GNU C website, they say that the struct timezone type is obsolete and should be avoided. Happy days.