Benchmarking Random Forest Classification

tlarkworthy · on July 16, 2013

Its random forests ... each tree is trained on a subset of the data. You can split the massive dataset into chunks and train independently. That sidesteps the "big data" hangup.

If you look at the implementation for ski-learn, each tree emits a normalised probability vector for each prediction, those vectors are simply multiplied together to get the aggregate prediction, so its not very difficult to do yourself.

Although regardless, you are applying a batch learning technique anyway. You want an incremental learner for big data.

msellout · on July 16, 2013

The training subset for each tree can still be quite large. Note that most of the implementations failed on their 12 GB dataset.

Although I'm a big believer in streaming/online machine learning, it's not necessarily the best solution. There are many cases when batch is the better option, especially for big data. Anything historical, really.

bayesianhorse · on July 16, 2013

I was thinking the same thing. Ensemble methods like this scale very well with the number of machines, with moderate efforts of coordination at least.

glouppe · on July 16, 2013

Any chance for you to run your benchmarks on this branch of Scikit-Learn? https://github.com/glouppe/scikit-learn/tree/trees-v2 This will be shipped anytime soon :)

We have been working hard to reduce computing times and memory footprint (though, there is still a lot of improvement on that side).

(Unfortunately, I cannot run your benchmarks myself, because the compiled version of WiseRF requires a newer version of glibc than the one on my cluster, and crashes.)

bravura · on July 16, 2013

Question: Why do I have to implement hyperparameter selection?

For me, the promise of in-the-cloud machine learning is that I can call 'train' method, and specify one single hyperparameter: training budget (i.e. $). Perhaps also the max time before I am returned a trained model.

That's it. Can you do that?

joeyrichar · on July 16, 2013

This is exactly what we're enabling with our ML Platform (currently in private beta). Such a system needs to be built on top of fast & scalable ML technology with smart & efficient tuning/optimization.

Would love to hear about your use cases & get you on the beta.

-Joey Richards, Chief Scientist @ wise.io

bayesianhorse · on July 16, 2013

There are different ways of finding optimal hyperparameters, and while a cloud system might be capable to provide a mechanism that can figure this out on its own, generally this will be less efficient...