NLTK has always seemed like a bit of a toy when compared to Stanford CoreNLP. I'd be very curious to see performance/accuracy charts on a number of corpora in comparison to CoreNLP.
The Cython implementation makes it somewhat believable that it's faster than CoreNLP, but I'd also like to hear a deep-dive on why it's several times faster, beyond that control over memory layout is the best way to win performance (stipulated). In particular, it would be good to know whether CoreNLP is doing more processing than spaCy or otherwise handling more concerns.
Finally, I'd really love to see a feature table comparing spaCy with CoreNLP.
The Cython implementation makes it somewhat believable that it's faster than CoreNLP, but I'd also like to hear a deep-dive on why it's several times faster,
Time complexity. The Stanford parser is a phrase structure parser that creates dependencies as a post processing step. So, assuming that they use some variation of CKY, the time complexity is O(N^3 |G|) where |G| is the size of the grammar. This uses Nivre-style greedy parsing, which is O(N).
So, a slightly fairer comparison would be e.g. the Malt parser. Although this will probably be better than that in terms of accuracy, since last time I checked the Malt parser doesn't use dynamic oracles yet and by default doesn't integrate Brown clusters or word embeddings (though you could do that yourself). Though I wonder a bit about feature set construction, because in my experience perceptrons are far more sensitive to adding 'wrong' features than e.g. SVM classifiers. This becomes interesting especially when you train a model for another language or dependency scheme annotation, since the features that are relevant differ per set-up.
That's not true --- I'm comparing against their neural-network shift-reduce dependency parser, which is very fast. Actually I don't know of a faster parser than theirs, other than spaCy.
First of all, the target output of the two systems is exactly the same --- labelled dependency parses with the same schemes, parts-of-speech, tokens, and lemmas. At least, that's how I run the CoreNLP in this benchmark. It has some other processing modules, but I turn them off for the speed comparison.
Second, very very similar algorithms are being run. The new CoreNLP model uses greedy shift-reduce dependency parsing, same as spaCy. That CoreNLP model was published late last year; before that, CoreNLP only had the older polynomial-time parsing algorithms implemented, which are much slower, and often less accurate.
The contribution of Chen and Manning's paper is to use a neural network model, where I'm using a linear model. (More specifically: they show some interesting tricks to make the neural network actually perform well. I suspect many people have tried to do this and failed.)
Chen and Manning say that their model is much faster than a linear model, because the linear model must explicitly compute lots of conjunction features --- I use about 100 feature templates.
So, they probably have something of an algorithmic advantage over my parser, although the extent of it is unclear. I'll only know when I implement their model. It's not terribly hard to do --- it's just a neural network --- but it's lower on my queue than a number of other things I want to work on. My hunch is that I won't see nearly as much benefit from it as their results suggest, because their baseline is quite weak.
So, I do think all we're seeing here is the same algorithm implemented in Java and C, so the C version is coming out 7x quicker. This makes sense to me. But, possibly the CoreNLP parser has to do some contortions to integrate into their framework. I don't know.
There's also a meta-level point. Maybe I just tried harder. The Stanford paper would still have been accepted, and still have been great, if it ran at 50% of the speed that it does. And we'll probably never know what would happen if the author spent a month doing nothing but trying to optimise the code --- I can't imagine he/she ever will. That wouldn't get a publication.
For your other question, about what spaCy offers and what CoreNLP offers. These are the main things I'm missing at the moment:
* Named entity recognition
* Phrase-structure parsing
* Coreference resolution
I have some preliminary work on NER. I plan to roll that out next, along with some word-sense disambiguation. PSG parsing is no problem to do either.
Thanks for the suggestion to include an evaluation of OpenNLP -- I'll do that.
> spaCy’s parser offers a better speed/accuracy trade-off than any published system: its accuracy is within 1% of the current state-of-the-art, and it’s seven times faster than the 2014 CoreNLP neural network parser, which is the previous fastest parser that I’m aware of.
The Cython implementation makes it somewhat believable that it's faster than CoreNLP, but I'd also like to hear a deep-dive on why it's several times faster, beyond that control over memory layout is the best way to win performance (stipulated). In particular, it would be good to know whether CoreNLP is doing more processing than spaCy or otherwise handling more concerns.
Finally, I'd really love to see a feature table comparing spaCy with CoreNLP.
Compelling work!