Not losing, but commoditized - in those fields, there are now perfectly good ope...

paganel · on Feb 2, 2019

The big advantage of big companies like Google is that they have lots (and I mean lots) of data and for them that data is comparatively cheap to store and manipulate.

I mean, I first wrote a text-categorization algorithm using a k-NN algorithm about 12-13 years ago, and in order to make it run with acceptable results I only needed to manually categorize about 200 articles for each category training set. That was very doable, both in terms of time spent for constructing the training set and in terms of storage costs. Now, I have been thinking for some time to write a ML algorithm that would automatically identify the forests from present-day satellite images or from some 1950s Soviet maps (which are very good on the details). I’m pretty sure that there already is some OS code that does that, but the training set requirements I think would “kill” the project for me. I read a couple of days ago (the article was shared here in HN) about some people at Stanford implementing a ML algorithm for identifying ships included in satellite images, and I remember reading that they used 1 million high-res images as a training set. Now, for me as a hobbyist or even for a small-ish company there’s no cheap way to store that training set. Never mind the costs of labeling those 1 million training images. Otherwise I totally agree with you, we live in a golden age of AI/ML code being made available for the general public, but unfortunately is the data that makes all the difference.