Could you care to elaborate further on specialized database engines and rapid da...

mlthoughts2018 · on Oct 29, 2019

I mean things like vertica or kdb+ that have specialized performance properties for some use cases. Also to some minor extent managed cluster pipeline tooling like Spark. I don’t mind paying for managed versions of these (not Databricks though). For annotation tools I mean things like Prodigy.

tixocloud · on Oct 29, 2019

Thanks. Why the interest in managed cluster pipeline but not Databricks? They seem to be the big name in the game for Spark. Would some tooling around snapshotting, sequential AB testing, staged rollouts, ghost models be of use?

mlthoughts2018 · on Oct 29, 2019

Databricks is a notebook frontend to cluster computing. Cluster computing is useful and I’m willing to pay for it as long as I have total & complete control of the environments and tooling used in the cluster workflow.

A notebook frontend however is less than useless, and is actively harmful by propagating poor notebook environments even further into aspects of computing where they cause harm and hurt reproducability and code factoring.

Given this, even if Databricks offered perfectly complete features for all aspects of cluster computing, it would still be inferior to just my own managed EC2 or EMR clusters or equivalent with other providers, where there is no “notebook as control plane” garbage.

But when you add to that the fact that Databricks lacks full features for me to totally own every customized detail of my cluster environment (e.g. how can I run plain Python multiprocessing tasks with zero Spark in Databricks? How can I bring my own custom defined GPU container with custom compiled Tensorflow?) it makes the deal even worse.

Databricks is just another Spark / Hadoop style snake oil seller banking on capturing a bunch of data science teams before people realize that it’s a conceptually junk way to work.

As for the other tooling you mention, I’d almost always say to build it in house. For example, I don’t know of a single A/B testing provider that actually uses frequentist sequential testing to correctly avoid early stopping bias. You actually need real statisticians hired in-house to solve these problems, and the engineering work to set up an extensible A/B test as a service internal tool is not bad. (I’ve built Bayesian A/B test frameworks with teams of 2-3 engineers in 3 different companies). It’s just not cost effective to outsource it on the false hopes of not needing to hire your own in-house statistics experts. Just pony up the dough and hire them.