Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Assuming your event data is immutable (i.e. no UPDATEs, just INSERTs), you'd probably have fewer headaches long-term if you just dumped the database to flatfiles, stored in HDFS and queried using Hive (which has MySQLish query syntax anyway). This architecture will take you to billions of rows quite happily.

This is the architecture we use for eventstream analysis at SnowPlow (https://github.com/snowplow/snowplow).



At my company, we experimented with Hive and found it to be too slow. Once you decide that every row of data will be materialized into a java hash table instance, there are real performance limitations that follow from that representation. We decided to use HDFS and Hadoop but built our own query language called Trecul; it uses LLVM to jit compile dataflow code into native code. Picking off a single field to filter on is a single machine instruction for us.

The code is open source (albeit with rough edges) at https://github.com/akamai-tech/trecul/


Thanks for sharing Michael - Trecul looks to be the first OSS release from Akamai (at least on GitHub)?

While we're on the subject of high-performance alternatives to Hive for event analysis, we've been watching the development of Ben Johnson's behavioural db, called Sky, with interest:

https://github.com/skylandlabs/sky


Thanks for the shout out, Alex.

Michael- Trecul looks cool. I'm doing something similar with Sky. I'm building an LLVM-based query language on it called EQL (Event Query Language) that's optimized for evented data. It does a lot of function rewriting to optimize loops and avoid heap allocations so it can crank through tens of millions of events per second. It's not finished yet but it should be done in the next couple weeks.


On the other end of the spectrum, for "small data" (i.e. millions of rows as opposed to tens of billons)... check out crush-tools. I love it, use it all the time.

http://code.google.com/p/crush-tools/


Thanks for the tip - looks cool, like a mini-map-reduce for Bash!


How big are the flatfiles that you're storing in HDFS? I've looked at it before for such a use, but for durability I want to write events in an isolate manner, which means lots and lots of small writes, either to single files or as a series of small files. I was under the impression that HDFS doesn't perform well in a use case like this (due to the size of it's write block size), but would LOVE if I could use it like that!


We're using HDFS with a periodic merger process that occasionally merges small files into larger files. Given the block size, HDFS really does want larger files, but it can tolerate a decent number of small files. The bigger problem with this approach is providing a consistent view of the dataset so that already running programs don't have the world totally change out from under them.


You're right, there can be something of a "small files" issue with HDFS. This is a good article for strategies to get round it: http://www.cloudera.com/blog/2009/02/the-small-files-problem...


It's also 3-4 orders of magnitude more expensive to support: MySQL is an out of the box install of a single process which many people are familiar with and which has various well-known data warehousing techniques and many GUIs and other tools for casual analysis.

Hadoop + Hive is a beast which will require multiple high memory systems just to run without daemons crashing and mysteriously deadlocking the entire cluster (i.e. you'll be searching for non-obvious messages in log files, googling and reading the source until you learn that some Java developers are still struggling with 30 year-old memory management and error handling techniques). You then need to write custom data loaders, completely architect around the write-once file I/O model, and then learn Hive before you can get a single result.

If you actually really need the things which Hadoop can better, it's worth that investment - but the decision is akin to knowing that you need to haul a billion pounds of coal and thus building a railroad makes sense. If you don't know that - and if you don't already know your access patterns well enough to carve them into stone - the overhead cost dwarfs the benefit.


It sounds like a problem better solved by triggers, or a few materialized views.


Are there any good resources which cover HDFS+Hive? I'd really love to see performance measurements but the entire premise of "Let's analyze ALL the data every single query, we'll just use a zillion workers!" has always came across to me as incredibly inefficient and computationally expensive.

I'd wager they query their data frequently and in a predictable manner, at which point using some sort data structure actually designed for fast searching (such as a simple btree in most rdbms) makes perfect sense here.

If the data is purely for archive purposes and is rarely queried or is queried in very random patterns that can't really make adequate usage of indices, then I'd agree with your suggestion.


Sadly there's not a great lot of documentation about HDFS/Hive - we're learning a lot as we go with SnowPlow.

But I do agree with you - if the same queries keep coming up frequently, then it's worth putting some tech in place to save on ad hoc querying costs. But an RDBMS isn't a great fit for this - because the data is a) non-relational and b) immutable. Happily there's a whole class of database designed for this type of work - analytics databases such as Greenplum and Infobright (Infobright is a modded MySQL too).

In any case, if you have your event data stored in flat files, then as well as a "raw load" into your analytics database, you can also schedule regular map-reduce jobs to populate specific cubes into your analytics db. This is something we're working on for SnowPlow now as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: