How do people solve the resource allocation problem with distributed job queues?...

brndnmtthws · on April 28, 2015

It may be worth reading about Mesos: https://www.cs.berkeley.edu/~alig/papers/mesos.pdf

tobz · on April 28, 2015

In the case of RabbitMQ, it starts paging messages to disk, IIRC. If you run into descriptor / memory / disk limits, it starts turning away new messages until it returns back to a safe threshold.

There's a certain amount it can do, but ultimately, you have to apply backpressure at some point unless you're willing to start losing messages at the message broker layer.

After that, there's still a lot you could do: spool them to disk on the publisher, retry later, etc. You still risk message loss, filling up THOSE disks, etc... but again, something has to eventually give.

barrkel · on April 29, 2015

I'm not worried about too many messages piling up; I'm worried about the consumers pulling too many messages off and acting on them concurrently using up all the resources on the machine, or pulling off too few messages, and leaving too many resources idle.

Blackthorn · on April 27, 2015

Kind of a different problem, isn't it?

My first exposure to the resource allocation problem and solutions came at Google, whose system evolved into Kubernetes (now an open source project). It's so damn effective I hope it takes off everywhere.

barrkel · on April 27, 2015

Well, my first concern about a "job queue" (as opposed to a message queue) is getting work done efficiently and reliably; do too little, and resources aren't being used efficiently, but do too much, and you might never get anything done.

We have this exact problem at the startup where I work. We have a home-made solution (a combo of Akka and Play for API / admin UI) that works OK, but really we'd prefer not to be in the business of writing job queues and schedulers.

Something big and cluster-oriented like Hadoop isn't a good fit; we typically only have one or two actual machines servicing jobs, because of our business model. Financial entities don't like their data being mixed with other people's data, so we give everybody tiny little networks of VMs, and can't farm work out to a giant cluster.

Redis is used as the backing storage for our homemade job queue. But without resource monitoring and allocation, Disque doesn't help with our problem. I'm sure it'll find much use elsewhere, though.