Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How do people solve the resource allocation problem with distributed job queues?

By resource allocation problem, I mean that jobs may be small (so that lots of them can occur in parallel) or large (occupying a significant fraction of a machine's CPU, memory, bandwidth, whatever), and may be mixed together. Trying to do too much can effectively crash a system with OOM killer or paging.

Does everybody just roll their own resource-based scheduler?



It may be worth reading about Mesos: https://www.cs.berkeley.edu/~alig/papers/mesos.pdf


In the case of RabbitMQ, it starts paging messages to disk, IIRC. If you run into descriptor / memory / disk limits, it starts turning away new messages until it returns back to a safe threshold.

There's a certain amount it can do, but ultimately, you have to apply backpressure at some point unless you're willing to start losing messages at the message broker layer.

After that, there's still a lot you could do: spool them to disk on the publisher, retry later, etc. You still risk message loss, filling up THOSE disks, etc... but again, something has to eventually give.


I'm not worried about too many messages piling up; I'm worried about the consumers pulling too many messages off and acting on them concurrently using up all the resources on the machine, or pulling off too few messages, and leaving too many resources idle.


Kind of a different problem, isn't it?

My first exposure to the resource allocation problem and solutions came at Google, whose system evolved into Kubernetes (now an open source project). It's so damn effective I hope it takes off everywhere.


Well, my first concern about a "job queue" (as opposed to a message queue) is getting work done efficiently and reliably; do too little, and resources aren't being used efficiently, but do too much, and you might never get anything done.

We have this exact problem at the startup where I work. We have a home-made solution (a combo of Akka and Play for API / admin UI) that works OK, but really we'd prefer not to be in the business of writing job queues and schedulers.

Something big and cluster-oriented like Hadoop isn't a good fit; we typically only have one or two actual machines servicing jobs, because of our business model. Financial entities don't like their data being mixed with other people's data, so we give everybody tiny little networks of VMs, and can't farm work out to a giant cluster.

Redis is used as the backing storage for our homemade job queue. But without resource monitoring and allocation, Disque doesn't help with our problem. I'm sure it'll find much use elsewhere, though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: