Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A lot of ppl get this backwards, that a stable program should never "crash". While it's actually the opposite, it should throw errors at every opportunity to do so.

The errors should then be logged and the program should be restarted by a watcher process.

Here's an example on how you can both log errors and e-mail them if a process crash, using a startup script (Linux, Ubuntu):

  exec sudo -u user /bin_location /program_path 2>&1 >>/log_path | tee -a /error_log_path | mail mail@domain.com -s email_subject


This is how Erlang (for example) gets its reputation of being "nine-nines" capable (i.e. capable of 99.9999999% uptime, or downtime on the order of milliseconds per year). Erlang (and Elixir and LFE) software following the OTP framework is usually ordered into "supervision trees" - layer upon layer of Erlang processes managing other Erlang processes in turn managing other Erlang processes, all potentially distributed across multiple Erlang VM (nowadays BEAM) instances.


A watchdog pattern just splits the program into several processes, the program as a whole still never crashes.


If an error is caught and handled, calling it a crash seems disingenuous.


One mistake that ppl do is they wrap their code around a try ... catch, where it's better to throw an error and exit. If there's an error in one place, chances are there are also errors elsewhere, so it's better to restart the program instead of continue with a bad state.

When the error gets thrown in your face, there's a higher chance that it gets fixed.

But this also have its setbacks. Loosing the whole state can be really bad.


I'm really note sure why you think catching an error in a separate process is somehow superior to catching it in a higher scope


I think it depends on the kind of error. If it is a "bug-detected" error (null-pointer dereference, out-of-bounds, divide-by-zero, out-of-memory, etc.), you better restart the program since you're in an unstable state. If it is a "domain-specific" error (connection lost, robot could not reach its destination, battery low, etc.), you better deal with it as soon as possible.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: