Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why intuitive troubleshooting has stopped working (honeycomb.io)
71 points by kiyanwang on Aug 3, 2022 | hide | past | favorite | 75 comments


The architecture of our systems has gotten a lot more sophisticated and complex over the past 20 years. We’re not running monoliths on a few beefy servers these days.

I am running a monolith, and troubleshooting is still pretty easy.

We’re operating distributed microservice ecosystems on top of a deep stack of frameworks, abstractions and runtimes that are all running on other people’s servers (aka 'the cloud').

Yikes, good luck.

Snark aside, I just want to remind most people out there: you do not work for Google or Facebook, and your systems do not have to be architected as if you are the next Google or Facebook. Neither of these companies started existence with microservice architectures or whatever. Save yourself the pain. Build a product. Find some customers. Worry about the architecture astronautery later.


The problem is that if a company advertises a job that is all about maintaing one or two monolithic applications without K8s, distributed system topics, tracing, microservices, AWS/GCP, etc., well, then not many engineers will apply. That's actually a paradox in itself: my bet is that only a few group of (good) engineers would apply to such a job. Management otherwise usually thinks that the bigger the pool of applicants, the better for the company. Also, a lot of engineers out there are resume-driven: they do want to work with microservices, the cloud and k8s.

It's a vicious circle.


> then not many engineers will apply

I would. I don't run server farms and don't need to provision entire environments so K8s would be extreme overkill for anything I do. Docker makes sense to formally define system requirements but even that I use sparingly.

I develop embedded systems though and usually only trace coffee machines.

I do host stuff on AWS and gladly use all the services they provide. Neat, easy and less maintenance, but the most important thing for me personally is that I work with an account that doesn't use my own credit card.

I can understand the applicants though, there is a fear to commit to solutions where any experience might not increase their market value. But from my experience that fear is highly exaggerated. There are people that reject candidates on these accounts but usually not the good one that know they need to invest in engineers. There are also quite a few bosses that are only interested in the newest tech. They have absolutely no idea what that means, but they will buy it. And a lot of cloud software hits the sweet spot for them.


> > then not many engineers will apply

I would.

Like you said ; one point is probably market value and another future relevance. I have had discussions with hiring managers lately and almost none that I talked to actually use most or any of the buzzwords they put in their ads. They just say they do to get more applications.


How broadly is this true?

Some of us developers do want to focus on things that are useful to people, that we can describe to non-tech people we meet in public. Are those of us who run bootstrapped, all-remote SaaS companies going to have trouble hiring?


Based on my experience:

- many frontend engineers would run away from companies that hire for jQuery skills. Frontend engineers nowadays usually want to write React and use Vercel and what not (even though the product may only require jQuery)

- many backend engineers would run away from companies that hire for PHP. Backende engineers nowadays usually want to write Python/Go/Kotlin/Java, well, anything except PHP, even though modern PHP is well-suited for most SaaS/web-products

- many infrastructure engineers would only apply for companies that use K8s, modern CI/CD pipelines (e.g., GitHub/GitLab) and cloud providers (e.g., AWS, GCP) and run away from anything that smells like on-prem installations using Ansible/Chef/custom bash scripts, even though the vast majority of web products out there do not need K8s/AWS/Terraform/etc.

I think in general, engineers do want to build useful stuff for people but only if they get to play with the coolest tech.


> even though modern PHP is well-suited for most SaaS/web-products

Maybe, but is it pleasant to use?

My memories of PHP are not fond at all, it is a really miserable language to write. Maybe it has the features it needs but is it actually nice to use?

Same with JQuery. There's a lot of rose colored glasses these days for JQuery, imo. It is not fun to write.

Also the complexity lift to use React versus using JQuery is minor compared to the complexity lift distributed K8s microservices have over an NGinx server with a custom bash script.

The extra complexity of the scalable microservices architecture might be sufficient to actually sink your startup. The extra complexity of using React over JQuery costs... Essentially nothing.

Edit: I guess what I'm really saying is that choosing a different backend technology or frontend framework isn't remotely the same as choosing to build a scalable distributed microservice architecture over a static bare metal server.

Choosing a technology is a choice based on what you want to work with and what you think you can hire other people to work with.

Choosing K8s terraform distributed microservices for scalability early on is the most premature optimization you can possibly make. Get a customer before thinking about scaling.


A Python/Go/Kotlin/Java framework has the same amount of complexity as a PHP one (you are using one, right? It's not <?php tags all over the source, right?) and are nicer to use. (Not much nicer, mind you, PHP frameworks have evolved too, but still nicer.)

Isn't that what we expect from evolving technologies? (And well, if you are doing <?php all over, bare Python with print statements has the same complexity, with about the same downsides. Nobody will want to touch it, for very good reasons.)

jQuery is a very predictive mark for decades old leggacy cruft that nobody is allowed to improve. The modern equivalent is vanilla JS, that is simpler, about as powerful, and perfectly viable to hire for. There's actually no technical problem with jQuery, it just correlates very well with bad companies.

The infrastructure world has a real problem. A large part of it is that Puppet, Ansible and Chef are all badly managed centralized, complex pieces of software, and custom bash scripts can be anything. Also, cloud offerings tend to minimize on-call incidents, and companies normally have very bad policies about on-call time (kept secret until the candidate has no other option).

A purely community oriented ops tool (Apache style) would do wonders here. As would better labor laws (on the US mostly).

(And edit here, the trend on newer ecosystems, like Go, Rust and Haskell - and old language where the ecosystem is mostly new, is to make the backend frameworks simpler. So you get a nicer to use system with less complexity to stop you from solving issues.)


All right, you got me. I've eschewed PHP for almost 20 years. And I don't use jQuery in new code either, though I do sometimes use plain, framework-less JavaScript, which in modern browsers provides a lot of the functionality of jQuery but with no runtime overhead. I wouldn't disqualify a candidate for being primarily experienced with jQuery, and I'd be curious to see their modern PHP.


You really don't want to hire those engineers though? We really explicitly hire for judgement, and while those all might be a good judgement for the business some day, they aren't today, and an engineer who doesn't get that is way more of a liability for the business.


"The problem is that if a company advertises a job that is all about maintaing one or two monolithic applications without K8s, distributed system topics, tracing, microservices, AWS/GCP, etc., well, then not many engineers will apply. "

Also: if you are an engineer that has managed to run successful systems for years but didn't use microservices, K8s, AWS and all the other stuff you shouldn't even bother to apply for many jobs because you are "not up-to-date". Heck, reading the job ads from my company even they probably wouldn't hire me or a lot of my colleagues if we applied.


What I find more worrying is that incubators and VCs now often push it. One of a disturbing metric for a known incubator (cannot name them) is how fast you burn through your free $100k aws budget; if you don't do it fast enough, it is a metric for failure as far as they are concerned. And I see this a lot; this push to pump massive amounts in stuff that's not needed and probably never will be needed.


Good! Let the people who want to over-engineer their systems self-select into companies with over-engineered systems.


Neither Facebook nor Google are built on microservices in this fashion. Amazon is (but it’s on its own servers).


Doctor, doctor, every time I touch my left knee to my right pinky I get a stabbing pain in my heart!

Don’t worry, I know a physical therapy regime that will fix you right up: no medicines or scans required

Wonderful! what is it?

Every day before bed, don’t touch your left knee to your right pinky


Even without architecture astronautics you still run on a very deep stack of frameworks and abstractions that make troubleshooting difficult.


My turtle stack is pretty well documented and 95%+ of problems have already been solved by other people.

For everything else, I'm quite happy with https://sentry.io and https://www.skylight.io for my troubleshooting needs.


How do you know that?


Because even if you run bare metal on a raspberry pi you rely on fairly complex binary blobs. Most software is too complicated for such a simple stack.


You mean the Linux kernel and device drivers? Most of the time these things cause fewer problems than other things.


That seems a fairly stupid line to draw, to be honest. Do you know how the silicon your code runs on is made? Do you know how the fibreglass the CPU chip is stuck to is made? How do they get the copper to stick?


Yes, but the sooner you build to scale the better. More costly later on.


Chances are quite high you will never need scale. If you knew 100% sure you needed it when building, you would build for scale from day 1. But most/all startups and also for many projects in large corps, they don't know if they need scale. And startup statistics definitely show it is a very high chance you don't need it ever because you will be out of business long before that. It's mostly a waste of effort, time and money.


Is it? You trade retrofit costs later for maintenance costs and feature development friction every day until scale is needed.


You make a very good point.


Scalability is a two way street and those services don't scale small.

That makes business opportunities for more flexible engineers.


i recall reading ken thompson's reply to what he thought of virtual machines. something like "they are such a good idea that i wrote unix".


> astronautery

gonna borrow that phrase.


Surely "astronautics" is the right word though?


In the sci-fi future such as star trek, you often see systems being reprogrammed by swapping crystals around or something like that. I used to think it was really stupid, but as I deal with trying to build complicated systems I find myself wishing I could build something like that to set up systems and have them just work.

Interestingly, 90's industrial automation sort of feels like it was trying to build systems like that. The control cabinets are filled with simple, discrete, computational blocks that control valves and actuators based on sensor inputs.

Getting back to the sci-fi, I realized that the crystal swapping could just be the high level interface to all the complex code below. They really wouldn't need to do anything except light up when they are detected correctly, but still control complex code hiding below the surface... Maybe written on demand by AI that reads the crystals as prompts.

I don't know where this is all heading, but maybe someday swapping databases will be as complicated as changing a lightbulb.


Decades back there was a UI sketch for a phone answering machine. It dropped a marble in a tray for each message received. "New messages? How many?" was a simple glance "New marbles? How many?". To play a marble/message, place it on the play-back dimple. Replay by nudging it. Save the marble, save the message. No "to save this message yet again, press N". Delete by dropping it in the recycling hopper. Running out of length-limited n-slot message storage is running out of marbles in the hopper. Very tactile, tangible affordances. The marbles were merely ID, not storage. But (reliably archival?) storage might be fun too - eg, finding a bag of marbles from when you were a kid? EDIT: Durrell Bishop


That sounds amazing actually. One of the more non-technical user-friendly paradigms I've heard of.


This is insanely creative thinking. I wonder what other things people have thought up like this that haven't seen the light of day.


Leads to a new meaning for “I lost my marbles”.


Not possible. Imagine the number of scenarios that require fixes. There are just too many of them. So you either end up with a gigantic amount of crystals (but still can't solve a problem in some exceptional cases) or you have a small number of crystals but a high number of situations where you can't resolve the problem with them.

So maybe those crystals are more like app-configurations that we already have: you can tune things in a predictable way, but as soon as it goes beyond that you need to change the code.


"Getting back to the sci-fi, I realized that the crystal swapping could just be the high level interface to all the complex code below. "

Personally I find it a bit hilarious when I think about it too much. We can fit a terabyte on something that fits on our thumbnail comfortably, and we can provide enough computation power to run most any plausible control system comfortably in a phone form factor, but the superadvanced aliens way smarter than us and millennia if not more beyond us need massive crystals.

I suppose we can hypothesize that some variant of "hyperspace" requires vast computational powers to work at all, and maybe some of the other advanced stuff does (e.g. one imagines the holodeck or anything even resembling it would possibly require more computing power than our entire world has), but it's hard to imagine "life support" requires some massive crystal to run.

Science fiction has always had a complicated relationship with computers. The exponential power increases of so many decades is just hard to deal with. Even if you understand that exponential power increases are occurring, that doesn't mean you can correctly understand what that means in practice.

(In the alien's defense, one thing they may legitimately have on us, and for which unspecified "crystals" clearly made out of a very hard and durable substance may even make some sense, is longevity. A crystal that "merely" stored a few gigabytes and had the computation power of a modern cell phone, but that could be accidentally left behind in someone's grave for 20,000 years and work just as well as the day it was made would have a lot of legitimate uses, even if they had the capability of making things with a lot more power that wouldn't last as long. But even then I don't see swapping these things around like they're a fancy puzzle from Rubiks or you're trying to solve an ancient Sudoku as being a likely use case.)


Maybe it's an exabyte micro micro micro SD card the size of a grain of sand inside a crystal case. Works the same as a floppy disk.


Imagine 100 years from now when humanity's population has been reduced by 100s of millions because of climate catastrophe. Now imagine all the software society relies upon. Now imagine being one of the few people left alive being capable of working on that software. You probably won't have source code.

One thing that has remained constant in our lives is the number of people who are capabable of working on complex software has steadily increased. What happens if that number suddenly decreases?


Gotta make it simple and robust enough that someone who isn't an idiot can reach in and change it when it stops working.


Now I started to think what is the actual ratio of software we need and software that is produced or distributed.


The real solution to this is how we solved it last time we were here: abandon the mainframe and go back to local applications on PCs.

The cloud really provides two things for most applications:

(1) Data synchronization at scale. This could be provided by a much, much thinner cloud that basically just syncs files but using CRDTs to make that more robust. Applications could use these services to sync state for collaboration. The vast majority of the brains could be at the endpoint.

(2) The cloud is copy protection and DRM. It gives software publishers an easy place to put a toll booth. This could be solved by a reboot of the concept of local licensing, or by just periodically calling out to a license server. Yes people will pirate your software but those people aren't going to pay anyway.


If I'm not mistaken, you overlooked another role of cloud services: coordination of real-time connections between endpoints, and in some cases actual relaying of data to work around NAT. I guess this is the space you're in with ZeroTier. But I'm not sure how much of a market there is for generalized solutions like ZeroTier and Tailscale. For the most part, people want solutions that connect their machines for a specific purpose, such as remote desktop access (e.g. TeamViewer/AnyDesk), and developers of such products don't want to require their users to have another networking product (ZeroTier/Tailscale) first.


Not sure this works. For example, how do you implement search? I'm not looking forward to having to download the entire search index catalog of a web store in order to find out whether they have something.


I'm not really talking about store web sites and other classical (and appropriate) uses of the web. I'm talking more about all the SaaS applications that are really desktop apps running remotely on a "mainframe."


I'm afraid I don't know of examples that don't take advantage of the much larger storage availability of the cloud, especially for things like search.


Maybe not run everything locally, but most “web apps” could be local apps with something like dropbox to back up the data and make it accessible.

The issue with this is the web is zero install cross platform and it is hard to compete with that.


I went to a serverless conference some time ago (great excuse to travel on the company's dime), and every single talk had a few minutes about how they were unable to test or debug their production infrastructure.

They couldn't even run their code locally, just bits of it. Serverless code could not run without a server. A specific provider's servers.

Call me old-fashioned, but I develop with a few docker containers locally, then release them to a plain old linux server, and it's working great. Sure, my humble business isn't Google, but neither are yours, right?


I went to one too, and saw a talk about synchronising data between microservices and was astounded at how much work it was and the fact they had a whole team just to build tooling to debug issues on this specific company.

Basically they store 100x more data than they need because of data duplication between microservices. The company didn't even have that much customers (<1000), data or even enough users (a handful per company), but duplication was still generating several terabytes.

It's amazing how much complexity people are able to cram into simple system.

(Interestingly enough there's another recent comment on another article here talking about this exact issue: https://news.ycombinator.com/item?id=32320660 )


Agree that being unable to test serverless code locally is begging for pain. While not perfect, testing AWS Lambda has improved over the years. I recently deployed a Lambda function only because I was able to reproduce the environment locally, using their Lambda Runtime Interface Emulator (RIE):

https://docs.aws.amazon.com/lambda/latest/dg/images-test.htm...


Until you need to connect to another service that can not be run locally and it is only accesible within the same Cloud Provider.

Sure it's not impossible to go around that, start mocking or run some alternative locally for debugging but I don't think this extra level of complexity is necessary (at least not always).


It's odd that over time things don't get simpler, instead everything trends towards more and more layered complexity - where the higher layers are abstracted so as to appear simpler.


Such is life (that is, biological evolution worked the same way).


But unlike nature, we can do intelligent design.


How is the evolutionary design not intelligent?


The laryngeal nerve goes in roughly a straight line in fish. With evolutionary development of a neck stretching out, both ends of it are in the neck but the middle is still stuck near the heart. It now goes from brain down to heart, loops around an artery then back up the neck to the larynx. With a giraffe that's a pointless detour of several meters because evolution can't intelligently move it above the artery.[1]

The optic nerve connects to the front of your retina giving you a blind spot, then pokes a hole in the retina to get out to the back and off to the visual cortex which is at the other end of the brain. Intelligent design would do better.

[1] https://forum.donanimhaber.com/store/c9/ef/2d/c9ef2d955e3f99...


There is no intelligence guiding it. If it gets stuck in local maximum there it stays. If routing a vein through the worst possible places works it won't get optimized etc.


Intelligence (in a metric, "this is intelligent way to do X", sense) implies some purpose. But life is just one big 'systems that survive and grow - grow and survive' tautology.


Veering into philosophy perhaps, but intelligence to me is “thinking”.


Only if you are designing another TempleOS. Thing for it's own sake, unaffected by any external requirements and completely specified, designed and implemented by one person.


Zealous creationism in that case.


Life implies death. When will these systems implode under their own complexity?


This is a well known cycle in software. Systems grow in complexity until they collapse under their own weight, then a new system is created that steals the most valuable ideas from the old system. Repeat.

It happens in nature too. Your skull has fewer bones in it than the skulls of many fish and humans do not by any stretch have the largest genome in nature.


Things may change when(if) our machines get smarter than us.

Currently we're constantly expanding the complexity of our systems, but since they don't understand their underlying purpose, they can't help us manage that complexity without drastically limiting their flexibility, there's no working around that.

But if they knew what they were doing, instead of "I need to reorganise this database stack to reduce overhead" we'll be giving simple commands like "please stop taking over the world" and letting the machines handle the complex details.


I hate headlines like this pieces's full one - omitted here on HN so far - which is "Why Intuitive Troubleshooting Has Stopped Working For You".

For me? Excuse me no, because I design systems to be supportable in production. Tsk.

"For some"? "For many"? Sure! But please don't annoy me before I even start reading with this kind of headline.


Why the hate? This headline makes the audience clear, and it’s not you! Enjoy the freedom this gives you to skip reading the post.


Ha fair enough a strong term. Forgive me I am just venting a bit about a wider pattern of these sort of headlines: "Why YOU aren't saving enough for retirement", "Why YOU need to eat fewer avocados", etc.

It's not just an audience indicator it's a challenge designed to engage, but it just winds me up. A stylistic technique I really dislike, that's all! :-)


A/B testing has destroyed headlines and I hate it.


Yeah. But, fortunately, it's a dynamic system! Learn to ignore the headlines that are written this way, because the noise is high.

Some folks have to use ad blockers to avoid seeing ads, while others train themselves to just ignore ads entirely. Personally, I don't even see them most of the time. If you're in the latter camp, training yourself to ignore headlines like these doesn't seem much harder!


Regardless of complexity: error messages, error messages, error messages. I can't emphasize enough how important decent error messages are. When you are having trouble with a file, include the full path in the error message. Put in the name of the function in the error message, what you expected and what you got. Bubble up lower level errors, don't swallow them. Include a stack trace, include the path of the log file if applicable. Please don't have your error message say "Something went wrong".


Agreed, a few times I've found myself in the unfortunate position of needing to push changes to production just to be able to diagnose an error, because it was being completely swallowed by a try-catch block with some generic "something went wrong" message.


I often feel that for example Python web stacks are overly reliant on conventions. Your work is expected to follow nested conventions, and if you do so, things work. Yet, some conventions imported from other underlying packages are not documented in the current package, so if you are new to the stack you are lost immediatly.


I went to a serverless conference some time ago (great excuse to travel on the company's dime), and every single talk had a few minutes about how they were unable to test or debug their production infrastructure. They couldn't even run their code locally, just bits of it.


That seems odd to me. If the service can’t run or emulate locally then I wouldn’t use it.


Please someone tell me wether it is at all possible to deliver very complex/sophisticated b2b stuff via cloud stuff.

For me it seems to be a nightmare to debug complex stuff, but without the “old” tooling it seems to me that it is quite impossible to do this when buried behind queues, pubsubs, cloud functions and all that.

Any comment on this is quite welcome.


tldr: Loose coupling means no intuitive troubleshooting.

Which you can infer from the diagrams but isn't stated directly in this marketing puff piece.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: