Instrumentation: The First Four Things You Measure

ackerman80 · on Jan 26, 2017

Came across this which gives good insight into the 4 golden signals for a top-level health tracking: https://blog.netsil.com/the-4-golden-signals-of-api-health-a...

One thing of note in the graph is the tracking of response size. This would be very useful for 200 responses with "Error" in the text. Because then the response size would drop drastically below a normal successful response payload size.

In addition to Latency, Error Rates, Throughput and Saturation , folks like Brendan Gregg @ Netflix have recommended tracking capacity.

maplebed · on Jan 26, 2017

Are you a plant? It must just be coincidence that the second post in the series is titled "measuring capacity." :) https://honeycomb.io/blog/2017/01/instrumentation-measuring-...

(bias alert - I work on Honeycomb)

ackerman80 · on Jan 26, 2017

we are all learning from the same folks ahead of us it seems :)

I agree with other comments though the devil is in the details of how to actually setup these "golden signals" so that they are useful and not just drown everyone in packet level non-sense.

saravana87 · on Jan 26, 2017

TCP retransmission rates looks like a useful metric which can help in monitoring the health of a service. One way to obtain that is by analyzing service interactions as mentioned in the blog. Tracing could be another way through which we can find that info. I am curious as to how code instrumented monitoring solutions get that information. (PS: I work for Netsil)

bbrazil · on Jan 27, 2017

By default you can only get that per-kernel from /proc/net/netsnmp. BPF may allow something more granular.

The other way of approaching it is to look for the additional latency it causes, which you can spot on a per-service basis.

saravana87 · on Jan 27, 2017

Additional latency could be an indicator, but there's no guarantee that it is because of retransmissions ?

bbrazil · on Jan 27, 2017

If you look at your latency histogram and are seeing a bump at around 200ms above normal (which was the default minimum wait time a few years back anyway), it's probably retransmits.

saravana87 · on Jan 27, 2017

Got it.

rboyd · on Jan 27, 2017

you can get retransmits from 'sar' on linux

saravana87 · on Jan 27, 2017

I see. But, it looks like it is per host and there is no way to find out for a particular service running on the host.

bbrazil · on Jan 26, 2017

> A histogram of the duration it took to serve a response to a request, also labelled by successes or errors.

I recommend against this, rather have one overall duration metric and another metric tracking a count of failures.

The reason for this is that very often just the success latency will end up being graphed, and high overall latency due to timing-out failed requests will be missed.

The more information you put on a dashboard, the more chance someone will miss a subtlety like this in the interpretation. Particularly if debugging distributed systems isn't their forte, or they've been woken up in the middle of the night by a page.

This guide only covers what I'd consider online serving systems, I'd suggest a look at the Prometheus instrumentation guidelines on what sort of things to monitor for other types of systems: https://prometheus.io/docs/practices/instrumentation/

maplebed · on Jan 26, 2017

By creating events that contain both the duration of the request and whether it succeeded, you can create graphs that show you the detail you need. Unless you include those data together at the beginning, it will be impossible to tease them apart later on. Combining them into one graph will likely conceal the difference in the two cases, as you describe, unless you feed them in to a system that an natively tease them apart as easily as show them together (such as http://honeycomb.io). So it seems like the disagreement is more about visualization than collection (the section of the blog in which that quote appears).

The originally quoted advice, to show "the duration it took to serve a response to a request, also labelled by successes or errors" remains good advice, so long as the visualization of that data makes clear the separation.

I absolutely agree that careful consideration is required when choosing what to put on dashboards to avoid confusion. That seems to be a separate issue.

(bias alert - I work on Honeycomb, and care deeply about collecting data in a way that lets you pull away the irrelevant data to illuminate the real problems.)

hibikir · on Jan 27, 2017

In practice, if your system if complicated and you have to look at the visualization, you are already in trouble. For anything complicated, you need exactly the inputs you describe, but everything has to be processed already by another layer that can give you higher level ideas.

This is a place where I think you guys could beat what other 3rd party monitoring tools are doing. I work with some of your guest bloggers, and I work on a subsystem with its own dashboard: about 50 charts. To make bringing new teammates a sensible experience, we need both a layer of alerts on top of the charts, and then a set of rules of thumb, that should be programmed if the alerting system was good enough, that put the alerts together into realistic failure cases: if X and Y triggered, but Z didn't, then chances are this piece is probably the culprit.

There's also opportunities in visualizations that aren't chart based: We used to have something like that for another complex system in another employer, but that's expensive, custom work, unless you join forces with something that understands were all your services are, knows all ingress and egress rules, and thus could automatically generate a picture of your system, along with understanding the instrumentation: So leave that until you merge with SkylinerHQ or something.

That said, I think you guys are heading towards a good, marketable product as it is. Fixing the annoying the statsd/splunk divide of older monitoring would probably make us buy it already.

bbrazil · on Jan 27, 2017

> Combining them into one graph will likely conceal the difference in the two cases, as you describe

Indeed. The first order issue is locating the problem though.

If you don't spot which of your microservices is the culprit due to only looking at successful latency, you're not going to get to the stage of comparing successful vs failed latency (and in practice, the increased error ratio combined with increased overall latency should tip you off).

> unless you feed them in to a system that an natively tease them apart as easily as show them together

And the user actually thinks to perform that additional analysis.

> So it seems like the disagreement is more about visualization than collection

What I've seen happen is that the collection leads to the visualisation, which subsequently leads to prolonged outages due to misunderstanding.

Thus I suggest removing the risk of the issue on the visualisation end, by eliminating the problem at the collection stage. This is particularly important when the people doing the visualisation aren't the same people writing the collection code, and thus don't know if the people creating the dashboards will all be sufficiently operationally sophisticated.

> to show "the duration it took to serve a response to a request, also labelled by successes or errors" remains good advice, so long as the visualisation of that data makes clear the separation.

It's a little messier than that. Depending exactly on how the data is collected, such a split could make some analyses more difficult or impossible. For example I need the overall latency increase in order to see if this server is entirely responsible for the overall latency increase I see one level up in the stack, or if there's some other or additional problem that needs explanation. There's no equivalent math for success or failure.

Put another way, the math on the overall works the way your intuition thinks it does. The split out version is more subtle.

>(bias alert - I work on Honeycomb, and care deeply about collecting data in a way that lets you pull away the irrelevant data to illuminate the real problems.)

I work on Prometheus which is a metrics system. Honeycomb seems to be based on event logs. There's logic to removing the success/failure split for duration metrics as I suggest, but it'd be insanity to remove it for event logs. So in your case it is purely a visualisation problem, whereas for us losing granularity at the collection stage is an option (and sometimes required on cardinality grounds).

The terminology the article uses (incrementing a counter at an instrumentation point) led me to believe we were discussing only metrics.

The way I would see things is that you'd use a metrics-based like Prometheus to locate and understand the general problem and which subsystems are involved, and then start using log-based tools like Honeycomb as you dig further in to see which exact requests are at fault. They're complementary tools with different tradeoffs.

I've written about this in more depth at http://thenewstack.io/classes-container-monitoring/

preetamjinka · on Jan 26, 2017

> rather have one overall duration metric

What exactly would that be?

bbrazil · on Jan 26, 2017

The duration metric as recommended by the article, but not broken out by success/failure.

spimmy · on Jan 26, 2017

yea that's not good advice, but "don't rely on overstuffed dashboards" is good advice.

you should only have simple dashboards (and alerting) for KPIs and end to end checks. Everything else should be instrumented and debugged using a real time sorting and slicing tool, esp if you have a complex system (microservices, distributed system, polygot persistence).

anw · on Jan 26, 2017

While this is good advice, I feel it is a bit too over-simplified.

Counting incoming and outgoing requests misses a lot of potential data points when determining "is this my fault?"

I work mainly in system integrations. If I check for the ratio for input:output, then I may miss that some service providers return a 200 with a body of "<message>Error</message>".

A better message is to make sure your systems are knowledgeable in how data is received from downstream submissions, and to have a universal way of translating that feedback to a format your own service understands.

HTTP codes are (pretty much) universal. But let's say you forgot to inlcude a header or forgot to base64 encode login details or simply are using a wrong value for an API key. If your system knows that "this XML element means Y for provider X, and means Z in our own system", then you can better gauge issues as they come up, instead of waiting for customers to complain. This is also where tools like Splunk are handy, so you can be alerted to these kinds of errors as they come up.

hamandcheese · on Jan 26, 2017

The article never defines what an error is, so I think it is very reasonable to take your approach. I think you're mistaking very abstract advice with very simple advice :)

tyingq · on Jan 27, 2017

I agree that this is over-simplified. It also skips over the mess you can get when a downstream dependency is having issues.

If the "things calling you" can't be effectively throttled, you often run into issues like, for example, hitting the limit on number of open sockets, file descriptors, receive queue, threads etc.

So, just saying "the downstream service is at fault" isn't really correct. Your service may also not be acting correctly in that situation. Those issues can also affect your logging and metrics.

It's not a trivial exercise to architect your service such that it always does the right thing (throttling input vs retries vs fail fast vs priority queues vs load balancing to multiple instances of a downstream service, exponential backoff, etc) when a downstream dependency is slow and/or down.

Edit: Similar to your observation about structured errors, connection pooling is probably also worth talking about in this situation. Which would change the stats you want...once # of connections made isn't the same thing as # of transactions, you would want to know both.

kasey_junk · on Jan 27, 2017

> A histogram of the duration it took to serve a response to a request, also labelled by successes or errors.

This is so much easier said than done. Most time series db that people use to instrument things quite simply cannot handle histogram data correctly. They make incorrect assumptions about the way roll-ups can happen or they require you to be specific about resolution requirements before you can know them well.

Then histogram data tends to be very expensive to query so it bogs down preventing you from making the kinds of queries that are really valuable for diagnosing performance regressions.

Finally, the visualization systems for histograms are really difficult because you need a third dimension to see them over time. Heat maps accomplish this but are hard to read at times and most dashboard systems don't have great visualization options for "show this time period next to this time period" which is an incredibly common requirement when comparing latency histograms.

maplebed · on Jan 29, 2017

Yup! It's hard! All the things you point out are right on.

We don't have the visualizations for histograms yet (though you can chart specific percentiles), but for the reasons you mention, Honeycomb is perfectly suited to give you that kind of data. I can't say we'll get that out the door soon, but it's one of my pet most wanted features so as soon as I can convince myself it's actually more important than all the other mountain of things that need to get done, you'll get your histograms and your time over time comparisons.

I've been advocating for a heat map style presentation of histograms for a long time, but I hadn't considered the difficulty that creates when trying to show time over time. That's an interesting one to noodle on.

Thanks for articulating well the value and reasons for difficulty in implementing histograms!

(bias alert - I work on Honeycomb)

sambe · on Jan 26, 2017

The whole series:

https://honeycomb.io/blog/categories/instrumentation/

techbio · on Jan 26, 2017

Author appears to use "downstream" and "upstream" to refer to "further down the stack" and "further up the stack".

Is this normal usage? Seems reversed to me.

astine · on Jan 26, 2017

It seems backwards to me too. I usually see these terms when referencing dependencies in software projects, ie: Ubuntu is downstream from the Linux kernel. I would think that you would see the same thing with services.

dmichulke · on Jan 26, 2017

It depends on whether you look at control flow (who calls whom) or data flow.

jholman · on Jan 27, 2017

This is the issue, yeah.

To rephrase that, "upstream" means "where events come from".

clarkmoody · on Jan 26, 2017

Seems reversed to me too. Literally, "upstream" is water coming from uphill, flowing to you, and then going "downstream" to some other destination.

If the river is data, then stuff you depend on is upstream from you, and things that depend on you are downstream.

ThrustVectoring · on Jan 26, 2017

Feels right to me, though I'm no authority figure on this matter. The visual I get is that requests flow from the end-user your service is supporting down to things that indirectly support your user.

jdormit · on Jan 26, 2017

Is the last paragraph a joke? If so, could someone explain it?

LeifCarrotson · on Jan 26, 2017

It trips my sarcasm detector, but it's also exactly what I'd expect business-speak to say if they did cut a series short due to lack of money.

greenleafjacob · on Jan 27, 2017

Saturation and utilization are different things. For CPU time, utilization would be how many cycles were spent running user tasks over total cycles. Saturation would be how much time (cycles?) was spent in the run-queue. For disks, utilization could be IOPS, saturation is time spent in I/O wait or queue sizes. For network interface, utilization could be Gbps, saturation is total time spent waiting to write to sendq.

siliconc0w · on Jan 26, 2017

Every request to and from the app should be instrumented. Paying addition to the requests to the app is a good start - but you really need detailed instrumentation of all downstream dependencies your service uses to process it's requests to understand where the issue is. It's often likely you're slow or throwing errors because a dependency you use is slow or throwing errors. Or maybe the upstream service complaining has changed it's request pattern and they're making more expensive queries. There is often a small minority of requests that are responsible for most of the performance issues so even if the overall volume hasn't changed, the composition and type of requests matter as well.

vaishaksuresh · on Jan 26, 2017

Off Topic: Does anyone know what tools the author uses to make the diagrams?

copyconstruct · on Jan 26, 2017

Paper by 53 or Sketches or Sketchbook or any of the several sketching apps available for the IPad would be my guess.

ChristianGeek · on Jan 26, 2017

I assumed it was whiteboard photos with hand-drawn photoshop masks.

grandalf · on Jan 27, 2017

This is a very useful, common-sense post. I've created exactly that sort of thing using redis. The expiry mechanism combined with formatting date strings to create time buckets for keys allows quite a bit of power and is simple to write.