Overview
Recently, I have been thinking about my preference for structured logging (details on this in to-be-written entry). A trivial but significant aspect of this, is that once you have switched to structured logging, markers like INFO
, DEBUG
, ERROR
,… don't help. Simply grouping such a wide range of code paths into these categories is insufficient for any kind of analysis.
What is an error?
An error is a local failure scenario that you don't currently understand - our continuous improvement process for error handling aims to move all these anomalies in the system from unknown to known.
First we must detect any failure then decide how to behave and cause an event to be recorded so we can observe the failure.
Failure detection
Most software uses return codes or try/catch blocks, we can visualise failure detection opportunities as concentric circles.
The inner most circle has the most detail about the current state but the least context. As you move out the context broadens but the detail diminishes until we go beyond the scope of one Main
process, across your Cluster
, then beyond, eventually reaching the Enterprise
at large.
Error Circle | Description |
---|---|
Function |
The point at which the error occurs |
Outer Function |
Typically business logic code (can span several layers with an application) |
Handler Process |
The message processing or http handler |
Main() |
Entry point to your process (not all systems use main as the entry point) |
Cluster Aggregation |
Covers all instances of your process |
Enterprise Aggregation |
The whole collection of services and processes |
Outer Function
and Handler Process
, could be one or several layers deep
How to behave
Given this view of system failures, make a decision about the desired behaviour for each condition the system experiences. This may be as simple as returning an error and giving up processing, but can often include some retry strategy.
Either way, by recording the specific failure mode as an event you can start to better understand your system. Furthermore you can start to react to these events as part of your expected system behaviour.
The key aspect is to decide where best (in the error circles) to record it and what name to give it. This gives you access to the best quality information to query within your monitoring systems.
Expect the unexpected
The well documented fallacies of distributed systems highlights that, at any time, any 2 components in a system will experience a disruption in connectivity. In addition to this, software components and services are constantly evolving. This occasionally results in system contracts being broken or assumed behaviours changing.
Given this, any remote services will;
- become unreachable over the network
- time out
- report to be unavailable
- return a fault
- supply an unexpected response
All of which will be observable to any calling client.
Be specific
The classic approach is to log the exception and stack trace when these occur. Some dashboard will show an increased number of errors by counting exceptions. Sometimes counting instances based on type and stack trace.
However, we shouldn't immediately treat these as errors. As seen above many of these are specific and expected behaviours. As with any expected behaviour the product team (engineering and product owner/manager) should have a defined response.
- Is it a temporary fault?
- Can the operation be re-tried?
- If re-trying how far back in our process to we re-try from?
- What is the back-off strategy?
- Do you need a circuit breaker?
- How long can you wait?
- How do you respond to the caller?
- Does anything need to be undone/rolled-back?
Under these conditions you should deal with the scenario (re-try, give up, etc.) and emit some kind of event to your diagnostics/metrics system. It is only when you don't recognise the problem, for example in your most outer error handler (see Main()
above), that this can be classified as an unknown error.
Name behaviours
In order to actually understand what your system is doing at any given time you must instrument it. This is key to enabling understanding of your system. If we accepted that broad categories such as ERROR
are too coarse, we must define a higher fidelity naming system. Your naming approach may depend on the capabilities of your tooling (etcs, statsd, splunk, ELK stack, etc.) but a left to right, coarse to fine-grained, name may help. For example;
http.request.my-handler.count
db.query.customer.duration
http.network-timeout.count
message.validation-error
queue.a-queue-name.count
Let's examine one of these names - http.request.my-handler.count
; starts with coarse grained http
, this is a request
(rather than response
) for a specific handler my-handler
and this event is a simple count
with no additional meta data.
Add any additional useful attributes to your events (structured logging helps here). Ids and enumerations not narratives.
Names tend to have a suffix with the event type, count
or duration
. The name tends to start with the part of the system where the event occurred.
Because every interesting behaviour in an application is journaled, it is possible to view either all events for a single logical process or an aggregate for particular type of process. e.g. what happened for this request or how are my database queries performing.
Conclusion
You may think this is just a semantic argument but, once you acknowledge that errors are simply behaviours that you don't currently understand, you can stop worrying about overly broad KPIs like error count and focus on how your system is actually behaving.
It is only by observation from the outside, at the point that these events are aggregated, that can you truly evaluate the health of your system.
As soon as you see a new error, triage it and deal with it, then it's no longer an error.
Further reading
Cindy Sridharan
Michael Nygard
- http://www.michaelnygard.com/blog/2016/11/fault/
- http://www.michaelnygard.com/blog/2016/11/availability-and-stability/
Peter Bourgon
- http://peter.bourgon.org/blog/2016/02/07/logging-v-instrumentation.html
- http://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html