Recently, I have been thinking about my preference for structured logging (details on this in to-be-written entry). A trivial but significant aspect of this, is that once you have switched to structured logging, markers like
ERROR,… don’t help. Simply grouping such a wide range of code paths into these categories is insufficient for any kind of analysis.
What is an error?
An error is a local failure scenario that you don’t currently understand - our continuous improvement process for error handling aims to move all these anomalies in the system from unknown to known.
First we must detect any failure then decide how to behave and cause an event to be recorded so we can observe the failure.
Most software uses return codes or try/catch blocks, we can visualise failure detection opportunities as concentric circles.
The inner most circle has the most detail about the current state but the least context. As you move out the context broadens but the detail diminishes until we go beyond the scope of one
Main process, across your
Cluster, then beyond, eventually reaching the
Enterprise at large.
||The point at which the error occurs|
||Typically business logic code (can span several layers with an application)|
||The message processing or http handler|
||Entry point to your process (not all systems use
||Covers all instances of your process|
||The whole collection of services and processes|
Outer Function and
Handler Process, could be one or several layers deep
How to behave
Given this view of system failures, make a decision about the desired behaviour for each condition the system experiences. This may be as simple as returning an error and giving up processing, but can often include some retry strategy.
Either way, by recording the specific failure mode as an event you can start to better understand your system. Furthermore you can start to react to these events as part of your expected system behaviour.
The key aspect is to decide where best (in the error circles) to record it and what name to give it. This gives you access to the best quality information to query within your monitoring systems.
Expect the unexpected
The well documented fallacies of distributed systems highlights that, at any time, any 2 components in a system will experience a disruption in connectivity. In addition to this, software components and services are constantly evolving. This occasionally results in system contracts being broken or assumed behaviours changing.
Given this, any remote services will;
- become unreachable over the network
- time out
- report to be unavailable
- return a fault
- supply an unexpected response
All of which will be observable to any calling client.
The classic approach is to log the exception and stack trace when these occur. Some dashboard will show an increased number of errors by counting exceptions. Sometimes counting instances based on type and stack trace.
However, we shouldn’t immediately treat these as errors. As seen above many of these are specific and expected behaviours. As with any expected behaviour the product team (engineering and product owner/manager) should have a defined response.
- Is it a temporary fault?
- Can the operation be re-tried?
- If re-trying how far back in our process to we re-try from?
- What is the back-off strategy?
- Do you need a circuit breaker?
- How long can you wait?
- How do you respond to the caller?
- Does anything need to be undone/rolled-back?
Under these conditions you should deal with the scenario (re-try, give up, etc.) and emit some kind of event to your diagnostics/metrics system. It is only when you don’t recognise the problem, for example in your most outer error handler (see
Main() above), that this can be classified as an unknown error.
In order to actually understand what your system is doing at any given time you must instrument it. This is key to enabling understanding of your system. If we accepted that broad categories such as
ERROR are too coarse, we must define a higher fidelity naming system. Your naming approach may depend on the capabilities of your tooling (etcs, statsd, splunk, ELK stack, etc.) but a left to right, coarse to fine-grained, name may help. For example;
Let’s examine one of these names -
http.request.my-handler.count; starts with coarse grained
http, this is a
request (rather than
response) for a specific handler
my-handler and this event is a simple
count with no additional meta data.
Add any additional useful attributes to your events (structured logging helps here). Ids and enumerations not narratives.
Names tend to have a suffix with the event type,
duration. The name tends to start with the part of the system where the event occurred.
Because every interesting behaviour in an application is journaled, it is possible to view either all events for a single logical process or an aggregate for particular type of process. e.g. what happened for this request or how are my database queries performing.
You may think this is just a semantic argument but, once you acknowledge that errors are simply behaviours that you don’t currently understand, you can stop worrying about overly broad KPIs like error count and focus on how your system is actually behaving.
It is only by observation from the outside, at the point that these events are aggregated, that can you truly evaluate the health of your system.
As soon as you see a new error, triage it and deal with it, then it’s no longer an error.