# Overview

Recently, I have been thinking about my preference for structured logging (details on this in to-be-written entry). A trivial but significant aspect of this, is that once you have switched to structured logging, markers like INFO, DEBUG, ERROR,… don’t help. Simply grouping such a wide range of code paths into these categories is insufficient for any kind of analysis.

# What is an error?

An error is a local failure scenario that you don’t currently understand - our continuous improvement process for error handling aims to move all these anomalies in the system from unknown to known.

First we must detect any failure then decide how to behave and cause an event to be recorded so we can observe the failure.

# Failure detection

Most software uses return codes or try/catch blocks, we can visualise failure detection opportunities as concentric circles.

The inner most circle has the most detail about the current state but the least context. As you move out the context broadens but the detail diminishes until we go beyond the scope of one Main process, across your Cluster, then beyond, eventually reaching the Enterprise at large.

Error Circle Description
Function The point at which the error occurs
Outer Function Typically business logic code (can span several layers with an application)
Handler Process The message processing or http handler
Main() Entry point to your process (not all systems use main as the entry point)
Cluster Aggregation Covers all instances of your process
Enterprise Aggregation The whole collection of services and processes

Outer Function and Handler Process, could be one or several layers deep

# How to behave

Given this view of system failures, make a decision about the desired behaviour for each condition the system experiences. This may be as simple as returning an error and giving up processing, but can often include some retry strategy.

Either way, by recording the specific failure mode as an event you can start to better understand your system. Furthermore you can start to react to these events as part of your expected system behaviour.

The key aspect is to decide where best (in the error circles) to record it and what name to give it. This gives you access to the best quality information to query within your monitoring systems.

# Expect the unexpected

The well documented fallacies of distributed systems highlights that, at any time, any 2 components in a system will experience a disruption in connectivity. In addition to this, software components and services are constantly evolving. This occasionally results in system contracts being broken or assumed behaviours changing.

Given this, any remote services will;

• become unreachable over the network
• time out
• report to be unavailable
• return a fault
• supply an unexpected response

All of which will be observable to any calling client.

# Be specific

The classic approach is to log the exception and stack trace when these occur. Some dashboard will show an increased number of errors by counting exceptions. Sometimes counting instances based on type and stack trace.

However, we shouldn’t immediately treat these as errors. As seen above many of these are specific and expected behaviours. As with any expected behaviour the product team (engineering and product owner/manager) should have a defined response.

• Is it a temporary fault?
• Can the operation be re-tried?
• If re-trying how far back in our process to we re-try from?
• What is the back-off strategy?
• Do you need a circuit breaker?
• How long can you wait?
• How do you respond to the caller?
• Does anything need to be undone/rolled-back?

Under these conditions you should deal with the scenario (re-try, give up, etc.) and emit some kind of event to your diagnostics/metrics system. It is only when you don’t recognise the problem, for example in your most outer error handler (see Main() above), that this can be classified as an unknown error.

# Name behaviours

In order to actually understand what your system is doing at any given time you must instrument it. This is key to enabling understanding of your system. If we accepted that broad categories such as ERROR are too coarse, we must define a higher fidelity naming system. Your naming approach may depend on the capabilities of your tooling (etcs, statsd, splunk, ELK stack, etc.) but a left to right, coarse to fine-grained, name may help. For example;

• http.request.my-handler.count
• db.query.customer.duration
• http.network-timeout.count
• message.validation-error
• queue.a-queue-name.count

Let’s examine one of these names - http.request.my-handler.count; starts with coarse grained http, this is a request (rather than response) for a specific handler my-handler and this event is a simple count with no additional meta data.

Add any additional useful attributes to your events (structured logging helps here). Ids and enumerations not narratives.

Names tend to have a suffix with the event type, count or duration. The name tends to start with the part of the system where the event occurred.

Because every interesting behaviour in an application is journaled, it is possible to view either all events for a single logical process or an aggregate for particular type of process. e.g. what happened for this request or how are my database queries performing.

# Conclusion

You may think this is just a semantic argument but, once you acknowledge that errors are simply behaviours that you don’t currently understand, you can stop worrying about overly broad KPIs like error count and focus on how your system is actually behaving.

It is only by observation from the outside, at the point that these events are aggregated, that can you truly evaluate the health of your system.

As soon as you see a new error, triage it and deal with it, then it’s no longer an error.