Featured image of post Run!

Run!

My experience with production incidents

Lire en Français

Preamble

Today I’d like to talk about a professional experience that’s close to my heart.

A year ago, I had already been working as a software engineer for 12 years.
I started out as a web developer. I then became deep learning specialist, before moving into cybersecurity to work on an IAM.
Although they were all different, these professional experiences all had one thing in common: I’d hardly ever touched a production in my life.

During my web experience, as soon as I’d pushed my last commit, the code was no longer in my hands.
My AI experience was based in a proof-of-concept team in an innovation cluster. In my IAM experience, I’d put my Jira tickets in “Done” when I’d pushed my last commit, then again, it was out of my hands.

Also, I’ve only ever worked on one project at a time. Or almost, but you get the idea.

That’s when I applied for a job in a team responsible for the company’s internal IS. I discovered that they were developing and operating around 50 microservices and middleware of all kinds.
I joined the team, found out a bit more about what they do, and my boss announced two annual objectives:

  • I must be able to do RUN in 3 months’ time
  • I’m to join the night/weekend on-call rotation 2 months later.

I’ll go into a bit more detail on these two principles later, but I want to emphasize the fact that from my background, it was obvious to me that I was incapable of achieving these goals.

My thought was: “It’s crazy, my new boss has more confidence in me than I do in myself”.

In the end, I achieved these objectives without any problems. Despite some apprehension.

How?

The incidents were all over the place, and most of the time I’d never even touched the microservice that triggered the alert, so how did I get through them without too much trouble?

That’s the question I’d like to answer. I’d like to tell you what this team is like, and what organization and tools they’ve put in place to help a newcomer integrate so quickly into the business.

I’m going to tell you what I think are the essential tricks of the RUN trade.

Note: In this article, I’ll confine myself to talking about RUN, which takes place during the day, and not on-call, which takes place at night or on weekends.
Although we’re talking about incidents in both cases, they’re still two different subjects, and I’ll only be concentrating on “daytime on-call”, i.e. RUN.

My first incidents

RUN refers to all activities linked to the maintenance and operation of systems in production. It is often contrasted with BUILD, which represents the design and development phase of new functionalities or infrastructures.
There are many aspects to this: real-time supervision and incident response, management of security patches and fixes, SLA management and monitoring of availability KPIs, etc.

The definition of RUN may vary, but I think the key idea is that RUN is all about the present: how it’s working, what’s happening, how to correct/improve.

In the case of my team, we mainly talk about incident management, although this can vary.

In an ecosystem where numerous microservices coexist, incidents can occur at any time. The causes can be numerous:

  • A case not covered in the source code
  • A called service that no longer responds
  • A problem with the infrastructure
  • Network latency
  • A regression
  • Etc.

And just about infinite other reasons, but that gives you an idea.

As luck would have it, I joined the stack team when a new need arose, so I was able to develop a new microservice from scratch.
Having developed a large part of the code and having all the code in my head, it was relatively easy for me, once in production, to understand what was happening in the event of an incident.
I’m not saying that finding the solution to a problem that arises on a project you’ve mastered is always easy, but at least you know roughly what you’re getting into.

That’s when I thought to myself:

If it’s relatively easy to form an intuition about a problem on a subject you’ve mastered, it’s much less easy on a subject you don’t know anything about.

During my first week of RUN, I wasn’t feeling serene. My phone rang, I looked at the alert, and the name of the microservice vaguely rang a bell.
An alert message like [user-sync-service] creating user: POST /api/users: 403....

I’m not familiar with this project. The first thing I have immediately at my disposal is a prefix containing the name of the service and the start of an error message. Did you notice that in just a few characters, we’ve already got quite a bit of information?
That’s the whole point of properly formatting log and error messages in your code.
A good practice is to divide your alert message into three parts:

  • A prefix containing the name of the affected service
  • A very short message describing the operation concerned
  • The error message that has been wrapped

At a glance, I’ve already understood that this is a microservice whose purpose is to synchronize users from one location to another. I understand that when trying to create a user to the destination via a POST endpoint, a 403 error occurred. So it’s potentially an authorization problem.
I’m far from being able to troubleshoot the problem, but I’ve already got a clue to guide my research.

For the record, synchronization, or interconnection, consists in transmitting information from one point to another. Typically, two tools that don’t know each other and weren’t specifically designed to communicate with each other, but which each have a REST API offering CRUD operations.
The idea is to say “I take data from one application, and send it to another, taking into account its data format”. That’s the general idea.

Another example: [doc-sync-service] get document: unexpectedly empty result.

Obviously another synchronization. But this time, the problem seems to have arisen when we tried to retrieve a document from the source.

I’d like to take this opportunity to stress the importance of writing error messages properly. I really like this vision proposed by this article, which we’ve adopted: https://preslav.me/2023/04/14/golang-error-handling-is-a-form-of-storytelling/

Obviously, I’m going to need to get to know all the projects on my team very quickly. The aim isn’t to troubleshoot with my eyes shut for the rest of my life, but I prefer to keep two things in mind:

  • On a dense perimeter, nobody will have the same level of knowledge of all projects.
  • There will always be potential new people to join the team.

In conclusion: never rely on the team’s experience, assuming that everyone knows the project inside out, and always be as clear and concise as possible when dealing with errors.

The course of a week’s RUN

I’m starting to get an idea of what incidents can look like, and now it’s time to start my first week of RUN.

I think there are a multitude of ways to manage RUNs, so if I can’t be exhaustive, I’ll just share with you our method and what I’ve learned.

To begin with, we work in RUN weeks. It’s a classic process, although not systematic, depending on the team. It means that we take it in turns to take over the week.
Being in charge of the RUN means managing the live feed. Live first.
This involves a number of things:

  • Carrying out the morning check every morning
  • Taking all incoming incidents in real time
  • Monitoring ongoing incident tickets
  • Being the team’s point of contact for urgent matters

Let’s take a closer look.

Morning check

The morning check consists of consulting a central dashboard of all our services.
This dashboard is designed not to be an messy aggregation of all the team’s dashboards, but to be a digest of the most important data.
Metrics and colors are very clear:

  • Green: all’s well
  • Orange: something to watch
  • Red: something’s wrong, action to be taken

With a glance and a few scrolls, I need to know the overall status of our entire perimeter.
This allows me both to check that no major problems have occurred overnight, and to anticipate potential incidents for the day ahead.

Incidents

The quality of incident handling is qualified by two metrics:

  • TTA: Time To Acknowledge, the time it took me to inform you that I was aware of the incident.
  • TRS: Time to Restore Service, the time it took to resolve the incident and restore the situation to normal.

When an incident occurs, the priority is to acknowledge it, then take two actions:

  • Assign the corresponding project (hence the need for clear log messages)
  • Requalify the priority if necessary

Yes, because there are several levels of priority for incidents, depending on severity and urgency.
They are generally classified from P1 (maximum urgency) to P5 (minimum urgency).
Depending on the team involved, a P1 will have to be resolved in a very short time (generally between 2 and 4 hours), while a P5 may take a week or more.

Obviously, for a P1 that occurs during the day (i.e. during the RUN), I’m not alone: it becomes the priority of the whole team.

Monitor tickets

Incidents in progress are in the form of tickets, usually found on a dedicated dashboard.
It might look something like this:

Here, I can see that project W, with its remaining resolution time, was the priority, until project S arrived with a much higher priority.
The person in charge of the RUN must always keep an eye on all tickets, whether assigned to him or not, to avoid letting resolution time exceed the maximum time allowed.
In this case, it’s called a breach: a breach of the time limit (or, more generally, a breach of the SLA).

Be the point of contact

A team is rarely independent; it often shares its ecosystem with other teams. When you need to escalate an urgent issue to a team concerning a particular request or a bug you may have found, you’ll contact the team concerned in the hope of getting a rapid response.
Whatever your team’s communication medium (Slack, Webex, …), if you ask a question in a channel dedicated to a team, that team needs to be at least a little reactive.

This reactivity is ensured by the fact that a member of this team makes it a priority to monitor the channels that concern him or her, so as to be able to respond rapidly and relay the information if necessary.

Troubleshooting

Having said that, let’s get to the heart of incident management: troubleshooting.

Oracle definition: Troubleshooting is a logical and systematic process of problem solving. Troubleshooting is a search for the source of a problem in order to identify its symptoms and eliminate potential causes, until it is resolved.

Troubleshooting is like carrying out an investigation: the aim is to understand what has happened with the information available to us, in order to restore service.
There are quite a few methods of troubleshooting, but I’d like to focus on the one I use most: Bottom-Up Troubleshooting.

The “Bottom-Up Troubleshooting” method starts by examining the physical components of a system (the lower layer), and works your way up the layers of the OSI model until you identify the cause of the problem. This approach is effective when it is suspected that the problem lies in the lower layers of the network.

Applying this method to a purely software domain, we start by checking the execution environment: making sure that the application is running properly, that processes are active and that deployment is correct. Next, we check system dependencies, such as the database and associated services, as well as the version of languages used.
We then test connectivity: does the web server correctly redirect requests? Does the application access the necessary resources?
Then we analyze configurations and logs to identify any errors.
If all seems in order, we examine the source code for recent bugs or problematic modifications.
Finally, we test the user interface by inspecting the browser console and network requests to see whether the problem is coming from the frontend or the backend.

I don’t claim to be exhaustive here, but it gives an idea of the approach to take when conducting an investigation.

Determining the entire path that led to the error in this way is called Root Cause Analysis (RCA): identifying the root causes of faults or problems.

When an incident occurs, especially when it’s a high priority, there’s one question I think is essential to ask:

?
What problem am I trying to solve?

“It doesn’t work” => what doesn’t work? All of it? A particular scope?
Are we able to reproduce the error systematically?

Although I’ve been talking about troubleshooting in the sense of solving an investigation, we must never lose sight of the initial objective: restoring service, and as quickly as possible.

When your customer loses access to their website hosted by you, it’s easy to fall into the trap of immediately wanting to root cause analysis. There’s a problem, and you want to understand why.

Sometimes we have to accept that understanding the cause of a problem can take time, and prioritize service availability above all else.

Case study 1: managing emergencies

I’d like to start with a practical example of an emergency situation. Let’s step away from the RUN for two minutes to talk about a night-time on-call incident.
Your customer has a problem with his website, which has suddenly become very slow and a large number of incoming requests are being lost.

It’s 4am.

You investigate, and see that it’s a load website on 3 servers. Two of them are not responding at all. The website has huge traffic, it’s a top priority and you don’t have a minute to lose. Asking the customer to wait a whole troubleshooting before restoring his service is unthinkable.
You’ve got to save time.

It’s out of the question to restart the machines, which are no longer responding in a hurry.
My first instinct would be to head for the load balancer to redirect all incoming requests to the only server still standing.

As load balancing has not been implemented for no reason, this solution is obviously not viable in the long term, but at least service has been restored.
We still need to make sure that the only functional server will be able to hold the load for the next few hours, and we’ll need to do some extra monitoring (hypercare) before going back to bed.
The service is hanging on by a thread, but it’ll last through the night until my colleagues wake up.

The next day, we’ll be able to carry out all the necessary troubleshooting as a team, in order to correct the problem permanently and distribute the load across the 3 machines once again.

Case study 2: getting out of the box

Let’s move on to a slightly less urgent case, typical of the RUN. There are so many examples of troubleshooting that I don’t know which one to choose.
I have chosen the case that I am about to present for two reasons: it’s one of the first cases I’ve had in my current job, and it’s the first where I realized the importance of stepping outside my frame of thinking.

What I mean by this is that, as I said earlier, when an incident occurs, we immediately form an intuition based on the little information we have at first sight, and then dig deeper. But there’s much more to troubleshooting than that. Most of the time, there’s a whole context to take into account: business constraints, the services that revolve around the one concerned, the infrastructure on which it runs, etc.

I’m going to modify the example slightly for reasons of confidentiality, but the principle will remain the same.

Let’s take our microservice for synchronizing users from point A to point B: [user-sync-service] creating user: POST /api/users: 500....
This is a 500 error. The microservice has failed to push users to the destination. A failed push.

My first instinct is to look at the metrics dashboard to assess the extent of the damage. I see that there are around 20% of requests in error 500 for 80% of requests in 200 on the same route.
Already, we can say that not everything is on the ground.

I’ve also noticed that this synchronization sends a lot of requests in a short space of time.
I thought it was a rate-limiting problem, but looking at the history of other synchronizations, I see that the volume is similar to what we’ve always had.

Next, let’s take a look at the logs.

Note: a well-done log message, error or not, will only contain a very simple message about what we were trying to do.
All other information will be found in the log fields, which we can use to filter.

So I start by filtering through all the log messages that contain an error, in order to find something in common. Knowing that we always include an “operation” field, which is a sort of code for the operation performed when the message was sent, I decide to display it to see.

1
2
3
4
5
6
message              | operation
---------------------------------
error: creating user | createUser
error: creating user | createUser
error: creating user | createUser
...

The first thing I notice is that all the errors have the same message and the same operation.
So there seems to be a problem with user creation, or so the messages seem to indicate.
I could then dig into the user creation processor to see what’s wrong, but I remember that a good proportion of the requests were in 200, and I tell myself that it’s certainly not enough to filter only on errors.

I therefore remove this filter and display all the messages:

1
2
3
4
5
6
7
8
message                | operation
-----------------------------------
error:   creating user | createUser
success: creating user | createUser
error:   creating user | createUser
success: creating user | createUser
success: creating user | createUser
error:   creating user | createUser

In fact, there are plenty of createUser operations that go well. The problem may not lie with the operation itself.
I’m wandering through the available fields and one of them catches my eye: the pod_id, the ID of the pod on which the microservice is running on the infra side.

I decided to display it:

1
2
3
4
5
6
7
8
message                | operation  | pod_id
-----------------------------------------------------
error:   creating user | createUser | app-54fb9-n4vkg
success: creating user | createUser | app-54fb9-n4vkg
error:   creating user | createUser | app-54fb9-n4vkg
success: creating user | createUser | app-54fb9-n4vkg
success: creating user | createUser | app-54fb9-n4vkg
error:   creating user | createUser | app-54fb9-n4vkg

And then it hits me: after checking, all the messages, success and error, come from the same pod.
And yet, I know a little about this microservice and I know that it’s supposed to run on two pods, not least to spread the load. This is suspicious.

So rather than go to the source code, I logged on to the infra to monitor the status of the pods, and bingo: one of them was down.
The load was then concentrated on a single pod, which didn’t hold the full load, and which dropped some of the requests.
So it was request overload after all.

I restarted the pod, restarted the synchronization, the load was once again distributed between the two pods, everything passed.
All that’s left now is to find out why this pod fell, but a large part of the root cause has been established.
We can now close the incident.

This example shows that having an overview of the information we have at our disposal can enable us to broaden our intuition towards a cause that may not have been suspected at first.

Five whys

I’d like to finish talking about troubleshooting by mentioning another method called “five whys”.

Wikipedia definition: The five whys are an iterative questioning technique used to explore the cause-and-effect relationships underlying a particular problem.
The main aim of this technique is to determine the root cause of a defect or problem by repeating the “why?” question five times, each time linking the current “why” to the answer of the previous “why”.
The method asserts that the answer to the fifth “why” posed in this way should reveal the root cause of the problem.

Example:

Problem: I’m late for work.

Why? I left home later than usual.
Why? My alarm didn’t go off.
Why? My phone was off.
Why? The battery was empty.
Why? I didn’t plug in my phone before going to bed.

The cause of my lateness at work would therefore be the fact that I had forgotten to charge my phone, and one solution to prevent this from happening again would be for me to pay more attention to it in future.

This is a classic example of the advantages and disadvantages of this method.

  • Advantage: it encourages you to take a step back and look at what’s going on.
  • Disadvantage: it over-simplifies problems a little too much. Remembering to charge my phone is no guarantee that I’ll be on time for work in the next few days.

Let’s take another technical example:.

Problem: Users can no longer connect to the application.

Why? The authentication system returns an error 500.
Why? The authentication service is unable to query the user database.
Why? The authentication service cannot establish a connection to the database.
Why? The database server is unreachable.
Why? A recent firewall configuration change blocked the authentication service from accessing the database.

This method is still useful for taking a step back before rushing headlong into resolving an incident, and is often used when writing post-mortems to gain another view of the problem.

Troubleshooting is a fairly vast field, so I don’t claim to have been exhaustive, but I hope I’ve been able to give you some food for thought on the subject.

Be a resilient developer

Here, I’ve outlined the main points I wanted to make about the RUN.
However, I do have one question:

?
How is all this possible?

What I mean to say is that the reason I was able to get to grips with these tools and access information so quickly in the event of an incident is because these tools and information exist.

Software and microservices are written by developers, and these developers have their part to play in the troubleshooting process well upstream of the incidents themselves.

I want to tell you about the good dev practices that are necessary for good incident management, and to that end I’ve written a blog post Be a resilient developer dedicated to this topic.

Don’t hesitate to read it if you’d like to know more about the day-to-day life of a developer working in an incident context, otherwise I’ll let you finish this read with a little last word!

A word about night on-call

This concludes this blog post on RUN incident management.
I’ve chosen to focus on this aspect of incident management because it represents more of my daily life than night shifts, but it really is something that can vary from team to team.

I want to leave you with a little blog post that I really liked about things that I think are useful to keep in mind when you’re on-call at night: What I tell people new to on-call

I hope these few lines (or rather a few pages) have taught you a few things, and given you an idea of the day-to-day life of engineers who regularly come into contact with incidents.

Thanks for reading all the way to the end!

Sources

Built with Hugo
Theme Stack designed by Jimmy