A stressful incident

Preamble

Last week, I experienced a software incident that made a particular impression on me. For those of you who know me or have read my other articles, you know that I work in a field where I regularly have to deal with incidents.
However, the subject I would like to talk to you about has nothing to do with my work.
It concerns my volunteer project.

Everything turned out fine, but I would like to share what I learned from the experience.

I made a mistake, but I think it’s always better to share these things rather than keep them to myself, so that I don’t repeat the same mistakes.
This article will recount the story of this incident in order to highlight the lessons to be learned. If you prefer a quick read, you can skip directly to this section: Lessons to be learned

Context

A little background: for the past five years, I have been developing an Android app for tracking chronic symptoms. It helps people track their symptoms as well as their lifestyle, enabling them to take quick notes and view clear displays on a daily basis.
The data is not stored anywhere online. This is a conscious choice on my part. As the data relates to users’ health, I don’t want to be connected to it in any way. So, local storage it is.
Users are responsible for their backup files, and that’s fine.

I’ll spare you the whole development history, but following the exposure of numerous Android vulnerabilities and Google’s policies on Android in response to these vulnerabilities, my automatic data export feature could no longer work. It has therefore been replaced by a manual export.
I’m talking about the data generated by the user throughout their use of the app, so that it can be imported in the event of data deletion or a change of smartphone.

Another point to note is that I chose to build my app in Angular, based on Ionic/Cordova. This was certainly a debatable choice, but it allowed me to use Angular, a web technology that I particularly like, and to consider cross-platform development in the future.
Little did I know that in the future I would come to hate Cordova, whose features are a little limited for my taste.

Five years later, we arrive at Monday, December 1, 2025.
To date, 11,000 people in around 30 countries use the app on a daily or almost daily basis.
I originally developed this app for myself and then decided to put it on the Play Store so that my friends and family could benefit from it, but I never imagined it would have such a large user base.
In fact, I hadn’t even considered RUN at all. Production was definitely not my thing before I started the job I’ve been doing for the past two years.

I’ll avoid going into the whole five-year history and focus on the incident itself, and especially what I learned from it.
It was very naive of me to think at the time that I could deliver an app and let it live without needing to maintain it.
I lacked rigor when it came to updates, which was my first mistake.

Over time, vulnerabilities are exposed in all technologies. They are gradually patched as Android updates are released. These updates regularly impose new constraints on app developers.
Depending on the updates, if you don’t keep up, two things can happen to the app in question:

It can be suspended from the Play Store.
It may remain on the Play Store, but be incompatible with newer versions of Android.

So, on December 1, it’s been 3-4 weeks since my app has been available for download on the latest versions of Android.
So I’m updating the target SDK (target API level) to remain compliant, and taking the opportunity to give some news about the app’s development (from time to time I display a pop-up with news when the update is available).
This involves updating a few dependencies and Cordova plugins.
I do my usual tests: first I create a debug build, check that everything is working properly, then a release build, test it thoroughly, make sure there are no regressions, and it goes into production.
App updates on Android take anywhere from a few hours to a few days. I went to bed, unaware that I would wake up to an unpleasant surprise.

The incident

What happened

Okay, enough about the context, let’s get to the point. As you might have guessed, things didn’t go as planned, and my wake-up call on December 2 was particularly unpleasant.
I received a dozen emails from users telling me the worst possible scenario: “I lost all my data.”
Four of them started giving me 1-star reviews on the Play Store (deserved).

My workday begins. My real job is my priority, so I quickly take the time to do the first steps for a critical incident:

Assess the extent of the impact
Immediate rollback

Then I start my workday.

Lucky break: half of the app’s users are American. The majority of the app’s users live in a time zone ahead of France. What’s more, the update had only been live for an hour when I woke up.

This means that it is still nighttime for most users when I perform the rollback. Out of 11,000 users, only 100 are potentially affected (100 smartphones on which the version causing the problem is installed).
100 people permanently losing their data is far too many, but it’s already much lower than the worst-case scenario. For everyone else, the update is no longer available.

To my knowledge, around 20 people will have been affected by this incident (update installed, app launched, incident detected, and had never made a manual backup).
At this point, I don’t care about the bad reviews and aggressive messages that come with it. What matters to me right now is that I created this project to help people, and for some of them, I’ve just done the opposite.
Even though I know that these are adults using a free, voluntary app and that it’s up to them to manage their data, I quickly feel guilty.

I respond as quickly as possible to everyone who has contacted me, whether by email or via reviews, and inform them of the situation.

I have no idea what happened, but I focus on my real job. The rollback is complete, the bleeding has stopped, and I’ll see what happens next tonight.
The plan is obvious to me: I cancel all my plans for the rest of the week and spend my evenings after work working on the incident until I find a solution.
This is followed by two evenings/parts of the night spent troubleshooting.

On the second night, December 3, my nerves start to fray. I’ve worked two double shifts without a break, and after hours and hours of troubleshooting, I still haven’t found a solution.
I was already feeling pretty tired.
I didn’t know it at the time, but one of my first intuitions was correct; I just hadn’t been able to test it thoroughly.
For those who are interested, here is a summary of the root cause analysis.

Technical context

First, a little technical background:
A Cordova app is a native application (Android/iOS) that embeds a WebView responsible for executing a locally stored web application (HTML/CSS/JS), with access to native APIs via plugins.

Although the files are local, the WebView serves the application via an internal server linked to https://localhost, which is not a real external server, but a secure abstraction provided by the WebView to make the app work like a real modern web app.

Local data is stored via IndexedDB, which is a browser-side NoSQL database.

Root cause analysis

cordova-plugin-ionic-webview, one of the plugins necessary for the app to function properly, the one that provides the webview, underwent a major update. This update was necessary for the API level I was aiming for, so I did it.
This update contains a significant breaking change: it no longer exposes the webview on http:// localhost but on https:// localhost.
IndexedDB is strictly isolated by origin (protocol + domain + port). By switching from http:// to https://, the browser considers it to be a completely different database.
From the user’s point of view, all the data has been lost. In reality, it has become inaccessible, which is significantly different.

Here is the key information about this incident: the data has not been lost. It is just no longer accessible. This means that if I find a way to revert to http://, users will get their data back. Even better: users who had the automatic update but haven’t opened the app yet wouldn’t even notice the incident if I provide the fix in time.

However, there is one major constraint to consider: internal data linked to an Android app is no longer accessible on non-rooted operating systems, and more importantly, it is permanently deleted when the app is uninstalled.

I immediately contact all affected users with whom I am already in contact to warn them not to uninstall the app.
They all respond favorably, except for one who had already uninstalled it out of frustration. Luckily, he hadn’t used the app in a long time.

Then the race against time began: the faster I could patch it, the more I could limit the potential impact.
This is not the time to think about a clean and definitive solution; I have to think efficiently.

The resolution

If you ever find yourself dealing with an incident related to a breaking change in one of your dependencies, your first instinct should be to tell yourself that if the technology is well designed, they probably didn’t make a breaking change without providing a way to be backward compatible.
I combed through the changelogs, but they were extremely incomplete and contained no information.
I then tried asking ChatGPT and Gemini, but they were completely clueless. Yes, I gave in to the temptation of the easy answer, so 2025.

Let’s go back to the good old methods and search StackOverflow, where I come across this thread: https://stackoverflow.com/questions/74049849/persist-indexeddb-in-cordova-app-when-app-updates, which indicates the Cordova configuration parameter designed specifically to ensure backward compatibility for file management. It looks very much like what I need.

I try the fix, and this time I do things properly: I create a beta-testing program and include myself in it to update via the Play Store as a user would on the production version. The feedback loop for this method is awful (sometimes several hours), but it’s the most reliable way to really test an update.

When I went live, I assumed that I hadn’t touched the data management module and that there was no reason for an app update to impact the data. Only this method would have allowed me to notice the problem.

It’s 2 a.m. and I have my answer: it works.
I then launch the review in production and take care to disable the automatic update at the end of the review (an option designed precisely because no one knows when the release will become available).
I take the opportunity to place an update pop-up reminding users of the importance of making manual backups.

I communicate once again with the affected users and go to bed. I take my misfortune in stride.

On Friday evening, the release is available. All I have to do is click and it goes live everywhere.
There’s no way I’m launching the release just before bedtime. I go to sleep and launch the update on Saturday morning so I have the day to do hypercare.
Throughout the day, I receive emails saying, “My data is back!” Everyone is thrilled and thanks me profusely. I can breathe again.

As I write these words, it’s been exactly one week since everything went back to normal.

Summary:

Initial potential impact: 11,000 users
Potential impact following rollback: 100 users
Actual impact: approximately 20 users
Irreversible impact: 1 user

That’s still one user too many, so it’s nothing to be proud of, not to mention the inconvenience caused to others, but I’m still happy that the impact was much less than I thought it would be at the beginning of the incident.
Of course, I’m only talking about the resolution of the incident itself here. This solution is only temporary. I’ve already started working on a transparent migration solution so that we don’t have to rely on this Cordova tag that I had to add (and which may be removed in the future).

This is the end of the story of what was probably the most stressful software incident of my life.

Lessons to be learned

I wrote this blog post mainly for this part, which is by far the most important: the lessons I learned. What I would like to say to my past self, who naively embarked on this volunteer project.
I’ll take this opportunity to share some ideas that seemed obvious to me at the time, but are always worth remembering.

Anticipate RUN. Software in production is alive. Once you launch something into production, you can’t just leave it alone. Regardless of the platform, there will always be a minimum amount of RUN and a duty to update. While this seems obvious to me today, it seemed much less so five years ago.

Anticipate potential success. Just because you’ve never published anything doesn’t mean your work won’t be seen by anyone. If people like your creation, they’ll adopt it. The more widespread the adoption, the more overwhelmed you may feel by your own work. A volunteer project shouldn’t become a thorn in your side. Future work must be anticipated.
It was precisely because I started to see it as a second job when the app was adopted by several thousand people that I ended up detaching myself from it and becoming lax about updates.

Anticipate a communication channel. My app is 100% local, completely cut off from the network. That’s fine, but when you find yourself unable to update it, you find yourself completely cut off from all communication with users. Setting up a call to a service that allows you to display emergency messages, regardless of the version of the app, can be very useful.

Automatic post-review updates are of no use except to launch a critical release while you are asleep. If the date and time of the end of the review are random, the date and time of distribution in production should not be.

There is no need to do a full rollout. You can start with 10% of users and deploy gradually to limit any potential impact.

If a beta testing program exists, use it. Testing releases directly by installing APKs works, but it doesn’t show the actual behavior of the app after it has been packaged on the Play Store, nor the behavior of a real update. It’s time-consuming and frustrating, but necessary. It’s the only truly valid test before putting an Android app into production on the Play Store.

Communicate quickly. This is one of the most important points. My manager often tells us that there is a huge difference in frustration for a user between an incident where no one communicates and an incident where those responsible communicate quickly. I have seen this very clearly for myself: angry emails quickly turned into messages of understanding and encouragement.

Follow up on leads to the end. This also seems obvious, but after hours of troubleshooting, you can get discouraged and end up like the famous cartoon character who stops digging just before he finds diamonds.

Don’t panic. Assess the impact, roll back, and take the time to analyze. It’s difficult when you’re in a race against time, but that’s often how incidents are. Keeping your cool is essential. That’s advice for my past self, but also my present and future self.

If even one of these tips can help you avoid making the same mistakes I did, I’ll be delighted.

Final words

If you’re interested, the app is called Life-Notes.
It is available on Android, in French and English: https://life-notes.fr/

I have recently resumed development after a break of several years, and new features are on the way!
Updates will once again be regular.

If you are interested in RUN and incident management, I have written and presented a conference on this topic in various locations in France: https://www.youtube.com/watch?v=HDDbCymVRhE