Solving an application murder mystery
TL;DR: Jump to the NOTE at the bottom to know why the application had mysteriously stopped.
It was almost the end of the first milestone and an interim demo of our application was scheduled for the customer. On the demo day, the team had freshly deployed the required containerized application and the required services on an Ubuntu (18.04.3 LTS) cloud instance. Some of the team members were still working on some things in the background on the server.
Around 12 pm, I was half way through the demo when suddenly everything stopped working. Pages weren’t loading nor could postman successfully complete any API. I immediately contacted a team member (let’s call him Chris) who I knew was most probably logged into the system to check what happened. He said things stopped for him too. He was connected to one of the database dockers and he was forced out of the container bash.
He said apart from him, most probably, no one else was using the server and that he did not issue any command that could have stopped the application. I took these pieces of input and got him to restart the application so I could carry on with my demo. I also asked him to check for the possible causes of the issue and gather as many details as possible.
The customer had obviously taken note of the situation. I let him know that it was an unusual incident and was most likely caused by an unknown/unanticipated external influence on the application. He wasn’t as tech savvy so I did not divulge any more details and completed the demo.
After the demo, I spent some time thinking about the issue. Since the app was freshly deployed, was tested enough and Chris was logged into the system and could run commands without issues, the possibility of server performance issue was low. Now, if no one from the team had anything to do with it, then it could be someone else who had access to the system or could be the programs running on the system. Apart from us, the customer’s IT team and the cloud provider could access the system. So my suspect list looked as follows.
I quickly scanned through the command history and realized that the default configuration saved latest commands of closed sessions in append mode and had no more details than the command executed and its order. It can be configured to log extra details but was on default configuration in our setup. It logged limited number of commands and so it was necessary to take a backup to ensure the list wasn’t lost.
It was difficult to understand from that list as to which ones were executed at the same time as the issue occurred and who executed them. I noted down a section of the list that could have been executed around the time of deployment. I would cross check it with the team during the sync up call later in the day. I carried on with my priority tasks for the day in the meantime.
By now, everyone in the team had moved on with their tasks tying loose ends for the upcoming delivery. No one could devote any time digging more into the issue. After the sync up, I quickly started to verify with the team if, in fact, no one apart from Chris was logged in to the server. Turns out, there was one more engineer (Jack) logged into the system at the same time. This was new information which could alter the suspect list state.
So then I started asking Jack if he had issued any command that could have had an impact. He said there weren’t any. He was just executing some read commands like listing files, etc. So no one from the team had (knowingly) caused the issue. This takes us back to the same suspect list as before and we could cross out the application team from it (for now).
Next we checked if the command sequence Chris executed matched the filtered sequence of commands from history and it did! We could now check for commands post this point for any issues. I asked folks to try and remember any details that could help solve the case.
Chris mentioned that all of the docker containers were down, even the unrelated ones, which was a bit odd. He had to bring up containers to get the application running again. He did not have to restart the docker service though. This meant that either all the docker processes were killed or the docker service was restarted.
I quickly checked the server logs. There were signs of force kill of applications. This indicated that there were, most probably, commands issued for stopping the containers. This couldn’t have been initiated through our application/services; at least not the way we’ve developed it. Chris also reported that the system resource consumption at the time was normal.
In the meanwhile, Jack tracked down the logged in users for the day and found that there were 4 such users at the same time of the issue. Three of them had the same private IP range. The fourth one was a public IP that was briefly logged in at around the same time of the issue. I figured that the 3 similar IP addressed could have been only obtained through the VPN we used to connect to the cloud instance. The range matched with the cloud instance’s private IPs.
Fortunately, all three of us, Chris, Jack and I were logged in to the system since morning. All of us checked our current IP addresses. The three private IPs belonged to the three of us. We were now left with that one rogue IP address. Jack looked up the IP and turns out it belonged to the cloud provider. Jack rested his case and said they must have done something that caused the issue as the login timing was just right.
Everyone, including me, was inclined to believe that something the provider did affected the setup. I could report this to the customer to take it forward as they were interfacing with the cloud provider. But I did not have proof that they actually had something to do with the outage. All we had was circumstantial evidence.
I explained this to the team and they were not in for any more digging and understandably so. They had already had a long week preparing for the demo. They just wanted to finish their tasks and retire for the day. I, anyway, wrote to the customer to check with the cloud provider if they had something to do with the outage.
The customer got back stating that someone from the cloud provider had logged in to the system but they did not do anything on the system to have altered the deployment. It reduced the probability that it could have been the cloud provider assuming they were sure that none of their actions could have directly or indirectly affected the setup. I, temporarily, lowered the probability of this suspect.
I had some priority tasks to complete for the day but I wanted to be sure that this never happened again. I would not have been able to focus on other things until I got to the bottom of this. The whole mystery had, by now, occupied me and it was the only thing on my mind. So I put aside all of my work for the day and started to look around for more details.
I wanted to check if there were hacking attempts to rule out any manual intervention. It turns out there were a lot and were frequent. Some from European regions but mostly from China. They were all futile attempts, however, as these calls did not go through. I could state this with some level of certainty as only the https port was open to the public, all the endpoints were secured and the enabled routes were highly custom. All these access attempts were logged.
Unless somebody accessed the instance directly and erased or left no details of existence or they knew the exact routes and had cracked the credentials or had exploited an unknown bug, it was highly improbable that someone hacked into the system at that time. So I assigned it a low probability and moved on to focus on the low hanging fruits.
I had to now figure out if only the docker containers were stopped or was the docker service itself restarted. I could have checked the latter first (which would have been a time saver) but there was a lingering thought in my mind that nudged me to check the former first.
A few days ago, an engineer had setup a CRON job to clean up the orphaned docker containers, networks, etc. I wanted to confirm that it wasn’t the source. The engineer was no longer in the team and no one in the team knew where it was on the server. I tracked it down and found that there wasn’t much to it unless the docker commands used in it had some relevant open bugs. I assigned it a low probability and went ahead to check if the docker service itself was restarted.
I googled to ensure if running containers are always automatically stopped when the docker service restarts. I found that there’s an option to disable the automatic stopping of running containers when the service is unavailable. The default setting, however, was to stop the containers. The probability that someone in the team would have changed the setting was low. I verified that it, in fact, was left untouched in the config file. So now, there was only one thing left to do.
I checked the logs for the docker service and found the restart time to be the exact same time of the outage. Et Voila! Now, it was clear that the docker service restart actually stopped all the running containers and that in turn stopped the application. It could mean that the person who logged in did it or there was an unknown (to me) program causing this. Since the bash history did not have any such command recorded and the fact that all our possible manual suspects were mostly ruled out, the probability that it happened without manual intervention was high.
I looked up the system config and did not find anything specific to docker in there. By now, everything was silent around me and it was 1 am. I realized that I was at it for a couple of hours straight. I had an early call next day but I couldn’t stop after coming so long without knowing the root cause. Oh the thrill!
So I kept at it and started to look through the installed system services to figure out which ones could have had an impact. I sifted through all and then stumbled upon one that was new to me in the Ubuntu distribution called unattended-upgrades. This service installs upgrades automatically which is useful in some cases such as security upgrades but not so much when you have a production server running and you don’t want anything unapproved to happen that can affect the setup.
This service ran every day at the same time (around 12 pm) and would check for updates and installs them. On that day, there was a containerd update available and it attempted an upgrade. This service’s logs proved that it had stopped the docker service, tried to upgrade it and failed and restarted the docker service at the exact time of the outage.
This was conclusive enough for me but I had to know if it was manually triggered. The uptime of the service, the absence of any related commands in bash history and the precise timing of execution as previous days of the service proved otherwise. It had been running since the last reboot of the system and is enabled by default for the Ubuntu version. So, the login at the same time was indeed circumstantial and that no person had anything to do with the outage.
Now, all we had to do to prevent this same issue from reoccurring in the future was to either
- block specific updates through this service on all our servers OR
- fully disable it
so we can control the updates of interest on the system and avoid any unintended consequences.
There were other things we could do to help RCA in future: We could alter configurations to disable container stoppage on service restart, log extra details in bash history, etc. for getting more details in the future. The cloud provider could block certain IP addresses/ranges to reduce the hacking attempts.
We were lucky to have had a very conclusive outcome and it was all because of the logs of various services available for tracing and a bit of luck.
It was 2 am and I finally got the closure and could move on ;)
NOTE: When you google ‘all containers stopped automatically ubuntu’ you get this result. Our issue had occurred way before this question was asked and so there wasn’t a ready answer for it back then that would show up in the immediate search results; at least not with the specified search query.