Many problems in IT start off as one thing and end up a completely unrelated root cause for the issue you are troubleshooting. I've worked with databases for decades in my career and while I wouldn't consider myself a DBA (Database Administrator), having a strong understanding of databases is crucial.
One of our clients runs version control software that is backed by a Postgres database and over time, our team noticed that disk space on the data drive was experiencing an unusually rapid consumption rate.
Collaborative Insights: Unraveling Usage Patterns and Data Bloat
Our team of experts at iuvo did not have a lot of insight as to how the client’s engineers used the system or the type of growth that would be typical to expect during a development cycle. We collaborated with another external consultant, who was part of the client's staff and shared responsibility for the infrastructure, to gain insights into the system's history.
iuvo, along with this external consultant, analyzed usage and tried to determine what the issue was. Disk space was recovered by deleting unnecessary backups that had occurred, along with other unnecessary software. Our initial suspicion as to the root cause of this issue was that there wasn't necessarily excessive growth, but rather unnecessary data over time.
Unfortunately, weeks later all the space our team had freed up was used again. This is when I started feeling something was wrong.
In IT work - the root problem is often not as obvious as the symptoms make it appear to be.
In this case - it seemed likely that the growth was driven by increased storage needs within the application, which became the focal point of our discussion. The external consultant was pushing for a new and larger server with a more complicated setup and migration that would fix the situation from their standpoint while incurring our client unnecessary expenses.
Root Cause Analysis: Posing the Right Questions
It did not feel right to me, the question of “why now?” had not been answered. A new more expensive server was not a fix, it was a bandage. We needed to determine what suddenly caused all this growth when, for years, it was minimal. I believe that the only stupid questions are the ones that are not asked. Questions are how root cause analysis is done.
In elementary school I learned the 5 W's of writing/research (who, what, where, when, why) and it is an excellent mindset to apply to technology troubleshooting. I ask questions all day every day.
Who: Who has access to the system? Who has made changes recently? Who owns the product and knows most about how it is supposed to operate? Who can you get more information from?
What: What is broken? Are there misconfigurations? Does everything look like you expect it to look. If so - what is different? With computers - everything that is being done is being done because of a command from someone or something.
When: When did this happen? How much growth is happening across a given period of time? Was it happening like this years ago? When did the uptick start happening? This is when time series graphs are invaluable as they allow you to go back in time and look for the beginning of your anomalies.
Where: Where is it happening? Where is the space being consumed? Where does the data live and originate from? Where are the machines that are interacting with this environment? Where do the backups live?
Why: Why is the first question that I believe most IT professionals will ask themselves for any issue that comes up. Why did this occur? Why did something get changed? Why did the rate of space consumption increase?
So, I began my process of thinking about what has changed in the last few months, especially change that caused some sort of residual effect for days or weeks later. It reminded me of an incident I experienced a little over a decade ago - the leap second JVM bug on June 30th, 2012. There was fallout for weeks even months after. Little obscure things that were related but hard to notice - the issue with this client felt similar. I opened a support case with the product vendor to begin investigation on their side to figure out how we could see a breakdown of the data from within the database from their software.
In addition, I began looking generally around the system for other anomalies to see if I could prove that my suspicions were correct. I had worked in this environment for a few months so while I had some familiarity, I still had a bit of uncertainty as to what exactly I was looking for.
As I started going through old e-mails, and change requests - I remembered some permission scheme changes a while back made by the external consultant which had broken other parts of the environment and build processes. I hadn't considered this as a potential issue here because the storage in question was all local disk, but still used similar network accounts that the NFS connected systems did for application access and service running.
Uncovering Backup Setbacks: A Permission Predicament
iuvo had not been involved in any of the backups for the version control system initially, so I was uncertain how their external contractors set it up. I got into the admin interface of the VCS and saw that automated backups were in place and configured to run, but something had prevented them from being successful. I investigated the system side of things and saw that the backup directory (local) that Postgres was trying to export to, no longer had sufficient permissions due to the permission/UID changes the external consultant made weeks prior. The database growth wasn't due to actual new source code being added that was larger than before, but instead, the failing Postgres backups weren't allowing maintenance tasks to complete reducing the overall database size.
Postgres uses Write-Ahead Logging (WAL) which logs changes to data files into a log before writing the changes to the datafiles themselves. In normal circumstances, Postgres periodically runs archive or maintenance tasks to checkpoint data in these logs into the database files, and then cleans up the logs to keep disk space clean. This is also done as part of taking backups, which we discovered were failing, so these archival tasks and thus the cleanup of the WAL logs were not happening.
Connecting the Dots: A Resolution
The errors that I discovered in the vendor application logs were not straightforward. The application was indicating that certain directories it expected to find were absent, and it had been unable to execute a successful archive for a considerable period.
Example error:
WARNING: archiving write-ahead log file "0000000100000215000000EA" failed too many times, will try again later.
I followed the paths mentioned in the logs on the filesystem and discovered that the service account running this application was updated, but the underlying file system that the backups were supposed to be writing into was not updated, causing these failures. I opened file permissions to match the user account that was running the service and the logs began logging archival transactions and cleanup.
Within a day or so, the bloated database had begun cleaning itself up back down to normal usage levels. Although I took the lead as the primary technical resource in resolving this issue, I was not alone in the troubleshooting process.
The value of an MSP like iuvo extends beyond my own experience as a consultant with years of expertise in diverse environments. It lies in the strength of our collaborative team of consultants who genuinely care about making a positive impact in both our working environment and for our clients. Together, we ensure top-notch support and solutions for any challenge that comes our way. If you are interested in learning more, please contact us today to get started.
Related Content:
- Backup and Recovery Testing (iuvotech.com)
- Backup Best Practices (iuvotech.com)
- Best Practices for File Safety (iuvotech.com)