If there is one thing I've learned about actually doing software work in production environments over my career, it is: debugging is hard. And once you accept that it is hard, you have no option, in my mind, except to fall back on two essential but different tools to address problems. Firstly, the scientific method and the process of elimination. And secondly: as much creativity as you can muster.
At a healthcare software company where I served many roles (it was a small operation), I encountered a significant problem with no obvious solution. Our CRM was homegrown - the company had been around for much longer than Salesforce, so years prior, our customer records, billing, and recurring invoices were all coded by hand. In the end, we ended up integrating this monolith into a lot of systems, but it remained the source of truth where customer records started and were stored. It was a relatively simple Windows Forms type application with a simple database backend.
Then something bizarre started happening: customers started getting deactivated. Almost every day, we'd come in to some number of well-known customers being deactivated. There didn't seem to be a pattern to the ones who were deactivated. They didn't share in ordinary things like alphabetical order, database ID order, length of time in the system, or other familiar attributes. As a homegrown system, there wasn't a significant focus on role-based access control. And while there were a few roles, many of us in leadership and the administrative staff "had" to have administrative access to perform other system tasks. With that access, it could have been any of several users at fault or the fault of some automated process. Looking a the automated processes we had in place, we couldn't find anywhere that would change or set the value of the "active" flag in the database. In fact, in the end, we realized the only place to change that flag was one check box in the UI, and nowhere else did any code we could find even touch that flag for a write.
Having little else to go on, we added more discrete auditing to the system to figure out who could be checking this box and deactivating the users. Once that logging was in place, we found an executive assistant who's account as causing the deactivations. She was mortified - she didn't have any knowledge of how this could be happening and wasn't - as far as she knew - intentionally deactivating customers. There was another odd pattern in the data: it was all from around 4:55 pm to 5:05 pm when they were happening. We knew something was happening at the end of the workday, but we didn't know what - so I decided a good old-fashioned "stake out" was in order. Okay, it wasn't actually a stakeout; she knew I was watching...but I asked her to go through her usual end-of-day routine while observing what was going on.
She started to pack up. She got her lunch bag from the break room, put it on her desk. Then she picked up her purse, put it on the desk, and tidied up the rest of the desk. But the purse wasn't on the desk - it was on the keyboard. Where she put her lunch down relative to her bag made the edge of her bag lay on the enter key on the right-hand side of her keyboard. While she diligently went around her desk, cleaning and sorting for the end of the day, the CRM system was on the screen. One enter opened a customer record. The next enter caused the focus to go from the Customer Name field to the Address field. Enter was still depressed, so the form cycled focus through all the tabs. Eventually, it unchecked the "active" checkbox and cycled through to the "Save" button.
Mystery solved! Was the computer making a bunch of noise while this was happening from the stuck key? Yes. Did she notice it? Not really. Should we have sanitized the inputs and locked the system down more to begin with? Sure. Would it have been cheaper to buy a CRM in the long run? Quite possibly.
The real lesson here isn't anything technical. The lesson is that as humans interact with systems - or as systems become complex enough to take actions on their own - they will make mistakes. And while you can't possibly anticipate every one of those mistakes from the onset, when you encounter one, you can work on making sure you have observability at every level so you can see it when it happens. And you can apply creative problem solving - what I like to call "alternative methods for success" - to issues that seem particularly perplexing.