my first production bug

This week marked an important event 🍾 - I have been at my current job since the first of March, and it’s been great, but one thing was missing. I hadn’t broken anything yet. This week, that changed. Here’s what happened…

Part of our platform at Vypr is a service called VyPops . This allows out clients to get customer feedback via video insights, and is proving popular with our clients, like Marks and Spencer. We got a message in our team Slack channel from one of our co-workers asking “Have any VyPops responses come through for this question yet? It doesn’t look like there are any?” I check the database - yes, there appear to be twenty-five answers. She comes back to us saying that they might exist in the database but they can’t see any in the front end of the platform. Is there any issue with them coming through? I log into the platform and see that yep, there aren’t any videos visible.

We begin investigating. Eventually, we get to the logs for the third party service that uploads the videos, and can see they’re full of errors. The error shows an incorrectly formatted URL, so that’s causing a 404, which is blocking the upload entirely. Ok, where does that URL come from? Eventually we find it, and it’s generated by passing a file path to a rails method called helpers.image_url . Doesn’t seem very helpful if you ask me.

My colleague has the smart idea to look at previous successful uploads, and the URL value in those is a) valid and b) has a different base URL - one from CloudFront rather than one of ours. What’s going on?

We look back through the video service upload logs to see when the errors started occurring. I keep clicking back through the pages. This is taking too long. Finally, I find it. The last successful upload was at about 13:55, five days ago. As my colleague says “Didn’t we do a release that day?”, the sinking feeling in my stomach tells me the answer. Yes, we did. We did a release at about 14:00. Full of my code 😬

First, we need to deploy a fix to get the system running again ⚒️. Hardcode the URL for now, that’ll allow the requests to go through successfully. We merge and release this quickly. Now we’re operational again, and I can take a deep breath and start investigating 🕵️.

I find a quick way to use the tests to show me the URL that’s being generated. Checking out each commit in turn starting with the one that got released. Running the tests to find the commit where the URL goes from valid to invalid….yep, it’s one of my merged PRs. Ok, so I’ve successfully isolated the commit that causes the problem. What’s next?

Go through the commit file by file and find the file where the breaking change is. I get through a few and I’m not there. I figure, for now, I can skip the Javascript files, as it’s a function in the Ruby code that’s causing the problem. I don’t completely rule them out because honestly, you never really know with computers, but I figure I’ll prioritise checking the Ruby changes.

After checking a few files that don’t affect the output, I get to the main routes routes configuration file. Copy over the changes, and run the tests. Yep, that’s the one. Makes sense! The problem is to do with a URL being generated, a route to a resource. Maybe I could have started there. I’ll know that for next time.

Eventually, it turns out that I erroneously included a built-in function in the rails routing array that I thought I needed for a feature. I didn’t actually need it, but it had the side effect of breaking the asset URL generator. But how? I think it’s because we have an Image resource that uses the /images route, and something about generating the route I didn’t need (the :update route) clashed with the way that Rails generates the URL for files in assets/images but I’m still not certain. More research needed! 📚

This has also shown a blind spot regarding Rails 💎. The great thing about Rails is that it does loads of stuff for you. The not great thing about Rails is that it does lots of stuff for you and obfuscates a lot of what it’s doing, and the how and the why. It’s good to have found out pretty early on that my mental model of route configuration isn’t right, and I need to understand it better. Not least because routing is very important!

This bug also revealed that we had a service failing silently since the release, and that’s just not OK. It had been pumping out errors into the logs because it was repeatedly making a request that involved this invalid URL, and we didn’t know about it. This was a recently built microservice - had it been an issue with the main monolith that powers our platform, we’d have known much more quickly. Luckily, we use PaperTrail and NewRelic on Heroku, so I’ve been able to set up a Slack integration that sends a message to our development channel if we get more than twenty errors in a minute, so that we won’t get caught out by something like this again. We can’t completely stop ourselves from releasing bugs, but we can make it a lot easier to catch them.

Working on a small team and a relatively small system means that when issues like this arise, we can swarm to address them, and the fact that we were able to have a fix released within a couple of hours of the problem being raised by a system user is a real benefit of our small team.

The autonomy we have has also meant that I’ve been able to identify system improvements (eg the Slack alerts, looking at adding some better error handling for the request that was failing on the 404) and implement them pretty quickly. Being able to learn lessons from this mistake and use those lessons to make the system more resilient is a big benefit of being part of a small agile team.

Written on May 18, 2023