How not to drive users and developers crazy

Profile picture for user gonzodaddy By Den Howlett June 5, 2020
Summary:
Lessons from a significant infrastructure change.

train-wreck - optimized
(via public images)

This story is in response to a request to expand on a couple of Tweets I sent following an 'incident' here at diginomica Towers. Here goes. 

Here's the history

When we re-engineered the site, the goal was to get 70% of developer effort goes to doing new stuff and 30% to maintenance. In an ideal world I'd like to go 80/20. A year in and we were comfortably meeting our goal. Before COVID-19 hit, we were re-evaluating our hosting provider for several reasons, not least being the fact that they are largely unresponsive and pricing is opaque. You all know where that goes - unexpected price and/or cost hikes. However, switching out providers is one of the riskier things you can do when your business depends on being 'always on.' 

The previous provider included an embedded CDN whereas we have to source a new CDN to work alongside the new hosting provider. That CDN inclusion was useful for us as it represented 'one less thing to do.' It also meant we got pretty good - but not stellar - performance. For those that don't know, Google Speed is your benchmark. Period. So we were elevating our risk of failure by tacking a CDN onto the new infrastructure. 

The planned switchover was delayed for technical reasons but we eventually got to a point where we could run the switch and, if anything went wrong, then at least it wasn't a Friday. We also had rollback in place in case the switch proved to be catastrophic. We'd done the obligatory testing, I'd run tests from a user perspective and nothing seemed off.

Then Sod's law hit us. 

Sod's law comes to bite us in the ass

We had expected 10-15 minutes downtime accompanied by a temporary loss of back end access and possible weirdness at the front end which would mean people would not see the latest and greatest but a day old snapshot. I could live with that. In order to give the devs enough time, we agreed to halt content production operations and provide an early evening two hour window to complete dev tasks.

The first sign of trouble was when the two hours stretched to four hours. And then all hell broke loose. The devs had to plough on and managed to get the front end working fine. But when our content team fired up their browsers a cascade of problems hit. Access to the main content library was borked - everyone was getting 'Access Denied' messages. Drafts could be edited but not edited once published. That's a pain when you find a niggly typo. 

When we do have problems, they are often OS/browser specific but on this occasion they were what I call global. To make matters more frustrating, one of the team could get access to the main content library. That meant others had to send over docs and/or HTML files so that we could keep production going. 

In the meantime, the picture as we saw it was so confusing that it was impossible to understand what was going on. The devs were hunkered down but that didn't prevent us from sending emails reflecting an increasing level of tension. Think in terms of DefCon 2. At one point I wondered whether the user rights were compromised which, if true, would be a serious problem. 

What were the problems?

The problem was solved - as they always are - but the fix is really a kludge and one that is being refined so that we harden the back end. What happened. As always it wasn't just one thing. 

  • The CDN documentation provided to the devs conflicted with what they saw happening. The conditions the devs saw was directly opposite to their understanding of what should have been session cookie behavior.
  • None of this was obvious from pre-flight testing either at the dev end or with users. 
  • The devs were so overwhelmed that communication between them and users went out of the window. As users, we were largely in the dark. 
  • As users, we were panicking because we couldn't get a consistent picture of what was happening and so our reports back to the devs were incomplete. 
  • As PMP I was incredibly frustrated and if you know me then you know where that goes.

Once the dust settled I ran a call with the devs and got a drains up as to what went wrong at their end. They acknowledged that communication was less than pristine, something I saw from our side too. After the call, I briefed our team in outline and we moved on. But it got me thinking about how we handle issues and how we can improve our methods both as users and devs. 

What we used to do

Right now, everything is documented in JIRA. I'm not a fan of the system but I appreciate its use from a dev perspective. Everything gets logged there so the devs can readily call up a history of events. 

We communicate with the devs via email and they in turn drop stuff into JIRA. I get JIRA notifications for design, dev and test. Internally, I drop dev notes into Workplace but avoid too much technical jargon because our team is not overly technical.  

This 'system' mostly works even though it might seem clunky and inefficient. Our relatively low level of ongoing issue means we can be fairly informal, even when that means a certain back and forth on issues. 

Where we're going

But there's some missing essentials. Here's what I set out as our incident protocol going forward:

  • When reporting bugs, please ensure you have tested under both logged in/logged out  conditions, take screenshots and, where appropriate note URLs where these occur. I know for example there is a social button problem devs have marked as ‘cannot replicate’ but which occurs only when logged in.
  • Also, it helps considerably if you describe with as much precision as possible the event leading to the suspected bug and the nature of the bug.
  • It’s not reasonable to simply state ‘x is borked’  - devs need as much information as possible to replicate at their end. There will be occasions where that’s appropriate but adding screenshots helps immensely.
  • I’ve asked devs to ensure there is as much documentation as is reasonable so that when issues re-occur they can quickly trace and resolve the issue to hand.
  • I appreciate this sounds like a lot of work but in reality it will cut down the back and forth to solve problems. 
  • Finally, it’s important to report the OS/browser combination under which bugs occur. They’re not all created equal and I know that OSX/Safari can be particularly troublesome. Especially if, like me, you’re running Catalina public beta. If you have access to additional combos then try as many as time allows to help identify whether system wide or OS/browser specific. 
  • Finally, for their part, I’ve asked devs to be much quicker and fulsome in communication. When, as today, we were faced with complex and difficult issues, a quick note to let us know they’re on the case, waiting for third party answers etc helps keep the FFS/WTAF factors at a manageable level.

I'm sure this isn't exhaustive and I'm keen to ensure users are not overburdened. 

What do you think? Have we struck a decent balance? What would you do differently? Anything glaringly missing?