A Standard+Case Tale

A red line of text appears on Danny’s screen. The monitor on the internal chat system has picked up a staff member complaining of problems with the client management system, ERNIE. Danny grins: the “#fail” keyword always gets the social monitor’s attention. Danny opens an Incident ticket. Another day on the Service Desk.

He looks her up on the HR system to find her location. Danny is a one-star Service Desk Analyst, certified to run basic ERNIE diagnostics. It was a 2 hour CBT and a simple 10 minute online exam. Danny tries to do one new certification every week: part of his bonus is based on the number of certifications he accumulates.

So he runs the diagnostics but can’t find any problems at her location. If it was affecting everyone the phones would be running hot and there’d be something on the Service Desk console by now.

Perhaps it is PBCAK (problem between chair and keyboard). He calls her. A lucky day, she actually answers. No “nilm” required on the ticket history (“not in, left message”).

“Hello Inge, this is Danny from the Service Desk. I saw your comment on Chatter. Having problems with ERNIE?”

It turns out Inge knows exactly what she is doing, but her location is out of date on the HR system. Danny always enjoys things like that: it guarantees that automation isn’t going to do him out of a job any time soon. Danny likes being a Service Desk Analyst - it is the most rewarding job he has had. And here at BigMed he knows his pay will be pretty good within a few years as he builds seniority through further certifications, accreditations, performance, and experience.

Danny runs a quick check for the correct location while she waits, and finds nothing. He takes remote control of her desktop (another certification he has – it included a whole lot of privacy policy questions along with the technical ones). He finds a configuration issue on her desktop client.

He categorises the Incident ticket accordingly. There is no automated workflow in the ticketing tool for that category of Incident, but there is a solution in the knowledgebase with a scripted series of steps to fix the configuration properly, and a checklist of everything that should be in place when finished. Danny clicks the “thumbs down” on the script – it is a bit confusing. He dashes off a quick comment. The solution’s use-count increments by one: Danny can see it gets quite a lot of use. Negative feedback and high use: Danny knows the solution will bubble up to the top of the QA queue, and he or someone else on the team will look into it soon. There are good status points to be earned for knowledge improvement work, and the whole team have learned to treasure their knowledgebase as the thing that most contributes to making their job easier. He chuckles to himself: he remembers when he learned the lesson: he came in one day with a monster hangover. He’d never have made it through the day without the knowledgebase.

Danny has also learned to always follow the script. The Service Delivery Coordinator and the Service Desk team leader both do random QA audits of closed tickets. Staff have been busted down the seniority tables for not following a standardised procedure when one exists. Besides, the scripts and checklists and automated workflows had saved his butt uncounted times when he nearly forgot something.

Access restored. Inge is happy and agrees the ticket can be closed. That’s one more on his First Call Resolution stats.

Danny is about to hang up when Inge, obviously inspired by Danny’s helpfulness, asks “What do I do about a dead client?” It turns out someone came in for an appointment but ERNIE showed them as deceased. Inge has all the details on paper but is still trying to work out how to process the appointment for a cadaver.

Danny opens a new Incident ticket and captures as much detail as he can from Inge while he has her on the phone, but regretfully tells her the Service Desk will have to get back to her. That’s one less on his FCR stats. Never mind, plenty more where that came from.

Danny knows dead people don’t arrive for appointments so he sets the Incident situation code to “unknown”.

He knows he is out of his depth – this isn’t going to be a First Level Resolution for him either. A quick glance at the case workers’ Kanban board shows that Lee is only handling two cases at the moment. Danny changes the Incident status to “case”, assigns it to the Case queue and calls Lee. She is accredited by the vendor of ERNIE as a Level 2 technician – she can handle it.

After Lee gets off the phone from Danny, she picks up the Incident off the queue, assigns it to herself, and adds the third case to her Kanban column. That’s her full up: no more than three cases in progress at a time. Next year she’ll have enough seniority to handle four cases at a time, with a consequent pay increment.

Lee logs on to ERNIE. The client is indeed dead, according to ERNIE. Lee considers the state of the Case:

    Situation: wrongly dead client. There is no administrative function to change the status once a patient is dead.
    Goal: change their status code to “active”.
    Action: Lee will need the DBA to make a direct change to the client’s database record. Policy says this has to be done as a Change, but there have been enough other issues with ERNIE data requiring direct DBA intervention that it is a Standard Change for the DBA to correct a patient code (ERNIE is a bit flaky). Lee opens a Standard Change ticket. Because it is Standard Change it is pre-approved so she assigns it directly to the DBA group queue.

You don’t get to be a case worker without developing some instincts. While she waits for the DBAs to action the change, Lee opens the Audit Log on ERNIE and searches to see what idiot marked the patient as dead. No idiot. There is no audit log of a status change on that patient.

Next, Lee goes to the Change ticket database and searches for Standard Changes to ERNIE data for the “deceased” status code. There are seven others this year. Warren – her least favourite case worker - has requested four of them. She calls Warren.

“Warren, it’s Lee here”
“Hello Lee”. Warren doesn’t sound enthused to hear from Lee.
“Did you know you have requested four patients be raised from the dead in ERNIE so far this year?”
“Yes there have been a couple of those. Someone is killing our clients.”
“It’s not funny Warren. Did you check who marked them deceased? I’ll save you the trouble. No-one did: there’s no audit record. Did it occur to you there might be a Problem here?”
“No it didn’t Lee. I had more important things to worry about. We play God, we get them marked alive again. Problem solved ok?”
“No the Problem isn’t solved! Only the incident is. There’s a Problem out there: something is falsely setting that code to ‘deceased’.”
“Go get ’em Lee, you’re the ace detective. Let me know how it works out.”

Lee fumes as she hangs up. She opens a Problem ticket linked to the Incident, and assigns it to herself. As the most highly accredited staff member for ERNIE, Lee deals with Problems on ERNIE when she isn’t fully occupied on Cases.

Lee schedules an online conference with Simon from 3Thimbles Tech, the vendor of ERNIE, and two BigMed staff: Ann, the applications developer who works on ERNIE integration, and Russell the IT security analyst (if audit records are missing he needs to be involved).

The conference battles with all the usual issues of people talking over each other, fractured video, and slow graphics on the shared online whiteboard. Lee pines for the days when people actually got together in a room, but at least the conference happened quickly.

The team apply all their regular tools: a situational analysis, a barrier analysis, a brainstorm, and a root cause analysis.

They soon focus on two essential clues
• the missing audit record
• all the accidentally-deceased clients were transferred from two particular external clinics

This unravels the cause. The two clinics are the only external agencies who use the BERC4 patient management system. Patient records are transferred into BigMed’s ERNIE through a complex system of scripted batch feeds called GLUE. Updates from the external agencies are also applied via GLUE. GLUE doesn’t write ERNIE audit records; it has its own basic log. It is a known issue that has caused heated debate, but GLUE was written quickly when the ERNIE implementation project ran short of funds and the properly designed integration system using expensive 3Thimbles services was cut.

Ann needs an hour to trawl the GLUE log, so the conference call disbands. Russell clearly thinks it was a waste of his time, and Simon manages to impart a strong “I told you so” message without actually saying anything. But Lee is unfazed: she has her result. She knows it already, in her gut.

While she waits, Lee checks the Standard Change to resurrect the original client. Nobody has picked it up from the DBAs’ queue so she rings the DBA team leader and applies a little pressure.

Lee’s gut was right. Ann comes back looking shame-faced. She physically turns up at Lee’s desk, so Lee can see the embarrassment as Ann explains that the table which maps BERC4 status codes to ERNIE status codes has an error. When BERC4 updates their status to “in remission”, GLUE maps that to an ERNIE code of “deceased”. Ann has tested it in the User Acceptance environment and will ask one of the external clinics to run a transaction against a test patient in Production. The mapping table has been fixed; there will be no more wrongly deceased patients.

The testing of GLUE was another project shortcut.

Lee consoles Ann but also adds “Sorry Ann but I think this Problem is sure to be flagged for review because of the client impact. They may put this on the weekly Intelligence Report”. Ann sighs and goes back to nursing GLUE along.

Lee updates the Problem record and puts it into a waiting status until Ann completes the production test. Lee has her own informal checklist of things to go over after an application problem is found. She runs through it now, and reminds herself she must publish it to the team knowledgebase. It would be useful for them, and besides she needs to get her five new solutions published before month end.

An alert pops up to tell her the Standard Change has been closed. Lee logs on to ERNIE and sure enough the client lives again.

Lee updates the Incident ticket with the actions taken and the result, then changes the ticket status from “Case” to “Resolved”. She smiles: she is chasing the 50-A-Month badge that the case workers covet. It also rewards her with an extra day’s leave but it is the badge she really wants, to show to Warren. He’s never had one; it will be her third.

She runs through the case resolution checklist to ensure she has done and written everything required. Problem ticket: check. History detail updated: check. Cost and effort estimate recorded: check. Medical impact and risk flagged: check. Nothing else applies to this one.

Lee transfers the Incident back to Danny, the Incident owner. When Danny sees it come up on his queue he calls Inge. No need to tell her all the details of the screw-up, but he thanks her for bringing it to their attention and assures her it shouldn’t happen again as the underlying Problem has been resolved. Inge agrees to close the Incident.

Danny is about to hang up and start the incident closure checklist when Inge says “Hey, while you are on the line, I’ve got another question…”

Inge has a query about her last payslip, something about payment for a public holiday. Danny has no payroll support certifications, so he puts Inge on hold, checks the Service Desk console, sees that George is off the phone and has the required certification, and transfers the call to George. While George answers, Danny opens a new Incident ticket and transfers it to George. Danny always does it that way: you are recognised for your throughput of tickets. Besides, it does George a small favour too.

Now Danny wraps up the original ticket. He pulls up the incident closure checklist and checks them all off:

  • Confirm that the Incident is resolved
  • Confirm that recovery is complete: the service has been restored to the user
  • User agrees the incident can be closed
  • Check that Incident is in the right category and correct if necessary
  • Check that Incident is associated with the correct asset(s)
  • Ensure that the Incident history is a complete record
  • Ensure that Incident documentation is complete
  • Determine whether the information in the Incident record should be part of the knowledgebase, and take the necessary action to copy it there
  • If it is likely that the incident could recur, check whether there is an existing Problem record, or create one
  • Link the Incident to any related Master-Incident, Problem or Change records
  • Close the Incident

So he does.

Clearly inspired by The Phoenix Project, G. Kim, K. Behr, G. Spafford, IT Revolution Press 2013, ISBN 978-0988262591, an excellent parable on the benefits of DevOps.