For Want of a Nail the Shoe Was Lost, or Sometimes It’s the Simple Answer
Sometimes it’s the simple things that cause the most heartache. Especially in the technology world in which we live. We roll out sophisticated technology solutions using a myriad of virtualized tools to consolidate servers and storage and even networking. We implement monitoring mechanisms with all kinds of bells, whistles, and alarms to manage said virtual systems, just to make our lives just a little more sane. We learn to rely on these tools and our intrinsic knowledge of our systems. We sometimes, just sometimes, get so caught up in the technology that we overlook some of the simple solutions.
Not that long ago I was made aware of a situation where someone fought for two weeks to resolve an infrastructure problem. The problem had been escalated up the ladder of multiple vendors. The escalation path included the integration partner, server provider, virtualization software vendor, storage vendor, and application vendor. People were working around the clock. They were updating firmware, swapping NICs, reinstalling software, scouring audit logs. After two weeks everyone was scratching their heads. No one was any closer to a solution than when the problem was first reported. This was a new install, why was there such a problem? Then one of the on-site engineers started cleaning up a little. He was dressing cables and labeling servers. Just on a whim, he checked to make sure the power cords were seated all the way. When he pushed the cord in, the whole power supply slid forward! Come to find out, the power supplies in the chassis were not seated properly. Once all of the power supplies were re-seated, everything worked fine. Not a single trap or alarm had picked up on the power issue. No diagnostic test revealed a power problem, but there it was. In hindsight, it seems simple.
It made me think. It’s always like that. How many times have we exhausted complex technical avenues before we realize the answer is so simple that it’s hidden? Here’s another example.
A long, long time ago at a customer site far away. . .
I was at a customer site for the first phase of a very large infrastructure conversion project. This was a project that had been almost two years in the planning. They were ripping out and replacing the heart and lungs of the Information Technology infrastructure. This project had overcome countless hurdles to get to the point where the network infrastructure could be replaced. Not only were we replacing everything in the network, but we were fundamentally changing the way in which IT services were delivered. We were leapfrogging two generations of technology. This project, because of its size, had a lot of visibility – visibility all the way to the CEO level. The customer was spending millions of dollars on a new infrastructure and they were about to incur an extended downtime. Everybody was watching.
At midnight on the chosen day of the chosen weekend, we took down the systems and began our conversion. We conducted several dry runs in advance of the downtime, so we knew the processes and procedures we needed to do and had some practice doing them. We pre-cabled and pre-configured everything we possibly could prior to shutting down the first system. By the time the system came down, the entire team was already in action installing and configuring the new systems. By 4:00 AM we were finished. We were standing around looking at each other saying things like, “I wonder what we should do with the extra 20 hours we budgeted but didn’t use?” I kept saying things like “See, it’s just like painting. The magic is in the preparation.” By 5:00 AM we were all sitting at Denny’s finishing our breakfast.
The next day ended up being a free day, or so we thought. The technical team scattered to the winds. At about 11:00 PM my phone rang. It was the on-call support person. He’s having a meltdown. He’s been working with the customer since 4:00 PM. He’s been taking angry calls for 7 hours. He can’t fix anything remotely and he can’t get me or the other technical people on the team. I grabbed my technical guy and headed back to the data center.
When we get into the data center there were about a dozen people waiting. We walked through the door and immediately started fielding questions. “Where were we? Why didn’t we answer the phone?” Everybody started talking at once. I asked my technical person to see what was going on and I drifted over to talk to the CIO. The CIO started telling me an interesting story. Apparently, the IT team knew the system was down at 8:00 AM. At 10:00 AM they had called the CIO. They called us at 4:00 PM and now it’s almost midnight. For all intents and purposes, the system had been down 24 hours.
So my tech guy is now working with their tech folks. After a little confusion, my guy finally gets the story. The whole site is not down; it’s just one system; a little less pressure but not much. This system was one of the systems my team had not moved. The customer’s team had responsibility for this system. At this point, we’ve been in the data center about 5 minutes. My guy is sitting in the floor, behind a rack of computer gear trying to figure out what’s wrong. He’s got about a dozen people talking to him at once. All of a sudden from behind the rack I heard my person ask “There are two ports on the back of this system and only one cable. Did anyone try the other port?” Now all of a sudden the room went silent, you could hear a pin drop. Even the equipment seemed to stop making a sound. People started pawing at the floor with their toes and looking at the ceiling. No one said a word. Everyone was trying to look small. The Technical Consultant asked again “Did anyone even TRY the second port?” No one answered. So he moved the cable to the second port. When he did, the system immediately popped up and started to work.
This system had been down for 24 hours. We had replaced the network, so when this system reported it had lost network connectivity, everyone looked at the network. They were testing, reconfiguring and swapping network gear all day. The simple thing got overlooked. This is not an indictment of the customer IT team. The customer’s technical team was far more credentialed than ours. However, when we found out that the system that was down was a system we hadn’t moved, we started from the beginning. The beginning was; was it plugged in? Had it been one that we had touched, we too would have most likely started with the network gear.
I guess this is a cautionary tale. The caution is, sometimes it pays to take a giant step backwards and check to see if the simple things are all ok. It’s been that way forever. It’s to our benefit to have all the alarms, audit logs and management tools at our disposal, but they can’t identify everything or eliminate the need to double-check. Sometimes an error, like the power supply issue mentioned previously, slips past.
I had the desktop techs reporting to me about twenty years ago. After an especially vexing user interaction one of my people documented the fix to a problem as “O-F-F does NOT mean ON, FULL FORCE. Turned on PC.” It never ceases to surprise me when I hear a new example. I guess the only moral to these stories is; it’s only simple in hindsight.
Joseph Kelly is the Director of Technical Consulting at Park Place International. Joe has been working with Healthcare providers and/or payers since the mid 1980’s and focused solely on MEDITECH and MEDITECH hospitals since 1997. Joe has provided technology consulting, architecture, design and planning services while at organizations such as EDS, JJWILD, Perot Systems, Dell Services, and now Park Place International. Joe’s overall all goal is to bring the leveraged, cloud-based virtual universe down to earth to most effectively meet real world objectives for MEDITECH hospitals. Joe has a BS in Computer Information Systems from Bentley University.