At 45 years, I’m even more amazed now than ever at just how incapable people are at performing basic troubleshooting. Getting to the root cause of a problem. Any problem. While I’m focusing on “technology” for the moment, this applies to LIFE in general. Every day I see people start to gather information about a problem, and before they finish listening to the whole story, are already diving in to “fix” the problem.
9 times out of 10 they end up not fixing the problem, but rather, they prolong or intensify it. This drags the problem on longer. The negative aspects of this are dangerous. Wasted time is just the beginning. If you’re a doctor, someone could die. If you maintain certain kinds of machinery or systems, people could die. This is serious shit! If you’re laughing right now: shut the fuck up and pay attention!
Gather the facts (not the rhetoric, bullshit story, just the facts)
Isolate the scope of the problem (how big, how far spread)
Compare with something not-broken
Look for patterns
Basic IT Troubleshooting
Gather the facts
Find out what changed
Inspect event logs, file logs
Eliminate the obvious (cables, power)
When did the problem first occur
Did it EVER work correctly
Isolate the Scope
Is it User-Specific (one user affected, others are not)
Is it Machine-Specific (all users affected on same computer)
Is it Application-specific (one app, or all apps)
Is it device-specific (printer, scanner)
Is it resource-specific (a particular shared folder)
Before and After log results
Verify interfaces (ping, browse)
Verify user accounts (enabled, locked, group memberships)
Verify security settings
What differs from this to another? (user, machine, app, device, etc)
Look for Patterns
Does it happen consistently
Does it happen at particular days, hours, weeks
Does it coincide with another process
What circumstances cause it to occur
Nearly every time someone contacts me for help with something on their computer (and I’m talking about IT “professionals” here, not family, friends and so on), I ask “what do the event logs show?”, and I get the same answer “I haven’t checked yet.”
Here’s a real world example: User calls in a support request saying their application is “broken and won’t launch anymore”. Help Desk technician immediately uninstalls and reinstalls the application. This process normally takes an hour per machine. But guess what? The problem returns. Did they check to see if another user could run that application under their own login? Did they check to see if the application works on other computers?
In one case, the problem was a license server not responding, so NONE of the applications on ANY computer were working. Restarting a service fixed the problem. In another case, the user’s profile had a corrupt registry key (HKCU) and simply deleting the registry key and subkeys forced the application to rebuild the keys and everything worked fine. In both cases, the “fix” was completely wrong and wasted an hour of time for everyone and accomplished nothing. Rushing in to fix a problem without being careful to diagnose it first is dangerous. It’s how space shuttles blow up. It’s how ships run aground. It’s how patients die. We all make mistakes, but a mistake is a deviation from a normal pattern of NOT making mistakes. If you make mistakes all the time, everytime, you need to find another career.
Finally, if you step back and look at your environment and find that you’re fixing the same kinds of problems a lot, that is almost always a clear indication that there wasn’t enough testing performed early on. Whether it was picking the wrong products or technologies, or not implementing properly, or not training users to use it effectively, or not telling users it was coming (and when), well, somebody screwed up.
Slow down, at least a little, and be sure to gather everything you can about a problem so you can get your mind around it and solve it effectively. The time you spend up front will usually save twice that on the other end when you try to solve the problem.