Step 1: Enter the bug in your case tracking system
At the end of all these steps is a phase where you are tearing your hair out and still haven't gone home, yet. Then you will realize one of two things:
- you've forgotten some crucial detail about the bug, such as what it was, or
- you could assign this to someone who knows more than you.
A case tracking system will prevent you from losing track of both your current task and any that have been put on the backburner. And if you're part of a team it'll also make it easy to delegate tasks to others and keep all discussion related to a bug in one place.
You should record these three things in each bug report:
- What the user was doing
- What they were expecting
- What happened instead
These will tell you how to recreate the bug. If you can't re-create the bug on demand, then your chances of fixing it will be nil.
Step 2: Google the error message
If there is an error message then you're in luck. It might be descriptive enough to tell you exactly what went wrong, or else give you a search query to find the solution on the web somewhere. No luck yet? Then continue to the next step.
Step 3: Identify the immediate line of code where the bug occurs
If it's a crashing bug then try running the program in the IDE with the debugger active and see what line of code it stops on. This isn't necessarily the line that contains the bug (see the next step), but it will tell you more about the nature of it.
If you can't attach a debugger to the running process, the next technique is to use "tracer bullets", which are just print() statements sprinkled around the code that tell you how far a program's execution has got up to. Print to the console (eg: Console.WriteLine("Reached stage 1"), or printf("Reached stage 1")) or log to a file, starting very granular (one print per method, or major operation), then refining it until you've found the one single operation that the crash or malfunction occurs on.
Step 4: Identify the line of code where the bug actually occurs
Once you know the immediate line, you can step backwards to find where the actual bug occurs. Only sometimes will you discover that they're both one and the same line of code. Just as often, you'll discover that the crashing line is innocent and that it has been passed bad data from earlier in the stack.
If you were following program execution in a debugger then look at the Stack Trace to find out what the history of the operation was. If it's deep within a function called by another function called by another function, then the stack trace will list each function going all the way back to the origin of program execution (your main()). If the malfunction happened somewhere within the vendor's framework or a third-party library, then for the moment assume the bug is somewhere in your program--for it is far more likely. Look down the stack for the most recent line of code that you wrote, and go there.
Step 5: Identify the species of bug
A bug can manifest in many bright and colorful forms, but most are actually all members of a short list of species. Compare your problem to the usual suspects below.
You began a for-loop at 1 instead of 0, or vice-versa. Or you thought
.Length was the same as the index of the last element. Check the language documentation to see if arrays are 0-based or 1-based. This bug sometimes manifests as an "Index out of range" exception, too
- Race condition
Your process or thread is expecting a result moments before it's actually ready. Look for the use of "Sleep" statements that pause a program or thread while it waits for something else to get done. Or perhaps it doesn't sleep because on your overpowered and underutilized development machine every query was satisfied in the milliseconds before your next statement executed. In the real world things get delayed and your code needs a way to wait properly for things it depends on to get done. Look into using mutexes, semaphores, or even a completely different way of handling threads and processes
- Configuration or constants are wrong
Look at configuration files and any constants you have defined. I once spent a 16-hour day in hell trying to figure out why a web site's shopping cart froze at the "Submit Order" stage. It was traced back to a bad value in an /etc/hosts file that prevented the application from resolving the IP address of the mail server, and the app was churning through to a timeout on the code that was trying to email a receipt to the customer
- Unexpected null
Betcha you got "Value is not initialized to an instance of an object" a few times, right? Make sure you're checking for null references, especially if you're chaining property references together to reach a deeply nested method. Also check for "DbNull" in frameworks that treat a database Null as a special type
- Bad input
Are you validating input? Did you just try to perform arithmetic when the user gave you a character value?
- Assignments instead of comparisons
Especially in C-family languages, make sure you didn't do = when you meant to do ==
- Wrong precision
Using integers instead of decimals, using floats for money values, not having a big-enough integer (are you trying to store values bigger than 2,147,483,647 in a 32-bit integer?). Can also be subtle bugs that occur because your decimal values are getting rounded and a deviation is growing over time (talk to Edward Lorenz about that one)
- Buffer overflow & Index Out-of-range
The number-one cause of security holes. Are you allocating memory and then trying to insert data larger than the space you've allocated? Likewise, are you trying to address an element that's past the end of an array?
- Programmer can't do math
You're using a formula that's incorrect. Also check to make sure you didn't use div instead of mod, that you know how to convert a fraction to a decimal, etc.
- Concatenating numbers and strings
You are expecting to concatenate two strings, but one of the values is a number and the interpreter tries to do arithmetic. Try explicitly casting every value to a string
- 33 chars in a varchar(32)
On SQL INSERT operations, check the data you're inserting against the types of each column. Some databases throw exceptions (like they're supposed to), and some just truncate and pretend nothing is wrong (like MySQL). A bug that I fixed recently was the result of switching from INSERT statements prepared by concatenating strings to parameterized commands: the programmer forgot to remove the quoting on a string value and it put it two characters over the column size limit. It took ages to spot that bug because we had become blind to those two little quote marks
- Invalid state
Examples: you tried to perform a query on a closed connection, or you tried to insert a row before its foreign-key dependencies had been inserted
- Coincidences in the development environment didn't carry over to production
For example: in the contrived data of the development database there was a 1:1 correlation between address ID and order ID and you coded to that assumption, but now the program is in production there are a zillion orders shipping to the same address ID, giving you 1:many matches
If your bug doesn't resemble any of the above, or you aren't able to isolate it to a line of code, you'll have more work to do. Continue to the next step.
Step 6: Use the process of elimination
If you can't isolate the bug to any particular line of code, either begin to disable blocks of code (comment them out) until the crash stops happening, or use a unit-testing framework to isolate methods and feed them the same parameters they'd see when you recreate the bug.
If the bug is manifesting in a system of components then begin disabling those components one-by-one, paring down the system to minimal functionality until it begins working again. Now start bringing the components back online, one by one, until the bug manifests itself again. You might now be able to go try going back to Step 3. Otherwise, it's on to the hard stuff.
Step 7: Log everything and analyze the logs
Go through each module or component and add more logging statements. Begin slowly, one module at a time, and analyze the logs until the malfunction occurs again. If the logs don't tell you where or what, then proceed to add more logging statements to more modules.
Your goal is to somehow get back to Step 3 with a better idea of where the malfunction is occurring, and it is also the point where you should be considering third-party tools to help you log better.
Step 8: Eliminate the hardware or platform as a cause
Replace RAM, replace hard drives, replace entire servers and workstations. Install the service pack, or uninstall the service pack. If the bug goes away then it was either the hardware, operating system or runtime. You might even try this step earlier in the process--per your judgement--as hardware failures frequently masquerade as software dysfunction.
If your program does network I/O then check switches, replace cables, and try the software on a different network.
For shits and giggles, try plugging the hardware into a different power outlet, particularly one on a different breaker or UPS. Sound crazy? Maybe when you're desperate.
Do you get the same bug no matter where you run it? Then it's in the software and the odds are that it's still in your code.
Step 9: Look at the correlations
- Does the bug always happen at the same time of day? Check scheduled tasks/cron-jobs that happen at that time
- Does it always coincide with something else, no matter how absurd a connection might seem between the two? Pay attention to everything, and I mean everything: does the bug occur when an air-conditioner flips on, for example? Then it might be a power surge doing something funny in the hardware
- Do the users or machines it affects all have something in common, even if it's a parameter that you otherwise wouldn't think affects the software, like where they're located? (This is how the legendary "500-mile email" bug was discovered)
- Does the bug occur when another process on the machine eats up a lot of memory or cycles? (I once found a problem with SQL-Server and an annoying "no trusted connection" exception this way)
Step 10: Bring-in outside help
Your final step will be to reach out to people who know more than you. By now you should have a vague idea of where the bug is occurring--like in your DBM, or your hardware, or maybe even the compiler. Try posing a question on a relevant support forum before contacting the vendors of these components and paying for a service call.
Operating systems, compilers, frameworks and libraries all have bugs and your software could be innocent, but your chances of getting the vendor to pay attention to you are slim if you can't provide steps to reproduce the problem. A friendly vendor will try to work with you, but bigger or understaffed vendors will ignore your case if you don't make it easy for them. Unfortunately that will mean a lot of work to submit a quality report.
Good practices (and when all else fails)
1 - Skip the Vicodin, though. Oxy is better anyhow.
- Get a second pair of eyes
Collar a co-worker and have them look at the problem with you. They might see something you didn't. Do this at any step of the process
- Have a good gander at the code
I frequently find bugs just by relaxing and reading the code. Walk through it in your mind
- Look at scenarios where the code works, compare the input to when it doesn't work
I recently found a bug where an input in XML form contained "xsi:type='xs:string'" and everything broke, but another input without that attribute succeeded. Turns out, the extra attribute was messing with deserialization
- Go to sleep
Do not refuse to go home until you've fixed it. Your powers diminish with fatigue and you'll just waste time and burn yourself out
- Use creative pause
Creative Pause is the term for getting up and going to do something else for a while. If you've ever noticed how you have your best ideas in the shower or while driving home it's because the change in mental tasks bumps you to another plane of thought. Try going for lunch, watching a movie, browsing the web, or working on a different problem for a while
- Disregard some of the symptoms and error messages and look at the problem again
Nasty bugs can come in disguises that can mislead you. Dial-Up Networking in Windows 95 claimed there was a busy signal when you could clearly hear the remote modem answer and try to negotiate. The 16-hour shopping-cart bug from above manifested in customers losing their shopping cart contents because we had load-balanced application servers, and as each server faulted-out the sessions were being transferred to sister machines that couldn't recover the cart's contents. When you're overwhelmed with symptoms you have to put your hands over your ears and shut them out so you can focus on just one, and when you've identified or eliminated it you can move on to the next until you've found the root
- Imitate Dr. Gregory House
Gather your team in your office, stomp around with a cane, write down the symptoms on a whiteboard and make snarky comments. It seems to work for TV medical dramas, so give it a shot1
Things that will not help you
- Swatting flies
Do not go crazy trying everything at once. Some managers panic and start ordering code rollbacks, server reboots, routing changes and other arm-flailing movements all-at-once in the hope that one of them will fix an emergency early. It never works. It also creates a bigger confusion that takes even longer to sort out. Do one thing at a time. Measure the result. Think about it. Then move on to the next hypothesis
- "Help, plz"
When you go to a support forum you should've already got past Step 3, at the minimum. Nobody will want to help you, or be able to help you, if you can't give them a good description of the problem, including your hardware/OS configuration and some relevant lines of code. Start a topic when you think you can describe the problem intelligently and pick a descriptive subject line for the message
- Shouting the solution into existence
If you think it's somebody else's fault, especially early in the process, at least talk to them in a civil fashion. Physicians and neuroscientists have been studying the phenomena but are now certain that shouting, exhorting, pleading and emphasizing the severity and dire consequences of an emergency have no positive impact on the problem-solving centers of the brain. Even if Democracy itself was under threat, being loudly and annoyingly emphatic will not summon the fix into existence, no matter how much brute willpower the technician can muster
Miscellaneous bugs I've fixed recently
- Duplicate filenames? But they include a timestamp!
Mysterious problem with files being generated twice. Further investigation: the files don't have the same contents. That's strange, they should always have a unique filename because the template includes the date and time formatted as "yyMMddhhmmss". Step 9, Look at the correlations: the first file was generated at 4:30am, and the dupe filename was generated at 4:30pm the same day. Coincidence? Nope, because "hh" in a time formatting string means 12-hour clock values. Doh! Changed template to "yyMMddHHmmss", bug fixed.