UPDATED 15:21 EDT / APRIL 02 2014

Ask DevOps: Unreproducible errors in the software–monitoring, error handling and bug tracking

devops-logsIn some companies, programmers ignore non-reproducable errors, however, the normal mission is to support testing and report them. Non-reproducible errors can be the most expensive bugs in your software company. Remember that when we have an error that we are not able to reproduce often causes a feeling of distrust in the product.

Inevitably, programs will have errors. No matter how careful the developer or rigid to testing there will always be unexpected errors. People have to learn to live with them and not close their eyes trying to ignore them. Since the errors are part of software development and deployment–and can often be quite problematic, in terms of cost and image, it is best to have them checked.

Google’s software developer, software tester Anthony Vallone wrote an excellent piece speaking to this issue for developers and quality assurance testers.

Understanding software bugs

Effective bug management is a critical activity in any software project. When bug management is ineffective, the project as a whole suffers: time, effort and energy are spent not on fixing bugs, but on arguments and delay tactics.

In order to validate the existence of the bug, the first step developers take is often made using the information in the bug report to reproduce the failure. However, reproducing reported bugs is not always straightforward. In fact, some reported bugs are impossible to reproduce. When all attempts at reproducing a reported bug are futile, the bug is marked as non-reproducible.

Although unreproducible errors can occur in any stage of the software life cycle, they are more frequent during the testing and when the product went live. When errors occur, the log should contain a lot of detail (hopefully). Unfortunately, detail that led to an error is often unavailable once the error is encountered. Also, if you’ve followed advice about not logging too much, your log records prior to the error record may not provide adequate detail.

The error could occur also due to the insufficient resources, timing issues, memory corruption or uninitialized memory and memory leaks.

Google’s software developer, software tester Anthony Vallone has provided some guidelines for development and testing to minimize the likelihood of these bugs from occurring. According to Vallone, the parameters involved in effective bug management span the range of purely technical issues, to human behavior and to organizational politics.  In many cases, these parameters are conflicting – satisfying one will result in neglecting the other. Finding the best solution for the conflicts is not easy.

Guidelines to reduce unreproducible errors

When the error is due to deadlocks, timing issues, memory corruption, uninitialized memory access, memory leaks, and resource issues, he provided some guidelines for development. As a precaution, organization should simplify the synchronization logic. If it’s too hard to understand, it will be difficult to reproduce and debug complex concurrency problems.

The next step is to avoid deadlocks and define an order for obtaining multiple locks and fine-grained it to increase concurrency complexity. Developers should also avoid shared memory. Shared memory access is very easy to get wrong, and the bugs may be quite difficult to reproduce.

For testers, Vallone suggests to process stress test the system regularly to unexpected failures when your system is under heavy load. Tester should test the software with debug and optimized builds under constrained resources by reducing the number of data centers, machines, processes, threads, available disk space, or available memory. They can also use dynamic analysis tools like memory debuggers, ASan, TSan, and MSan regularly to identify many categories of unreproducible memory/threading issues.

Vallone then proposes to use tried and tested defensive programming, fuzz testing, error handling at minimizing unreproducible bugs. He said defensive programming is used to verify the work of your dependencies with known risks of failure like user-provided data, I/O operations, and RPC calls.

“The most common sections of code to remain untested is error handling code. Don’t skip test coverage here. Bad error handling code can cause unreproducible bugs and create great risk if it does not handle fatal errors well,” he said.

In addition, the software developer suggested other resolution terminologies commonly used for non-reproducible bugs including checking for duplicate keys, testing concurrent data access, developing APIs and following good logging practices.

Reach out to the Vallone’s blog to get the detail information in minimizing unreproducible bugs.


Since you’re here …

… We’d like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.

If you like the reporting, video interviews and other ad-free content here, please take a moment to check out a sample of the video content supported by our sponsors, tweet your support, and keep coming back to SiliconANGLE.