FREE Subscription to Dr. Dobb’s Digest: Same Great Content, New Digital Edition
Site Archive (Complete)
Testing & Debugging Blog: Five Questions With James Hamilton
Testing and Debugging
BREAKPOINTS

Test, Debug, Release, Rinse, Repeat ...

by Kevin Carlson
THE BOOK OF TESTING

Thoughts From a Braidy Tester

by Michael Hunter
December 04, 2007

Five Questions With James Hamilton

James Hamilton started out as an auto mechanic in Ottawa, Ontario, almost thirty years ago. After a few years of that he decided to become a code jockey instead and so pursued a couple CS degrees. With those in hand he went off to IBM, where he worked on projects like IBM's first C++ language compiler and garnered a fistful of awards. From there he moved to Microsoft, where he spent stints on SQL Server, Windows NT, and Exchange, and is now helping the Windows Live Platform Services team build a high scale services platform. On top of all that he speaks at a plethora of conferences, serves on a multitude of industry committees, and publishes papers on a regular basis.

Here is what James has to say:

DDJ: What was your first introduction to testing? What did that leave you thinking about the act and/or concept of testing?
JH: My first experience with testing was in building systems as part of my undergraduate degree in computer science. I did well with these assignments but it wasn’t by being a better developer. University Computer Science programs are full of good developers. My secret weapons were: 1) build componentized systems that can be independently tested at each component, 2) build simple automated testing systems that can drive all component tests and overall system functional tests, and 3) having made it possible to test and easy to run all the tests, do it any time for anything no matter how small the change. Even a comment. The last point is probably more succinctly summarized as having patience and follow through. It’s perhaps the easiest statement to make, the most boring attribute to see in print and yet it’s one of the most important attribute of both good developers and good testers. In all forms of engineering, the first 90% is easy. Getting something running reasonable well and mostly correct is easy. It’s the last 10% that separates great software from frustrating software and it’s that last 10% that separates good developer and testers from the average.

Going back to the three secret weapons, it’s ironic that I was able to do considerably better than average on development work at university by investing in testing rather than in development. By spending more on testing upfront and automating the tests, I was able evolve systems more quickly and efficiently and the resultant systems were often more correct. If you agree with me that great testing is one of the keys to building complex systems, why is it that so few universities actually teach testing?

My first job after university was leading an Ada compiler test team at IBM. My initial observation was that there was huge opportunity for improvement and I’ve since learned that this is true industry-wide. In my opinion, if you want to fundamentally improve software quality and the speed with which it’s produced, it makes most sense to go where the biggest opportunities for improvement lie. Software testing and design and development for testability is, in my view, the easiest and best way to improve the current state of the industry.

DDJ: What is the most interesting bug you have seen?
JH: Over the years, every software release on which I’ve been involved has had some truly difficult bugs. One of the most interesting was just prior to shipping DB2 V2. I was the Unix architect on DB2 and this was our first Unix port of the DB2 UDB database management system. This was a super important release from a fairly new team and the company was watching. Like all software projects, we were late and there was considerable pressure to ship. We had one remaining serious bug. A very rare index corruption bug that only showed up under load. This is exactly where many software projects hurt themselves and their customers by bending to the pressure to ship and, by so doing, failing over the long haul. Janet Perna who lead the DB2 development team knew that a new database just entering the Unix database market would be quickly irrelevant if customers encountered even a rare data corruption bug. She was 100% right and this is the hardest decision that development leaders have to make. Years later Paul Flessner who led the SQL Server team said, in a tough shiproom discussion: if we ship this bug, customers will still remember 10 years from now whereas if we’re a month late, it’ll be forgotten before the end of year.

Geoff Peddle, one of the best engineers on the DB2 team had been chasing this index corruption bug for weeks. As the release got later, more and more engineers got involved in tracking this bug but the weeks dragged on. This bug always showed up in high scale, long haul stress runs but was nearly impossible to reproduce. And, when index corruption was detected, the actual issue that corrupted the index may have happened hours ago. We tried all sorts of tricks to catch the problem as it was happening including hooking the lowest level of the system where pages are written out and putting full page integrity checks in place. Nothing would reproduce the problem at its occurrence but, once or twice a day, the long haul stress machines would bump into it.

We eventually found it – I wish I could recall who figured it out (drop me a note if you remember). The problem was a cache firmware problem. One broadly distributed hard disk model had a firmware cache problem. On rare occurrences under heavy load, a cache error would show up where a write would be written to cache but not to disk. On subsequent reads, the old page contents from the previous page generation are returned directly from disk. To make things a bit more fun to debug, seconds later the disk firmware would force the dirty page from the cache to disk restoring overall database integrity. This means when it comes time to debug the data corruption, every page is now perfect which led us all to be convinced that it was a DB2 problem since the on-disk contents were always correct.

Since the problem was in non-flashable disk firmware, the fix was to change the device driver for this manufacturer’s disk line to disable the cache for affected disk drives. The unfortunate side effect of this fix is that, for the thousands of customers who uses these disks, when they did a device driver upgrade, their disk caches were shut off which will substantially negatively impact most workloads.

We did the right thing to delay shipping to get this problem resolved since the faulty device was the most common disk in use on the RS/6000 which was our primary target platform. Had we shipped on time, customers would have remembered that bug and hated us for the data corruption problems that followed for the subsequent decade.

DDJ: What has most surprised you as you have learned about testing/in your experiences with testing?
JH: The clear division between test and development that many in the industry still advocate today simply doesn’t work. In healthy teams, testers are developers by all measures. They may not be directly coding on the product or service but they are writing code and they need to be as capable as the rest of the engineering team. Large test teams are a symptom of a problem rather than a way to solve it. I’m not saying we shouldn’t have engineers who focus solely on testing. My point is that these folks need to have all the skills of developers. They need to be just as capable and as well compensated as any other member of the engineering team.

Just as testers are developers, developers need to test as well. All developers need to be responsible for unit testing. By “responsible” I mean that, when code is checked in, it should be functionally correct. A few companies have operated this way in the past where developers were both expected to fully unit test AND there was a considerable social and professional stigma associated with checking in code that later showed multiple functional problems. At these companies, development included unit test and developers were held accountable for the both the quality and timeliness of what they produce. At companies where developers write code and testers write tests, developers end up being mostly responsible for code quantity and schedule and considerably less accountable for quality. This is bad for the developers and does terrible things to the products, rewarding what we need least: lots of nearly correct code.

If developers are responsible for unit test, the code will be organized such that it’s easy to test, the overall quality will be better and the code produced will be smaller, tighter, and more correct.

DDJ: What do you think is the most important thing for a tester to know? To do? For developers to know and do about testing?
JH: Testers need to take a customer perspective with a developers knowledge. They need to know the system as a whole better than developers and they need to understand and represent the users and administrators of the system. They need to keep the overall project focused on ease of use and ease of administration and they should be actively using the product or service in every way practical.

Developers need to design for testability. Systems that are hard to test are hard to ship. Developers need to own unit test and design systems that can be effectively tested through automated tests below the UI level. Years ago the semi-conductor industry responded to exploding complexity by employing design for testability and by including test circuitry in the actual shipped component. We need to do exactly the same thing with software. We should be designing systems that are easy to test and we should be including self-test code in the production systems. Just as modern CPUs include elaborate self-test circuitry, production software systems should spend resources implementing self test.

Systems where substantial resources are spent on self test are easy to evolve going forward, show far less software entropy and generally have lower service costs.

DDJ: What do you see as the biggest challenge for testers/the test discipline for the next five years?
JH: Testing doesn’t scale. Software complexity continues to increase and is leading to non-linear test team growth. I’ve worked with teams that have more testers than developers and yet their product quality was still not under control. Bigger test teams usually end up doing more of what doesn’t scale and they just continue to grow. Eventually someone determines that to control costs, we should offshore the testing work. They feel great and may even be rewarded for taking 1/3 the cost out of the testing effort but it’s bad engineering piled on bad engineering. The base problem is that testing doesn’t scale and growing test teams to cope with the non-linearly scaling complexity doesn’t work. Sending large, ineffective teams off-shore does reduce costs temporarily but they are still ineffective and they still can’t scale.

The only way to address the exploding complexity of large software systems is to design them for testability. Developers need to view “done” as having written the code, written the unit tests necessary to establish that the code works, and to ensure that the tests pass. Development shops that view “done” as code complete are doomed to be late and ship low quality software. Developers should be held accountable for the number of bugs found in their code after it has been checked in and declared done. For the most part, this isn’t tracked and often isn’t even noticed.

The biggest challenge for the test discipline over the next 5 years is to focus on this core problem: testing doesn’t scale. Test teams need to educate the development groups in design for testability and the production systems need to include self test code. Testers need to push this part of their jobs to development. Successful test teams will show the leadership to push unit test to development, to automate all they do, and to be part of all design and overall project decision. A healthy test team will be a small, high talent organization that is a full part of all product decisions. The test team needs to represent the users and administrators of their product and to ensure that what ships actually works for the target audience.

Successful test teams will be the ones that innovate in finding ways to allow test to scale. Successful test teams may get smaller than many current generation test teams but the overall engineering talent density on these teams may actually rise. To continue to be successful, the industry as a whole needs to address the test scaling problem on large software systems.


[See my Table Of Contents post for more details about this interview series.]

Posted by The Braidy Tester at 07:30 AM  Permalink




 
INFO-LINK