FREE Subscription to Dr. Dobb’s Digest: Same Great Content, New Digital Edition
Site Archive (Complete)
Architecture & Design
Email
Print
Reprint

add to:
Del.icio.us
Digg
Google
Furl
Slashdot
Y! MyWeb
Blink
July 01, 1999
9907i.htm

July 1999: Management Forum: Waste Managment

Improving data quality is an important endeavor to any development project-and it's something all managers need to understand.

It is perhaps modern computing’s most famous dictum, yet we often ignore that universal law of “garbage in, garbage out.” When we do, we risk transforming well-designed software into badly-behaved systems. In this month’s column, John Boddie argues that controlling and cleaning up the garbage is as central to good software development as it is to modern living. We do not all have to become waste management specialists, but every manager needs a grounding in the rudiments of data migration, along with an appreciation for the complexities of dealing with messy data.

—Larry Constantine

This is the day—the payoff for the technical risks, the weekends spent in front of a terminal, and the mind-numbing requirements meetings. The boss is here, along with the vice president of development and her equivalent from the user community. After a few remarks about all the hard work, you start the demo. It looks good. The user you’ve carefully coached moves confidently through the screen sequences. Response time is great. The user interface draws pleasing comments.

Then someone says, “That can’t be right.”

“What can’t be right?”

“That order. You’re using real data, aren’t you?”

“Yes. We wanted you to see the system in a production setting.”

“Well, McDongle and Crashmore has five sites and you only show two.”

Stay cool. You tell the user at the terminal to go to the customer profile screens, but she can only find two sites.

The user’s vice president looks at your boss’s boss and says, “Are you sure this new technology is worth giving up 60% of our customer base?”

You never planned on this, and therein lies the problem.

Trash Compactors

Software development strives for new functionality and new ways to deliver data to users. In practice, what we often do is reinvent the trash compactor. You remember—the kitchen marvel that turned a 20 pound bag of trash into 20 pounds of trash.

Alas, no level of tool or user interface sophistication can overcome the burden of bad data. Although it is one of the key determinants of development success, the process of converting data from old systems into data for new ones receives almost no attention from the development community.

Developers build trash compactors because they are focused on the processing and not the content of the data. A trash compactor doesn’t care if it’s crushing milk cartons or old broccoli, and our systems don’t care if a customer’s telephone number is right or wrong. We conveniently overlook the fact that referential integrity rules in our databases can be satisfied with data that isn’t related in the real world. Bad data, we say, is “not our problem.”

Trash compactors can be wonderfully complex and great fun to build. Trash itself is much less interesting. Most developers would not consider “making sure the parts list is correct” a career-enhancing opportunity. Typically, such assignments fall to the most junior staff—when they can’t be fobbed off on the users. But by leaving these jobs to junior programmers, what are the chances of difficulties when you attempt to deploy the system? When these difficulties show up (and they will), will everyone still think you did a great job?

Face it, these are assignments nobody wants. Cleaning up legacy data is like being a referee, any news you have is probably bad. I’m currently leading a data migration effort for a telephone company and so far we’ve only found data for about 70% of the routers they installed in the field. Results like these create long faces at project meetings.

The View from the Landfill

In my opinion, data migration and data quality improvement are two of the most complex and valuable data processing areas. Ask any business process owner what is the first priority, accurate data or a really nice interface, and you already know the answer. Nevertheless, chances are that you will spend far more time thinking about the interface and working on it than you ever will dealing with the data’s accuracy.

Even if you want to clean up the data, you may not have the technical skills required. Java development expertise doesn’t easily translate into interpreting VSAM files or using Wang utilities.

Data migration work requires an understanding the semantics of the data you are dealing with, the business practices that use the current systems, and the way these are expected to change, as well as the technologies that the new system, the old systems, and the migration environment use. In my current project, the data sources include Wang, Oracle, Microsoft SQL Server, Microsoft Excel spreadsheets, Microsoft Access databases, direct feeds from telephone switches in the network, direct feeds from Cisco and Newbridge routers, Lotus Notes, proprietary databases, and flat-file outputs from mainframe systems.

In data migration, you do not have the luxury of dealing with the clean abstractions that are collected into the new system requirements. No requirements document starts with the assumption that the new system will be initialized with bad data, yet this is often the case in practice.

When migrating data, you must contend with idiosyncrasies from years of legacy system operation, including data elements that have been redefined, multiple codings of data fields, obsolete status values, and the like. You cannot disregard or arbitrarily correct these. If you can’t find location data for a $25,000 piece of equipment, you can’t leave it out of the new system just because it doesn’t fit your model. Data migration can make you feel you are turning over rocks by a river after a major storm. You will start to notice odd-looking things when you start dealing with the details that have accumulated over the years.

Primary Treatment

When migrating data, identify the data source for your new system first. This is not simply a matter of identifying what system will be replaced. If the system being replaced gets its data from other systems and they, in turn, get it from other ones, the chances are slim that the most accessible data is the most accurate.

Identifying data sources can be difficult. In most companies, systems environments have become so complex that data processing groups don’t really understand the “big picture,” and users can’t identify all their data sources and data streams. In fact, one of the first things the development teams of these companies probably have to do is examine the current system and “extract the business rules.” If users really understood the process details, this step would be unnecessary.

In many cases, new systems will be replacing multiple legacy systems whose data is supposed to be identical. When it isn’t, you must know how the data came to be in each of the systems in order to choose which of the supposedly identical data items is correct. This involves looking at the business processes as well as the systems.

For example, my current project combines data from legacy systems for order entry and billing. Each contains customer account data, including account status, contact address, and billing address. About 20% of the records are inconsistent in at least one of these attributes. The obvious choice is to use the data from the billing system as the correct data whenever billing and order entry attributes differ. However, contact information is updated in the order entry system, and these updates are often not made in billing. Likewise, accounts may be made inactive in the billing system but are not updated in the order entry system unless the user calls and requests it specifically. Knowing these things, we can make a rule that the data for the new system should include the billing address and status from the billing system and the contact address from the order entry system. We still need to produce reports showing the discrepancies between the order entry and billing systems. One report might show all customers where the billing address differs and the account is more than 60 days overdue, which may indicate that the billing address was updated in the order entry system but not in the billing system.

Looking at the “last edit date” for records is not as useful as it might appear, since it’s difficult to determine what changed in the record. Was it a contact name? Was it an area code in the phone number? Change dates provide collateral information, but they seldom drive adjudication in those cases where data sources contain different values.

You must map all the sources and destinations to a meta data model. To manage data migration, you must create a meta data model and keep it current. It will be your most valuable management tool. It will track source systems and databases, the destination system and its database, and all the attributes including their formatting and coding. You will use the meta data to establish and enforce conventions, such as always using “Ave.” instead of “Avenue” or “Ave” (without the period).

You can’t assume that the new system is the only destination for the migrated data. If the system being replaced provided data for multiple downstream systems (not an unusual situation), then the data you are cleaning up may be a candidate for migrating to these systems as well.

All of this analysis work may sound a lot like developing software. In fact, much of it is the same, except there is more detail and more riding on understanding what all of the data really means. Since data migration always shows up early on the critical path, the work that supports it must also be accurate early.

Secondary Treatment

Data systems that include errors are likely to get new errors every day. Reducing the influx of errors means spending a lot of time dealing with users who don’t have a lot of time to worry about data migration.

Downsizing information technology personnel adds its own challenges. The people who understood the legacy systems and data have moved on, and those who are left are so overworked that they hardly have time to breathe. On the plus side, most people still working with the legacy systems are interested in doing what they can to improve the data’s quality because they understand that good data makes their jobs easier. These people also understand that more new systems are promised than are actually delivered and that the work you are doing with data migration may benefit them over an extended period. You need to find ways to let these people help you make the data better with as much efficiency and as little disruption to their regular work as possible.

In most cases, this work involves generating reports that show both the current legacy data and its expected values, and then either developing code to migrate good data back into the legacy systems that should have it already or providing worksheets that let users examine the data and make the necessary changes directly. You need to coordinate the worksheets with the screens the users will be invoking to make the changes.

You also need to determine with the business process owner how the errors were introduced into the system. For example, if sales representatives include dummy site address information to get sales approved faster and commissions credited earlier, it may be necessary for the process owner to curtail this practice. If not, then the data migration work needs to be conscious of orders received after a given date, knowing that the site addresses may be erroneous.

Trash to Steam

In the real world, turning trash into steam sounds like a great idea. Unfortunately, it is also complex, expensive, and subject to regulations, hearings, and even public protests.

The software development equivalent of trash to steam is phased implementation. In phased implementation, data migration is no longer a simple matter of concocting files of relatively good data and loading them into a new system. It becomes a complex enterprise with transactional processing, application modification (both legacy and target), special purpose middleware, and shifting objectives. Developers who have been through the process assert that the cost and difficulty of migrating from system A to system B in phases is likely to be more costly than the development of A and B combined.

Typically, phased implementation requires that the new system and the old system be run in parallel. This means that migration now becomes a two-way street. If the defined phases include both functional and organizational or geographical steps, you’ll need “retrofits” to bring previously migrated data up to the current standard. You might need requirements for new data and functionality as a consequence of business process changes. If multiple systems are involved, this is a complex undertaking.

Phased implementations are often the result of mergers or acquisitions. In these situations, further complications can arise from the cultural differences between organizations and from differences in business processes, terminology, and even legal requirements. When migrating data to support a new system, none of this is an abstraction. You’ll need to deal with it every day.

Now ask yourself, do you want to leave all of data migration’s complexity in the hands of junior programmers and newly-minted managers? Do you still believe that data migration and quality improvement are “not your problem?”

TOP 5 ARTICLES
No Top Articles.



MICROSITES
FEATURED TOPIC

ADDITIONAL TOPICS

INFO-LINK