Described by Martin Fowler in his aptly titled book, Refactoring (Addison-Wesley, 1999), and popularized by Extreme Programming (XP), refactoring is a useful technique to restructure code in a disciplined manner. By making certain small changes to your code that retain its original semantics, you can support new requirements and keep your design as simple as possible, allowing you to safely and easily evolve your code without putting existing functionality at risk.
As opposed to traditional development, in which your design is first modeled in detail (an approach often referred to as "big design up front"BDUF), with refactoring, your system's design can evolve over time with an "emergent" method. Refactoring works very well for programming source code, and refactoring tools such as Instantiations Inc.'s jFactor (www.instantiations.com/jfactor) and IntelliJ's IDEA (www.intellij.com/idea) are now available for Java. However, the general consensus is that refactoring data schemas is much more difficult than refactoring source code.
Impediments to Data Refactoring
Why? Because databases are unfortunately notorious for high coupling. Within
the database, there is coupling between tables via foreign keys to other tables,
and further coupling between tables and database code, such as triggers and
stored procedures, which manipulates those tables. A more significant problem
is the coupling between a database and the systems that access it, including
online applications, batch jobs, reporting applications and data extraction
systems. Within these systems, the same data tables and columns are often accessed
from several parts of the system, again increasing the system's interconnections.
Therefore, a simple data refactoring can trigger a cascade of other refactorings
within your database and source code.
Applying Good Principles
Luckily, good design principles like encapsulation, loose coupling and high
cohesion help to make data refactoring easier. Through their familiarity with
the rules of data normalization, data professionals are acutely cognizant of
coupling and cohesion issues between data entities (see C.J. Date's An Introduction
to Database Systems 7/e, Addison-Wesley, 2001). Unfortunately, the importance
of encapsulation isn't as well recognized by the data community. In fact, in
his book, Foundation for Object/Relational Database Systems (Addison-Wesley,
1998), Date brags that encapsulation is a "non-issue." In reality,
when a database is well encapsulatedperhaps through a persistence framework
such as Castor (http://castor.exolab.org)
or CocoBase (www.thoughtinc.com), by
stored procedures, or even by data objectsit becomes significantly easier
to evolve because coupling between systems and your database is dramatically
reduced.
Political Production
Politics can't be avoided, and control of an organization's data is always a
political hotbed. Data administration groups are often very protective of their
production environments, which is a good thing when managed appropriately. However,
centralized data groups commonly tend toward prescriptive and documentation-intensive
processes that dramatically slow down development and hobble emergent design
efforts. Producing detailed logical or physical data models early in a project
is not very agile and is more appropriate for a BDUF approach. Data administration
groups can aid agile development efforts; they just need to reconsider the way
that they operate.
The Process of Data Refactoring
First, don't get carried away. A development team shouldn't modify a production
database on a whim; first, you should try out ideas in your own development
sandbox, examine the production implications of proposed changes (more on this
later), and consider making those changes in the production database only if
and when it makes sense. To paraphrase Ron Jeffries (www.xprogramming.com),
we're agile, not stupid.
For data refactoring to work, effective data migration and conversion are crucialevery time you change the database schema, you still need to store the same data to maintain the original semantics of your systems. Typically, you'd write scripts to copy the affected data to a secondary location, convert the old schema to the new one and then translate the copied data, so you can write it to the new schema. You actually need two scripts: one that evolves your development database schema and another that will eventually be used to help migrate your production database schema. The production script is an accumulation of the scripts for your development environmenta necessary approach because you can refactor your development environment whenever you need tobut your production environment can be changed only when you release your system into production. You must back up your database before running these scripts, so you can restore your database if required. Yes, writing these scripts is difficult, but with experience, it becomes a much easier task.
While production realities will constrain development, you also need to be realistic. For example, a common refrain of data groups states that the organization has x terabytes of data; therefore, you can't change the production schema because it's so big. I believe this is a red herring because the vast majority of projects, including reporting database/warehouse efforts, will typically deal with only a fraction of the available datathus, even a collection of data refactorings will have a negligible impact. The extent of the modifications required for systems coupled to the changed portions of the schema is the real issueit's not the size of your database; it's how you use it that counts.
Include a DBA on Your Team
By far, the best approach for supporting data refactoring is to have an experienced
database administrator (DBA) as an active team member. A DBA will apply the
data refactorings taking into account any necessary constraints from the current
production databases. Furthermore, she'll help the team through the corporate
database administration processes and reviews required to transition into production.
Naturally, she'll also need to work with the people who apply the supporting
refactorings to the systems that access the changed data schema, as well.
Your DBA should be well versed in your company's database standards, guidelines and chosen tools, so she can apply them effectively. Important standards and guidelines typically focus on naming conventions for tables, columns and stored procedures. Tool compatibility is also important to ensure that your data models are usable by others in your organization and to reduce licensing costs.
Having a DBA on your team is a different approach for organizations accustomed to a centralized data administration group that functions merely as a quality gate to production and interacts with project teams only through reviews. For data refactoring to work, however, data administration groups must actively support development teams as best they can, and developers may have to tolerate a few constraints that reflect the realities that the data administration folks must put up with. When everyone comes to the table willing to work toward an effective solution, it's possible to overcome the object-data divide (see "Crossing the Object-Data Divide," Thinking Objectively, Mar. and Apr. 2000) and minimize constraints without putting your production environment at risk.
Refactoring in Practice
Refactoring a well-designed database schema is easy when you don't need to worry
about the systems that access the database. However, unless you've chosen to
encapsulate data access, you often have extensive coupling between systems and
a database, which makes data refactoring a tough task. Data refactoring works
best when you apply one refactoring at a time, so you can iterate and incrementally
release your work. With the right approach and the right team support, you can
bring the well-known benefits of refactoring to your database as well as to
your code.
A summary of known data refactorings is listed at www.agilemodeling.com/essays/dataRefactorings.htm.
I'd like to thank Charles T. Betz, Ron Jeffries and Chris Roffler for their input, either on the Agile Modeling mailing list or in private, about my work to date in data refactoring. There's more to come.
Required Reading
Scott W. Ambler |