Four Fixes Refurbish Legacy Data Warehouses

By Margy Ross
October 1, 2006

DW/BI professionals are often tasked with making evolutionary upgrades and improvements to minimize cost and upheaval in the current analytic environment. We explore four upgrades that can breathe new life into legacy data warehouses.

Few readers have the luxury of working with a blank slate when it comes to the development of their data warehouse/business intelligence environment. Instead, many of us deal with the decisions–and potentially the sins–of our predecessors. Your DW/BI environment would likely look very different if you were to build it from scratch, but a complete toss-and-rebuild is seldom an alternative.

More often, DW/BI professionals are tasked with making evolutionary upgrades and improvements to minimize cost and upheaval of the current analytic environment. These four upgrades can breathe new life into legacy data warehouses.

Conform The Nonconformed Dimensions
The Kimball Group has steadfastly made the case for master conformed dimensions. Conformed dimensions contain the descriptive attributes and corresponding names, meanings and values that have been agreed to across the enterprise. Using conformed dimensions ensures that the data warehouse is delivering consistently defined attributes for labeling, grouping, filtering and integrating data from multiple business processes.

Unfortunately, many data warehouses/marts were developed without regard for this critical master data. Standalone data stores with independently defined dimensions are often constructed because it’s the path of least resistance when deadlines loom. Rather than attempt to reach consensus on common reference data, isolated teams believe it’s quicker and easier to just build autonomous dimensions. This approach may let these teams declare victory, but it doesn’t support the business’ desire for integration and consistency.

Some organizations wind up with independent data stores because developers purposely focused on delivering a departmental solution, likely due to the funding available. Without the vision and acknowledged need for an enterprise perspective, teams are often chartered to build with blinders on to meet a limited set of goals.

So what do you do if you’re confronted with an environment that’s been built without a foundation of common conformed dimensions? Can these stovepipes be rescued? In spite of the vendor hype, there’s no magic elixir that miraculously delivers master dimensions. Technology can facilitate and enable data integration, but there’s no silver bullet. The first step toward integration nirvana is to assess the state of the data as well as requirements, expectations and buy-in from the business. You can self-diagnose the issues related to nonconforming dimensions, but keep in mind that you’re likely to face a long, uphill internal struggle and resistance to change if the business community doesn’t perceive the need or incremental value in the project.

As we described in “Data Stewardship 101” (June 2006), one of the most crucial steps in conforming nonconformed dimensions is to organize the appropriate resources to tackle this vexing problem. Data stewards must be identified and assigned responsibility and authority to determine common dimension attributes, define domain values and transformation business rules, and establish ongoing processes to ensure data quality. Obviously, that’s no small feat, so it’s critical to identify the right leader. Ideally, you want someone from the business community who is respected by senior management and who has knowledge and skills to achieve organizational consensus. Navigating the unavoidable cross-functional challenges requires experience, widespread respect, political acumen and strong communication skills.

Not everyone is cut out to be a data steward. It’s feasible for folks on the data warehouse team to serve as stewards, but they need to demonstrate all the traits and characteristics described above. Most important, they need the support of business management and the authority to push cross-enterprise agreement and adoption, even when unpopular compromise is required. Without this power, stewards face the prospect of endless tire spinning as they try to rationalize diverse perspectives.

Once the data stewards produce the specifications for a conformed master dimension, then the skilled extract-transform-and-load staff builds the master dimension. Depending on the existence and/or quality of reference data in the operational source systems, this may require intricate record matching and deduplication. When merging multiple source systems, clearly defined rules of survivorship are needed to identify which data source takes precedence for each attribute.

With master dimensions built, it’s then time to retrofit the existing data warehouses/marts with standardized data. New surrogate key mapping tables for each dimension are used to recast existing fact table rows. In addition, aggregate summary tables and cubes will likely need to be reconstructed. While the implications for the ETL system are inevitably significant, hopefully the impact on the business intelligence layer of these underlying physical table changes can be minimized with an abstraction layer using views, synonyms or your BI tool’s metadata, depending on your platforms.

Create Surrogate Keys
Another data warehousing best practice is to create surrogate keys–typically a meaningless simple integer–for the primary key in each dimension table. The associated fact table rows use this same surrogate as a foreign key reference to the dimension tables.

Establishing, managing and using surrogate keys may initially feel like an unnecessary burden on the extract, transformation and load system, so many data warehouses have been constructed based on the operational natural keys, sometimes referred to as a smart key because of its embedded meaning, rather than using surrogates. At first blush, these natural keys may not present any obvious problems. But as the environment matures, teams often wish things had been done differently when the following situations arise:

The natural keys in the operational system get recycled after a product has been discontinued or an account closed for more than a specified length of time. Two years of dormancy may seem like a lifetime to an operational system, but relying on reassigned natural keys can wreak havoc in the data warehouse where data is retained for extended periods.
The business decides it’s important to track dimension attribute changes after all. While this need may not have been envisioned up front, as the business develops expertise with the data, team members often want to see the impact when descriptive dimension attributes change. Naturally gravitating to the least onerous route, they may have developed the initial data warehouse with the premise that dimension tables would reflect the most current attribute values, overwriting any previous descriptors when they change (commonly referred to as slowly changing dimension type 1 technique). But the rules may now be maturing, requiring the insertion of new rows into the dimension table to capture new profiles (via slowly changing dimension type 2 methods). Relying on the natural key as the primary key of the dimension table obviously doesn’t allow for multiple profile versions, while using a concatenated key based on the natural key and effective dates has a negative impact on both query performance and usability.
Performance has been negatively impacted by an inefficient natural key. In contrast to a tight integer, natural keys are often bulky, alphanumeric fields that result in suboptimal dimension/fact table join performance and unnecessarily large indexes.
You must create conformed dimensions to integrate reference data from multiple sources, each with its own unique natural key. Alternatively, you may need a default dimension primary key to represent the condition when a dimension value is unknown or not applicable to a given measurement event.

Each of these situations can be addressed by using surrogate keys as the primary keys for the dimension tables. So how do you implement surrogates when you’re already working with a system that was built without them? In the simplest scenario, you would add the sequential surrogate key to the dimension table, leaving the existing natural key intact as an attribute, and recast the fact rows to reference the new surrogate key values. Depending on your BI tool, you may need to update the tool’s metadata to reflect the change in join fields. If you need to support type 2 attribute change tracking, additional work would be required to source the historical attribute profiles and then repopulate the fact table with the surrogate key values that were valid and in effect when the fact row occurred.

Deliver The Details
Some people believe dimensional models are only appropriate for summarized information. They maintain that dimensional structures are intended for managerial, strategic analysis and therefore should be populated with summarized data, not operational detail. Kimball Group vehemently disagrees; in our view, dimensional models should be populated with the most detailed, atomic data captured by the source systems so business users can ask the most detailed, precise questions possible. Even if users don’t care about the particulars of a single transaction or subtransaction, their question of the moment may require rolling up or drilling down in these details in unpredictable ways. Of course, database administrators may opt to presummarize information to avoid on-the-fly summarization in every case, but these aggregate summaries are performance-tuning complements to the atomic level, not replacements.

Restricting your dimensional models to summarized information imposes serious limitations. Summary data naturally presupposes the typical business questions. When business requirements change, as they inevitably will, both the data model and ETL system must change to accommodate new data. Summary data alone also limits query flexibility. Users run into dead ends when the presummarized data can’t support an unanticipated inquiry. While you can roll up detailed data in unpredictable ways, the converse is not true; you can’t magically explode summary data into its underlying components. Delivering dimensional models populated with detailed data ensures maximum flexibility and extensibility.

So what do you do if you’ve inherited a data warehouse filled with preaggregated information, but the details are missing in action? The solution is straightforward: you need to source and populate the bedrock atomic detail. Since more detailed data is naturally more dimensional, this will almost inevitably require that new dimension tables be constructed as well.

As you’re delivering the details, don’t be lulled into thinking that summary data should be dimensionalized but that atomic details are better handled in a normalized schema. Business users need the ability to seamlessly and arbitrarily traverse up, down and across both the detailed and summary data in a single interface. While normalization may save a few bytes of storage space, the resulting complexity creates navigational challenges and typically slower performance for BI reports and queries.

Reduce Redundancies
Many organizations take a piecemeal approach to their data warehouse design and deployment, so it’s common for the same performance metrics to reside in numerous analytic environments, often segregated by business department or function. The multiple, uncoordinated extracts from the same operational source systems required to populate these redundant data stores are inefficient and wasteful. Variations of similar-but-different information result from inconsistent business rules and naming conventions, causing unnecessary confusion and requiring reconciliation throughout the enterprise. Obviously, conflicting databases that perpetuate incompatible views of the organization are distracting resource drains.

Take a more enterprise-centric approach, storing core performance metrics once to support the analytic needs of multiple departments or groups. The enterprise framework is defined and communicated via the data warehouse bus matrix. The matrix rows represent the business events or processes of the organization, while the columns reflect the common, conformed dimensions. The bus matrix provides the macro point of view for architecting the data warehouse environment, regardless of database or technology preferences.

But what do you do if you’ve discovered that your organization’s key performance metrics and indicators are currently available in five, 10 or more different analytic environments? The first step is to assess the damage. You can create a detailed implementation bus matrix, described in our column “The Matrix: Revisited” (December 2005), to gather information and document your current state. With these details in hand, you’re ready to help senior executives understand the quagmire resulting from uncontrolled, standalone development. Since they’re often the ones demanding more information (and by the way, they want it delivered yesterday), it’s important that they comprehend the consequences of piecemeal development, which can significantly hinder the organization’s decision-making capabilities.

If the business appreciates the inefficiencies and obstacles of the current environment, then they are much more likely to support a migration strategy to reduce the unnecessary redundancies. Unfortunately, this rationalization often has serious ramifications on the BI application layer when the underlying data stores are removed or consolidated.

Face The Realities
We’ve described four of the most common data fixes for more mature warehouses. It’s worth noting that regardless of the opportunities for improvement with your existing environment, it’s important to evaluate the costs versus benefits of taking any corrective action. It’s similar to the decision process when confronted with maintenance of an aging automobile; sometimes it makes sense to spend the money, but other times you may opt to merely live with the dents or dump the heap of junk altogether.

Keep your eyes open for occasions when the impact of the corrective action is either less noticeable or less painful. For example, if your organization is implementing a new operational source system or migrating to a new ETL or BI tool platform, the upheaval with your existing transformation and/or analytic processes presents an opportunity for other corrections. Using the automobile maintenance analogy, if your car is already in the shop and the mechanic has lifted the hood to change the oil, he might as well check the windshield wiper fluid while he’s in there.

More than 100 free articles on dimensional modeling are available at the Kimball Group website, plus sign up to receive monthly Kimball Design Tips.

Required Reading:

Four Fixes Refurbish Legacy Data Warehouses

About the Author: Margy Ross