Whether you are developing a new dimensional data warehouse or replacing an existing environment, the ETL (extract, transform, load) implementation effort is inevitably on the critical path. Difficult data sources, unclear requirements, data quality problems, changing scope, and other unforeseen problems often conspire to put the squeeze on the ETL development team. It simply may not be possible to fully deliver on the project team’s original commitments; compromises will need to be made. In the end, these compromises, if not carefully considered, may create long-term headaches.
In my last article on “Six Key Decisions for ETL Architectures,” I described the decisions ETL teams face when implementing a dimensional data warehouse. This article focuses on three common ETL development compromises that cause most of the long-term problems around dimensional data warehouses. Avoiding these compromises will not only improve the effectiveness of your ETL implementation, but will also increase the likelihood of overall DW/BI success.
Compromise 1: Neglecting slowly changing dimension requirements
Kimball Group has written extensively on slowly changing dimension (SCD) strategies and complementary implementation alternatives. It’s important that the ETL team embrace SCDs as an important strategy early in the initial implementation process. A common compromise is to put off to the future the effort required to properly support SCDs, especially Type 2 SCDs where dimension changes are tracked by adding new rows to the dimension table. The result is often a total rework disaster.
Deferring the implementation of proper SCD strategies does save ETL development time in the immediate phase. But as a result, the implementation embraces only Type 1 SCDs, where all history in the data warehouse is associated with current dimension values. Initially, this seems to be a reasonable compromise. However, it’s almost always more difficult to “do it right” when you have to circle back in a later phase. The unfortunate realities are that:
- Following a successful initial implementation, the team faces pressure to roll out new capabilities and additional phases without time to revisit prior deliverables and add the required change-tracking capabilities. Thus, the rework ultimately required to support SCD requirements continues to expand.
- Once the ETL team finally has the bandwidth to address SCD, the ugly truth becomes apparent. Adding SCD Type 2 capabilities into the historical data requires rebuilding every dimension that contains Type 2 attributes; each dimension will have to have its primary key rekeyed to reflect the new historically appropriate Type 2 rows. Rebuilding and rekeying even one core conformed dimension will unavoidably require reloading all impacted fact tables due to the new dimension key structures.
- Facing a possible rebuild of much of the data warehouse environment, many organizations will back away from the effort. Rather than reworking the existing historical data to restate the dimension and fact tables in their correct historical context, they implement the proper SCD strategies from a point-in-time forward. By compromising the implementation of proper SCD techniques in the initial development process, the organization has lost possibly years of important historic context.
Compromise 2: Failing to Embrace a Metadata Strategy
DW/BI environments spin off copious amounts of metadata. There is business metadata, process metadata, and technical infrastructure metadata that all needs to be vetted, captured and made available. The ETL processes alone generate significant amounts of metadata.
Unfortunately, many ETL implementation teams do not embrace metadata early in the development process, putting off its capture to a future phase. This compromise typically is made because the ETL team does not “own” the overall metadata strategy. In fact, in the early stages of many new implementation efforts, it’s not uncommon for there to be no designated owner of the metadata strategy.
Lack of ownership and leadership makes it easy to defer dealing with metadata, but that’s a short-sighted mistake. Much of the critical business metadata is identified and captured, often in spreadsheet form, during the dimensional-modeling and source-to-target mapping phases. What’s more, most organizations use ETL tools to develop their environment, and these tools have capabilities to capture the most pertinent business metadata. Thus, the ETL development phase presents an opportune moment — often squandered — to capture richly described metadata. Instead, the ETL development team only captures the information required for their development purposes, leaving valuable descriptive information on the cutting room floor. Ultimately, in a later phase, much of this effort ends up being redone in order to capture the required information.
At a minimum, the ETL team should strive to capture the business metadata created during the data-modeling and source-to-target mapping processes. Most organizations find it valuable to focus initially on capturing, integrating, flowing, and, ultimately, surfacing the business metadata through their BI tool; other metadata can be integrated over time.
Compromise 3: Not Delivering a Meaningful Scope
The ETL team is often under the gun to deliver results under tight time constraints. Compromises must be made. Reducing the scope of the initial project can be an acceptable compromise. If, for example, a large number of schemas was included in the initial scope, one time-honored solution is to break that effort up into several phases. It’s a reasonable, considered compromise assuming the DW/BI project team and sponsors are all fully, if not grudgingly, on board.
But it’s a problem when the ETL team makes scope compromises without proactively communicating with the DW/BI project team and sponsors. Clearly, this is a recipe for failure and an unacceptable compromise.
This situation is often a symptom of deeper organizational challenges. It can start innocently enough, with shortcuts taken under pressure in the heat of the moment. In retrospect, however, these compromises would never have been made in the full light of day. In an effort to achieve overly ambitious deadlines, the ETL team might fail to handle data quality errors uncovered during the development process, fail to properly support late arriving data, neglect to fully test all ETL processes, or perform only cursory quality assurance checks on loaded data. These compromises lead to inconsistent reporting, an inability to tie into existing environments, and erroneous data and often lead to a total loss of confidence among business sponsors and users. The outcome can be total project chaos and failure.
Make Compromises Openly and Honestly
Compromises may be necessary. The most common concession is to scale back an overly ambitious project scope; but key stakeholders need to be included in this decision. Other, less intrusive changes can be considered, such as reducing the number of years of back history used to seed a new environment, reducing the number of dimension attributes or number of metrics required in the initial phase (while being careful about SCD Type 2 requirements), or reducing the number of source systems integrated in the initial phase. Just keep everyone informed and on the same page. The key is to compromise in areas that do not put the long-term viability of the project at risk.