What does it take to develop a robust dimensional model? Here’s how to get from requirements-gathering to final approval in a process that will ferret out the good, bad and ugly realities of your source data and help you avoid surprises, delays and cost overruns.
Kimball Group has written more than 250 Intelligent Enterprise columns and Kimball Design Tips about dimensional modeling techniques, but we haven’t written much about the dimensional modeling process. What are the tasks and deliverables required to create a robust design? This article presents best practices for designing a dimensional model.
Before embarking on a dimensional design project, you need a solid understanding of the business’s requirements along with a reasonable assessment of the underlying source data. It’s tempting to skip the requirements review, but resist this urge as doing so increases the risk of developing a source-driven model that falls short of business needs in numerous small but significant ways. Optimally, the requirements have been thoroughly researched and documented in a user-approved requirements finding document with the top priority needs clearly identified. This deliverable often also includes a preliminary data warehouse bus matrix. (More details about the bus matrix, a critical data architecture planning tool, are available in an Intelligent Enterprise column, “The Matrix: Revisited.”)
In addition to the business requirements and bus matrix, you’ll also want to review any source data profiling insights uncovered by the project team to date. The final item on your required-reading list is your organization’s naming convention standards document. If you don’t already have naming conventions to adapt for data warehousing and business intelligence, you’ll need to establish them as you develop the dimensional model.
Join the Party
Everyone knows that the key to a successful party is inviting the right people. The same holds true for the dimensional modeling festivities. Unfortunately, too many data modelers view the development of a dimensional model as a solo, independent activity; they retreat to their ivory cubicle to ponder the pros and cons of modeling alternatives on their own, emerging weeks later to unfurl their masterpiece for others to review. While the data modeler should be leading the charge and retain primary responsibility for the deliverables, the best way to arrive at a sound dimensional design is through a collaborative team effort since no one person likely has detailed knowledge of both the business requirements and source system idiosyncrasies.
The modeling team should also include people who can accurately represent the business users’ analytic needs, such as the business analyst, power user, BI application developer, or all of the above. In particular, power user involvement is valuable as they’ve likely already identified business rules to convert source data into more meaningful information for decision making; their insights often result in a richer, more analytically complete design. The designated data stewards should be involved in the modeling process to drive to organizational agreement on names, definitions, and business rules (If you haven’t already established a data stewardship program, there’s no time like the present to do so).
It’s also beneficial to have some source experts at least intermittently involved in the process to quickly answer questions and resolve issues regarding data timing and content nuances. Finally, you should invite members of the ETL team to the party to gain early insights about the model and its source-to-target mapping. ETL team members typically have more to gain than contribute to the modeling process, but getting their buy-in to the design saves valuable time and avoids tire-spinning down the road. For the same reason, you should also involve the DBAs who will implement the physical design.
Don’t attempt to schedule all-day design meetings. Instead, you should structure 2 1/2-hour morning and afternoon sessions so the lead modeler has time to update the documentation before the next meeting (and so team members can deal with the demands of their day jobs). Rather than asking team members to commit 5 days per week to the design sessions, you should schedule the sessions for 3 days per week, such as Tuesday-Thursday. You’re more likely to have consistent participation if the schedule is not overly aggressive. Consistent participation is critical to avoid rehashing design decisions.
With the team assembled and prerequisite reading completed, it may be appropriate to briefly introduce dimensional modeling so everyone on the team understands its core principles (and appreciates that denormalization is not always evil). The team’s first objective is to reach agreement on a high-level model diagram, or bubble chart, as illustrated above. The bubble chart represents a fact table corresponding to a single business process; it includes a clearly articulated grain declaration and identification of core dimensions. The bubble chart can often be derived largely from the preliminary bus matrix. It serves to get everyone on the same page regarding the scope of the model without getting unnecessarily mired in the details. Because it’s easy-to-understand, the high-level diagram also facilitates discussion between the design team and business partners.
Dive Into the Details
With consensus on the bubble chart, the team then launches into table-by-table and column-by-column discussion, drilling into more detail about the requisite attributes and metrics, including definitions, transformation rules and data-quality concerns. Your initial sessions should focus on a more straightforward dimension table so the team experiences a quick win before dealing with the more controversial dimensions. The dimensional model will unfold through a series of design sessions, with each pass producing a more detailed design that’s been repeatedly tested against your understanding of the business needs.
As you’re iteratively fleshing out the design, team members will need to do more interrogation of the data. Thus, ongoing access to a profiling tool (or less sophisticated method) will be critical. Ferreting out the good, bad and ugly realities of your source data during the design process will minimize the surprises and corresponding overruns during the ETL design and development activity.
Throughout the design sessions, the data modeler or designated scribe should be filling in a detailed worksheet for each table with information such as the attribute or fact name, column description, sample values and decodes, change tracking rules for dimension attributes, and preliminary transformation business rules. These worksheets form the basis for the source-to-target mapping, which is further embellished by the physical designer and then ultimately handed off to the ETL team. In addition to the detailed worksheets, the design team scribe should also be logging open issues so they’re captured in a single document to facilitate review and assignment at the end of every session.
Review the Results
The last phase of the modeling process involves reviewing and validating the model with interested parties, starting with the project team, extending to those in IT with intimate knowledge of the source systems, and concluding with the broader business community. Approval from this last group is critical before time and money is invested in the data implementation. Plan to devote much of this review to illustrating how the model will address sample questions from the requirements findings.
A typical design effort usually takes three to four weeks for a single business process dimensional model, but the time required will vary depending on the complexity of the business process, availability of preexisting conformed dimensions, experience of the modeling team, existence of well-documented business requirements, and the difficulty reaching consensus.
Designing a dimensional model with interested parties representing diverse skills requires commitment and cooperation, but the end result is a robust model that has been rigorously tested against both the business needs and data realities. That’s exactly what you want before you move ahead with the implementation.