Design Tip #170 Leverage Process Metadata for Self-Monitoring DW Operations

By Bob Becker
November 5, 2014

In most cases, metadata is a neglected area of the DW/BI system; however, an increasing number of DW/BI teams have made positive strides in delivering business metadata to their users. This Design Tip looks beyond the business metadata to suggest several opportunities for leveraging ETL process metadata to improve data warehouse operations. The goal is to apply dimensional modeling principles and leverage your BI tool’s capabilities to enable a suite of scheduled, exception-driven BI applications that proactively alert DW/BI team members to inconsistencies in the data warehouse environment before they surface to users.

As Warren Thornthwaite described in The Data Warehouse Lifecycle Toolkit, metadata is the DNA of the data warehouse, defining its elements and how they work together. Metadata drives the warehouse and provides flexibility by buffering the various components of the system from each other. Metadata lives in various repositories created by the tools, programs, and utilities that make the DW/BI system work. There are three key categories of metadata:

Business metadata describes the contents of the data warehouse in more user accessible terms. It identifies what data you have, where it comes from, what it means, and what its relationship is to other data in the warehouse. The display name and content description fields are basic examples of business metadata. Business metadata often serves as documentation for the data warehouse. When users browse the metadata to see what’s in the warehouse, they are primarily viewing business metadata. Our recommended metadata approach, described in Warren’s Design Tip #75, suggests concentrating on business metadata first, making sure it is correct, complete, maintained, and accessible to the business users.

Technical metadata defines the objects and processes that make up the DW/BI system from a technical perspective. This includes the system metadata that defines the data structures themselves, like tables, fields, data types, indexes, and partitions in the relational engine, and databases, dimensions, measures, and data mining models. In the ETL process, technical metadata defines the sources and targets for a particular task, the transformations (including business rules and data quality screens), and their frequency. Technical metadata does the same kinds of things in the front room; it defines the data model and how it is to be displayed to the users, along with the reports, schedules, distribution lists, and user security rights.

Process metadata describes the results of various operations in the warehouse. In the ETL process, each task logs key data about its execution, such as start time, end time, CPU seconds used, disk reads, disk writes, and rows processed. Similar process metadata is generated when users query the warehouse. This data is initially valuable for troubleshooting the ETL or query process. After people begin using the system, this data is a critical input to the performance monitoring and improvement process.

Process metadata presents a number of interesting opportunities for the DW/BI team to more proactively manage the data warehouse environment. Three particularly interesting types of process metadata include:

ETL operations statistics generated by the workflow monitor including start times, end times, durations, CPU utilization, incoming and loaded row counts, buffer utilization, and fail-over events.
Error events captured via the data quality architecture described by Ralph Kimball in his seminal article.
Query statistics and aggregate usage statistics captured by the BI tool or query monitor.

Each of these process metadata categories can easily be captured in most data warehouse environments. Imagine applying our knowledge of dimensional modeling towards these different types of metadata. We can quickly envision at least one schema for each type of process metadata including a fact table capturing every detailed measurement.

The ETL operations statistics will result in one fact table row for each ETL process run with its associated metrics; this fact table might be surrounded by dimensions that describe the date, time, ETL process, source system, target schema and job status. In Ralph’s aforementioned article, he describes an error event fact table that captures detailed metadata about each error encountered during the ETL processing. This fact table links to dimensions such as date, time, ETL process, source table/column, and target table/column. Likewise we can imagine a similar schema for the query statistics metadata. Once we’ve designed and populated these schemas, we can begin to leverage this metadata.

For example, we can use the BI tool available in our technology stack to evaluate each ETL job’s operations statistics against its recent history. We’ll not only report on the operations statistics, but also leverage the BI tool’s exception reporting capabilities. Perhaps we’d like to see all ETL jobs that took plus or minus two standard deviations longer or shorter than normal or handled plus or minus two standard deviations more or less rows. We’re looking for ETL jobs that have most likely encountered some sort of processing error which is causing more or less work than normal. Clearly we’ll want to investigate these outlier situations and take corrective actions.

Similarly we can utilize our BI tool against the error event schema. One of the advantages of the error event schema is that it provides a basis for managing data quality initiatives over time. The dimensionality of the error event data allows us to study the evolution of data quality by source, software module, key performance indicator, and type of error. In addition, we can develop BI analyses that evaluate the daily error events and utilize them to help identify other types of ETL operations challenges similar to those we described against the ETL operations schema. The error event schema allows the team to identify soft errors where data is not manifestly wrong but falls outside the expected variation around the mean. We encapsulate the data from the error event fact table and publish it all the way to the BI layer in the form of the audit dimension. The audit dimension elevates metadata to being real data.

In order to leverage our query statistics metadata, a standard set of test BI queries that will be scheduled to run against the presentation area upon completion of the ETL processes. The query stats fact table will be populated by the results of this standard suite of test queries to develop a history of performance statistics. We can then build another similar set of BI analyses against this schema looking for any queries that run plus or minus two standard deviations faster or slower today than they have over that last 90 days. This will help us identify empty tables, missing indexes, and aggregate table opportunities.

From basic process metadata, we can develop capabilities to enable our data warehouse to be self-monitoring. We can package the BI analyses developed against our process metadata schemas and schedule them to run at appropriate points during the ETL processing. Should any of our exception conditions hit, the BI tool can be directed to fire off an alert to the appropriate person on the DW team. Far better to learn of a bad load or missing index from our monitoring environment than from a slew of hostile emails and phone calls when our business users are unable to retrieve their reports and analyses.

Enabling process metadata reporting and analysis may sound like a lot of work, but the payback is a self-monitoring data warehouse environment with capabilities that enable the DW team to be more proactive in managing the environment. Most data warehouse teams have the skills and technologies available to them to leverage this type of ETL/DW monitoring.

About the Author: Bob Becker

Bob Becker has worked with business managers and IT professionals to prioritize, justify and implement large-scale decision support and data warehousing systems since 1990. Regardless of the industry, he is highly skilled at identifying business requirements, facilitating organizational consensus, and designing dimensional data models.