10 ways to begin a data warehouse project

IT departments typically launch data warehouse projects without input from business partners, explaining DW concepts to the uninitiated and building the business case themselves. But what steps should you take once the project is greenlighted? CIO Jeff Relkin examines key aspects of the process, including determining data access controls, deciding on an implementation approach, identifying project scope, developing the spec document, assessing data quality, and building a metadata repository.

By Jeff Relkin

This article is also available as a PDF download.

High functioning value-added IT departments operate in a consultative mode, using the enterprise business model and strategic plan to work effectively with business partners to identify technology-based solutions in response to requirements as articulated by the business. Projects are launched based on collaboratively deciding what's needed.

A data warehouse, however, is one of the few examples of a project that's typically initiated independently by IT without input from the business. IT has to build the initial business case for the data warehouse, since few people outside the technology discipline understand what a data warehouse really is or what kind of value it can provide. What's the best way to get started? Here are some suggestions.

#1: Determine your organization's appetite for change

What makes a data warehouse different from other traditional applications? One obvious answer is the rather profound difference between transactional (OLTP) and analytic (OLAP) processing. Another is the fact that data warehouses are additive in nature, which means they don't conform to standard accounting rules for financial data and are tolerant of redundancy, a construct normal relational databases avoid at all costs.

Legacy applications typically have standard interfaces and proscribed reporting packages, but data warehouses are mostly accessed via ad-hoc single-use queries. Dimensional cubes don't look anything like relational tables. All in all, the use of a data warehouse requires a completely different user mindset. Is your organization ready for that? If not, you have two choices: Save yourself a lot of grief and don't start the project to begin with or figure out how to use your IT professional change agent skills to move the enterprise culture into one that embraces high-level analytics and the associated set of advanced technology tools.

#2: Identify the most likely business unit to benefit from a data warehouse and approach it proactively

Every time I've gone to my business customers and said, "I'm building a data warehouse... what would you like to do with it?", the response I've gotten is, "What's a data warehouse?" Well begun is well done, so a key initial activity is to determine which of your business customers has the greatest potential need for analytical data and tools, along with the capacity and interest to put those tools to effective use.

If you work for a small organization, you'll probably want to go straight to the top. At this point, you're just trying to generate interest and excitement (and funding), so build a short but effective business case before you have any meetings. Cover the high-level basics of data warehouse functionality in business value terms. Be scrupulously careful to avoid anything remotely resembling geek-speak. Do not make the fatal mistake of attempting to quantify data warehouse ROI. Many people have heard that data warehouses are "sinkholes of money." Don't deny this, because it's usually true, and the worst thing you can do now is kill your credibility. The initial sale you're trying to make is based on improved decision making through advanced analytics. Financial benefits will be longer term and difficult to measure, so be honest about that. Once you get approval to proceed, you'll want to quickly get business management sponsorship. IT should be the project owners only for the shortest time possible.

#3: Determine data access controls

Your proposal has been greenlighted--now what do you do? All kinds of questions need to be answered, and one good place to start is by determining inclusion parameters and security requirements. You have two main areas to think about, sources and targets. Your instinct will be to use every piece of data you can get your hands on, but not every system, database, legacy application, or information enrichment file is a potential source for your data warehouse. A lot of data derived from those sources probably isn't relevant to your identified customer group and may contain qualitative data that doesn't lend itself to analysis.

You also have to consider whether your downstream data warehouse customers are even allowed to process raw data. Users who are authorized at the application level shouldn't necessarily be granted access to granular information. Likewise, the target data warehouse will need some security controls imposed on it. One of the great dangers of data warehouses is that in the hands of unskilled practitioners, they can be used to make and justify spectacularly bad business decisions. Sometimes users have to be protected from themselves.

#4: Assemble the team

Who should work on the data warehouse effort? This is a project, so you'll need a project manager, someone comfortable with nontraditional project management methods (highly iterative process, multiple short-duration interim deliverables, toleration for change and uncertainty, etc.). You'll also need a database designer and a dedicated DBA, people who are well versed in the differences between relational tables and multidimensional cube constructs.

Your DBA needs to be dedicated because a good data warehouse project involves constant revision and tuning, and in addition to unique architectural skills, your DBA can't be distracted by conflicts with other responsibilities. A development team will be necessary to build the various EAI (enterprise application interface) and ETL (extract/transform/load) interfaces, and you'll need dedicated report and query specialists. You'll want to make sure your query developers have great customer focus skills, because they'll also be the ones who'll support the data warehouse to the customer community when it goes live. Don't forget the most important members of your team--a customer council.

#5: Decide on the implementation approach

How should you build your data warehouse? There are two traditional approaches: the galactic data warehouse and the architected datamart. Classic data warehouse topology consists of a source layer, which feeds into an ODS (operational data store), from there into the enterprise data warehouse, and from there into a series of datamarts. All data flows unidirectionally downstream, and the reporting layer at the bottom connects to the datamarts and the data warehouse. The entire environment is connected to a metadata repository.

In the galactic data warehouse approach, you build the complete backend infrastructure from source to data warehouse, encompassing as much of the source data as you might possibly someday want, and implement datamarts fed from the data warehouse as needed. The architected datamart approach bypasses the central data warehouse altogether, sending focused subsets of data directly from the sources to the datamarts. Both approaches have merit, but typically you can get to an interim deliverable far faster with architected datamarts. The danger is that you'll never have, or take, the opportunity to go back and insert the middle, the data warehouse itself. Your decision as to which approach to use depends on your own technology and business environment and should be based on a strategic risk assessment.

#6: Identify the project scope

How broad should you make your data warehouse environment? As is the case with any project, before you start you need to define the three most fundamental components of the effort: what are the inputs, what kind of processing will be performed on them, and what comes out the other side?

One of the key differences between a data warehouse project and nearly every other technology project is that it's almost always impossible to define what the data warehouse and all its various components will look like by the time you're done, so you actually want to avoid too much scope definition up front or you'll lock yourself into a rigid design and end up spending even more money to overcome limitations. You're basically trying to put up a building without working from a blueprint. Nevertheless, you must come to some agreement with your customers as to at least the first few deliverables. This is why the project manager has to be comfortable with uncertainty. You'll be going to your customer intially with a very raw product and saying, "Try this, let me know how you like it," and then moving forward iteratively from there. Definition will become clearer the more you move into implementation.

#7: Establish the success criteria

What's your customer trying to achieve? If you ask your customer at the start of the project, you'll get a blank stare and a shoulder shrug, so here's where your consultative and business skills get exercised. There's no way to make a definitive statement about the financial value of the data warehouse up front--sometimes not even after it's finished. You must understand enough about your customer's business that you can help define, in qualitative and not quantitative terms, how the project will be judged successful. What results will better decision making generate? How will having advanced analytics help to understand the business? What will some of the new techniques, like segmentation analysis, provide in terms of business intelligence? How will the data warehouse help drive strategy at the enterprise and departmental level?

#8: Conduct the "25 question" analysis

Now that you have the beginning of an overall strategy, where do you go from here? I've found the most effective way to proceed is to assemble a group of 15 users and ask them to write down the 25 questions about the business for which they'd most like to know the answers. Don't allow them to collaborate. It's critically important for each one to develop a question list independently. Make sure they don't view this as a technical assignment but phrase each question in English business terms as if they were addressing their boss or the CEO (or as if those folks were addressing them).

Collect everyone's lists, which will give you 375 total questions. Now normalize the list. Eliminate all the redundancies and combine all the similar questions. You'll find you'll end up with somewhere around 50 discrete questions. Convene all 15 users and review the list with them, getting them to put the questions in priority order. Know what you have in your hand? Your spec document, not only for the data warehouse but also for the first iteration of your standard query repository. If you can build a data warehouse and the queries that answer those questions in the order the users specified, you'll be a hero.

#9: Assess current data quality and pre-cleansing efforts

What's the surest way to doom a data warehouse project? Forget about quality. No matter how good your organization thinks the data in your source systems really is, you know better. It's highly suspect. Since a data warehouse is additive in nature, you'll never do any update transactions to it. That means garbage in is not just garbage out, it's permanent garbage. Build the effort and cost of data cleansing into your data warehouse project. Not only will you have a data warehouse that produces credible results, you'll drive an overall push that's probably long overdue to improve the quality of the data in the backend systems themselves. At the same time, you'll probably uncover a host of errors in the legacy systems that should have long since been corrected.

#10: Don't forget the metadata

Remember last Christmas when you tried to put together your daughter's bicycle without the instructions? That's what using a data warehouse without metadata is like. Metadata, or data about your data, is the glossary that documents all the important information users need so they can understand how to correctly utilize the data. The metadata repository is generally set up as a separate database, connected passively to all components of the data warehouse--the sources, the ODS, the warehouse, and the datamarts. It also drives the interfaces, providing technical information that documents the processes and transformations that operate on each data entity. Whether you construct it yourself or purchase a product, remember to build this critical component into your project plan and expect the maintenance of it to be as iterative as the data warehouse itself. As one changes, so does the other.

Jeff Relkin has 30+ years of technology-based experience at several Fortune 500 corporations as a developer, consultant, and manager. He has also been an adjunct professor in the master's program at Manhattanville College. At present, he's the CIO of the Millennium Challenge Corporation (MCC), a federal government agency located in Washington, DC. The views expressed in this article do not necessarily represent the views of MCC or the United States of America.