Scott Lowe explains why data cleansing is so important and describes the first steps in the process for Westminster College.
Westminster College has undertaken a number of data-centric projects over the past year. Although we knew that we had areas with data challenges, these projects have helped us to uncover the true scope of the problem. Our data challenges have created problems in some ways with our new systems and have prevented us from being able to fully implement them and receive maximum benefit. In this blog, I will outline what steps we are taking to correct campus-wide data challenges.
"How many students do we have on campus right now?" This question, asked often by the College president, never received a consistent answer. Different people in different divisions answered the question differently... and correctly. You might be asking yourself how it's possible that different answers could be considered correct. It's all about context. For example, when the question is asked of the Registrar, she considers the number of students registered in classes. When the same question is asked of the housing office, their context revolves around how many students live in campus-provided housing.
The reason we could not get consistent answers to this question was threefold: 1) We did not have standardized definitions for how to count things. 2) The initial question wasn't specific. 3) In some cases, data elements that would enable easy "counting" were not in place or were being used incorrectly or inconsistently.
In addition to issues like these, we were having trouble in another campus division just getting data from their completely separate database. Over the years, their system had been semi-manually integrated with the rest of the campus, but only in one direction. Further, over the years, as new division VPs came and went, all the standard codes in use changed to suit the styles of the incoming VPs, and no time was provided to allow the data people to go back and make historical data consistent. So, as new reports were developed, the division relied on the memory of one individual to make sure that all the right data was presented. As you can imagine, this method left much to be desired. As a result of this and because not all necessary information was being captured, it was close to impossible to perform deep analytics on the data system to support the efforts of that division.
Next up--business intelligence related to student retention. When it comes to business metrics for tuition-driven colleges, enrollment numbers and retention are critical. Besides wanting to retain students for financial reasons, our educational mission demands that we work with students to help them be successful. For us, this success means that students stay at the college and eventually graduate. In order for them to do so, we need to retain them for their four-year undergraduate stint.
As you might imagine, there are dozens, if not hundreds, of factors that go into a student's decision to stay or leave. Given that colleges track a ton of information about students, it makes sense to mine this data to see if we can identify common attributes that might signal to us whether or not a student will or will not be successful at the college. For example, is there any possibility that students who lived in a particular building didn't retain well? Do students from a particular area tend to leave the college? If we mine the data and discover this kind of anomaly, it gives us an opportunity to see if we can determine where something might be going wrong for a group of students so that we can remove that barrier to success, if possible.
These kinds of efforts require consistent data. That consistency has to span years in order for a regression analysis to be of any benefit and for us to be able to actually use identified factors in any meaningful way.
Correcting the problems
Inherent data quality issues aren't corrected overnight. Instead, it takes committed leadership from the top of the organization to drive data quality throughout the organization and to make it a top priority, especially for those who want to make better decisions based on the information on hand.
To start with, there needs to be structure in place for handling data quality on an ongoing basis. Without addressing what was obviously a structural deficiency, any data cleanup efforts we undertook would be for naught since we'd be right back here after a period of time. For Westminster, this meant expanding the scope of what we call our Module Managers Committee, the group charged with centrally coordinating campus data activities. Prior to making the scope change, this group was only a loose coordination mechanism. With the scope change, the MMC now has the following role:
This is approval -- as opposed to coordination -- of significant changes to individual business unit table codes, use of data fields, and so on. Even a VP cannot simply order his people to make an internal mass code change anymore.
- Codes are shared throughout the ERP. A change in one division can seriously impact reporting and operations in another--and it has before.
- As unit VPs come and go, we need to maintain historical consistency in the data. With enough business justification and a plan for addressing all second-order consequences, a VP can initiate mass data changes. The goal here is not to have the MMC be a blocking mechanism, but to be a mechanism for ensuring that all aspects of a change are considered and handled.
Expansion of data
Individual units often need to expand their use of the system through the use of user-defined fields or, sometimes, repurposing existing fields. The MMC will consider the best ways to handle these kinds of requests and make a recommendation to IT for implementation. This makes sure that everyone is aware of what data is being captured so that it can be used to maximum benefit.
Holder of the data dictionary
This group is responsible for ensuring that there are clear definitions for data and that data is stored in ways that enable campus processes and business intelligence systems to operate.
Again, the group is not supposed to be a roadblock, but it is supposed to help ensure that all aspects of decisions are considered, and for the rare instance when there is simply no business justification, the group does have the power to reject a request. If the group makes a decision, the decision can be appealed to me, and if I do not accept the appeal, it can be taken to the president of the college. We felt that this was an important item to place into the group's charter in the event that we ended up with a group more interested in control than coordination at some point in the future.
Clear identification of an executive sponsor
I am the executive sponsor and the link between the group and the executive team, although I have delegated the actual operation of the group to my Director of Information Systems/lead DBA.
This is an ongoing effort.
Just the beginning
The creation of our MMC was just the first step taken in a journey to clean our data. We are right now working on a major implementation that will allow us to take major steps toward having better data. We're implementing a new fundraising system that involves transferring all the data from our completely separate system into the new system. We've taken significant time to clean up this data for this migration.
That's just one initiative. In a future blog, I'll describe some other efforts we're undertaking in the name of data quality.