Data Management

Leadership challenges of a data cleansing effort

Scott Lowe explains why data cleansing is so important and describes the first steps in the process for Westminster College.

Westminster College has undertaken a number of data-centric projects over the past year. Although we knew that we had areas with data challenges, these projects have helped us to uncover the true scope of the problem. Our data challenges have created problems in some ways with our new systems and have prevented us from being able to fully implement them and receive maximum benefit. In this blog, I will outline what steps we are taking to correct campus-wide data challenges.

Early symptoms

"How many students do we have on campus right now?" This question, asked often by the College president, never received a consistent answer. Different people in different divisions answered the question differently... and correctly. You might be asking yourself how it's possible that different answers could be considered correct. It's all about context. For example, when the question is asked of the Registrar, she considers the number of students registered in classes. When the same question is asked of the housing office, their context revolves around how many students live in campus-provided housing.

The reason we could not get consistent answers to this question was threefold: 1) We did not have standardized definitions for how to count things. 2) The initial question wasn't specific. 3) In some cases, data elements that would enable easy "counting" were not in place or were being used incorrectly or inconsistently.

Troubles mount

In addition to issues like these, we were having trouble in another campus division just getting data from their completely separate database. Over the years, their system had been semi-manually integrated with the rest of the campus, but only in one direction. Further, over the years, as new division VPs came and went, all the standard codes in use changed to suit the styles of the incoming VPs, and no time was provided to allow the data people to go back and make historical data consistent. So, as new reports were developed, the division relied on the memory of one individual to make sure that all the right data was presented. As you can imagine, this method left much to be desired. As a result of this and because not all necessary information was being captured, it was close to impossible to perform deep analytics on the data system to support the efforts of that division.

Next up--business intelligence related to student retention. When it comes to business metrics for tuition-driven colleges, enrollment numbers and retention are critical. Besides wanting to retain students for financial reasons, our educational mission demands that we work with students to help them be successful.  For us, this success means that students stay at the college and eventually graduate. In order for them to do so, we need to retain them for their four-year undergraduate stint.

As you might imagine, there are dozens, if not hundreds, of factors that go into a student's decision to stay or leave. Given that colleges track a ton of information about students, it makes sense to mine this data to see if we can identify common attributes that might signal to us whether or not a student will or will not be successful at the college. For example, is there any possibility that students who lived in a particular building didn't retain well? Do students from a particular area tend to leave the college? If we mine the data and discover this kind of anomaly, it gives us an opportunity to see if we can determine where something might be going wrong for a group of students so that we can remove that barrier to success, if possible.

These kinds of efforts require consistent data. That consistency has to span years in order for a regression analysis to be of any benefit and for us to be able to actually use identified factors in any meaningful way.

Correcting the problems

Inherent data quality issues aren't corrected overnight. Instead, it takes committed leadership from the top of the organization to drive data quality throughout the organization and to make it a top priority, especially for those who want to make better decisions based on the information on hand.

To start with, there needs to be structure in place for handling data quality on an ongoing basis. Without addressing what was obviously a structural deficiency, any data cleanup efforts we undertook would be for naught since we'd be right back here after a period of time. For Westminster, this meant expanding the scope of what we call our Module Managers Committee, the group charged with centrally coordinating campus data activities. Prior to making the scope change, this group was only a loose coordination mechanism. With the scope change, the MMC now has the following role:

Approval

This is approval -- as opposed to coordination -- of significant changes to individual business unit table codes, use of data fields, and so on. Even a VP cannot simply order his people to make an internal mass code change anymore.

  • Codes are shared throughout the ERP. A change in one division can seriously impact reporting and operations in another--and it has before.
  • As unit VPs come and go, we need to maintain historical consistency in the data. With enough business justification and a plan for addressing all second-order consequences, a VP can initiate mass data changes. The goal here is not to have the MMC be a blocking mechanism, but to be a mechanism for ensuring that all aspects of a change are considered and handled.

Expansion of data

Individual units often need to expand their use of the system through the use of user-defined fields or, sometimes, repurposing existing fields. The MMC will consider the best ways to handle these kinds of requests and make a recommendation to IT for implementation. This makes sure that everyone is aware of what data is being captured so that it can be used to maximum benefit.

Holder of the data dictionary

This group is responsible for ensuring that there are clear definitions for data and that data is stored in ways that enable campus processes and business intelligence systems to operate.

Appeals process

Again, the group is not supposed to be a roadblock, but it is supposed to help ensure that all aspects of decisions are considered, and for the rare instance when there is simply no business justification, the group does have the power to reject a request. If the group makes a decision, the decision can be appealed to me, and if I do not accept the appeal, it can be taken to the president of the college. We felt that this was an important item to place into the group's charter in the event that we ended up with a group more interested in control than coordination at some point in the future.

Clear identification of an executive sponsor

I am the executive sponsor and the link between the group and the executive team, although I have delegated the actual operation of the group to my Director of Information Systems/lead DBA.

This is an ongoing effort.

Just the beginning

The creation of our MMC was just the first step taken in a journey to clean our data. We are right now working on a major implementation that will allow us to take major steps toward having better data. We're implementing a new fundraising system that involves transferring all the data from our completely separate system into the new system. We've taken significant time to clean up this data for this migration.

That's just one initiative. In a future blog, I'll describe some other efforts we're undertaking in the name of data quality.

About

Since 1994, Scott Lowe has been providing technology solutions to a variety of organizations. After spending 10 years in multiple CIO roles, Scott is now an independent consultant, blogger, author, owner of The 1610 Group, and a Senior IT Executive w...

12 comments
Lboudreau
Lboudreau

Interesting read. There is actually a lot of great work being done with data quality and data linkage tools for the future of education with P20 and SLDS initiatives. We know in our business good record linkage happens behind the scenes. No personal information is ever shared, even the smallest of data sets aren't shared. Several departments have to approve each and every data request individually.


Linda Boudreau

Data Ladder

TAPhilo
TAPhilo

Sometimes you really do need a dictator to ensure that everyone marches to the same data definations and values. this is where a top down military style DOES work. It also requires a very large wall to show all the mappings between systems and to ensure that when a new field is created in ONE system it is make available to all the OTHER systems too so they can use it and it is now locked into existance.

Tony Hopkinson
Tony Hopkinson

All too often, a few compromises are allowed to creep in sacrificing the goal for some short term benefit. The main rule is no one gets to corrupt good data, the hard bit is identifing good. :( I've been on the technical end of a few of these efforts, near everyone was scuppered by lack of leadership in terms of conflicting priorities.

StaticFish
StaticFish

First I want to apologize for my bluntness. I don't normally write in responds to articles but you caught me at a rare moment. I can appreciate the issue addressed by the author (data cleansing) and I can't say I have a in depth knowledge of the problem but a quick visit to Westminster's website show a liberal arts college of 1100 to 1200 students. Westminster is ranked #166 in liberal arts colleges in the U.S. That means there are 165 Lliberal arts colleges of similar size, and most likely some of them have had to address issues that you are addressing now. Are you sure the genesis of your problem isn't more system or should I say systems related. Maybe getting senior staff out on a field trip to see how other schools of similar size handle their data is in order. Your going over issues that have been addressed before by many other institutions of higher learning with more resources to resolve your issues than W.C. has. Your senior staff needs to ask "What are small liberal arts colleges like Aherst, Middlebury, Williams, ranking in the top 10, charging 40K - 50K per year doing that that Westminster is not? What kind of data systems are they using? How are they structuring their systems inter-departmentally? How are they extracting the data? Go do a panty raid on some of these top 10 and bring back ideas that will work for your college. Best of luck

naoufel.mami
naoufel.mami

Hi, Thanks to share those valuable initiatives. During the process of Data cleansing which tools you have used? rgds

richord
richord

The problem of counting students is similar to the problem faced by organizations when counting customers. The first question to ask is ???who???s asking??? and the second is ???what is the definition of customer???? An important consideration is that the answers to both these questions are subjective and temporal. Subjective because the context constrains the answers while over time the answers will change. Within sales, a prospect is considered a customer. Within finance, a customer maybe defined as only those who did business within the past year. The standardized definitions are subject to change (sometimes without notice). For example are individuals who registered but haven???t paid considered studnets? For promotional purposes the marketing department might consider these as ???students???. Updating historic data to include new classifications is a challenge. The attributes needed to classify the data may not have been captured so going back and attempting to classify the data according to some new codes may not be possible. Many databases contained codes for gender as male/female. However in today???s world a further classification may be needed. How do you do you go back and determine gender for historic records to code them according to new classifications. It is important to ask how important is this historic data to the decision making process and is it possible to ???derive??? these classifications from the data previously captured. Data mining of data to determine correlations require extensive planning and execution. Having the data is but one consideration. Having the ???right??? data is the other. Correlations between data such as ???living in a particular building??? and retention may be discovered but correlation is not causation. How many data points for each student will be collected and what is the relevance of this data? It is costly to collect and manage data. Will this data provide insights that will impact the outcomes in a measurable way? What has been the impact of the data quality efforts and what are the expectations for the future impacts? Is the investment in a bureaucracy paying off? Has progress been affected by the bureaucracy? Has performance of the organization improved as a result of these efforts? Is this effort sustainable?

richord
richord

I assume you have not heard of the latest data fad, Data Governance? Data governance is targeted at solving all the data errors of the past by establishing an elaborate bureaucracy to take over data. There are councils, committees and stewards. There are policies and meetings and meetings and even more meetings. X-spurts will profess that all organizations have to be mature and embrace data governance. Without data governance your business will face the apocalypse. Data governance will solve business and IT relationship and trust issues. It will clean up bad data and it will ensure excellent customer service. Of course you could bypass this democratic process and dictate improvements. Or you can permit the current state of data anarchy to prevail. I suggest a hybrid. Select data management best practices such as data quality that will benefit the business and dictate their adoption. This is the ???low hanging fruit??? and ???just do it??? approach. Then identify preventive practices such as improved database design and data usage practices and embed these into the culture, behavior and processes (e.g. SDLC, BPM etc.) of the organization through enlightenment (not training and education) by showing by doing. This will take years.

Scott Lowe
Scott Lowe

Tony, I agree with you on all counts. I've started taking a hard line when it comes to tolerance for data integrity and it's a constant challenge. It seems like I'm notified (sometimes too late) of something taking place that, although it might make an individual process take 5 less seconds, creates massive problems down the road. As I stated elsewhere, there is a huge cultural component to this and that's the hardest kind of change to effect. At every opportunity - whenever one of these situations arises - I don't start with the staff member violating data integrity rules. I start with the division VP explaining why the activity cannot continue and I try to start the conversations with some kind of relatable analogy. That actually worked today in a situation I had and the VP immediately got what I was saying. Although most of the people on the executive team "get" data quality, educating the rest is key in the integrity goal. Fortunately, I have the full support of the president, so that always helps. Rarely do I simply say "no" to a request -- it might be "not right now" or "we'll put it on the list" but twice in the past year, I've simply said "no" to multiple requests that have come in to IT. These requests were intended to develop systems by which offices could work around data integrity issues. i.e." just add a field for us to do "x" and we don't need to track that data anymore... don't worry... we'll remember it." One of the requestors even had the division VP come and make the request. Again, I said no and told the VP that the only person that would change my mind on it would be the President and then, only over my objection. Eventually, the VP understood where I was coming from... the needs of one particular office need to be weighed against everyone else's. Keep data clean needs to be a goal for every office... that's going to be hard. Sorry I ramble. Scott

Scott Lowe
Scott Lowe

StaticFish, One of the great things about higher ed is the inter-school collaboration that takes place. These kinds of issues are discussed far and wide but, even among schools on the exact same system, the "drift" that takes place is incredible, so there is definitely not a one size fits all opportunity. In speaking with others at other institutions - not necessarily the elites with resources that far outstrip our own - in our class, we're far from unique but we did have one massive disadvantage - poor coordination between departments when it comes to data changes. That's the primary issue we're trying to solve now and this group will also be charged with a good chunk of the cleanup. Regardless of what's happening at other colleges, we will have to do the hard work of the cleanup. In June, I'll be at a conference with a presentation on this very topic and I intend to ask others in person exactly the same kinds kinds of questions you raised. Honestly, I think our primary challenge is cultural. Just today, we had one division that, to solve a problem internal to their group, was creating duplicate records in our ERP rather than handling the situation in a way that makes more sense. It was only sheer luck that we even found out this was taking place so that we could stop it. The end result of duplicate records is never good, particularly for those that use the data later in the lifecycle. Believe it or not, it's better than it used to be, but we have a very long way to go. I appreciate the comment and can say with certainty that we are looking at others' best practices. Scott

Tony Hopkinson
Tony Hopkinson

It's getting there, the more balkanised your systems, with a wide range of age and quality the harder it is.

Tony Hopkinson
Tony Hopkinson

Even if you started right, staying right is a constant battle between short and long term, technical and business priorities. If you didn't start right or didn't stay right, well debts always get called in the real world. Thing is, this isn't arcane technomage bibble-babble, anyone vaguely competent can understand the fundamental issues, this is 'business' not evaluating the long term effects of a short term bodge correctly, or even at all. You can explain it, you can demonstrate it, you can prove it, but they needed something yesterday at zero cost, it is what is is. Low hanging fruit, is an approach to be wary of, if you don't use the gains to build ladders and resource pickers with a head for heights, it's a cop out. Your data governor gets promoted to CIO, and the next guy looks for a better opportunity ( a new data island hurrah) to get promoted themselves.

Tony Hopkinson
Tony Hopkinson

I specialise in client server database, so I;m the guy who gets forced to screw things up and then the one who gets the blame for it being screwed up. Ran into a minor one today, legacy database no referential integrity, implied foriegn key to another table with a surrogate identity contraint seeded at one. So why do some of the furkers have zero in them then? Didn't see it in my test data, must have been symptom of an old bug, only cost us a few hours, they add up though, in fact experience suggest it's exponential.

Editor's Picks