You just updated your LinkedIn profile with the sexiest job of the 21st Century, according to Harvard Business Review. That’s right: you’re a data scientist. You’re pulling down a six-figure salary. You’re single-handedly turning your once-tired business into a data-driven machine with fancy new machine learning models and algorithms. Your parents may not understand what you do, but they’re proud.
If only they knew that you’re basically a data janitor.
That’s not to say that janitorial work isn’t a noble profession, whether it’s of the sweep-the-floors or the cleanse-the-data variety. Both are important and, in the case of data science, data cleansing, or data preparation, is a critical precursor to being able to do anything useful with data.
SEE: Hiring kit: Data scientist (TechRepublic Premium)
According to Anaconda’s 2021 State of Data Science survey, survey respondents reported they spend “39% of their time on data prep and data cleansing, which is more than the time spent on model training, model selection and deploying models combined.” According to other studies, data preparation can claim as much as 80% of a data scientist’s time.
Data preparation takes so much of a data scientist’s time because, ultimately, data can’t do much if it hasn’t been vetted and prepped for success. Given the importance of good data preparation to delivering good data science, it’s important to understand what it is and how to do it well.
What is data preparation?
According to TechRepublic, data preparation is “the process of cleaning, transforming and restructuring data so that users can use it for analysis, business intelligence and visualization.” AWS’s definition is even simpler: “Data preparation is the process of preparing raw data so that it is suitable for further processing and analysis.”
But what does this actually mean in practice?
Data doesn’t typically reach enterprises in a standardized format and, thus, needs to be prepared for enterprise use. Some of the data is structured—like customer names, addresses and product preferences — while most is almost certainly unstructured—like geo-spatial, product reviews, mobile activity and tweets.
Before data scientists can run machine learning models to tease out insights, they’re first going to need to transform the data, reformatting it or perhaps correcting it, so it’s in a consistent format that serves their needs. This is where data preparation makes all the difference.
What are the benefits of data preparation?
- The ability to fix errors quickly by “catch[ing] errors before processing”
- The production of top-quality data by “cleaning and reformatting datasets [to] ensure that all data used in analysis will be of high quality”
- The ability to make better business decisions
In addition, data preparation can help to reduce data management costs that balloon when you try to apply bad data to otherwise good ML models. Now, given the importance of getting data preparation right, what are some tips for doing it well?
Top 6 data preparation tips for your business
If you’ve read this far, you hopefully are convinced that you can’t deliver ML success without substantial investment in data preparation. Yet, many data scientists want to focus on the sexy part of the job (models) at the expense of adequate data preparation.
It’s relatively easy to train an ML model, and much harder and more important to understand the distribution of data and apply models accordingly. Such understanding comes through data preparation. Consider these six tips as you begin the data preparation process for various business use cases:
1. Prepare for preparation
Now that you’ve determined that data preparation is non-negotiable in your future, make a plan for who will complete which preparation tasks, on what timeline, and for what specific business purposes. This will ensure no time or resources are wasted in the preparation process.
2. Don’t pretend the data is perfect
As you prepare data, you’ll get a closer look at what’s there and will almost certainly see gaps in the data. The key is to make sure you communicate any limitations in the data to stakeholders, so you can calibrate expectations accordingly and as early as possible.
3. Tools can help, but people are essential
From the previously mentioned Anaconda report: “While data preparation and data cleansing are time-consuming and potentially tedious, automation is not the solution. Instead, having a human in the mix ensures data quality, more accurate results, and provides context for the data.”
SEE: Hiring kit: Data scientist (TechRepublic Premium)
A savvy data scientist will know what clean data looks like and can help to shape raw data into a usable form. Make sure you’re hiring people with the requisite skills; as a bonus, look for data scientists with the leadership and mentorship skills to build up other team members.
4. Do hypothesis testing to understand your data’s distribution
One trick to getting a sense of the right distribution of your data, and thereby uncovering outliers and missing values, is to do hypothesis testing. Berkeley Lab researcher Adrian Perez has outlined a series of tests you can run to better understand data, so you can more effectively prepare it for use.
5. Prioritize data according to your use case
While it may seem obvious to, for example, consider data from your Eloqua system when working on a marketing analytics use case, this kind of human judgment is essential to prioritizing data sources for a given model.
Given time or cost constraints, you’re likely going to need to stack rank data sources that are most likely to be useful to the model for each project. Choosing which data sources will take precedence over others can help streamline the data preparation process.
6. Take data storage seriously
Many enterprises treat their data lakes like data swamps, shoving data into the repository without worrying about formatting. This is fine until you actually want to use the data. You probably don’t want to undertake the burden of rebuilding databases after the fact, so thinking ahead and standardizing data formats when data is being ingested can remove a great deal of the pain associated with data preparation.
What are some data preparation tools?
Though people are the primary component of data preparation success, there are tools on the market that can automate some of the drudgery. Some of the leaders in this market include Microsoft, Alteryx, Tableau and Zaloni, though the right data preparation tool for your business will depend on budget and specific business goals and requirements.
Disclosure: I work for MongoDB but the views expressed herein are mine.