Data scientists come from a world of research and hypotheses. They develop queries in the form of big data algorithms that can become quite complex and that may not yield results until after numerous iterations. Their natural counterparts in IT—data analysts—come from a different world of highly structured data work. Data analysts are used to querying data from structured databases, and they see their query results rapidly.
Understandable conflicts arise when data scientists and data analysts try to work together, because their working styles and expectations can be quite different. These differences in expectations and methodologies can even extend to the data itself. When this happens, IT data architecture is challenged.
SEE: Hiring kit: Data scientists (TechRepublic Premium)
“There are a lot of historic differences between data scientists and IT data engineers,” said Dave Langton, VP of product at Matillion. “The two main differences are that data scientists tend to use files, often containing machine-generated semi-structured data, and need to respond to changes in data schemas often. Data engineers work with structured data with a goal in mind (e.g., a data warehouse star schema).”
From an architectural standpoint, what this has meant for database administrators is that data for data scientists must be established in file-oriented data lakes, while the data for IT data analysts must be sorted in data warehouses that use traditional and often proprietary structured databases.
“Maintaining proprietary data warehouses for business intelligence (BI) workloads that data analysts use, and separate data lakes for data science and machine learning workloads has led to complicated, expensive architecture that slows down the ability to get value from data and tangles up data governance,” Joel Minnick, VP of product marketing at Databricks, said. “Data analytics, data science, and machine learning have to continue to converge, and as a result, we believe the days of maintaining both data warehouses and data lakes are numbered.”
This certainly would be good news for DBAs, who would welcome the prospect of just having to maintain one pool of data that all parties can use. Additionally, eliminating different data silos and converging them might also go a long way toward eliminating the work silos between the data science and IT groups, fostering improved coordination and collaboration.
SEE: Snowflake data warehouse platform: A cheat sheet (free PDF) (TechRepublic download)
As a single data repository that everyone could use, Minnick proposes a data “lakehouse,” which combines both data lakes and data warehouses into one data repository.
“The lakehouse is a best-of-both-worlds data architecture that builds upon the open data lake, where most organizations already store the majority of their data, and adds the transactional support and performance necessary for traditional analytics without giving up flexibility,” Minnick said. “As a result, all major data use cases from streaming analytics to BI, data science, and AI can be accomplished on one unified data platform.”
What steps can organizations take to migrate to this all-in-one data strategy?
1. Foster a collaborative culture between data scientists and data analysts that addresses both people and tools
If the data science and IT data analysis groups have grown up independently of each other, organizations may need to build a sense of teamwork and collaboration between the two.
On the data side, the goal will be to consolidate all data in a single data repository. As part of the process, data scientists, IT data analysts and the DBA will need to partner and collaborate in the standardization of data definitions and in determining which datasets to combine so this standard platform can be built.
2. Consider building a corporate center of data excellence (CoE)
“Data science is a fast-evolving discipline with an ever-growing set of frameworks and algorithms to enable everything from statistical analysis to supervised learning to deep learning using neural networks,” Minnick said. “The CoE will act as a forcing function to ensure communication, development of best practices, and that data teams are marching toward a common goal.”
Organizationally, Minnick recommends that the CoE be placed under a chief data officer.
3. Tie the data science-data analyst unification effort back to the business
A shared set of goals and data can contribute to a stronger and more integrated corporate culture. These synergies can speed times to results for the business, and that’s a win for everyone.
“In order for organizations to get the full value from their data, data teams need to work together instead of data scientists and data engineers each operating in their own siloes,” Minnick said. “A unified approach like a data lakehouse is a key factor to enable better collaboration because all data team members work on the same data rather than siloed copies.”