At its simplest, data ingestion is the process of shifting or replicating data from a source and moving it to a new destination. Some of the sources from which data is moved or replicated are databases, files or even IoT data streams. The data moved and/or replicated during data ingestion is then stored at a destination that can be on-premises. More often than not, however, it’s in the cloud.
SEE: Data migration testing checklist: Through pre- and post-migration (TechRepublic (Premium)
Ingested data remains in its raw and original form, as it existed in the source, so if there is a need to parse or transform the data into a format that is more compatible with analytics or other applications, that’s a follow-up transformation operation that will still need to be performed. In this guide, we’ll discuss additional specifics and benefits of data ingestion, as well as some of the top data ingestion tools to consider investing in.
- What is the purpose of data ingestion?
- Types of data ingestion
- Data ingestion vs. ETL
- Top data ingestion tools
What is the purpose of data ingestion?
The purpose of data ingestion is to move large volumes of data rapidly. This is made possible because there is no need to transform data during data moves or replications. The speed of ingestion enables organizations to move data quickly.
Data ingestion uses software automation to move large amounts of data efficiently, as the operation requires little manual effort from IT. Data ingestion is a mass means of data capture from virtually any source. It can deal with the extremely large volumes of data that are entering corporate networks on a daily basis.
SEE: Top data integration tools (TechRepublic)
Data ingestion is a “mover” technology that can be combined with data editing and formatting technologies such as ETL. By itself, data ingestion only ingests data; it does not transform it.
For many organizations, data ingestion is a critical tool that helps them manage the front end of their data and data just entering their enterprise. A data ingestion tool enables companies to immediately move their data into a central data repository without the risk of leaving any valuable data “out there” in sources that may later no longer be accessible.
Types of data ingestion
There are three fundamental types of data ingestion: real-time, batch and lambda.
Real-time data ingestion
Real-time data ingestion immediately moves data as it comes in from source systems such as IoT, files and databases.
To economize this data movement, data ingestion uses a tried-and-true method of data capture: It only captures data that has been changed from the last time data was collected. This operation is known as “change data capture.”
Real-time data ingestion is frequently used for moving application data related to stock trading or IoT infrastructure monitoring.
Batch data ingestion
Batch data ingestion involves ingesting data at night (in a batch of data) or at periodic data collection intervals scheduled during the day. This enables organizations to capture all of the data they need for decision-making in a timely fashion at a rate that does not quite require real-time data capture.
Periodically collecting sales data from distributed retail and e-commerce selling outlets is a good example of when periodic batch ingestion would be used.
Lambda data ingestion
Lambda data ingestion combines both real-time and batch data ingestion practices. The goal is to move data as quickly as possible.
If there is a latency or data transfer speed issue that could impact performance, the lambda data ingestion technique model can temporarily queue data, sending it to target data repositories only when those repositories become available.
Data ingestion vs. ETL
Data ingestion is a rapid-action process that takes raw data from source files and moves the data in a direct, as-is state into a target central data repository.
ETL is likewise a data transfer tool, but it is slower than data ingestion because it also transforms data into formats that are suitable for access in the central data repository where the data will be housed.
SEE: Data integration vs. ETL: What are the differences? (TechRepublic)
The advantage of data ingestion is that you can immediately capture all of your incoming data. However, once you have the data, you will still have to work on it so it can be formatted for use.
With ETL, most of the data formatting is already done. The downside to ETL is that it takes longer to capture and process incoming data.
Top data ingestion tools
Formerly known as Syncsort, Precisely Connect provides both real-time and batch data ingestion for advanced analytics, data migration and machine learning goals. It also supports both CDC and ETL functionality.
Precisely Connect can source and target data to either on-premises or cloud-based systems. Data can be in relational database, big data, streaming or mainframe formats.
Geared toward big data ingestion, Apache Kafka is an open source software solution that provides high-throughput data integration, streaming analytics and data pipelines. It can connect to a wide variety of external data sources. It is also a gateway to a plethora of add-on tools and functionality from the global open-source community.
Talend Data Fabric
Talend Data Fabric enables you to pull data from as many as 1,000 different data sources. Data can be targeted to either internal or cloud-based data repositories.
The cloud services that Talend supports are Google Cloud Platform, Amazon Web Services, Snowflake, Microsoft Azure and Databricks. Talend Data Fabric also features automated error detection and correction.
Read next: Top cloud and application migration tools (TechRepublic)
Subscribe to the Data Insider Newsletter
Learn the latest news and best practices about data science, big data analytics, artificial intelligence, data security, and more. Delivered Mondays and Thursdays