Data ingestion 101

Data ingestion collects data from its origin, does some transformation or filtering, and stores data to the destination.

$ant article

By Ant Colony Team

4 min read

Data ingestion 101
 article

Data pipelines

We would like to introduce the concept of data ingestion to the reader first. By data ingestion, we mean a process of collecting a huge amount of data from one or many sources and saving it to a system (or platform) of storage. It must provide a uniform supply of data to the destination and prepare it for data analytics.

How Data Ingestion works

Data ingestion collects data from its origin, does some transformation or filtering, and stores data to the destination.

Data types

We have encountered two main categories of data types:

  1. Data types based on the way data is collected into:
  • Realtime (streaming) data: it can be collected from different sources like streaming platforms, machines, IoT, sensors. That data has to be stored immediately in raw format to prevent data loss.
  • Batch data - It is mostly consisting of millions of records over a period of time.
  1. Data types based on the structure of data:
  • Structured data: the data is stored in databases because the tables in databases have a strict structure of columns and datatypes.
  • Semi-structured data: the data stored in JSON, XML, CSV format. Columns and data types can be different for records.
  • and unstructured data: this data can be in any format (image, audio, video), cannot be stored as structured data, and needs to have some metadata to be more meaningful.

Data Sources

Data sources are the places where the available and desired data could be found and collected. Mostly, data sources are owned by 3rd parties and you cannot choose the way you can access the data or the structure of data you want. Some of the data source types are HTTP clients, Apache Kafka, DB, or some information systems like ERP or CRM.

Data Destinations

It is very important to select the proper destination type to store collected data. It's better to spend some time researching the best options for destination type, instead of getting into trouble with destination storage after the data has been ingested. Some of the data destinations that can be selected are SQL/NoSQL databases, Amazon S3 or we can send data to other consumers.

Types of data ingestion

Many different data ingestion types can be grouped into those categories:

  • Batch-based data ingestion
  • Real-time / streaming data ingestion
  • Combination of batch and real-time data ingestion

Batch-based data ingestion

Batch data is ingested at scheduled, fixed intervals and it can be grouped by schedule or conditions. This process is useful for large data volumes that need processing and it is typically less expensive. As real-life examples of batch data, you can think about reports, timeline posts, credit card transactions, etc.

Real-time / streaming data ingestion

This type of data ingestion is used when data must be continuously collected. Each piece of data is collected as soon as it is recognized and processed as individual data, so it's not grouped in any way. The good approach to be sure that all data has been collected is to store raw data and let the other resources (services) do the processing job without interrupting real-time data ingestion services.

Tips and tricks with Data ingestion

  • Consistency: if ingested data is not structured type, try to make it uniform while processing to create more meaningful data.
  • Helpful metadata: add some more information (fields) to every single piece of data (a record) such as created datetime, created by, updated datetime, updated by, source, tags
  • Think about Data Query Layer and data consumers: Always think about the ingested (or processed) data consumers, and how to allow easier and faster getting ingested data. Set proper index (indices) to the database.
  • Logging, monitoring, and alerting: Log as much as you can. Use tools to monitor all data ingestion services and pipelines. Don't swallow exceptions. Create alerts if something goes wrong.

Conclusion

Data ingestion gives businesses the intelligence necessary to make all types of informed decisions. It takes the central part of modern data handling and analysis. Raw, unprocessed datasets contain data that may or may not be important for businesses. Data ingestion and processing help data become meaningful and valuable information that can change the world we used to live in.