Data Intake is the first step and one of the fundamental pieces within Big Data Architecture. However, one may tend to think that it is "just" a copy or move of data between different systems. But beware, not giving it the proper importance, and not spending time on everything that goes with it, can cause a real indigestion of data in your Big Data system.
What is Data Intake?
Data entry is the process by which data, from different sources, structure and/or characteristics is entered into another data storage or processing system.
This is undoubtedly the first step that must be taken into account when designing a Big Data architecture, for which, it is necessary to have very clear, not only the type and source of data, but which is the final objective and what is intended to be achieved with them. Therefore, at this point, a thorough analysis must be made, because it is the basis for determining the technologies that will make up your Big Data architecture...remember that we do not want to build systems with a bad base...
So, what are the factors to be considered?
When analyzing what would be the appropriate technology and architecture to perform the data intake in your Big Data system, you have to take into account the following factors:
Data origin and format
At this point, several questions should be asked:
- What will be the origin or origins of the data?
- Do they come from external or internal systems?
- Will it be structured data or unstructured data?
- What is the volume of data? Daily volume, and pose as it would be the first data load.
- Is there a possibility that new data sources will be added later?
So with this information, we would start to evaluate the types of connectors that would be necessary, as well as to evaluate, depending on the volume of data, the scalability that the system should have. It is also the moment to think how it would be and what it would mean to incorporate a new data source to the system.
At this point, it is important to be clear about how long it should take from the time the data is ingested until it is usable. This information must be taken into account when deciding which will be the extraction method, the destination of the data and the connectors to be used, since there are technologies that are Batch, where the latency can be in hours or days, or real-time technologies, where the latency is measured in microseconds. The lower the latency of the information, the more difficult it will be to implement the technology to be used.
Another aspect to take into account is whether the information of the source sources is usually modified or not. And if this information is frequently modified, it is necessary to analyze what strategy should be followed with the information that has already been ingested, that is, could all the information be stored and a history of changes be kept? or should the information that is held be modified? And what is the best strategy, through updates, or deletes+insert? It will depend on the latency you want to obtain, as well as the possibilities offered by the final destination of the selected data.
It will also be necessary to analyse whether transformations will be necessary during the intake process (on-the-fly transformations) or not. The transformation process must consider the latency that is incorporated into the system, as it can affect the target latency of the system since complex transformations can affect the performance to obtain the information in real-time. Another aspect to be considered at this point is that when performing transformations the final information will not be a mirror of the initial information, an aspect that must be taken into account in the data update strategy that has been selected.
Destination of the data
When selecting the final destination, you should take into account aspects such as
- Will it be necessary to send the data to more than one destination, for example, HDFS and Cassandra?
- How will the data be used at the destination? How will the queries be used? Will they be random searches or not?
- What data transformation processes will be performed once the data is ingested?
- What is the frequency and update of the source data?
This affects the decision to partition or bucket the data, as well as the tool to be used to consult the data (Hive, Cassandra, HBase...), and has a lot to do with the latency and availability of the information at the time of data intake or processing.
Study of the data
Other factors to consider:
- Data quality, reviewing the data is reliable, i.e. if it is well reported, or if there are duplicates. Therefore, as in the case of updates, we must think about what to do, if for example take them as they are to our Big Data system, or rather, clean the data as far as possible, to try to have quality data.
- Data security. It must be taken into account whether there is data that is sensitive or confidential, in which case, it should be seen whether it should be masked or not directly dumped into the system, all depending on the final objective.
Data Intake Technologies
Once the analysis of all the information has been done, and the final objective is clear, it is time to design the Big Data architecture. To do this, a good tactic is to face the different technologies available and see what they bring with respect to the analysis done previously. There are many technological solutions available to perform data ingest, and in fact, some solutions are not "pure" ingest, but you can ingest and process data at the same time. Some of these technologies are listed below:
- Basic File Transfer to HDFS
- Apache Flume
- Apache Kafka
- Apache NiFi
- Apache Spark
Each of them has its pros/cons in relation to the aspects considered in the previous point. Therefore, it is necessary to carry out a study of those pros/contras in relation to the objectives, to design the architecture that best suits your needs, and that the final system is a solid system that is immune to data indigestion. In future articles we will study advantages and disadvantages of some of these technologies with respect to the factors that we have commented so that it can help when designing the architecture to be implemented.