Skip to content

How do you design a Big Data architecture and not die trying?

Share on twitter
Share on linkedin
Share on email
Share on whatsapp
Big Data Architectures

Gartner [1] defines Big Data systems with the "three uves": large volumes, high speed and a wide variety of information assets. As these three components grow, variety becomes the most decisive factor when evaluating an investment in Big Data.

In some surveys conducted in 2016 among the heads of Big Data departments [2], variety of data was highlighted as the most important factor, followed by volume (25%), and finally speed (6%). The trend in 2017, highlights that the great opportunity lies in the integration of more data sources, not of larger amounts, but of different data sources. In other words, it highlights the characteristic of variety as the most important feature, relegating the characteristic of volume to second place. This is why companies focus on identifying new sources and integrating data sources that have traditionally been used for other purposes, such as supporting their applications: Making more use of data sources has emerged as the new challenge within the corporate world.

When designing Big Data architectures, we find ourselves with a very wide and constantly growing toolbox that often makes us feel overwhelmed when facing this kind of challenges, and because I believe that this feeling has been present at some point, I have decided to write this article, in order to help you or at least to guide you when it comes to designing Big Data architectures.

Before you start going into the post, I don't want you to think that after reading it you will have the solution to your architecture, you won't have that "picture" of the architecture you should use, since there is no single architecture that covers all use cases, but I do want that with this article we know: what to ask ourselves and what things we have to take into account when designing a solution.

What features does your Big Data system need to have?

Normally what we find is that when we implement a new Big Data architecture, we do it for a "small objective", that is, it arises from the need to give solution to a concrete case of use. On the one hand, this is something positive and something that we have strongly recommended on other occasions: "Focus on your objective". Without a clear objective it is possible that your solution will fail, but on the other hand, you have to take into account that the architecture will grow and must be able to support future use cases.

One of the most important things is that you define the high level "layers" that your architecture will have, and that none of your layers are based on a specific tool, because many times the architecture will need different types of tools to cover the different problems that it has to respond to. Let's see a small example:

Our Big Data architecture must have an entry point for the data or what is often called a "Data Ingestion layer", if a priori our use case needs to acquire the data from relational databases, you may decide to use "Sqoop" as a tool. What happens when you need to read files, logs, emails in the future? You will have to incorporate new tools such as Flume, Nifi, etc.

This kind of situations will not only happen in your "Data Ingestion Layer" but also in the different layers of your architecture, so my recommendation is that you don't design from a tool, instead try to have your architecture based on modular nodes that are part of each of the layers of the architecture. This will make your design a scalable and modular architecture.

The use of different tools will allow you to give solutions to more use cases, but this does not mean that you use everything. Try to evaluate the tools before incorporating them and think ahead, as a system with a variety of tools will increase its complexity in terms of management and administration.

Among the characteristics that your architecture should possess, we find:

  • Scalability: it must be able to increase the hardware capacities of each of the layers of the system, in some cases increasing its processing capacity and in other cases its storage capacity.
  • Fault tolerance: if any of the servers or nodes are down, the system must guarantee its availability, avoiding data loss. This applies to each of the layers of your architecture.
  • Data distribution: bear in mind that due to the large volume of information, these systems distribute the data and for this, you need to equip your architecture with different nodes that house the information. This type of solution is far from the traditional ones whose paradigm was based on the centrality of data.
  • Distributed processing: when processing these volumes of information and applying more or less complex algorithms, these solutions are based on distributed processing to optimize execution times. This implies that your design must be prepared to have different nodes where to distribute the processing of the data and have the ability to scale.
  • Data location: this term is widely used in Big Data systems and refers to the proximity between the analytical processes and the data being processed. These architectures must favour the location of the data to avoid network transfer, which in traditional systems is not considered a critical point but in these systems can penalise execution times.

When it comes to choosing technologies, don't get overwhelmed. Here are some tips that can help you in your design:

  • In information intake: evaluate your types of sources, not all tools are good for every source, and in some cases you will find that it is best to combine several tools to cover all your cases.
  • In processing: evaluate if your system has to be streamed or batch. Some systems that are not defined as purely streaming use what they call micro-batch which usually gives an answer to problems that in the daily use of language are called streaming.
  • In monitoring: keep in mind that we are talking about a multitude of tools and that their monitoring, control and management can be very tedious, so regardless of whether you decide to install a complete stack or to install independent tools and generate your own architecture combusto, I also recommend that you use tools to control, monitor and manage your architecture, this will facilitate and centralize all these tasks.

Don't deviate from your path, there are things we must always keep in mind:

Surely as you gain experience in designing Big Data architectures, you can give me ideas on this subject, but for the moment from my point of view, these are some of the main answers you should know before starting to design a solution.

  • Focus on your use cases, when you have your objectives clear you will know to empower your architecture. Volume, variety, speed, ... do you really need it all?
  • Define your architecture: batch or streaming? do you really need your architecture to support streaming?
  • Evaluate your data sources: How heterogeneous are your data sources? Do the chosen tools support all the types of data sources you have?

Share the article

Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on email
Email
Share on whatsapp
WhatsApp

A new generation of technological services and products for our customers