Everyone hears about Big Data, about how massive data processing can help us find solutions to problems that until now seemed too volumetric and whose processing depended on time that we do not have. I'm talking about the kind of problems that companies face in order to offer our clients services such as recommendation systems, prediction systems, fraud analysis or even streaming video and audio systems.
There are currently many technologies that offer a solution to these problems, including Apache Spark, which is one of the most widespread processing systems today, for both structured and unstructured data.
What is Apache Spark?
Apache Spark is a cluster distributed computing framework, based on a system of operations (transformations and actions) performed on distributed data collections, called RDD (Resilient Distributed Dataset).
One of the advantages that makes Spark an efficient and fast system for mass data processing is that RDDs are stored in memory, allowing faster access to them for complex operations.
However, the reason why Spark has become so widespread and has become a reference technology for data processing is that it is an open source project in continuous development, with a huge active community. Because of this, different libraries have been developed designed to solve different types of problems.
Among all the modules that Spark has, in this article we are going to deepen in Spark SQL.
When we speak of structured data we refer to information that is usually found in most databases, labeled and controlled information that can be found in rows or columns. Spark SQL is mainly used for the treatment of this type of information and it does it by means of DataFrames.
DataFrames are distributed data sets organized by columns that can be built from various data sources such as Cassandra, HIVE, Elastic Search, JDBC, csv files, json, avro, etc. or directly from an existing RDD. To explain it in a simpler way, DataFrames are conceptually equivalent to the tables of a relational database.
The Spark SQL API allows the connection to the source sources, obtaining the data and going to be managed in memory by Spark.
Example of a DataFrame construction using JDBC
So, what are the advantages of dealing with Spark?
Bearing in mind that the data origin of a company can be diverse, maintaining structured or semi-structured information in different systems, a usual case in which it could be applied would be that of a sales company that obtains information on the movements that its clients make, stores records of all its sales and also receives more information from other external data sources, in json format.
With Apache Spark you could unify all this information and process it to obtain results, such as knowing which are your potential customers or which customers repeat your experience making other orders.
With this, we conclude that Spark SQL provides us with an abstraction layer that supports structured or semi-structured data, and, that provides the following advantages
- Unify distributed information in different environments or storage systems.
- Explore and analyze data with a high level of abstraction using SQL language, allowing its use by non-specialists.
- Eliminates restrictions imposed by source data sources. For example, Cassandra does not allow to perform queries by filtering by fields not included in the Primary Key, however, Spark SQL by keeping all the information in memory allows to perform queries on any type of data.
- Optimized operations. The API offers a multitude of transformation methods that abstract from complexity at a low level, ensuring an optimal process.
- Work with an open source platform in continuous development, with a large active community.