Have you ever needed information and you don't know where it is? Have you had to work with data that you don't know where it comes from? Do you know the person responsible for the information you are dealing with? Are you aware of the importance of all the data you store? These are some of the many questions that a Data Governance system answers.
As reflected by Tania Arcos in her article on the entry into force of the GDPR, there are more and more reasons to implement a Data Governance system, which, by defining it in a simpler way, is in charge of managing the knowledge a company has about its information, and to achieve this, it pursues the following objectives
- Define policies and procedures to control access to company data
- Promote, control and monitor the execution metrics of data management services.
- Manage and resolve information-related problems within the company.
- Define, know and promote the value of assets.
But... How do you build a system of Data Governance?
The development of a system that meets all these objectives is a very complicated task, so companies like Hortonworks, proposed within its stack the Apache Atlas tool to support the construction of such systems.
Apache Atlas is an open source tool specially designed to solve the problems of data governance. It provides a series of functionalities that are sufficiently generic to allow you to define an appropriate knowledge management system, regardless of the type of business model your company has.
Metadata management
The importance of metadata in this type of system is that it is the starting point for knowing how and where your data is located. For this purpose, Apache Atlas has an API that allows the management of metadata by creating types and entities that model the information structures contained in your datalake. To see a clearer example of how would be the metadata management in a datalake with Apache Hive, each of the elements or tables that Hive stores will be represented by Apache Atlas as an entity of type hive_table.

In the image you can see the hive_table type entities that have been extracted from the datalake.
Atlas performs the extraction of metadata automatically in the case of Apache Hive and other compatible systems. However, for less common storage systems it is necessary to do it manually, using its Rest API.
Information modelling
The tool has a label assignment system that allows the information to be modelled and defined in a very flexible way. These labels allow classifying the information as well as describing it, to help understand the technical and business concept that the data may have.

In addition to the labels, Atlas allows you to model and define the information at a more conceptual and business level through the use of taxonomies. Taxonomies classify information in a hierarchical way, making it easier for users with a less technical profile to understand the data.

Data lineage
The origin or lineage is a very important part in this type of system since it allows users to know where the information they are using comes from. Apache Atlas has a functionality that shows the lineage of the data in the form of a graph, allowing the visualization of the processes and transformations that the data undergoes over time from its source to its destination.
Within the Atlas interface you can search for the different entities in your government system, and if we access one of them a LINEAGE & IMPACT tab is displayed showing the lineage of the data:

Conclusion
With the processing of large volumes of data brought about by Big Data and new technologies, the use of management and information knowledge administration systems within companies is becoming increasingly necessary. For this reason, this type of system allows data to be organized more efficiently and makes it easier to understand.