The 5Ws of document analysis

What?

In this article we will draw on our own version of the 5Ws (or the 7 circumstances of Hermagoras of Themnos) to give a high-level answer to some questions concerning document analysis.

Who?

Document analysis can rely on mathematical, statistical, machine learning or deep learning techniques for its technical resolution. However, the field that really plays a leading role in this task is Natural Language Processing (NLP). An area of computing that combines artificial intelligence and linguistics, it addresses the problem of transforming natural language into a formal language that can be processed by a machine. In this way, a computer can exploit the information present in the manifestations of human language: texts, voice recordings, song lyrics, etc.

Where?

To carry out a document analysis, the first thing we need to know is where to get the data we need. Depending on the needs of what we want to know and the circumstances of the project, we have multiple data sources at our disposal:

The file
If the document is in a digital file, we can exploit its own data. Data such as its name, creation date, format or extension (.docx, .pdf, .txt, etc.), user who owns the file, file size, etc. are of interest.
File metadata
In some file formats we can find a metadata layer with specific information about the document it hosts. And in many cases, this metadata provides information about the publication of the document. This includes the author, publisher, date of publication, title of the document, etc.
The text
The text of the document itself is the basic data source for NLP analysis, to the extent that in many cases it is the only data input to the system.
The document
If the text is structured in a document, we have before us an excellent source of information. Headers, footnotes, tables of contents, chapter headings, footnotes, etc. concentrate a large amount of data in very few words. These elements make it easier to obtain information such as the author, publisher, bibliographical references, the topics covered in the content of the document, etc. In fact, it may be more effective to analyse only the introduction or titles of a document than to study the whole text.
Open sources
We must also take into consideration the possibility of finding information about a document outside of it. At this point, we can look for reviews, technical sheets, summaries, opinions, academic works, journalistic articles... This brings us closer to the area of OSINT (Open-Source INTelligence), with which we can obtain information that other humans have contributed about our document.
On the other hand, we can make use of open sources to semantically enrich our analysis. For example, if our text names companies, we can add information about them (size, sector, turnover...) or a link to their official website. Or if technical terms are used, we could include their definition.
Other documents
Finally, we must bear in mind that sometimes we need to know about other documents that serve as a context for our own. This is the case, for example, with news. Thus, in order to locate an editorial line in the political spectrum, it is necessary to have news from other sources with which to compare it.

How?

Once we have the data, we can apply NLP techniques to extract knowledge from it. Among the best known and most interesting document analysis techniques are:

Sentiment analysis.
This involves identifying and extracting subjective information (about the writer) from content of various kinds. Typically, an assessment is obtained of how positively or negatively a topic is being treated in a text. Although other more refined metrics such as the detection of irony, affect, joy, etc. can also be obtained.
Document Structure Analysis.
Using this tool we can identify the main structural components of a document, extracting titles, headings, subjects, recipients, etc. We will quickly be able to obtain a table of contents that will provide us with an overview of the structure of the message.
Entity detection.
These techniques make it possible to locate the mention of entities within a text. In addition, it classifies these entities in a category, such as people, places, companies, governmental bodies, economic quantities...
Text classification
Document classification allows us to assign a text to one or more known and predefined categories. With this we can detect spam mail or assign subjects to a book. Since we know a priori the possible classes of texts, we make use of so-called supervised learning techniques.
Text clustering.
Clustering allows us to automatically discover the implicit structure of a collection of documents. This allows us to automatically group the most similar or related texts together or to determine how many different types of texts there are. Since clustering does not have a fixed number of groups, it makes use of unsupervised learning techniques.
Summary
These techniques allow us to know the content of the text without even having read it. They typically provide a short alternative text that we can read instead of the original text. But it is also common to obtain key terms and metrics that help us to understand the content.

But there are many more techniques to be explored, such as the similarity of documents or obtaining representative texts of a subject. And, perhaps the most relevant, vectorisation, which allows us to express a text numerically and opens the door to the application of cutting-edge mathematical and analytical techniques in the sector.

For what? When? Why?....

If you want to know why we might want to do document analysis, the best thing to do is to read the article in our blog in which we tell 10 very interesting use cases.

The when and why we expect you to tell us when you tell us the use case you are interested in solving.

WE RECOMMEND YOU

Fugas de datos empresariales: ¿Cómo podemos protegernos mejor?

OSINT contra el tráfico de drogas

La Inteligencia Artificial en España: La transformación Digital en Salud, Agricultura y Manufactura

OSINT para LEA’s y Departamentos de Seguridad

X Edición de MorterueloCon: seguridad informática con sabor conquense

Adrian Fernandez Chicote

Trained as an architect, reformed as a web designer and transformed into a data analyst. In my head inhabits a hotchpotch of information about linguistics, structure calculus, art history, algebra, neuroscience, sociology, statistics, drawing, artificial intelligence... But if you scratch a little, you'll see that what I really am is an anarchic homo ludens with a passion for games. "If I can't dance, your revolution doesn't interest me." - Emma Goldman -

The 5Ws of document analysis

What?

Who?

Where?

How?

For what? When? Why?....

Fugas de datos empresariales: ¿Cómo podemos protegernos mejor?

OSINT contra el tráfico de drogas

La Inteligencia Artificial en España: La transformación Digital en Salud, Agricultura y Manufactura

OSINT para LEA’s y Departamentos de Seguridad

X Edición de MorterueloCon: seguridad informática con sabor conquense

FOLLOW US

Share the article

Related Posts

Fugas de datos empresariales: ¿Cómo podemos protegernos mejor?

OSINT contra el tráfico de drogas

OSINT para LEA’s y Departamentos de Seguridad

A new generation of technological services and products for our customers

About Future

Future World

Trends

Canal del Informante