In recent articles we have talked about analysing documents without reading them, as well as the different use cases for this type of tool. One of the applications discussed in that first article was the proposal innovation evaluator , but what exactly does it mean? This is the key question that we will try to answer in this article. In it, we will take a look at the new technologies that are being developed and how to apply them to optimise time and resources in our day-to-day work.
Documents as protagonists
Documenting a topic, identifying innovative papers on a subject or establishing the state of the art is a major challenge when considering the large volumes of data available.
So how can we deal with this problem? One of the steps being taken in this direction is the development of innovation evaluators. Supported by the world of Machine Learning, these tools make it possible to compare the content of different documents and identify whether or not they deal with the same subject without the need to read them. This makes it possible to automate and optimise the time spent on this type of work.
Certain selection processes, research project competitions, works execution in city councils, are just some examples of these situations; the documentation to be reviewed, classified and evaluated can entail work and time that makes them very costly processes, in which there is also the risk of "overlooking" relevant and valuable proposals. Innovation evaluators are born to respond to these needs. We will now explain the different phases that make up these practices: creation of a corpus, state of the art and innovation evaluator.
Creation of a corpus
When developing a tool to measure the degree of innovation of a document within a field, it is essential to have a sample of real data that is representative of the specific domain. This set of information is what is commonly known as a corpus.
Thus, for example, a corpus can be made up of texts of different types depending on whether they are written texts (scientific research, medical reports, etc.), whether they belong to a specific subject (financial, advertising, literary) or whether they contain several languages (monolingual or multilingual). And how can we obtain a representative corpus? There are different ways available, which can be generalised as follows:
- Own documents: the corpus is created from documents previously collected by the organisation, e.g. from institutional archives, digitised texts, studies, own scientific research, etc.
- OSINT and Crawler methodologies: if we do not have our own set of texts, we can obtain the corpus from open sources. As we have already explained in previous articles "OSINT, The power of open source information", there are tools that allow us to collect information from external sources on a given topic and generate a corpus in an automated way.
State of the art
Once the databases of the corpus have been established, the next step is what we call the classification of the state of the art; studies of the documents in the corpus that allow us to acquire a general knowledge of the topics they deal with without the need to read them. In other words, without knowing anything about the different themes present in the documents we are dealing with, we are able to classify them. This classification is based on the similarity of the texts.
As explained in our article "Classifying the state of the art" we have already outlined some of the methods that are being used. Below are some of these ideas but we strongly recommend reading them for more detailed information:
- Machine Learning identification of groupings of documents (classes) that share a certain similarity of content.
- Detection of themes contained in our corpus by means of key concepts, the most characteristic terms of each group, frequency of occurrence of words, ...
- Identification of the most representative text of each class with the aim of providing an overall idea of the topics covered. This enables a general knowledge of a class of texts to be acquired by reading just one of them.
- Metrics in relation to the degree to which a document belongs to a theme.
- Temporal analysis of text publications. Studies of the temporal evolution of publications that make it possible to detect publication patterns or trends.
Once the above steps have been taken, we can begin to measure the level of innovation of a text in relation to the state of the art that we use as a frame of reference. This level of innovation is measured through certain parameters that tell us how similar or different each document is: how many topics it deals with that are already known, what percentage of its content deals with new topics, etc.
The characteristics compared can be formal (length of the text, author, date of publication...), or based on its content (writing style, vocabulary, language...). One of the most relevant of these is the comparison at the thematic level, which allows us to know whether a text deals with an already known subject or whether it has not yet been included in our reference corpus.
These evaluators also allow us to establish metrics of absolute similarity or difference, detection of local anomalies within a subject (documents that, although they belong to a subject, present differential aspects with respect to the rest), multi-class texts (the subjects are not new, but the combination is), etc.
Measuring the degree of similarity has other applications, and can help us for example to save reading time; detecting copying or plagiarism, the level of interest or the amount of new information a text provides (is it relevant to read?) and identifying which parts of an article deal with unfamiliar topics are some examples of situations we could deal with where such tools are of interest.
All of the above, together with other aspects of text analysis such as automatic summaries, keyword and frequent word detection, etc., allow us to evaluate a document without even having to read it.