After the previous chapters of a coffee with IoT, Chapter 0: A recipe for internal hacktivism and Chapter 1: NiFi flows, today we will talk about Chapter 2: Raw data.
Coffee machine declaration
Announcer: And that's how a sewer crocodile once again... Excuse me, I'm being told that we're going to connect live to our reporter Cafeole, who is in the middle of the story.
Cafeole: Thank you so much for giving me Hillary. I am standing next to the Future Space coffee machine, who is willing to make a highly secret confession about her IOT contamination.
Coffee machine: Souvenirs, novelties, joke articles!
# Leer fichero ----------------------------------------------------------------- # Ruta fichero.ruta = "./data/coffee.csv" # Lectura de las líneas conexion = file(fichero.ruta, open = "r") lineas = readLines( con = conexion) close(conexion)
An outside view
As mentioned in previous articles, the data resulting from the monitoring of the electricity consumption of the coffee machine of FutureSpace is dumped into a csv format file. Although data are already handled in the previous phases (in the intelligent plug or in the NiFi flow), this file is our primary data source, the first place where we can consider that data are stored and available, allowing both humans and machines to access them.
As far as this article is concerned, we will leave out what happens in a data analysis project before we have the data, and we will consider that our analysis begins at this point. So, let's start by finding out a few things about the file we have in hand.
# Datos del fichero ------------------------------------------------------------ # Tamaño del fichero fichero.tamanio <- round( x = file.size(fichero.ruta) / 2^20, digits = 2) # Número de líneas fichero.numLineas <- as.numeric(length(lineas))
Seen from the outside, the file gives us some information, such as the format of the data it contains(csv or comma separated values) or its disk size (204.97 MB). This is information that, although it will not go very far in the analysis process, is very useful for loading the data into our analysis tool.
An interior view
Next, it is interesting to look at the contents of the file to see what the raw data looks like (raw data). In this case, being a csv, the format is oriented to tabular data, so it is important to find out things like the number of records it has (5,358,606 lines) or if it has a header, that is, if at the beginning of the file there is a line with the names of the columns in the table. A frequent operation at this point is the revision of the first and last lines of the file. This helps us to obtain information about the records, but it also allows us to detect contents that are not data, such as annotations or titles, since these are the areas where they are most frequently inserted.
# Líneas de ejemplo ------------------------------------------------------------ # Primeras líneas lineas.inicio <- lineas[1:10] lineas.inicio <- paste( "Primeras líneas: ", paste(lineas.inicio, collapse = "\n"), sep = "\n") cat(lineas.inicio) # Últimas líneas lineas.final <- lineas[(fichero.numLineas - 10):fichero.numLineas] lineas.final <- paste( "Últimas líneas: ", paste(lineas.final, collapse = "\n"), sep = "\n") cat(lineas.final)
Thu Jun 14 13:03:01 CEST 2018,32.274447
Thu Jun 14 13:03:03 CEST 2018,32.299572
Thu Jun 14 13:03:05 CEST 2018,32.321334
Thu Jun 14 13:03:07 CEST 2018,32.377058
Thu Jun 14 13:03:09 CEST 2018,32.279788
Thu Jun 14 13:03:11 CEST 2018,32.304122
Thu Jun 14 13:03:13 CEST 2018,32.202237
Thu Jun 14 13:03:15 CEST 2018,32.254729
Thu Jun 14 13:03:17 CEST 2018,32.300891
Thu Jun 14 13:03:19 CEST 2018,32.360703
Mon Oct 22 11:54:49 CEST 2018,32.333468
Mon Oct 22 11:54:51 CEST 2018,32.277744
Mon Oct 22 11:54:53 CEST 2018,32.158976
Mon Oct 22 11:54:55 CEST 2018,32.255982
Mon Oct 22 11:54:57 CEST 2018,32.198016
Mon Oct 22 11:54:59 CEST 2018,32.191817
Mon Oct 22 11:55:01 CEST 2018,32.277678
Mon Oct 22 11:55:03 CEST 2018,32.216085
Mon Oct 22 11:55:06 CEST 2018,32.199533
Mon Oct 22 11:55:08 CEST 2018,32.206721
Mon Oct 22 11:55:10 CEST 2018,32.213579
Just by looking at these records, there are already some things we can take note of for our analytical process, for example:
- The file has no header.
- The information is presented in 2 columns.
- The first column refers to when the smart socket is consulted about the power consumption it is recording. It presents a date and time with an accuracy of seconds, in English format, but referring toCentral European Summer Time.
- The second shows the measured consumption in watts, as specified by the manufacturer of the intelligent plug.
- The information recording began on 14 June 2018 at around 1 p.m. and ended on 22 October 2018 at around 12 noon.
# Duración y observaciones ----------------------------------------------------- # Fechas de inicio y fin fechaInicio <- mdy_hms("Jun 14 2018 13:03:01") fechaFin <- mdy_hms("Oct 22 2018 11:55:10") # Duración de la monitorización observacionesTeoricas.segundos <- as.integer( difftime(fechaFin, fechaInicio, units = "secs")) observacionesTeoricas.numero <- observacionesTeoricas.segundos / 2 observacionesTeoricas.duracion <- seconds_to_period( observacionesTeoricas.segundos) observacionesReales.numero <- fichero.numLineas observacionesReales.segundos <- observacionesReales.numero * 2 # Diferencia entre monitorización teórica y real observacionesAusentes.segundos <- observacionesTeoricas.segundos - observacionesReales.segundos observacionesAusentes.duracion <- seconds_to_period( observacionesAusentes.segundos)
If we add to this some quick counts, we can realize that between the initial moment and the end of the recording, 11,227,929 seconds have passed (equivalent to a period of 129d 22H 52M 9S), or that, if we take into account that the data extraction was prepared to make a measurement every 2 seconds, we should have 5,613,965 observations, although the file only has 5,358,606. Coincidentally, if you pay attention to the records, it is possible to notice that there are jumps of more than 2 seconds between one record and the next (possibly due to delays in the responses from the smart plug). For example, between the last lines, which correspond to October 22nd, it is possible to observe a jump of 3 seconds between two records, where it goes from 11:55:03 to 11:55:06.
If we wanted to study the content of the file a little more, we could simulate a loading of the data into the analysis tool. This allows us to easily detect some peculiar records, especially when they do not obey to the previously observed format. In this way, 31,025 records with incorrect format are detected, whose content is an error code returned by the API of the intelligent plug. Let's see an example line:
# Carga de datos inicial ------------------------------------------------------- # Tabla base electricidad <- read.csv( file = fichero.ruta, header = F, col.names = c("momento", "consumo"), sep = ",", colClasses = rep("character", 2), stringsAsFactors = F ) # Observaciones erróneas observacionesErroneas <- which(is.na(as.numeric(electricidad$consumo))) observacionesErroneas.numero <- length(observacionesErroneas) lineas.erroneas <- lineas[observacionesErroneas]
At this point, we already have an idea of the kind of things we can learn about data from the file itself. In addition, we have taken note of the characteristics and particularities of the file and the data itself, which will allow us to know how to load it into our analytical tool. But there is little more we can get out of it. To understand the data, we need to do more than just read it. And that something is called statistics.
And we can't see all the data somehow? We'll see that in the next chapters of Coffee with IOT.
*Note on the side. For the curious: The objective data
Let's take another example (one that touches me personally). A client can tell an architect 'I want a house with two bedrooms, a living room, a kitchen and a bathroom'. And, of course, the architect points out 'reticular space with mobility dynamised by flow attractors, in which light coexists with the fragmented volumes and the rhizomatic organicity of the colour unifies the vertical walls'. A subjective fact and an objective one.
In medicine, a distinction is often made between objective and subjective data. For example, when a patient says 'I have chest pains' or 'I have chills', this is subjective data. When the doctor writes down 'sweating' or 'tachycardia' in the report, we have objective data. It is rare that this duality of data is recognized outside the health sciences, but we could say that whenever there is a human being who acts as a subject and another who is a professional in the area of interest being treated, we can have objective and subjective data.
When the photograph appeared, the painting became clearly subjective. A photographic image was considered a faithful representation of reality, so painters had to paint other things. But it didn't take long to realize that the photographer has his own look. Simply choosing what to photograph and what not to photograph is already an element of subjectivity. To this, we add layers and layers of technical decisions (focal length, exposure time, film grain...), narratives (title, caption or tweet that accompanies the image, etc.). And let's not say anything about what happens when you leave a camera to a magician like Méliès.
So it is no accident that one's opinions are called 'point of view'.
The same thing happens to data as to photography. They look like targets, even something like certain, but we choose what we measure, with what precision or with what unit. We build the story of what we want to tell from the moment we collect the data, and we create a world where if you're not measured you don't exist.