Skip to content

IOT coffee: The raw data (Chapter 2)

Share on twitter
Share on linkedin
Share on email
Share on whatsapp
Coffee with IoT: The raw data

After the previous chapters of a coffee with IoT, Chapter 0: A recipe for internal hacktivism and Chapter 1: NiFi flows, today we will talk about Chapter 2: Raw data.

Coffee machine declaration

Announcer: And that's how a sewer crocodile once again... Excuse me, I'm being told that we're going to connect live to our reporter Cafeole, who is in the middle of the story.

Cafeole: Thank you so much for giving me Hillary. I am standing next to the Future Space coffee machine, who is willing to make a highly secret confession about her IOT contamination.

Coffee machine: Souvenirs, novelties, joke articles!

# Leer fichero -----------------------------------------------------------------
# Ruta
fichero.ruta = "./data/coffee.csv"
# Lectura de las líneas
conexion = file(fichero.ruta, open = "r")
lineas = readLines(
  con = conexion)

An outside view

As mentioned in previous articles, the data resulting from the monitoring of the electricity consumption of the coffee machine of FutureSpace is dumped into a csv format file. Although data are already handled in the previous phases (in the intelligent plug or in the NiFi flow), this file is our primary data source, the first place where we can consider that data are stored and available, allowing both humans and machines to access them.

As far as this article is concerned, we will leave out what happens in a data analysis project before we have the data, and we will consider that our analysis begins at this point. So, let's start by finding out a few things about the file we have in hand.

# Datos del fichero ------------------------------------------------------------
# Tamaño del fichero
fichero.tamanio <- round(
  x = file.size(fichero.ruta) / 2^20,
  digits = 2)
# Número de líneas
fichero.numLineas <- as.numeric(length(lineas))

Seen from the outside, the file gives us some information, such as the format of the data it contains(csv or comma separated values) or its disk size (204.97 MB). This is information that, although it will not go very far in the analysis process, is very useful for loading the data into our analysis tool.

An interior view

Next, it is interesting to look at the contents of the file to see what the raw data looks like (raw data). In this case, being a csv, the format is oriented to tabular data, so it is important to find out things like the number of records it has (5,358,606 lines) or if it has a header, that is, if at the beginning of the file there is a line with the names of the columns in the table. A frequent operation at this point is the revision of the first and last lines of the file. This helps us to obtain information about the records, but it also allows us to detect contents that are not data, such as annotations or titles, since these are the areas where they are most frequently inserted.

# Líneas de ejemplo ------------------------------------------------------------
# Primeras líneas
lineas.inicio <- lineas[1:10]
lineas.inicio <- paste(
  "Primeras líneas: ",
  paste(lineas.inicio, collapse = "\n"),
  sep = "\n")

# Últimas líneas <- lineas[(fichero.numLineas - 10):fichero.numLineas] <- paste(
  "Últimas líneas: ",
  paste(, collapse = "\n"),
  sep = "\n")

First lines: 

Thu Jun 14 13:03:01 CEST 2018,32.274447

Thu Jun 14 13:03:03 CEST 2018,32.299572

Thu Jun 14 13:03:05 CEST 2018,32.321334

Thu Jun 14 13:03:07 CEST 2018,32.377058

Thu Jun 14 13:03:09 CEST 2018,32.279788

Thu Jun 14 13:03:11 CEST 2018,32.304122

Thu Jun 14 13:03:13 CEST 2018,32.202237

Thu Jun 14 13:03:15 CEST 2018,32.254729

Thu Jun 14 13:03:17 CEST 2018,32.300891

Thu Jun 14 13:03:19 CEST 2018,32.360703

Last lines: 

Mon Oct 22 11:54:49 CEST 2018,32.333468

Mon Oct 22 11:54:51 CEST 2018,32.277744

Mon Oct 22 11:54:53 CEST 2018,32.158976

Mon Oct 22 11:54:55 CEST 2018,32.255982

Mon Oct 22 11:54:57 CEST 2018,32.198016

Mon Oct 22 11:54:59 CEST 2018,32.191817

Mon Oct 22 11:55:01 CEST 2018,32.277678

Mon Oct 22 11:55:03 CEST 2018,32.216085

Mon Oct 22 11:55:06 CEST 2018,32.199533

Mon Oct 22 11:55:08 CEST 2018,32.206721

Mon Oct 22 11:55:10 CEST 2018,32.213579

Just by looking at these records, there are already some things we can take note of for our analytical process, for example:

  • The file has no header.
  • The information is presented in 2 columns.
  • The first column refers to when the smart socket is consulted about the power consumption it is recording. It presents a date and time with an accuracy of seconds, in English format, but referring toCentral European Summer Time.
  • The second shows the measured consumption in watts, as specified by the manufacturer of the intelligent plug.
  • The information recording began on 14 June 2018 at around 1 p.m. and ended on 22 October 2018 at around 12 noon.
# Duración y observaciones -----------------------------------------------------
# Fechas de inicio y fin
fechaInicio <- mdy_hms("Jun 14 2018 13:03:01")
fechaFin <- mdy_hms("Oct 22 2018 11:55:10")
# Duración de la monitorización
observacionesTeoricas.segundos <- as.integer(
  difftime(fechaFin, fechaInicio, units = "secs"))
observacionesTeoricas.numero <- observacionesTeoricas.segundos / 2
observacionesTeoricas.duracion <- seconds_to_period(
observacionesReales.numero <- fichero.numLineas
observacionesReales.segundos <- observacionesReales.numero * 2
# Diferencia entre monitorización teórica y real
observacionesAusentes.segundos <-
  observacionesTeoricas.segundos - observacionesReales.segundos
observacionesAusentes.duracion <- seconds_to_period(

If we add to this some quick counts, we can realize that between the initial moment and the end of the recording, 11,227,929 seconds have passed (equivalent to a period of 129d 22H 52M 9S), or that, if we take into account that the data extraction was prepared to make a measurement every 2 seconds, we should have 5,613,965 observations, although the file only has 5,358,606. Coincidentally, if you pay attention to the records, it is possible to notice that there are jumps of more than 2 seconds between one record and the next (possibly due to delays in the responses from the smart plug). For example, between the last lines, which correspond to October 22nd, it is possible to observe a jump of 3 seconds between two records, where it goes from 11:55:03 to 11:55:06.

If we wanted to study the content of the file a little more, we could simulate a loading of the data into the analysis tool. This allows us to easily detect some peculiar records, especially when they do not obey to the previously observed format. In this way, 31,025 records with incorrect format are detected, whose content is an error code returned by the API of the intelligent plug. Let's see an example line:

# Carga de datos inicial -------------------------------------------------------
# Tabla base
electricidad <- read.csv(
  file = fichero.ruta,
  header = F,
  col.names = c("momento", "consumo"),
  sep = ",",
  colClasses = rep("character", 2),
  stringsAsFactors = F
# Observaciones erróneas
observacionesErroneas <- which($consumo)))
observacionesErroneas.numero <- length(observacionesErroneas)
lineas.erroneas <- lineas[observacionesErroneas[1]]

{«error_code»:-20002,»msg»:»Request timeout»}


At this point, we already have an idea of the kind of things we can learn about data from the file itself. In addition, we have taken note of the characteristics and particularities of the file and the data itself, which will allow us to know how to load it into our analytical tool. But there is little more we can get out of it. To understand the data, we need to do more than just read it. And that something is called statistics.

And we can't see all the data somehow? We'll see that in the next chapters of Coffee with IOT.

*Note on the side. For the curious: The objective data

Let's take another example (one that touches me personally). A client can tell an architect 'I want a house with two bedrooms, a living room, a kitchen and a bathroom'. And, of course, the architect points out 'reticular space with mobility dynamised by flow attractors, in which light coexists with the fragmented volumes and the rhizomatic organicity of the colour unifies the vertical walls'. A subjective fact and an objective one.

In medicine, a distinction is often made between objective and subjective data. For example, when a patient says 'I have chest pains' or 'I have chills', this is subjective data. When the doctor writes down 'sweating' or 'tachycardia' in the report, we have objective data. It is rare that this duality of data is recognized outside the health sciences, but we could say that whenever there is a human being who acts as a subject and another who is a professional in the area of interest being treated, we can have objective and subjective data.

When the photograph appeared, the painting became clearly subjective. A photographic image was considered a faithful representation of reality, so painters had to paint other things. But it didn't take long to realize that the photographer has his own look. Simply choosing what to photograph and what not to photograph is already an element of subjectivity. To this, we add layers and layers of technical decisions (focal length, exposure time, film grain...), narratives (title, caption or tweet that accompanies the image, etc.). And let's not say anything about what happens when you leave a camera to a magician like Méliès.

So it is no accident that one's opinions are called 'point of view'.

The same thing happens to data as to photography. They look like targets, even something like certain, but we choose what we measure, with what precision or with what unit. We build the story of what we want to tell from the moment we collect the data, and we create a world where if you're not measured you don't exist.

Share the article

Share on twitter
Share on linkedin
Share on email
Share on whatsapp

A new generation of technological services and products for our customers