Analysis of pollution levels in Barcelona

Introduction

This analysis has been made in order to participate in the World Dataviz Challenge 2019. Given the growing environmental concern nowadays, and the explicit recommendation of the challenge to address pollution as a topic, the choice was pretty clear. R, one of the main programming languages used in data science (and my personal favourite), has been the chosen one for this development.

The analysis consists in the following steps:

  1. Loading: Load pollution data.
  2. Preprocessing: Preprocess the data until it is structured enough for analyze.
  3. Exploration: Get insights from pollution trends and characteristics using plots.
  4. Modelling: Train and apply a model to predict future pollution levels.
  5. Application: Develop an application that show pollution levels evolution by station.

The source code for the analysis, documents and application can be found at sourcehut. The application can be seen at shiny portal.

Loading

First, it is required to load the data. They can be retrieved from the following links:

Althought there is the possibility to use an API directly to retrieve the data, in this case I prefer to just download the .csv files.

The required packages for development are loaded.

library(data.table)
library(dplyr)
library(forecast)
library(ggplot2)
library(leaflet)
library(Metrics)
library(scales)
library(shiny)
library(shinydashboard)
library(shinyWidgets)

As specified in the Air Data Quality URL, dataset structure changed completely since April 2019. I decide then to define two different loading parts:

  1. The old one that gets the data from the June 2018 (first file available) until March 2019 (the last with the same structure).
  2. The new one that gets data from April 2019 to September 2019 (the last file at that moment).

Auxiliary variables will also be defined.

# Define variables
input_path <- '../../data/input'
output_path <- '../../data/output'
filename <- '*_qualitat_aire_bcn.csv'
limit_dates <- list(min=as.Date('2018-06-01'), max=as.Date('2019-09-30'))

# Get files, but take into account that from April '19 the structure is different
files_to_read <- list.files(path=input_path, pattern=filename, ignore.case=TRUE)
filter <- as.numeric(gsub(pattern='[^0-9]', replacement='', x=files_to_read))
old_files <- files_to_read[!(filter >= 201904)]
new_files <- files_to_read[(filter >= 201904)]

Then it is possible to load the data and inspect both structures. For most of data processing, I tend to use data.table package, which will be seen from now on.

# Read files
old_quality_dt <-data.table()
for (f in old_files) {
    file_dt <- file.path(input_path, f) %>% fread()
    old_quality_dt <- rbind(old_quality_dt, file_dt)
}

new_quality_dt <- data.table()
for (f in new_files) {
    file_dt <- file.path(input_path, f) %>% fread()
    new_quality_dt <- rbind(new_quality_dt, file_dt)
}

str(old_quality_dt)
## Classes 'data.table' and 'data.frame':   53191 obs. of  18 variables:
##  $ nom_cabina   : chr  "Barcelona - Sants" "Barcelona - Eixample" "Barcelona - Gràcia" "Barcelona - Ciutadella" ...
##  $ qualitat_aire: chr  "--" "Bona" "Bona" "Bona" ...
##  $ codi_dtes    : chr  "ID" "IH" "IJ" "IL" ...
##  $ zqa          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ codi_eoi     : int  8019042 8019043 8019044 8019050 8019054 8019057 8019004 8019042 8019043 8019044 ...
##  $ longitud     : num  2.13 2.15 2.15 2.19 2.15 ...
##  $ latitud      : num  41.4 41.4 41.4 41.4 41.4 ...
##  $ hora_o3      : chr  NA "8h" "8h" "8h" ...
##  $ qualitat_o3  : chr  NA "Bona" "Bona" "Bona" ...
##  $ valor_o3     : chr  NA "37 µg/m³" "51 µg/m³" "27 µg/m³" ...
##  $ hora_no2     : chr  "7h" "8h" "8h" "8h" ...
##  $ qualitat_no2 : chr  "--" "Bona" "Bona" "Bona" ...
##  $ valor_no2    : chr  "--" "62 µg/m³" "44 µg/m³" "40 µg/m³" ...
##  $ hora_pm10    : chr  NA "9h" "9h" NA ...
##  $ qualitat_pm10: chr  NA "Bona" "Bona" NA ...
##  $ valor_pm10   : chr  NA "26 µg/m³" "23 µg/m³" NA ...
##  $ generat      : chr  "11/06/2018 9:00" "11/06/2018 9:00" "11/06/2018 9:00" "11/06/2018 9:00" ...
##  $ dateTime     : int  1528700703 1528700703 1528700703 1528700703 1528700703 1528700703 1528700703 1528704305 1528704305 1528704305 ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(new_quality_dt)
## Classes 'data.table' and 'data.frame':   6780 obs. of  57 variables:
##  $ CODI_PROVINCIA  : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ PROVINCIA       : chr  "Barcelona" "Barcelona" "Barcelona" "Barcelona" ...
##  $ CODI_MUNICIPI   : int  19 19 19 19 19 19 19 19 19 19 ...
##  $ MUNICIPI        : chr  "Barcelona" "Barcelona" "Barcelona" "Barcelona" ...
##  $ ESTACIO         : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ CODI_CONTAMINANT: int  7 7 7 7 7 7 7 7 7 7 ...
##  $ ANY             : int  2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
##  $ MES             : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ DIA             : int  2 3 4 5 6 7 8 9 10 11 ...
##  $ H01             : num  1 1 1 2 1 22 3 2 2 10 ...
##  $ V01             : chr  "V" "V" "V" "V" ...
##  $ H02             : num  1 1 1 1 1 9 2 6 4 6 ...
##  $ V02             : chr  "V" "V" "V" "V" ...
##  $ H03             : num  3 1 2 8 1 1 2 3 1 4 ...
##  $ V03             : chr  "V" "V" "V" "V" ...
##  $ H04             : num  1 1 4 2 1 1 2 2 2 2 ...
##  $ V04             : chr  "V" "V" "V" "V" ...
##  $ H05             : num  1 1 1 2 1 1 1 1 1 1 ...
##  $ V05             : chr  "V" "V" "V" "V" ...
##  $ H06             : num  2 1 1 8 1 1 4 2 1 1 ...
##  $ V06             : chr  "V" "V" "V" "V" ...
##  $ H07             : num  13 2 1 21 1 1 19 9 2 3 ...
##  $ V07             : chr  "V" "V" "V" "V" ...
##  $ H08             : num  58 4 3 53 5 4 57 32 13 11 ...
##  $ V08             : chr  "V" "V" "V" "V" ...
##  $ H09             : num  131 6 9 83 3 1 77 58 19 12 ...
##  $ V09             : chr  "V" "V" "V" "V" ...
##  $ H10             : num  148 6 16 81 6 2 87 57 17 10 ...
##  $ V10             : chr  "V" "V" "V" "V" ...
##  $ H11             : num  31 5 13 72 7 3 56 6 12 11 ...
##  $ V11             : chr  "V" "V" "V" "V" ...
##  $ H12             : num  18 7 10 31 5 2 13 6 13 8 ...
##  $ V12             : chr  "V" "V" "V" "V" ...
##  $ H13             : num  15 7 6 17 4 3 20 4 17 6 ...
##  $ V13             : chr  "V" "V" "V" "V" ...
##  $ H14             : num  9 8 10 6 5 2 9 4 22 5 ...
##  $ V14             : chr  "V" "V" "V" "V" ...
##  $ H15             : num  NA 6 8 4 3 3 5 4 20 4 ...
##  $ V15             : chr  "N" "V" "V" "V" ...
##  $ H16             : num  NA 3 5 5 3 2 5 3 7 7 ...
##  $ V16             : chr  "N" "V" "V" "V" ...
##  $ H17             : num  8 6 4 5 5 1 5 3 5 5 ...
##  $ V17             : chr  "V" "V" "V" "V" ...
##  $ H18             : num  7 3 4 5 4 2 6 2 7 3 ...
##  $ V18             : chr  "V" "V" "V" "V" ...
##  $ H19             : num  7 3 3 3 4 2 12 3 5 2 ...
##  $ V19             : chr  "V" "V" "V" "V" ...
##  $ H20             : num  8 3 2 3 6 2 7 6 13 4 ...
##  $ V20             : chr  "V" "V" "V" "V" ...
##  $ H21             : num  3 2 2 2 2 2 5 8 2 1 ...
##  $ V21             : chr  "V" "V" "V" "V" ...
##  $ H22             : num  1 2 1 2 2 2 5 3 3 1 ...
##  $ V22             : chr  "V" "V" "V" "V" ...
##  $ H23             : num  1 2 2 1 1 2 2 3 7 2 ...
##  $ V23             : chr  "V" "V" "V" "V" ...
##  $ H24             : num  1 2 2 2 3 7 2 3 20 1 ...
##  $ V24             : chr  "V" "V" "V" "V" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Preprocessing

Being the format completely different, and given that I want to have all combinations of station, pollutant, and times, the approach followed will be:

  1. Create a dataset will all the desired combinations, and separate it into two parts (old and new).
  2. Process both datasets separately for them to be considered tidy, as explained by a great R guru.
  3. Concatenate both datasets to create the definite one.

Then, the first step is taken.

stations_dt <- file.path(input_path, 'Qualitat_Aire_Estacions.csv') %>% fread()
stations_dt <- unique(stations_dt[, Codi_Contaminant:=NULL]) 
pollutants_dt <- file.path(input_path, 'Qualitat_Aire_Contaminants.csv') %>% fread()
pollutants_dt <- pollutants_dt[Desc_Contaminant %in% c('O3', 'NO2', 'PM10')]
units <- unique(pollutants_dt[, .(units=Unitats)])

date <- seq(limit_dates$min, limit_dates$max, 'days')
time <- 0:23
datetimes_dt <- CJ(date, time)

quality_dt <- merge(stations_dt[, key:=1], datetimes_dt[, key:=1], by='key', allow.cartesian=TRUE)
quality_dt <- merge(quality_dt[, key:=1], pollutants_dt[, key:=1], by='key', allow.cartesian=TRUE)
quality_dt[, key:=NULL]
names(quality_dt) <- names(quality_dt) %>% tolower()
quality_dt_old <- quality_dt[date < '2019-01-04']
quality_dt_new <- quality_dt[date >= '2019-01-04']

Afterwards, old structured data is processed.

# In order, get day, convert hours to integer, and filter by pollutant
old_quality_dt[, date:=as.Date(gsub(pattern=' .*', replacement='', generat), format='%d/%m/%Y')]
old_quality_dt[, hora_o3:=as.integer(gsub(pattern='h', replacement='', hora_o3))]
old_quality_dt[, hora_no2:=as.integer(gsub(pattern='h', replacement='', hora_no2))]
old_quality_dt[, hora_pm10:=as.integer(gsub(pattern='h', replacement='', hora_pm10))]
old_quality_dt[, valor_o3:=as.numeric(gsub(pattern=' .*', replacement='', valor_o3))]
old_quality_dt[, valor_no2:=as.numeric(gsub(pattern=' .*', replacement='', valor_no2))]
old_quality_dt[, valor_pm10:=as.numeric(gsub(pattern=' .*', replacement='', valor_pm10))]

old_qual_o3_dt <- copy(old_quality_dt)
o3_code <- as.integer(pollutants_dt[Desc_Contaminant=='O3', .(Codi_Contaminant)])
old_qual_o3_dt <- old_qual_o3_dt[, .(codi_dtes, date, time=hora_o3, 
                        value=valor_o3, pollutant=o3_code)]
old_qual_o3_dt <- na.omit(old_qual_o3_dt)

old_qual_no2_dt <- copy(old_quality_dt)
no2_code <- as.integer(pollutants_dt[Desc_Contaminant=='NO2', .(Codi_Contaminant)])
old_qual_no2_dt <- old_qual_no2_dt[, .(codi_dtes, date, time=hora_no2, 
                        value=valor_no2, pollutant=no2_code)]
old_qual_no2_dt <- na.omit(old_qual_no2_dt)

old_qual_pm10_dt <- copy(old_quality_dt)
pm10_code <- as.integer(pollutants_dt[Desc_Contaminant=='PM10', .(Codi_Contaminant)])
old_qual_pm10_dt <- old_qual_pm10_dt[, .(codi_dtes, date, time=hora_pm10,
                        value=valor_pm10, pollutant=pm10_code)]
old_qual_pm10_dt <- na.omit(old_qual_pm10_dt)

old_quality_dt <- rbind(old_qual_o3_dt, old_qual_no2_dt, old_qual_pm10_dt)

New structured data is processed as well.

# In order get date, from wide (hour) to values
names(new_quality_dt) <- names(new_quality_dt) %>% tolower()
new_quality_dt[, date:=as.Date(paste(as.character(any), as.character(mes), as.character(dia), sep='/'))]
value_cols <- names(new_quality_dt) %>% grep(pattern='^h[0-9][0-9]', value=TRUE)
new_quality_dt <- melt(new_quality_dt, id.vars=c('estacio', 'codi_contaminant', 'date'), 
            measure.vars=value_cols, variable.name='time')
new_quality_dt[, time:=as.integer(gsub(pattern='h', replacement='', x=time)) -1]

Finally, data can be appended.

# Join datasets
quality_dt_old <- merge(x=quality_dt_old, y=old_quality_dt, 
            by.x=c('codi_dtes', 'date', 'time', 'codi_contaminant'),
                by.y=c('codi_dtes', 'date', 'time', 'pollutant'), all.x=TRUE) 
quality_dt_new <- merge(x=quality_dt_new, y=new_quality_dt, 
            by.x=c('estacio', 'date', 'time', 'codi_contaminant'),
                by.y=c('estacio', 'date', 'time', 'codi_contaminant'), all.x=TRUE) 
quality_dt <- rbind(quality_dt_old, quality_dt_new)
setorder(quality_dt, estacio, codi_contaminant, date, time)
quality_dt[, datetime:=as.POSIXct(paste(date, time), format='%Y-%m-%d %H')]
quality_dt[, yearmonth:=format(date, '%Y%m')]

Exploration

It is easier to understand data if they are aggregated. Mean values by month and pollutant are plotted, regardless of the station in which they are measured.

monthly_dt <- quality_dt[, .(value=mean(value, na.rm=TRUE)), 
                by=.(desc_contaminant, yearmonth)]
m <- ggplot(data=monthly_dt) + 
    aes(x=yearmonth, y=value, color=desc_contaminant, group=desc_contaminant) +
    geom_line() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
    labs(x='Year-Month', y=paste0('Value (', units, ')'), colour='Pollutant')
plot(m)

It can be seen that there are no data for both February 2019 and March 2019, and that O3 levels are higher than the rest, with higher variability also. I am also interested in pollution levels evolution during the day, so mean values by time and pollutant are also plotted.

hourly_dt <- quality_dt[, .(value=mean(value, na.rm=TRUE)), by=.(desc_contaminant, time)]
h <- ggplot(data=hourly_dt) + 
    aes(x=time, y=value, color=desc_contaminant, group=desc_contaminant) + geom_line() + 
    labs(x='Hour', y=paste0('Value (', units, ')'), colour='Pollutant')
plot(h)

It can be seen that there is a clear difference between pollutants: whereas PM10 and NO2 levels are relatively low after noon, O3 levels are really high in comparison. The question regarding differences between stations arises as well, regardless of time.

stations_dt <- quality_dt[, .(value=mean(value, na.rm=TRUE)), by=.(desc_contaminant, estacio)]
s <- ggplot(data=quality_dt) + 
    aes(x=as.factor(estacio), y=value) + geom_boxplot() + facet_wrap(desc_contaminant~.) + 
    labs(x='Station', y=paste0('Value (', units, ')'))
plot(s)

In this case there are outliers that do not permit to see differences. They are first checked, to ensure that there was not any strange pattern, and they are removed for exploratory purposes.

s <- ggplot(data=quality_dt[value < 1000]) + 
    aes(x=as.factor(estacio), y=value) + geom_boxplot() + facet_wrap(desc_contaminant~.) + 
    labs(x='Station', y=paste0('Value (', units, ')'))
plot(s)

There seems not to be clear differences between stations, and therefore between Barcelona districts. What is curious is that PM10 values are really dispersed above the 75th percentile. Having seen some stations without data, I will just keep combinations of stations and pollutants that contain values.

perc_na_dt <- quality_dt[, .(perc_nas=sum(is.na(value))/.N), by=.(desc_contaminant, estacio)]
include_comb_dt <- perc_na_dt[perc_nas < 0.8, .(desc_contaminant, estacio)]  
quality_dt <- merge(quality_dt, include_comb_dt, by=c('desc_contaminant', 'estacio'))

Considering outliers, and taking into account that time series forecasting methods in general do not handle NA values, it will try to tackle both with tsclean function.

# Replace NA values and outliers - modelling purposes
quality_dt[, original_value:=value]
quality_dt[, value:=as.numeric(tsclean(as.ts(original_value))), by=.(desc_contaminant, estacio)]
quality_dt[, imputed_values:=0]
quality_dt[is.na(original_value) | original_value != value, imputed_values:=1]

After a quick search, I have found standards defined by the World Health Organization related to all 3 pollutants considered: O3, NO2 and PM10. Then, it could be very useful to visualize their current levels compared to the standards defined.

# Compare to standards
# PM10 standard: 50/24hour
# O3 standards: 100/8hour
# NO2 standards: 200/hour

standards_pm10 <- 50
standards_o3 <- 100
standards_no2 <- 200

daily_dt <- quality_dt[, .(value=mean(value, na.rm=TRUE)), by=.(desc_contaminant, date)]
quality_dt[, hour_range:=NaN]
quality_dt[between(time, 0, 7), hour_range:=1]
quality_dt[between(time, 8, 15), hour_range:=2]
quality_dt[between(time, 16, 23), hour_range:=3]
daily_dt <- quality_dt[, .(value=mean(value, na.rm=TRUE)), by=.(desc_contaminant, date)]
hrange_dt <- quality_dt[, .(value=mean(value, na.rm=TRUE)), by=.(desc_contaminant, date, hour_range)]

pm10_p <- ggplot(data=daily_dt[desc_contaminant=='PM10']) + aes(x=date, y=value) + geom_line() +
    geom_hline(yintercept=standards_pm10, color='red') + labs(x='Date', y=paste0('Value (', units, ')'))
o3_p <- ggplot(data=hrange_dt[desc_contaminant=='O3']) + aes(x=date, y=value) + geom_line() +
    geom_hline(yintercept=standards_o3, color='red') + labs(x='Date', y=paste0('Value (', units, ')'))
no2_p <- ggplot(data=quality_dt[desc_contaminant=='NO2']) + aes(x=date, y=value) + geom_line() +
    geom_hline(yintercept=standards_no2, color='red') + labs(x='Date', y=paste0('Value (', units, ')'))
plot(pm10_p)

plot(o3_p)

plot(no2_p)

It can be seen that, considering the correspondent time ranges for the different pollutants, in few cases levels are above standards.

Modelling

Predicting future pollution levels can provide information to city councils, which can be used to mitigate future high pollution levels, for example.

In general, time series forecasting may not be really accurate for long-term predictions; this is why just October 2019 values will be predicted. Taking into account that there are no data for February and March 2019, the idea is to use data from April 2019 to September 2019 to train the model.

For simplification and computation reasons, daily data (mean values by day) will be used for the analysis, instead of hourly data.

In order to know how good the model can behave on unseen data, it is necessary to backtest it. A model will be trained from June 2018 to November 2018; it will be used to predict December 2018 values, and its results will be compared with the actual values. The one that performs best for the past will be the one used for the future.

In this case the two models to compare are: ARIMA and Exponential Smoothing. The metric to compare both models is SMAPE, which considers relative differences.

gen_model_dates <- list(train_ini=as.Date('2018/06/01'), train_end=as.Date('2018/11/30'), 
            test_end=as.Date('2018/12/31'))
daily_dt <- quality_dt[, .(value=mean(value, na.rm=TRUE)), by=.(desc_contaminant, date, estacio)]

test_models <- function(dt, model_dates) {
    train_ts <- dt[between(date, model_dates$train_ini, model_dates$train_end), value]
    test_ts <- dt[between(date, model_dates$train_end + 1, model_dates$test_end), value]
    fc_periods <- as.numeric((model_dates$test_end) - (model_dates$train_end + 1) + 1)
    pred_autoarima <- auto.arima(train_ts) %>% forecast(fc_periods)
    pred_ets <- ets(train_ts) %>% forecast(fc_periods)
    results_dt <- data.table(actuals=test_ts, pred_autoarima=as.numeric(pred_autoarima$mean), 
                          pred_ets=as.numeric(pred_ets$mean))
    return(results_dt)
}

test_dt <- daily_dt[, test_models(dt=.SD, model_dates=gen_model_dates), by=.(desc_contaminant, estacio)]
smape_autoarima <- smape(test_dt[, actuals], test_dt[, pred_autoarima])
smape_ets <- smape(test_dt[, actuals], test_dt[, pred_ets])
cat('Smape Autoarima', smape_autoarima)
## Smape Autoarima 0.3317006
cat('Smape ETS', smape_ets)
## Smape ETS 0.3999834

As Autoarima has the lowest smape of the two options, it will be the model applied to recent data.

apply_model_dates <- list(train_ini=as.Date('2019/04/01'), train_end=as.Date('2019/09/30'), 
                test_end=as.Date('2019/10/31'))

apply_models <- function(dt, model_dates) {
    train_ts <- dt[between(date, model_dates$train_ini, model_dates$train_end), value]
    fc_periods <- as.numeric((model_dates$test_end) - (model_dates$train_end + 1) + 1)
    pred_autoarima <- auto.arima(train_ts) %>% forecast(fc_periods)
    pred_dates <- seq(model_dates$train_end + 1, model_dates$test_end, 'days')
    pred_dt <- data.table(value=as.numeric(pred_autoarima$mean), date=pred_dates)
    return(pred_dt)
}

predictions_dt <- daily_dt[, apply_models(dt=.SD, model_dates=apply_model_dates), 
                by=.(desc_contaminant, estacio)]
daily_dt <- rbind(daily_dt, predictions_dt)

Application

The last step is to prepare the dataset that will be used for the interactive application. From my point of view, the most important features are the following:

  • Pollutant
  • District
  • Pollution station district and coordinates
  • Date
  • WHO Standards
  • Metrics

For user experience purposes, I think it would be better to visualize monthly data. The question then is, which metrics should be taken into account for this new dataset.

In this case, the metrics are:

  • Mean: Mean value, taking into account mean values per day (mean of means).
  • Max: Maximum value, taking into account mean values per day (max of means).
  • Percentage above standards: Percentage of days whose maximum value is above the defined WHO standards.

It is important to note that comparisons to WHO standards may not be completely fair. This is because WHO standards consider different time ranges for mean values, whereas this metrics are computed as monthly means, maximums and percentages from daily means.

# Data for app - select and save dataset
stations_dt <- unique(quality_dt[, .(estacio, nom_districte, latitud, longitud)])
daily_dt <- merge(daily_dt, stations_dt, by=c('estacio'))
daily_dt <- daily_dt[, estacio:=NULL]
old_names <- c('desc_contaminant', 'nom_districte', 'latitud', 'longitud')
new_names <- c('pollutant', 'district', 'latitude', 'longitude')
setnames(daily_dt, old_names, new_names)

units <- unique(pollutants_dt[, .(units=Unitats)])
standards_dt <- data.table(pollutant=c('PM10', 'NO2', 'O3'), 
                standards=c(standards_pm10, standards_no2, standards_o3))
standards_dt[, units:=units]
daily_dt <- merge(daily_dt, standards_dt, by='pollutant')
daily_dt[, yearmonth:=format(date, '%Y%m')]

monthly_dt <- daily_dt[, .(mean_value=round(mean(value), 2), max_value=round(max(value), 2), 
                perc_above_st=sum(value>standards)/.N), 
            by=.(pollutant, yearmonth, district, latitude, longitude, standards, units)] 

fwrite(monthly_dt, file.path(output_path, 'app_data.csv'))

Shiny is an R package to develop interactive dashboards from R code. This will be the package used for the final application. The structure of a generic shiny application is the following:

  • global.R: Loads the data and process them if necessary.
  • server.R: Generates build objects that interact with the data.
  • ui.R: Generates the user interface.

The code global.R is the one worth being detailed.

In the application, there are circles representing stations, and both radius and colour are customizable; that is why two auxiliar columns will be generated.

Colour logic will be defined as follows:

  • If Max metric is greater than standards, station colour will be red.
  • If Max metric is greater or equal than 80% of standards, station colour will be yellow.
  • If Max metric is lower than 80% of standards, station colour will be green.

Radius intuition is that bigger radius means more extreme values (either good or bad). The specific logic will be the following:

  • Green colour:
    • Maximum radius can be seen when values range from 0 to 50% of standards.
    • Radius will linearly decrease as values approaches 80% of standards, converging then to minimum radius.
  • Yellow colour: Radius is constant and its value set to minimum.
  • Red colour:
    • Radius will linearly increase from minimum to maximum as values approach 200% of standards.
    • Maximum radius can be seen when values are from 200% of standards onwards.
dt <- file.path(output_path, 'app_data.csv') %>% fread()

# Add features to data for map usage
colour_ratio <- 0.8

dt[, risk_colour:='green']
dt[max_value >= colour_ratio * standards, risk_colour:='yellow']
dt[max_value > standards, risk_colour:='red']

min_radius <- 20
max_radius <- 40
min_ratio <- 0.5
max_ratio <- 2

dt[, risk_radius:=min_radius]
dt[risk_colour=='red', risk_radius:=min_radius * (max_value / standards)]
dt[risk_colour=='red' & max_value > max_ratio * standards, risk_radius:=max_radius]
dt[risk_colour=='green', risk_radius:=min_radius * (standards / max_value)]
dt[risk_colour=='green' & max_value < min_ratio * standards, risk_radius:=max_radius]

The rest of code for shiny deployment is not detailed in this analysis. As has been already said, it can be found in the specified repository.

Wrap up

The main objective of the analysis has been to show the potential of using:

  • Open source technologies
  • Open data
  • Data modelling
  • Visualization tools

Using these tools appropriately, it is possible to provide information that can generate a positive impact on people’s quality of life.

comments powered by Disqus