+ - 0:00:00
Notes for current slide
Notes for next slide

Obtain & wrangle OpenAQ data with R

ropenaq and beyond

Maëlle Salmon - 2020-03-31

tiny.cc/ropenaq

masalmon.eu

ma_salmon

maelle

Ryan Millier on Pexels

1 / 34

Context (1/2)

😷 ❤️ In December 2015 I was a statistician & data manager with the CHAI project (Cardiovascular Health Effects of Air Pollution in Telangana, India.) ➡️ interest in open air quality data

2 / 34

Context (1/2)

😷 ❤️ In December 2015 I was a statistician & data manager with the CHAI project (Cardiovascular Health Effects of Air Pollution in Telangana, India.) ➡️ interest in open air quality data

📊 📉 Work with R, An open-source language and environment for statistical computing. ➡️ interest in wrangling such data with R

2 / 34

Context (1/2)

😷 ❤️ In December 2015 I was a statistician & data manager with the CHAI project (Cardiovascular Health Effects of Air Pollution in Telangana, India.) ➡️ interest in open air quality data

📊 📉 Work with R, An open-source language and environment for statistical computing. ➡️ interest in wrangling such data with R

Hear of OpenAQ... 💥

2 / 34

Context (1/2)

😷 ❤️ In December 2015 I was a statistician & data manager with the CHAI project (Cardiovascular Health Effects of Air Pollution in Telangana, India.) ➡️ interest in open air quality data

📊 📉 Work with R, An open-source language and environment for statistical computing. ➡️ interest in wrangling such data with R

Hear of OpenAQ... 💥

👨‍🏭 Decision to create an R client for OpenAQ API!

2 / 34

Context (2/2)

That client now exists: ropenaq, docs.ropensci.org/ropenaq

3 / 34

Context (2/2)

That client now exists: ropenaq, docs.ropensci.org/ropenaq

  • Actively maintained
3 / 34

Context (2/2)

That client now exists: ropenaq, docs.ropensci.org/ropenaq

  • Actively maintained

  • Distributed on CRAN install.packages("ropenaq")

3 / 34

Context (2/2)

That client now exists: ropenaq, docs.ropensci.org/ropenaq

3 / 34

Context (2/2)

That client now exists: ropenaq, docs.ropensci.org/ropenaq

🙋 No longer with the CHAI project but still into R and air quality data. 😉

3 / 34

Why R by the way?

Open-source, free language with many capabilities:

  • Data wrangling, modeling, visualization;

  • Reproducible reports and analyses;

  • Interactive plots and apps.

4 / 34

And what is an R package actually?

The OpenAQ R client is an R "package":

  • An add-in to R;

  • A collection of code and documentation.

5 / 34

What ropenaq does

Gets data into R

  • From OpenAQ's API endpoints (measurements, latest and countries, cities, locations);

  • as tables called "data.frames", common format in R.

6 / 34

What ropenaq does

library("ropenaq")
aq_latest(country = "IN")
7 / 34

What ropenaq does

library("ropenaq")
aq_latest(country = "IN")
## # A tibble: 5,527 x 15
## location city country distance parameter value lastUpdated unit
## <chr> <chr> <chr> <dbl> <chr> <dbl> <dttm> <chr>
## 1 AAQMS K… Pune IN 1.41e7 co 1220 2018-02-22 03:00:00 µg/m³
## 2 AAQMS K… Pune IN 1.41e7 pm10 165 2018-02-22 03:00:00 µg/m³
## 3 AAQMS K… Pune IN 1.41e7 pm25 66 2018-02-22 03:00:00 µg/m³
## 4 AAQMS K… Pune IN 1.41e7 o3 15.8 2018-02-22 03:00:00 µg/m³
## 5 AAQMS K… Pune IN 1.41e7 so2 22.8 2018-02-22 03:00:00 µg/m³
## 6 AP Tiru… Chit… IN NA no2 49.9 2016-06-30 13:00:00 µg/m³
## 7 AP Tiru… Chit… IN NA o3 32.1 2016-06-30 13:00:00 µg/m³
## 8 AP Tiru… Chit… IN NA so2 5.7 2016-06-30 13:00:00 µg/m³
## 9 AP Tiru… Chit… IN NA so2 6.9 2016-07-04 06:15:00 µg/m³
## 10 AP Tiru… Chit… IN NA pm10 12 2016-07-04 06:15:00 µg/m³
## # … with 5,517 more rows, and 7 more variables: sourceName <chr>,
## # averagingPeriod_value <dbl>, averagingPeriod_unit <chr>, latitude <dbl>,
## # longitude <dbl>, cityURL <chr>, locationURL <chr>
8 / 34

Select locations

delhi_locations <- aq_locations(
city = "Delhi",
country = "IN",
parameter = "pm25"
)
delhi_locations
9 / 34

Select locations

delhi_locations <- aq_locations(
city = "Delhi",
country = "IN",
parameter = "pm25"
)
delhi_locations
## # A tibble: 71 x 25
## id country city cities location locations sourceName sourceNames
## <chr> <chr> <chr> <list> <chr> <list> <chr> <list>
## 1 IN-2… IN Delhi <chr … Alipur,… <chr [1]> caaqm <chr [1]>
## 2 IN-1… IN Delhi <chr … Anand V… <chr [1]> caaqm <chr [1]>
## 3 IN-60 IN Delhi <chr … Anand V… <chr [2]> caaqm <chr [2]>
## 4 IN-6 IN Delhi <chr … Anand V… <chr [2]> CPCB <chr [2]>
## 5 IN-2… IN Delhi <chr … Ashok V… <chr [1]> caaqm <chr [1]>
## 6 IN-1… IN Delhi <chr … Ashok V… <chr [1]> caaqm <chr [1]>
## 7 IN-71 IN Delhi <chr … Aya Nag… <chr [2]> caaqm <chr [2]>
## 8 IN-1… IN Delhi <chr … Bawana,… <chr [1]> caaqm <chr [1]>
## 9 IN-1… IN Delhi <chr … Burari … <chr [2]> caaqm <chr [2]>
## 10 IN-8 IN Delhi <chr … Civil L… <chr [1]> CPCB <chr [1]>
## # … with 61 more rows, and 17 more variables: sourceType <chr>,
## # sourceTypes <list>, firstUpdated <dttm>, lastUpdated <dttm>,
## # countsByMeasurement <list>, count <int>, longitude <dbl>, latitude <dbl>,
## # pm25 <lgl>, pm10 <lgl>, no2 <lgl>, so2 <lgl>, o3 <lgl>, co <lgl>, bc <lgl>,
## # cityURL <chr>, locationURL <chr>
10 / 34

Select locations

For each location, useful metadata to select the optimal location(s) for your needs!

head(delhi_locations$sourceName)
## [1] "caaqm" "caaqm" "caaqm" "CPCB" "caaqm" "caaqm"
head(delhi_locations$pm25)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
min(delhi_locations$firstUpdated)
## [1] "2015-04-09 06:00:00 UTC"
11 / 34

Use locations

head(delhi_locations$location)
## [1] "Alipur, Delhi - DPCC" "Anand Vihar, Delhi - DPCC"
## [3] "Anand Vihar, Delhi - DPCC" "Anand Vihar, Delhi - DPCC"
## [5] "Ashok Vihar, Delhi - DPCC" "Ashok Vihar, Delhi - DPCC"
head(delhi_locations$locationURL)
## [1] "Alipur%2C+Delhi+-+DPCC" "Anand+Vihar%2C+Delhi+-+DPCC"
## [3] "Anand+Vihar%2C+Delhi+-+DPCC" "Anand+Vihar%2C+Delhi+-+DPCC"
## [5] "Ashok+Vihar%2C+Delhi+-+DPCC" "Ashok+Vihar%2C+Delhi+-+DPCC"
12 / 34

Use locations

💡 In aq_latest() and aq_measurements() use the URLencoded city / location.

res <- aq_latest(location = "Alipur, Delhi - DPCC")
## Error: This location/city/country combination is not available within the platform. See ?aq_locations
res <- aq_latest(location = "Alipur%2C+Delhi+-+DPCC")
13 / 34

Delhi PM2.5 vs WHO threshold

data <- aq_measurements(location = "US+Diplomatic+Post%3A+New+Delhi",
                        parameter = "pm25")

# dplyr by Hadley Wickham et al
data <- dplyr::filter(data, value > 0)
14 / 34

Delhi PM2.5 vs WHO threshold

data
## # A tibble: 2,158 x 12
## location parameter value unit country city latitude longitude
## <chr> <chr> <int> <chr> <chr> <chr> <dbl> <dbl>
## 1 US Dipl… pm25 12 µg/m³ IN Delhi 28.6 77.2
## 2 US Dipl… pm25 18 µg/m³ IN Delhi 28.6 77.2
## 3 US Dipl… pm25 16 µg/m³ IN Delhi 28.6 77.2
## 4 US Dipl… pm25 19 µg/m³ IN Delhi 28.6 77.2
## 5 US Dipl… pm25 44 µg/m³ IN Delhi 28.6 77.2
## 6 US Dipl… pm25 38 µg/m³ IN Delhi 28.6 77.2
## 7 US Dipl… pm25 17 µg/m³ IN Delhi 28.6 77.2
## 8 US Dipl… pm25 15 µg/m³ IN Delhi 28.6 77.2
## 9 US Dipl… pm25 13 µg/m³ IN Delhi 28.6 77.2
## 10 US Dipl… pm25 15 µg/m³ IN Delhi 28.6 77.2
## # … with 2,148 more rows, and 4 more variables: dateUTC <dttm>,
## # dateLocal <dttm>, cityURL <chr>, locationURL <chr>
15 / 34

Delhi PM2.5 vs WHO threshold

library("ggplot2") # Hadley Wickham et al

ggplot(data) +
  geom_point(aes(x = dateLocal, y = value)
,
             col = "CornflowerBlue") +
  geom_hline(yintercept = 25,
             size = 1.2,
             col = "darkred") +
  ylab(expression(paste("PM2.5 concentration (", mu, "g/",m^3,")")))+
   xlab("Time (days)") +
  ggtitle(data$location[1]) +
  hrbrthemes::theme_ipsum(
    base_size = 16,
    axis_title_size = 16)

16 / 34

Delhi PM2.5 vs WHO threshold

Delhi PM2.5 values over the last 90 days

17 / 34

London PM2.5 vs WHO threshold

data <- aq_measurements(location = "London+Westminster",
                        parameter = "pm25")

data <- dplyr::filter(data, value > 0)
18 / 34

London PM2.5 vs WHO threshold

ggplot(data) +
geom_point(aes(x = dateLocal, y = value),
col = "CornflowerBlue") +
geom_hline(yintercept = 25, size = 1.2, col = "darkred") +
ylab(expression(paste("PM2.5 concentration (", mu, "g/",m^3,")")))+
xlab("Time (days)") +
ggtitle(data$location[1]) +
hrbrthemes::theme_ipsum(
base_size = 16,
axis_title_size = 16)
19 / 34

London PM2.5 vs WHO threshold

London PM2.5 values over the last 90 days

20 / 34

How much data can you get?

Maximum number of results for any query is 100 (pages) times 10000 (results per page).

21 / 34

How much data can you get?

Maximum number of results for any query is 100 (pages) times 10000 (results per page).

You don't need to use the page and limit arguments, ropenaq handles all of this for you. Just wait a bit. ⌚

21 / 34

How much data can you get?

Maximum number of results for any query is 100 (pages) times 10000 (results per page).

You don't need to use the page and limit arguments, ropenaq handles all of this for you. Just wait a bit. ⌚

res <- aq_measurements(country = "IN")
nrow(res)
## [1] 1000000
attr(res, "meta")$found
## [1] 29175743
21 / 34

What if you need more data

  • More data than 1,000,000 results? Split your query, e.g. using date_from and date_to. Or see following item.
22 / 34

What if you need more data

  • More data than 1,000,000 results? Split your query, e.g. using date_from and date_to. Or see following item.

  • More data than the last 90 days? OpenAQ data older than 90 days is on Amazon Web Services (AWS) Athena. More setup for you but worth it and transferrable skills 💪

22 / 34

OpenAQ on AWS: setup in the browser (1/2)

23 / 34

OpenAQ on AWS: setup in the browser (1/2)

💰 Will I lose all my money? 💰 For OpenAQ on AWS you can't avoid Athena fees. They're super small. If you're nervous, do like me, check the billing interface often until you're confident.

23 / 34

OpenAQ on AWS: setup in the browser (2/2)

Go to the Athena console. Joe Flasher's gist

24 / 34

OpenAQ on AWS: setup in the browser (2/2)

Go to the Athena console. Joe Flasher's gist

  • Check your location
24 / 34

OpenAQ on AWS: setup in the browser (2/2)

Go to the Athena console. Joe Flasher's gist

  • Check your location

  • Create the OpenAQ table

24 / 34

OpenAQ on AWS: setup in the browser (2/2)

Go to the Athena console. Joe Flasher's gist

  • Check your location

  • Create the OpenAQ table

  • Run a query like in the gist to see if it works

24 / 34

OpenAQ on AWS: setup in the browser (2/2)

Go to the Athena console. Joe Flasher's gist

  • Check your location

  • Create the OpenAQ table

  • Run a query like in the gist to see if it works

  • Get your credentials (go into your "My security credentials" settings from the drop-down menu at the top right of the console where your username is.)

24 / 34

OpenAQ on AWS: setup in R

Now, install AWR.Athena. It depends on rJava. 🍀. Another newer R package without rJava: noctua 👀

25 / 34

OpenAQ on AWS: setup in R

Now, install AWR.Athena. It depends on rJava. 🍀. Another newer R package without rJava: noctua 👀

Set up credentials. I used AWS CLI and ran aws configure, YMMV. 🤷

25 / 34

OpenAQ on AWS: setup in R

Now, install AWR.Athena. It depends on rJava. 🍀. Another newer R package without rJava: noctua 👀

Set up credentials. I used AWS CLI and ran aws configure, YMMV. 🤷

In a script, try setting up a connection.

25 / 34

OpenAQ on AWS: test query in R

library("AWR.Athena") # Neal Fultz and Gergely Daróczi
library("DBI") # Kirill Müller et al
con <- dbConnect(AWR.Athena::Athena(),
region = 'us-east-1',
S3OutputLocation = 's3://ropenaq',
Schema = 'default')
query <- "SELECT *
FROM openaq
WHERE location='US Diplomatic Post: Hanoi'
LIMIT 10"
meas <- dbGetQuery(con, query)
26 / 34

OpenAQ on AWS: test query in R

library("AWR.Athena") # Neal Fultz and Gergely Daróczi
library("DBI") # Kirill Müller et al
con <- dbConnect(AWR.Athena::Athena(),
region = 'us-east-1',
S3OutputLocation = 's3://ropenaq',
Schema = 'default')
query <- "SELECT *
FROM openaq
WHERE location='US Diplomatic Post: Hanoi'
LIMIT 10"
meas <- dbGetQuery(con, query)

Now on to cooler stuff!

26 / 34

OpenAQ on AWS: use in R

Selecting locations in Wuhan with ropenaq.

locations <- ropenaq::aq_locations(
country = "CN",
longitude = 114.3055,
latitude = 30.5928,
radius = 5000
)
unique(locations$city)
## [1] "武汉市"
27 / 34

OpenAQ on AWS: use in R

Prepare your query with glue.

query <- glue::glue(
"SELECT *
FROM openaq
WHERE city='{locations$city[1]}'
AND parameter='pm25'"
)
query
## SELECT *
## FROM openaq
## WHERE city='武汉市'
## AND parameter='pm25'
28 / 34

OpenAQ on AWS: use in R

Make the query with AWR.Athena

library("AWR.Athena")
library("DBI")
con <- dbConnect(AWR.Athena::Athena(),
region = 'us-east-1',
S3OutputLocation = 's3://ropenaq',
Schema = 'default')
meas <- dbGetQuery(con, query)
29 / 34

OpenAQ on AWS: use in R

Make the query with AWR.Athena

library("AWR.Athena")
library("DBI")
con <- dbConnect(AWR.Athena::Athena(),
region = 'us-east-1',
S3OutputLocation = 's3://ropenaq',
Schema = 'default')
meas <- dbGetQuery(con, query)

Now save meas and munge it, plot it, etc.

29 / 34

OpenAQ in R

In short,

  • ropenaq alone: only OpenAQ API data i.e. last 90 days, no setup, no credit card info.

  • AWR.Athena (or noctua?): All OpenAQ data via AWS, more setup, pennies.

Take your time, document steps.

In both cases, you can do a lot from R once you got the data! 🎉

30 / 34

Conclusion: why use ropenaq

  • Need OpenAQ API data in R?

  • Or want an engaging dataset to learn how to use R for further projects?

install.packages("ropenaq")

browseURL("https://docs.ropensci.org/ropenaq")

31 / 34

Conclusion: why use ropenaq

  • Need OpenAQ API data in R?

  • Or want an engaging dataset to learn how to use R for further projects?

install.packages("ropenaq")

browseURL("https://docs.ropensci.org/ropenaq")

  • Want more data than the last 90 days? Setup your AWS account, install the AWR.Athena package (or noctua?). Use ropenaq to select locations.
31 / 34

Conclusion: how to learn R

Some resources

32 / 34

Conclusion: how to learn R

Some resources

32 / 34

Conclusion: how to contribute

ropenaq R package, R client for the OpenAQ API

33 / 34

Conclusion: how to contribute

ropenaq R package, R client for the OpenAQ API

33 / 34

Conclusion: how to contribute

ropenaq R package, R client for the OpenAQ API

33 / 34

Conclusion: how to contribute

ropenaq R package, R client for the OpenAQ API

33 / 34

Conclusion: how to contribute

ropenaq R package, R client for the OpenAQ API

33 / 34

tiny.cc/ropenaq

Diagram drawn by Damien Cornu. 😍

34 / 34

Context (1/2)

😷 ❤️ In December 2015 I was a statistician & data manager with the CHAI project (Cardiovascular Health Effects of Air Pollution in Telangana, India.) ➡️ interest in open air quality data

2 / 34
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow