class: center, middle, inverse, title-slide # Obtain & wrangle OpenAQ data with R ## ropenaq and beyond ### Maëlle Salmon - 2020-03-31 ###
tiny.cc/ropenaq
masalmon.eu
ma_salmon
maelle
Ryan Millier on Pexels
--- # Context (1/2) 😷 ❤️ In December 2015 I was a statistician & data manager with the [CHAI project](https://chaiproject.org/) (_Cardiovascular Health Effects of Air Pollution in Telangana, India._) ➡️ interest in open air quality data -- 📊 📉 Work with R, _An open-source language and environment for statistical computing._ ➡️ interest in wrangling such data with R -- Hear of OpenAQ... 💥 -- 👨🏭 **Decision to create an R client for OpenAQ API!** --- # Context (2/2) That client now exists: `ropenaq`, [docs.ropensci.org/ropenaq](https://docs.ropensci.org/ropenaq/) -- * Actively maintained -- * Distributed on CRAN `install.packages("ropenaq")` -- * [Peer-reviewed by rOpenSci](https://ropensci.org/software-review/) -- 🙋 No longer with the CHAI project but still into R and air quality data. 😉 --- # Why R by the way? Open-source, free language with many capabilities: * Data wrangling, modeling, visualization; * Reproducible reports and analyses; * Interactive plots and apps. --- # And what is an R package actually? The OpenAQ R client is an R "package": * An add-in to R; * A collection of code and documentation. --- # What ropenaq does Gets data into R * From OpenAQ's API endpoints (measurements, latest and countries, cities, locations); * as tables called "data.frames", common format in R. --- # What ropenaq does ```r library("ropenaq") aq_latest(country = "IN") ``` --- # What ropenaq does ```r library("ropenaq") aq_latest(country = "IN") ``` ``` ## # A tibble: 5,527 x 15 ## location city country distance parameter value lastUpdated unit ## <chr> <chr> <chr> <dbl> <chr> <dbl> <dttm> <chr> ## 1 AAQMS K… Pune IN 1.41e7 co 1220 2018-02-22 03:00:00 µg/m³ ## 2 AAQMS K… Pune IN 1.41e7 pm10 165 2018-02-22 03:00:00 µg/m³ ## 3 AAQMS K… Pune IN 1.41e7 pm25 66 2018-02-22 03:00:00 µg/m³ ## 4 AAQMS K… Pune IN 1.41e7 o3 15.8 2018-02-22 03:00:00 µg/m³ ## 5 AAQMS K… Pune IN 1.41e7 so2 22.8 2018-02-22 03:00:00 µg/m³ ## 6 AP Tiru… Chit… IN NA no2 49.9 2016-06-30 13:00:00 µg/m³ ## 7 AP Tiru… Chit… IN NA o3 32.1 2016-06-30 13:00:00 µg/m³ ## 8 AP Tiru… Chit… IN NA so2 5.7 2016-06-30 13:00:00 µg/m³ ## 9 AP Tiru… Chit… IN NA so2 6.9 2016-07-04 06:15:00 µg/m³ ## 10 AP Tiru… Chit… IN NA pm10 12 2016-07-04 06:15:00 µg/m³ ## # … with 5,517 more rows, and 7 more variables: sourceName <chr>, ## # averagingPeriod_value <dbl>, averagingPeriod_unit <chr>, latitude <dbl>, ## # longitude <dbl>, cityURL <chr>, locationURL <chr> ``` --- # Select locations ```r delhi_locations <- aq_locations( city = "Delhi", country = "IN", parameter = "pm25" ) delhi_locations ``` --- # Select locations ```r delhi_locations <- aq_locations( city = "Delhi", country = "IN", parameter = "pm25" ) delhi_locations ``` ``` ## # A tibble: 71 x 25 ## id country city cities location locations sourceName sourceNames ## <chr> <chr> <chr> <list> <chr> <list> <chr> <list> ## 1 IN-2… IN Delhi <chr … Alipur,… <chr [1]> caaqm <chr [1]> ## 2 IN-1… IN Delhi <chr … Anand V… <chr [1]> caaqm <chr [1]> ## 3 IN-60 IN Delhi <chr … Anand V… <chr [2]> caaqm <chr [2]> ## 4 IN-6 IN Delhi <chr … Anand V… <chr [2]> CPCB <chr [2]> ## 5 IN-2… IN Delhi <chr … Ashok V… <chr [1]> caaqm <chr [1]> ## 6 IN-1… IN Delhi <chr … Ashok V… <chr [1]> caaqm <chr [1]> ## 7 IN-71 IN Delhi <chr … Aya Nag… <chr [2]> caaqm <chr [2]> ## 8 IN-1… IN Delhi <chr … Bawana,… <chr [1]> caaqm <chr [1]> ## 9 IN-1… IN Delhi <chr … Burari … <chr [2]> caaqm <chr [2]> ## 10 IN-8 IN Delhi <chr … Civil L… <chr [1]> CPCB <chr [1]> ## # … with 61 more rows, and 17 more variables: sourceType <chr>, ## # sourceTypes <list>, firstUpdated <dttm>, lastUpdated <dttm>, ## # countsByMeasurement <list>, count <int>, longitude <dbl>, latitude <dbl>, ## # pm25 <lgl>, pm10 <lgl>, no2 <lgl>, so2 <lgl>, o3 <lgl>, co <lgl>, bc <lgl>, ## # cityURL <chr>, locationURL <chr> ``` --- # Select locations For each location, useful metadata to select the optimal location(s) for your needs! ```r head(delhi_locations$sourceName) ``` ``` ## [1] "caaqm" "caaqm" "caaqm" "CPCB" "caaqm" "caaqm" ``` ```r head(delhi_locations$pm25) ``` ``` ## [1] TRUE TRUE TRUE TRUE TRUE TRUE ``` ```r min(delhi_locations$firstUpdated) ``` ``` ## [1] "2015-04-09 06:00:00 UTC" ``` --- # Use locations ```r head(delhi_locations$location) ``` ``` ## [1] "Alipur, Delhi - DPCC" "Anand Vihar, Delhi - DPCC" ## [3] "Anand Vihar, Delhi - DPCC" "Anand Vihar, Delhi - DPCC" ## [5] "Ashok Vihar, Delhi - DPCC" "Ashok Vihar, Delhi - DPCC" ``` ```r head(delhi_locations$locationURL) ``` ``` ## [1] "Alipur%2C+Delhi+-+DPCC" "Anand+Vihar%2C+Delhi+-+DPCC" ## [3] "Anand+Vihar%2C+Delhi+-+DPCC" "Anand+Vihar%2C+Delhi+-+DPCC" ## [5] "Ashok+Vihar%2C+Delhi+-+DPCC" "Ashok+Vihar%2C+Delhi+-+DPCC" ``` --- # Use locations 💡 In `aq_latest()` and `aq_measurements()` use the URLencoded city / location. ```r res <- aq_latest(location = "Alipur, Delhi - DPCC") ``` ``` ## Error: This location/city/country combination is not available within the platform. See ?aq_locations ``` ```r res <- aq_latest(location = "Alipur%2C+Delhi+-+DPCC") ``` --- # Delhi PM2.5 vs WHO threshold <code class ='r hljs remark-code'>data <- aq_measurements(<span style='background-color:#ffff7f'>location = "US+Diplomatic+Post%3A+New+Delhi"</span>,<br> parameter = "pm25")</code> ```r # dplyr by Hadley Wickham et al data <- dplyr::filter(data, value > 0) ``` --- # Delhi PM2.5 vs WHO threshold ```r data ``` ``` ## # A tibble: 2,158 x 12 ## location parameter value unit country city latitude longitude ## <chr> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> ## 1 US Dipl… pm25 12 µg/m³ IN Delhi 28.6 77.2 ## 2 US Dipl… pm25 18 µg/m³ IN Delhi 28.6 77.2 ## 3 US Dipl… pm25 16 µg/m³ IN Delhi 28.6 77.2 ## 4 US Dipl… pm25 19 µg/m³ IN Delhi 28.6 77.2 ## 5 US Dipl… pm25 44 µg/m³ IN Delhi 28.6 77.2 ## 6 US Dipl… pm25 38 µg/m³ IN Delhi 28.6 77.2 ## 7 US Dipl… pm25 17 µg/m³ IN Delhi 28.6 77.2 ## 8 US Dipl… pm25 15 µg/m³ IN Delhi 28.6 77.2 ## 9 US Dipl… pm25 13 µg/m³ IN Delhi 28.6 77.2 ## 10 US Dipl… pm25 15 µg/m³ IN Delhi 28.6 77.2 ## # … with 2,148 more rows, and 4 more variables: dateUTC <dttm>, ## # dateLocal <dttm>, cityURL <chr>, locationURL <chr> ``` --- # Delhi PM2.5 vs WHO threshold <code class ='r hljs remark-code'>library("ggplot2") # Hadley Wickham et al<br><br><span style='background-color:#ffff7f'>ggplot(data) +<br> geom_point(aes(x = dateLocal, y = value)</span>, <br> col = "CornflowerBlue") +<br> <span style='background-color:#ffff7f'>geom_hline(yintercept = 25</span>, <br> size = 1.2, <br> col = "darkred") +<br> ylab(expression(paste("PM2.5 concentration (", mu, "g/",m^3,")")))+<br> xlab("Time (days)") +<br> ggtitle(data$location[1]) +<br> hrbrthemes::theme_ipsum(<br> base_size = 16,<br> axis_title_size = 16)</code> ![](index_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- # Delhi PM2.5 vs WHO threshold ![Delhi PM2.5 values over the last 90 days](index_files/figure-html/delhiplot-1.png) --- # London PM2.5 vs WHO threshold <code class ='r hljs remark-code'>data <- aq_measurements(<span style='background-color:#ffff7f'>location = "London+Westminster"</span>,<br> parameter = "pm25")</code> ```r data <- dplyr::filter(data, value > 0) ``` --- # London PM2.5 vs WHO threshold ```r ggplot(data) + geom_point(aes(x = dateLocal, y = value), col = "CornflowerBlue") + geom_hline(yintercept = 25, size = 1.2, col = "darkred") + ylab(expression(paste("PM2.5 concentration (", mu, "g/",m^3,")")))+ xlab("Time (days)") + ggtitle(data$location[1]) + hrbrthemes::theme_ipsum( base_size = 16, axis_title_size = 16) ``` --- # London PM2.5 vs WHO threshold ![London PM2.5 values over the last 90 days](index_files/figure-html/londonplot-1.png) --- # How much data can you get? Maximum number of results for any query is 100 (pages) times 10000 (results per page). -- You don't need to use the `page` and `limit` arguments, `ropenaq` handles all of this for you. Just wait a bit. ⌚ -- ```r res <- aq_measurements(country = "IN") nrow(res) ``` ``` ## [1] 1000000 ``` ```r attr(res, "meta")$found ``` ``` ## [1] 29175743 ``` --- # What if you need more data * **More data than 1,000,000 results?** Split your query, e.g. using `date_from` and `date_to`. Or see following item. -- * **More data than the last 90 days?** OpenAQ data older than 90 days is on Amazon Web Services (AWS) Athena. **More setup for you but worth it and transferrable skills** 💪 --- # OpenAQ on AWS: setup in the browser (1/2) * Set up an AWS account (root and IAM) * Create an S3 bucket. To avoid S3 fees, [set up a lifecycle policy](https://gist.github.com/jflasher/573525aff9a5d8a966e5718272ceb25a#gistcomment-3196277). -- 💰 **Will I lose all my money?** 💰 For OpenAQ on AWS you can't avoid Athena fees. They're super small. If you're nervous, do like me, check the billing interface often until you're confident. --- # OpenAQ on AWS: setup in the browser (2/2) Go to the Athena console. [Joe Flasher's gist](https://gist.github.com/jflasher/573525aff9a5d8a966e5718272ceb25a) -- * Check your location -- * Create the OpenAQ table -- * Run a query like in the gist to see if it works -- * Get your credentials (go into your "My security credentials" settings from the drop-down menu at the top right of the console where your username is.) --- # OpenAQ on AWS: setup in R Now, install `AWR.Athena`. It depends on `rJava`. 🍀. _Another newer R package without rJava: [`noctua`](https://dyfanjones.github.io/noctua/)_ 👀 -- Set up credentials. I used AWS CLI and ran `aws configure`, YMMV. 🤷 -- In a script, try setting up a connection. --- # OpenAQ on AWS: test query in R ```r library("AWR.Athena") # Neal Fultz and Gergely Daróczi library("DBI") # Kirill Müller et al con <- dbConnect(AWR.Athena::Athena(), region = 'us-east-1', S3OutputLocation = 's3://ropenaq', Schema = 'default') query <- "SELECT * FROM openaq WHERE location='US Diplomatic Post: Hanoi' LIMIT 10" meas <- dbGetQuery(con, query) ``` -- Now on to cooler stuff! --- # OpenAQ on AWS: use in R Selecting locations in Wuhan with `ropenaq`. ```r locations <- ropenaq::aq_locations( country = "CN", longitude = 114.3055, latitude = 30.5928, radius = 5000 ) unique(locations$city) ``` ``` ## [1] "武汉市" ``` --- # OpenAQ on AWS: use in R Prepare your query with `glue`. ```r query <- glue::glue( "SELECT * FROM openaq WHERE city='{locations$city[1]}' AND parameter='pm25'" ) query ``` ``` ## SELECT * ## FROM openaq ## WHERE city='武汉市' ## AND parameter='pm25' ``` --- # OpenAQ on AWS: use in R Make the query with `AWR.Athena` ```r library("AWR.Athena") library("DBI") con <- dbConnect(AWR.Athena::Athena(), region = 'us-east-1', S3OutputLocation = 's3://ropenaq', Schema = 'default') meas <- dbGetQuery(con, query) ``` -- Now save `meas` and munge it, plot it, etc. --- # OpenAQ in R In short, * `ropenaq` alone: only OpenAQ API data i.e. last 90 days, no setup, no credit card info. * `AWR.Athena` (or `noctua`?): All OpenAQ data via AWS, more setup, pennies. Take your time, document steps. In both cases, you can do a lot from R once you got the data! 🎉 --- # Conclusion: why use ropenaq * Need OpenAQ API data in R? * Or want an engaging dataset to learn how to use R for further projects? `install.packages("ropenaq")` `browseURL("https://docs.ropensci.org/ropenaq")` -- * Want more data than the last 90 days? Setup your AWS account, install the `AWR.Athena` package (or `noctua`?). Use `ropenaq` to select locations. --- # Conclusion: how to learn R Some resources * [R for Data Science book](https://r4ds.had.co.nz/) by Hadley Wickham and Garrett Grolemund, free online version; * [RStudio education website](https://education.rstudio.com/); * [Data Science specialization by John Hopkins University on Coursera](https://www.coursera.org/specializations/jhu-data-science); -- * Where/how to get help: [my post](https://masalmon.eu/2018/07/22/wheretogethelp/), [Sam Tyner's post](https://sctyner.github.io/rhelp.html); * Keep informed: [my post](https://masalmon.eu/2019/01/25/uptodate/), [R Weekly](https://rweekly.org/). --- # Conclusion: how to contribute **ropenaq R package, R client for the OpenAQ API** -- * Use cases: report them on [rOpenSci forum](https://discuss.ropensci.org/c/usecases/10) -- * Docs? Code? Let's talk at [github.com/ropensci/ropenaq](https://github.com/ropensci/ropenaq) Code cool stuff: HTTP requests with `crul`, asynchronous queries, tests with mocking. -- * Bugs? GitHub too! 🐛 -- --- class: part-slide, center, bottom background-image:url(img/RopenAQ.jpg) **[tiny.cc/ropenaq](https://maelle.github.io/communityseries/#1)** Diagram drawn by [Damien Cornu](https://twitter.com/damienaberlin). 😍