datagovindia is a wrapper around >100,000 APIs of the Government of India’s open data platform data.gov.in. Here is a small guide to take you through the package. Primarily,the functionality is centered around three aspects :
- API discovery - Finding the right API from all the available APIs
- API information - Getting information about a particular API
- Querying the API - Getting a tidy data set from the chosen API
The package is now on CRAN, download using :
install.packages("datagovindia")
You can install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("econabhishek/datagovindia")
- An account on data.gov.in
- An API key from the My Account page (instructions here : official guide)
library(datagovindia)
Know more about the various functions in the package vignette.
Once you have the API key ready, and have chosen the API you want and have its index_name (vignette for more details) using the search functions in the package, you are ready to extract data from it.
The function get_api_data is really the powerhouse in this package which allows one to do things over and above a manually constructed API query can do by utilizing the data.frame structure of the underlying data. It allows the user to filter, sort, select variables and to decide how much of the data to extract. The website can itself filter on only one field with one value at a time but one command through the wrapper can make multiple requests and append the results from these requests at the same time.
But before we dive into data extraction, we first need to validate our API key relieved from data.gov.in. To get the key, you need to register first register and then get the key from your “My Account” page after logging in. More instruction can be found on this official guide. Once you get your API key, you can validate it as follows (only need to do this once per session, this is a sample key from the website for demonstration) :
##Using a sample key
register_api_key("579b464db66ec23bdd000001cdd3946e44ce4aad7209ff7b23ac571b")
#> Connected to the internet
#> The server is online
#> The API key is valid and you won't have to set it again
Once you have your key registered, you are ready to extract data from a chosen API. Here is what each argument means :
- api_index : index_name of the chosen API (found by using search functions)
- results_per_req : Results per request sent to the server ; can take integer values or the string “all” to get all of the available data
- filter_by : A named character vector of field id (not the name) - value(s) pairs ; can take multiple fields as well as multiple comma separated values
- field_select : A character vector of fields to select only a subset of variables in the final data.frame
- sort_by : Sort by one or multiple fields
In a nutshell, first find the API you want using the search functions,
get the index_name of the API from the results, optionally take a
look at the fields present in the data of the API and then use the
get_api_data function to extract the data. Suppose we choose the API
“Real time Air Quality Index from various location” with index_ name
3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69. First we will look at which
fields are available to construct the right query.
Suppose We want to get the data from only 2 cities Chandigarh and
Gurugram and pollutants PM10 and NO2. We will let all fields to be
returned (dataset columns).
We now look at the fields available to play with.
get_api_fields("3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69")
id | name | type |
---|---|---|
document_id | document_id | double |
id | id | double |
country | country | keyword |
state | state | keyword |
city | city | keyword |
station | station | keyword |
pollutant_id | pollutant_id | keyword |
last_update | last_update | date |
pollutant_min | pollutant_min | double |
pollutant_max | pollutant_max | double |
pollutant_avg | pollutant_avg | double |
resource_uuid | resource_uuid | keyword |
We accordingly select the city and pollution_id fields for constructing our query. Note that we use only field id to finally query the data.
get_api_data(api_index="3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69",
results_per_req=10,filter_by=c(city="Gurugram,Chandigarh",
polutant_id="PM10,NO2"),
field_select=c(),
sort_by=c('state','city'))
#> Connected to the internet
#> The server is online
#> url-https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=579b464db66ec23bdd000001cdd3946e44ce4aad7209ff7b23ac571b&format=json&offset=0&limit=10&filters[city]=Gurugram&filters[polutant_id]=PM10
#> gave the API a rest
#> url-https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=579b464db66ec23bdd000001cdd3946e44ce4aad7209ff7b23ac571b&format=json&offset=0&limit=10&filters[city]=Chandigarh&filters[polutant_id]=PM10
#> gave the API a rest
#> url-https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=579b464db66ec23bdd000001cdd3946e44ce4aad7209ff7b23ac571b&format=json&offset=0&limit=10&filters[city]=Gurugram&filters[polutant_id]=NO2
#> gave the API a rest
#> url-https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=579b464db66ec23bdd000001cdd3946e44ce4aad7209ff7b23ac571b&format=json&offset=0&limit=10&filters[city]=Chandigarh&filters[polutant_id]=NO2
#> gave the API a rest
#> No results returned - check your api_index
id | country | state | city | station | pollutant_id | last_update | pollutant_min | pollutant_max | pollutant_avg |
---|---|---|---|---|---|---|---|---|---|
550 | India | Haryana | Gurugram | NISE Gwal Pahari, Gurugram - IMD | PM10 | 25-09-2021 05:00:00 | 22 | 102 | 50 |
555 | India | Haryana | Gurugram | Sector-51, Gurugram - HSPCB | PM10 | 25-09-2021 05:00:00 | 59 | 119 | 81 |
562 | India | Haryana | Gurugram | Teri Gram, Gurugram - HSPCB | PM10 | 25-09-2021 05:00:00 | 36 | 100 | 61 |
103 | India | Chandigarh | Chandigarh | Sector 22, Chandigarh - CPCC | PM10 | 25-09-2021 05:00:00 | 13 | 102 | 49 |
110 | India | Chandigarh | Chandigarh | Sector-25, Chandigarh - CPCC | PM10 | 25-09-2021 05:00:00 | 19 | 84 | 42 |
551 | India | Haryana | Gurugram | NISE Gwal Pahari, Gurugram - IMD | NO2 | 25-09-2021 05:00:00 | 13 | 25 | 17 |
556 | India | Haryana | Gurugram | Sector-51, Gurugram - HSPCB | NO2 | 25-09-2021 05:00:00 | 8 | 13 | 10 |
563 | India | Haryana | Gurugram | Teri Gram, Gurugram - HSPCB | NO2 | 25-09-2021 05:00:00 | 8 | 10 | 8 |
569 | India | Haryana | Gurugram | Vikas Sadan, Gurugram - HSPCB | NO2 | 25-09-2021 05:00:00 | 17 | 40 | 28 |
104 | India | Chandigarh | Chandigarh | Sector 22, Chandigarh - CPCC | NO2 | 25-09-2021 05:00:00 | 15 | 83 | 42 |
111 | India | Chandigarh | Chandigarh | Sector-25, Chandigarh - CPCC | NO2 | 25-09-2021 05:00:00 | 4 | 29 | 13 |
This wrapper is also available on Python (PyPI) visit -
Use
pip install datagovindia
Authors :