sparkbq is a sparklyr extension package providing an integration with Google BigQuery. It builds on top of spark-bigquery, which provides a Google BigQuery data source to Apache Spark.
You can install the released version of sparkbq from CRAN via
install.packages("sparkbq")
or the latest development version through
devtools::install_github("miraisolutions/sparkbq", ref = "develop")
The following table provides an overview over supported versions of Apache Spark, Scala, and Google Dataproc:
sparkbq | spark-bigquery | Apache Spark | Scala | Google Dataproc |
---|---|---|---|---|
0.1.x | 0.1.0 | 2.2.x and 2.3.x | 2.11 | 1.2.x and 1.3.x |
sparkbq is based on the Spark package spark-bigquery which is available in a separate GitHub repository.
library(sparklyr)
library(sparkbq)
library(dplyr)
config <- spark_config()
sc <- spark_connect(master = "local[*]", config = config)
# Set Google BigQuery default settings
bigquery_defaults(
billingProjectId = "<your_billing_project_id>",
gcsBucket = "<your_gcs_bucket>",
datasetLocation = "US",
serviceAccountKeyFile = "<your_service_account_key_file>",
type = "direct"
)
# Reading the public shakespeare data table
# https://cloud.google.com/bigquery/public-data/
# https://cloud.google.com/bigquery/sample-tables
hamlet <-
spark_read_bigquery(
sc,
name = "hamlet",
projectId = "bigquery-public-data",
datasetId = "samples",
tableId = "shakespeare") %>%
filter(corpus == "hamlet") # NOTE: predicate pushdown to BigQuery!
# Retrieve results into a local tibble
hamlet %>% collect()
# Write result into "mysamples" dataset in our BigQuery (billing) project
spark_write_bigquery(
hamlet,
datasetId = "mysamples",
tableId = "hamlet",
mode = "overwrite")
When running outside of Google Cloud it is necessary to specify a service account JSON key file. The service account key file can be passed as parameter serviceAccountKeyFile
to bigquery_defaults
or directly to spark_read_bigquery
and spark_write_bigquery
.
Alternatively, an environment variable export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service_account_keyfile.json
can be set (see https://cloud.google.com/docs/authentication/getting-started for more information). Make sure the variable is set before starting the R session.
When running on Google Cloud, e.g. Google Cloud Dataproc, application default credentials (ADC) may be used in which case it is not necessary to specify a service account key file.