BigQuery¶
Installing BigQuery clients¶
To interact with BigQuery from Python, install the google-cloud-bigquery library:
$ pip install google-cloud-bigquery
There are two main ways to get authorized to GCP: through a personal account, or through service account (a system user who only has minimal permissions on the project, used for machine-to-machine access such as for scripts, jobs, and other automated workflows most of the time).
To read data from BigQuery as described here the relevant accounts
(either personal or service account) require at least BigQuery Data Viewer
and BigQuery Job User
roles.
Service account access through application credentials¶
Access to Google services is provided via a credentials file, which can be downloaded from the IAM page of the Google Cloud Console, or given to you by your Google account administrator.
If you have credentials for a service account only used by yourself, we suggest storing these in your home directory under
/home/faculty
. For a description of how your home directory persists across servers, read the Your home directory section of the documentation.If the credentials belong to a service account linked to the project, we suggest storing the credentials in the project workspace so that everyone in the project can access them.
Once you have settled on a location for the credentials, make sure the
BigQuery client knows how to find them by setting the
GOOGLE_APPLICATION_CREDENTIALS
environment variable. Run the
following commands, replacing /path/to/credentials.json
with the
absolute path to the credentials:
echo "export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json" > /etc/faculty_environment.d/gcloud-credentials.sh
sudo sv restart jupyter # Restart Jupyter to make sure it has access to credentials.
To make setting up servers more reproducible, we recommend adding these commands to the scripts section of a custom environment. For further information on setting environment variables in Faculty, refer to the Environment variables section.
Alternative access using the gcloud CLI¶
To access BigQuery as a single user (as opposed to using a service account,
outlined above), you can use the gcloud
CLI.
This method will work only within the servers you are running. Once you start a server, authenticate to your Google account by opening a terminal and running the following command:
gcloud auth login --update-adc
This will provide you with an URL to visit in your browser to do the authentication process, that you can copy-paste from the terminal window into your browser. At the end of the authentication process you will receive a code that you need to paste back into the terminal.
The first time you authenticate you will also need to set your GCP project ID with the command:
gcloud config set project PROJECT_ID
where you fill in your actual project ID (which you might find using the
gcloud projects list
command after authenticating first as outlined above).
These settings are are saved in your home folder, thus the authentication will be valid for all servers that you start until the session expiry. You will likely have to reauthenticate with the first command above on a regular basis (at least once each day).
This access should work for both scripts and notebooks, without setting any extra environment variable.
Accessing BigQuery from Python¶
Accessing BigQuery through the Google’s BigQuery client library is achieved like this:
from google.cloud import bigquery
client = bigquery.Client()
query = "SELECT * FROM `bigquery-public-data.london_bicycles.cycle_hire` LIMIT 10"
df = client.query(query).result().to_dataframe()
This returns the result of the query as a Pandas DataFrame.
To be able to access data with the .to_dataframe()
method, the service
account or user needs the BigQuery Read Session User
role as well.
Refer to the BigQuery documentation for other examples.
An alternative way is accessing the data directly using Pandas:
import pandas
query = "SELECT * FROM `bigquery-public-data.london_bicycles.cycle_hire` LIMIT 10"
# Run a Standard SQL query using the environment's default project
df = pandas.read_gbq(sql, dialect='standard')
This doesn’t require any extra access role beyond the minimum, but might be slower than Google’s BigQuery client library, especially on large requests or datasets.