BigQuery ======== Installing BigQuery clients --------------------------- To interact with BigQuery from Python, install the `google-cloud-bigquery `_ library: .. code-block:: bash $ pip install google-cloud-bigquery There are two main ways to get authorized to GCP: through a personal account, or through service account (a system user who only has minimal permissions on the project, used for machine-to-machine access such as for scripts, jobs, and other automated workflows most of the time). To read data from BigQuery as described here the relevant accounts (either personal or service account) require at least ``BigQuery Data Viewer`` and ``BigQuery Job User`` roles. Service account access through application credentials ------------------------------------------------------ Access to Google services is provided via a `credentials file `_, which can be downloaded from the IAM page of the `Google Cloud Console `_, or given to you by your Google account administrator. - If you have credentials for a service account only used by yourself, we suggest storing these in your home directory under ``/home/faculty``. For a description of how your home directory persists across servers, read the :ref:`user_workspace` section of the documentation. - If the credentials belong to a service account linked to the project, we suggest storing the credentials in the project workspace so that everyone in the project can access them. Once you have settled on a location for the credentials, make sure the BigQuery client knows how to find them by setting the ``GOOGLE_APPLICATION_CREDENTIALS`` environment variable. Run the following commands, replacing ``/path/to/credentials.json`` with the absolute path to the credentials: .. code-block:: bash echo "export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json" > /etc/faculty_environment.d/gcloud-credentials.sh sudo sv restart jupyter # Restart Jupyter to make sure it has access to credentials. To make setting up servers more reproducible, we recommend adding these commands to the `scripts` section of a custom environment. For further information on setting environment variables in Faculty, refer to the :ref:`environment_variables` section. Alternative access using the gcloud CLI --------------------------------------- To access BigQuery as a single user (as opposed to using a service account, outlined above), you can use the ``gcloud`` CLI. This method will work only within the servers you are running. Once you start a server, authenticate to your Google account by opening a terminal and running the following command: .. code-block:: bash gcloud auth login --update-adc This will provide you with an URL to visit in your browser to do the authentication process, that you can copy-paste from the terminal window into your browser. At the end of the authentication process you will receive a code that you need to paste back into the terminal. The first time you authenticate you will also need to set your GCP project ID with the command: .. code-block:: bash gcloud config set project PROJECT_ID where you fill in your actual project ID (which you might find using the ``gcloud projects list`` command after authenticating first as outlined above). These settings are are saved in your home folder, thus the authentication will be valid for all servers that you start until the session expiry. You will likely have to reauthenticate with the first command above on a regular basis (at least once each day). This access should work for both scripts and notebooks, without setting any extra environment variable. Accessing BigQuery from Python ------------------------------ Accessing BigQuery through the `Google's BigQuery client library `_ is achieved like this: .. code-block:: python from google.cloud import bigquery client = bigquery.Client() query = "SELECT * FROM `bigquery-public-data.london_bicycles.cycle_hire` LIMIT 10" df = client.query(query).result().to_dataframe() This returns the result of the query as a Pandas DataFrame. To be able to access data with the ``.to_dataframe()`` method, the service account or user needs the ``BigQuery Read Session User`` role as well. Refer to the `BigQuery documentation `_ for other examples. An alternative way is accessing the data directly using Pandas: .. code-block:: python import pandas query = "SELECT * FROM `bigquery-public-data.london_bicycles.cycle_hire` LIMIT 10" # Run a Standard SQL query using the environment's default project df = pandas.read_gbq(sql, dialect='standard') This doesn't require any extra access role beyond the minimum, but might be slower than Google's BigQuery client library, especially on large requests or datasets.