Version control with git
========================
Just as when writing any other type of software, when doing data science it's
really useful to keep your code under version control. The *de facto* standard
for code version control is `git `_, which comes
preinstalled on Faculty platform. This document provides some tips on using git
effectively in the data science workflow on Faculty platform.
.. contents:: Contents
:local:
:depth: 2
The git command line interface
------------------------------
The most common way to use git in the platform is using the command line
interface (CLI). This allows you to perform git operations by running commands
in the terminal.
To get started with a repository, you first need to clone it on the platform.
This makes a copy of the repository on the platform where you can commit new
changes, which can later be pushed back to the "main" repository you clones
from.
For example, to clone `the Faculty Python library
`_ from GitHub, create a server in a
project's workspace, open a terminal (click the Terminal icon on the server
once ready) and run the following command:
.. code-block:: bash
git clone https://github.com/facultyai/faculty.git
Once the repository is cloned, you can view and edit its files, stage and
commit changes, and eventually push changes back to GitHub (see the note below
:ref:`about authentication `).
Using git requires more than just cloning repositories, however. For a more
complete introduction to using git, in particular the CLI, we recommend doing
one of the many free interactive tutorials online, such as `this one by
codecademy `_.
VSCode interface
----------------
In addition to the CLI, you can also use git through its integration in the
VSCode editor that comes preinstalled on platform servers.
In the VSCode interface, which can be found via the "Editor" option in the
ellipsis menu on servers in the workspace, git can be used from the "Source
Control" view. In this view, you can clone git repositories via the ellipsis
menu at the top of the interface:
.. image:: images/vscode_clone.png
Adding, committing and other parts of the git workflow can be managed from the
VSCode Source Control view, as summarised in `the VSCode documentation
`_, however we
recommend using SSH keys for authentication (rather than using VSCode's
integrations with GitHub etc.) :ref:`as described below `.
Tips for using git on Faculty Platform
--------------------------------------
Through our experience using Faculty Platform and the experience of our
internal and external users, we have collected some useful practices that help
make using git on the platform more productive.
Configuring the git user name and email
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When making commits with git, it stores information with the commit about who
create it and when. Code hosting providers like GitHub then use this
information to match commits with users.
By default, Faculty Platform configures git to use the name and email address
you used to sign up to the platform. To see the configured username and email
address, you can run these commands in the terminal:
.. code-block:: bash
git config user.name
git config user.email
If you want to modify the name or email address associated with commits you
make, you can do this with:
.. code-block:: bash
git config --global user.name "Jane Bloggs"
git config --global user.email jane@bloggs.com
Work on separate copies of the repository
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Most of the time, people use git by cloning repositories to their computer,
make changes and push back to a shared code hosting service like GitHub or
Bitbucket.
However, the model on Faculty Platform, where all members of a project can
access and edit the same files (those in the project's workspace), can present
some challenges to git users. If, for example, multiple data scientists are
working on the same copy of a repository, one user can very easily overwrite or
discard changes of another. This is an unfortunate but necessary consequence of
git's design when used on shared filesystems, where commonplace operations like
checking out branches modify the working copy of files accessed by the user.
It is possible to have multiple users working on the same copy of a repository,
but it requires a lot of coordination and incurs a lot of friction. Instead, we
strongly recommend that when multiple users need to work on the same codebase,
that they make a full copy of the repository per user and perform any code
merges through their code hosting service. Our most common pattern for managing
this is to have a development directory for each user in a project, containing
files (including code repositories) that others are expected not to modify
without permission.
git is designed to facilitate a branching code development model, recognising
that branching is an inevitable part of code development and providing the
tools to make merging as easy as possible. We have found that despite the
apparent initial pain of setting up a separate copy of a repository per user,
than in the long term embracing this model is much more manageable.
Clean output from notebooks
~~~~~~~~~~~~~~~~~~~~~~~~~~~
When working in the platform, you may often find yourself using Jupyter
Notebooks. The interactive, rich output environment of notebooks is a great for
the exploratory and iterative nature of data science, but they can result in
confusing diffs in git, which is predominantly aimed at simple code files.
Tools like ``nbdiff`` (which comes preinstalled on Faculty Platform) make it
easier to compare versions of notebooks stored in git, and some code hosting
platforms will show rich comparisons of the notebooks in their user interface,
however differences in which cells were run in a notebook and in which order
can still result in complex and/or large diffs due to differences in the stored
outputs.
We therefore recommend that all notebooks committed to git be first cleared of
their output. This can be done either through the Jupyter UI (Cell > All Output
> Clear, and then re-save the notebook) or with the ``nbclean`` command line
tool (requires installation with ``pip`` first).
Use a separate copy of the repo for jobs and development
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
:ref:`jobs-main` and git go well together - with a versioned codebase and jobs'
easily reproducible execution environment and parameters, large parts of the
data science workflow can be made faster and more consistent.
However, care should be taken to ensure that the code run by jobs consistently
uses the exepcted branch or commit. Running code from a copy of the repository
that you or another data scientist is using for development can be troublesome.
An uncommitted change to the repo, or simply forgetting to switch back to the
main branch, can result in unexpected errors, or worse, subtly different
results.
To avoid these issues, we suggest that jobs use a separate copy of the
repository that remains untouched during development, except to explicitly
update the version of code used. This typically lives in a dedicated directory,
separate from any data scientists' copies of the repo being used for
development.
An even stronger pattern, that requires a bit more setup, is to use a feature
of most code hosting providers called deployment keys. These allow you to
generate a special key that provides read-only access to a single repository.
Using deployment keys, a job can be configured to clone a dedicated copy of the
repository before running its code. This means that the job can be guaranteed
to be running the expected branch or commit, unaffected by the state of any
repo in the project workspace. In this pattern, we recommend storing the job's
copy of the repo to the ``/tmp`` filesystem, which is only accessible by the
individual job run/subrun. Consult your code hosting provider's documentation
for more information on setting up deployment keys.
.. _git-ssh-keys:
Use SSH keys for authentication
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By default, most code hosting services will recommend that you clone git
repositories over HTTPS, which uses your username and password for
authentication. Not only can this be inconvenient (as you need to reenter your
password for the code hosting service on a regular basis), but in some
instances a git repository might be configured to always use a particular
username of the code hosting service, leading to some confusion by collagues on
the project.
We recommend that users instead use SSH keys to authenticate their access to
code hosting providers. SSH is another protocol that can be used by git to
communicate with code hosting services like GitHub and Bitbucket, and uses a
pair of key files (one public, one private and secret) to prove your identity.
Setting up SSH keys requires around 10 minutes of one-off setup by each user,
but in the long term this time investment is well worth it:
1. Create the SSH key
+++++++++++++++++++++
Create a server in Faculty Platform and open the terminal. Run the following
command to generate an SSH key:
.. code-block:: bash
ssh-keygen -t ed25519 -C faculty
You will be prompted to enter a location to store the SSH key and to enter a
passphrase. Just hit enter to accept the default values - this will generate a
passwordless SSH key in your home directory.
Your home directory is available on all your servers on all projects, but is
not accessible to other users, so can safely be used to store credentials such
as SSH keys that identify you as an individual.
2. Copy the public key
++++++++++++++++++++++
Running the above command will generate an SSH key with two parts - a public
part and a private part. The public part can be safely uploaded to GitHub and
Bitbucket (they will us it to verify your identity when using the key), while
the private key is secret and should not be shared (anyone in possession of it
could masquerade as you).
To copy the public key, run the following command to print it to the terminal
and then copy the printed text:
.. code-block:: bash
cat ~/.ssh/id_ed25519.pub
Make sure you copy the file that includes the ``.pub`` extension - the part
without the extension is the private key, which should not be shared. The text
you copy should look similar to the following, and you should include the
leading ``ssh-ed25519`` part::
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICLKIXvyP4jrPGyUrNdvvkKv70TeIQ1f0/QFAwihuiY faculty
3. Add the public key to your code hosting service
++++++++++++++++++++++++++++++++++++++++++++++++++
In order to use this key to authenticate access to your code repositories, you
should add it to the settings of you account on your code hosting tool. Below
are links for doing this on commonly used providers:
* `GitHub `_
* `Bitbucket `_
* `GitLab `_
4. Clone repositories using SSH
+++++++++++++++++++++++++++++++
Now that you've set up SSH keys, you can access repositories using SSH. To do
this, you'll need to use SSH URLs for repositories when cloning, pulling and
pushing.
For new repositories, this is easy. All the common git hosting providers have a
"Clone" or "Code" button that provide you with the HTTPS and SSH URLs for each
repo. To clone a repository with SSH, simply use the SSH URL instead of the
HTTPS one. For example, our earlier cloning example would become:
.. code-block:: bash
git clone git@github.com:facultyai/faculty.git
For existing repositories, you may need to change the URL used by git when
pushing and pulling commits. This can be done with the ``git remote`` command -
open a terminal, navigate to the repostiory and run:
.. code-block:: bash
git remote set-url origin GIT-URL
replacing ``GIT-URL`` with the correct SSH URL for the repo. In the earlier
example, this would be:
.. code-block:: bash
git remote set-url origin git@github.com:facultyai/faculty.git