Connect to Spark on an external cluster

Using the Apache Livy service, you can connect to an external Spark cluster from Faculty notebooks, apps and APIs.

Note

To enable interaction with the Spark cluster, Apache Livy must be installed. Contact your cluster administrator to arrange installation - documentation on installation is available here.

We also strongly recommend to use Spark 2, which provides a much easier to use interface for data science than Spark 1. Indeed, MLlib, the Spark machine learning library has already deprecated their RDD (Spark 1) interface. You can check your Spark version by running print(sc.version) inside a Spark session as described below. Contact your cluster administrator to install Spark 2 and configure Apache Livy to use it.

Any libraries or other dependencies needed by your code must be installed on the Spark cluster, not on your Faculty server. Using sparkmagic/pylivy and Apache Livy, the code you run inside a %spark cell is run inside the external cluster, not in your notebook.

Interactive Spark in notebooks

The sparkmagic package provides Jupyter magics for managing Spark sessions on a external cluster and executing Spark code in them.

To use sparkmagic in your notebooks, install the package with pip install sparkmagic in a terminal or with a Faculty Environment, and load the magics in a notebook with:

%load_ext sparkmagic.magics

This makes available the %spark magic, which is the main entry point for managing sessions and executing code.

Session management

Create a new PySpark session with:

%spark add -s my_session -l python -u https://mycluster.com:8998

The -u/--url option sets the URL where Livy is running - you can get this from your system administrator. You can create R and Scala Spark sessions by setting the -l/--language option to r or scala respectively. For documentation of all options run %spark?.

You can also list running sessions with:

%spark list

and delete sessions with:

%spark delete sessionname

sparkmagic also provides the %manage_spark command, which returns a widget for managing Spark sessions on the Livy server, which you may prefer to the above interface.

Executing code

Once you’ve created a Spark session as above, execute a cell on the cluster by decorating a cell with the %%spark cell magic (note the two %):

%%spark
print('I am being executed on the external cluster')

Note

It’s important to bear in mind the distinction between code executed in the external Spark cluster from code that is executed in the notebook in your Faculty server. Cells that have the %%spark magic are executed on the external cluster, and will only see variables that exist there, and cells without that magic are executed on your Faculty server, and will only see variables that exist there.

If you get errors like NameError: name ‘df’ is not defined, it may be because the variable you meant exists in the other context.

Transfer of data between the external cluster and Faculty notebook must be done explicity, as described below.

The spark SparkSession (Spark 2 only) and sc SparkContext objects will be inserted into the session automatically. For example, to create a Spark DataFrame from a CSV file in the cluster’s HDFS filesystem:

%%spark
df = spark.read.csv('hdfs:////data/sample_data.csv')

Variables created in one cell will persist in the session and will be available in other later cells. For example, we can run a second cell that counts the number of rows in the Spark DataFrame created above:

%%spark
print(df.count())

Any output generated by your code in the cluster will be retrieved and displayed as the output of the notebook cell in Faculty.

Retrieving data

Often, you’ll want to retrieve the contents of a Spark DataFrame from the cluster so you can do additional processing and modelling in your normal Jupyter notebook. You can do this with the -o option:

%spark -o df

This will evaluate and collect the Spark DataFrame df on the external Spark cluster, and save its data into a Pandas DataFrame in your Faculty notebook, also called df.

Note

Using %spark -o will attempt to load all of the values from a Spark DataFrame into the memory on your Faculty server. If this is very large, as is often the case with Spark DataFrames, it may crash your server due to running out of memory!

You can also use -o with a %%spark cell magic. The below code creates a Spark DataFrame in the external cluster called top_ten, then collects it into the Faculty notebook as the Pandas DataFrame top_ten.

%%spark -o top_ten
top_ten = df.limit(10)

Run Spark jobs from scripts, apps and APIs

The pylivy package provides tools for managing Spark sessions on an external cluster and executing Spark code in them.

Unlike sparkmagic, pylivy does not depend on being executed from inside a Jupyter notebook, making it suitable for use in scripts, apps and APIs.

To use pylivy, install it with pip install livy in a terminal or with a Faculty Environment.

Usage

pylivy provides the LivySession class, which creates a Spark session and shuts it down automatically when finished. To execute code in the session, pass it as a string to the run() method on the session:

from livy import LivySession

with LivySession('https://mycluster.com:8998') as session:
    session.run('print("foo")')

You may also find the textwrap.dedent() function from the Python standard library useful for writing multiple-line code snippets inline:

from textwrap import dedent

with LivySession('https://mycluster.com:8998') as session:
    session.run(dedent("""
        df = spark.read.csv('hdfs:////data/sample_data.csv')
        top_ten = df.limit(10)
    """))

The read() method on the session allows you to evaluate and retrieve the contents of a Spark DataFrame. Pass it the name of the Spark DataFrame you want to read, and it will return it as a Pandas DataFrame:

with LivySession('https://mycluster.com:8998') as session:
    session.run(dedent("""
        df = spark.read.csv('hdfs:////data/sample_data.csv')
        top_ten = df.limit(10)
    """))
    top_ten_pandas = session.read('top_ten')

# Do something useful with the Pandas dataframe
make_plot(top_ten_pandas)