Connect to Spark on an external cluster¶
Using the Apache Livy service, you can connect to an external Spark cluster from Faculty notebooks, apps and APIs.
Contents
Note
To enable interaction with the Spark cluster, Apache Livy must be installed. Contact your cluster administrator to arrange installation - documentation on installation is available here.
We also strongly recommend to use Spark 2, which provides a much easier to
use interface for data science than Spark 1. Indeed, MLlib, the Spark
machine learning library has already deprecated their RDD (Spark 1)
interface. You can check your Spark version by running
print(sc.version)
inside a Spark session as described below. Contact
your cluster administrator to install Spark 2 and configure Apache Livy to
use it.
Any libraries or other dependencies needed by your code must be installed
on the Spark cluster, not on your Faculty server. Using
sparkmagic/pylivy and Apache Livy, the code you run inside a %spark
cell is run inside the external cluster, not in your notebook.
Interactive Spark in notebooks¶
The sparkmagic package provides Jupyter magics for managing Spark sessions on a external cluster and executing Spark code in them.
To use sparkmagic in your notebooks, install the package with
pip install sparkmagic
in a terminal or with a Faculty Environment, and load the magics in a notebook with:
%load_ext sparkmagic.magics
This makes available the %spark
magic, which is the main entry point for
managing sessions and executing code.
Session management¶
Create a new PySpark session with:
%spark add -s my_session -l python -u https://mycluster.com:8998
The -u
/--url
option sets the URL where Livy is running - you can get
this from your system administrator. You can create R and Scala Spark sessions
by setting the -l
/--language
option to r
or scala
respectively.
For documentation of all options run %spark?
.
You can also list running sessions with:
%spark list
and delete sessions with:
%spark delete sessionname
sparkmagic also provides the %manage_spark
command, which returns a widget
for managing Spark sessions on the Livy server, which you may prefer to the
above interface.
Executing code¶
Once you’ve created a Spark session as above, execute a cell on the cluster by
decorating a cell with the %%spark
cell magic (note the two %
):
%%spark
print('I am being executed on the external cluster')
Note
It’s important to bear in mind the distinction between code executed in the
external Spark cluster from code that is executed in the notebook in your
Faculty server. Cells that have the %%spark
magic are executed on
the external cluster, and will only see variables that exist there, and
cells without that magic are executed on your Faculty server, and will
only see variables that exist there.
If you get errors like NameError: name ‘df’ is not defined, it may be because the variable you meant exists in the other context.
Transfer of data between the external cluster and Faculty notebook must be done explicity, as described below.
The spark
SparkSession (Spark 2 only) and sc
SparkContext objects will
be inserted into the session automatically. For example, to create a Spark
DataFrame from a CSV file in the cluster’s HDFS filesystem:
%%spark
df = spark.read.csv('hdfs:////data/sample_data.csv')
Variables created in one cell will persist in the session and will be available in other later cells. For example, we can run a second cell that counts the number of rows in the Spark DataFrame created above:
%%spark
print(df.count())
Any output generated by your code in the cluster will be retrieved and displayed as the output of the notebook cell in Faculty.
Retrieving data¶
Often, you’ll want to retrieve the contents of a Spark DataFrame from the
cluster so you can do additional processing and modelling in your normal
Jupyter notebook. You can do this with the -o
option:
%spark -o df
This will evaluate and collect the Spark DataFrame df
on the external
Spark cluster, and save its data into a Pandas DataFrame in your Faculty
notebook, also called df
.
Note
Using %spark -o
will attempt to load all of the values from a Spark
DataFrame into the memory on your Faculty server. If this is very large,
as is often the case with Spark DataFrames, it may crash your server due to
running out of memory!
You can also use -o
with a %%spark
cell magic. The below code creates a
Spark DataFrame in the external cluster called top_ten
, then collects it
into the Faculty notebook as the Pandas DataFrame top_ten
.
%%spark -o top_ten
top_ten = df.limit(10)
Run Spark jobs from scripts, apps and APIs¶
The pylivy package provides tools for managing Spark sessions on an external cluster and executing Spark code in them.
Unlike sparkmagic, pylivy does not depend on being executed from inside a Jupyter notebook, making it suitable for use in scripts, apps and APIs.
To use pylivy, install it with pip install livy
in a terminal or with a
Faculty Environment.
Usage¶
pylivy provides the LivySession
class, which creates a Spark session and
shuts it down automatically when finished. To execute code in the session, pass
it as a string to the run()
method on the session:
from livy import LivySession
with LivySession('https://mycluster.com:8998') as session:
session.run('print("foo")')
You may also find the textwrap.dedent()
function from the Python standard
library useful for writing multiple-line code snippets inline:
from textwrap import dedent
with LivySession('https://mycluster.com:8998') as session:
session.run(dedent("""
df = spark.read.csv('hdfs:////data/sample_data.csv')
top_ten = df.limit(10)
"""))
The read()
method on the session allows you to evaluate and retrieve the
contents of a Spark DataFrame. Pass it the name of the Spark DataFrame you want
to read, and it will return it as a Pandas DataFrame:
with LivySession('https://mycluster.com:8998') as session:
session.run(dedent("""
df = spark.read.csv('hdfs:////data/sample_data.csv')
top_ten = df.limit(10)
"""))
top_ten_pandas = session.read('top_ten')
# Do something useful with the Pandas dataframe
make_plot(top_ten_pandas)