API¶

lens.summarise¶

Summarise a Pandas DataFrame

lens.summarise.summarise(df, scheduler='multiprocessing', num_workers=None, size=None, pairdensities=True)

Create a Lens Summary for a Pandas DataFrame.

This creates a Summary instance containing many quantities of interest to a data scientist.

Parameters: df : pd.DataFrame DataFrame to be analysed. scheduler : str, optional Dask scheduler to use. Must be one of [distributed, multiprocessing, processes, single-threaded, sync, synchronous, threading, threads]. num_workers : int or None, optional Number of workers in the pool. If the environment variable NUM_CPUS is set that number will be used, otherwise it will use as many workers as CPUs available in the machine. size : int, optional DataFrame size on disk, which will be added to the report. pairdensities : bool, optional Whether to compute the pairdensity estimation between all pairs of numerical columns. For most datasets, this is the most expensive computation. Default is True. summary : Summary The computed data summary.

Examples

Let’s explore the wine quality dataset.

>>> import pandas as pd
>>> import lens
>>> url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"  # noqa
>>> summary = lens.summarise(wines_df)


Now that we have a Summary instance we can inspect the shape of the dataset

>>> summary.columns
['fixed acidity',
'volatile acidity',
'citric acid',
'residual sugar',
'chlorides',
'free sulfur dioxide',
'total sulfur dioxide',
'density',
'pH',
'sulphates',
'alcohol',
'quality']
>>> summary.rows
4898


So far, nothing groundbreaking. Let’s look at the quality column:

>>> summary.summary('quality')
{'desc': 'categorical',
'dtype': 'int64',
'name': 'quality',
'notnulls': 4898,
'nulls': 0,
'unique': 7}


This tells us that there are seven unique values in the quality columns, and zero null values. It also tells us that lens will treat this column as categorical. Let’s look at this in more details:

>>> summary.details('quality')
{'desc': 'categorical',
'frequencies': {3: 20, 4: 163, 5: 1457, 6: 2198, 7: 880, 8: 175, 9: 5},
'iqr': 1.0,
'max': 9,
'mean': 5.8779093507554103,
'median': 6.0,
'min': 3,
'name': 'quality',
'std': 0.88563857496783116,
'sum': 28790}


This tells us that the median wine quality is 6 and the standard deviation is less than one. Let’s now get the correlation between the quality column and the alcohol column:

>>> summary.pair_detail('quality', 'alcohol')['correlation']
{'pearson': 0.4355747154613688, 'spearman': 0.4403691816246831}


Thus, the Spearman Rank Correlation coefficient between these two columns is 0.44.

class lens.summarise.Summary(report)

A summary of a pandas DataFrame.

Create a summary instance by calling lens.summarise.summarise() on a DataFrame. This calculates several quantities of interest to data scientists.

The Summary object is designed for programmatic use. For more direct visual inspection, use the lens.explorer.Explorer class in a Jupyter notebook.

Attributes: columns Get a list of column names of the dataset. rows Get the number of rows in the dataset. rows_unique Get the number of unique rows in the dataset.

Methods

 cdf(self, column) Approximate cdf for column correlation_matrix(self[, include, exclude]) Correlation matrix for numeric columns details(self, column) Type-specific information for a column from_json(file) Create a Summary from a report saved in JSON format. histogram(self, column) Return the histogram for column. kde(self, column) Return a Kernel Density Estimate for column. pair_details(self, first, second) Get pairwise information for a column pair. pdf(self, column) Approximate pdf for column summary(self, column) Basic information about the column tdigest(self, column) Return a TDigest object approximating the distribution of a column tdigest_centroids(self, column) Get TDigest centroids and counts for column. to_json(self[, file]) Produce a JSON serialization of the report.
cdf(self, column)

Approximate cdf for column

This returns a function representing the cdf of a numeric column.

Parameters: column : str Name of the column. cdf: function Function representing the cdf.

Examples

>>> cdf = summary.cdf('chlorides')
>>> min_value = summary.details('chlorides')['min']
>>> max_value = summary.details('chlorides')['max']
>>> xs = np.linspace(min_value, max_value, 200)
>>> plt.plot(xs, cdf(xs))

columns

Get a list of column names of the dataset.

Returns: list Column names

Examples

>>> summary.columns
['fixed acidity',
'volatile acidity',
'citric acid',
'residual sugar',
'chlorides',
'free sulfur dioxide',
'total sulfur dioxide',
'density',
'pH',
'sulphates',
'alcohol',
'quality']

correlation_matrix(self, include=None, exclude=None)

Correlation matrix for numeric columns

Parameters: include: list of strings, optional List of numeric columns to include. Includes all columns by default. exclude: list of strings, optional List of numeric columns to exclude. Includes all columns by default. columns: list of strings List of column names correlation_matrix: 2D array of floats The correlation matrix, ordered such that correlation_matrix[i, j] is the correlation between columns[i] and columns[j]

Notes

The columns are ordered through hierarchical clustering. Thus, neighbouring columns in the output will be more correlated.

details(self, column)

Type-specific information for a column

The details method returns additional information on column, beyond that provided by the summary method. If column is numeric, this returns summary statistics. If it is categorical, it returns a dictionary of how often each category occurs.

Parameters: column : str Column name dict Dictionary of detailed information.

Examples

>>> summary.details('alcohol')
{'desc': 'numeric',
'iqr': 1.9000000000000004,
'max': 14.199999999999999,
'mean': 10.514267047774602,
'median': 10.4,
'min': 8.0,
'name': 'alcohol',
'std': 1.2306205677573181,
'sum': 51498.880000000005}

>>> summary.details('quality')
{'desc': 'categorical',
'frequencies':
{3: 20, 4: 163, 5: 1457, 6: 2198, 7: 880, 8: 175, 9: 5},
'iqr': 1.0,
'max': 9,
'mean': 5.8779093507554103,
'median': 6.0,
'min': 3,
'name': 'quality',
'std': 0.88563857496783116,
'sum': 28790}

static from_json(file)

Create a Summary from a report saved in JSON format.

Parameters: file : str or buffer Path to file containing the JSON report or buffer from which the report can be read. Summary Summary object containing the summary in the JSON file.
histogram(self, column)

Return the histogram for column.

This function returns a histogram for the column. The number of bins is estimated through the Freedman-Diaconis rule.

Parameters: column: str Name of the column counts: array Counts for each of the bins of the histogram. bin_edges : array Edges of the bins in the histogram. Length is length(counts)+1.
kde(self, column)

Return a Kernel Density Estimate for column.

This function returns a KDE for the column. It is computed between the minimum and maximum values of the column and uses Scott’s rule to compute the bandwith.

Parameters: column: str Name of the column x: array Values at which the KDE has been evaluated. y : array Values of the KDE.
pair_details(self, first, second)

Get pairwise information for a column pair.

The information returned depends on the types of the two columns. It may contain the following keys.

correlation
dictionary with the Spearman rank correlation coefficient and Pearson product-moment correlation coefficient between the columns. This is returned when both columns are numeric.
pairdensity
dictionary with an estimate of the pairwise density between the columns. The density is either a 2D KDE estimate if both columns are numerical, or several 1D KDE estimates if one of the columns is categorical and the other numerical (grouped by the categorical column) or a cross-tabuluation.
Parameters: first : str Name of the first column. second : str Name of the second column. dict Dictionary of pairwise information.

Examples

>>> summary.pair_details('chlorides', 'quality')
{'correlation': {
'pearson': -0.20993441094675602,
'spearman': -0.31448847828244203},
{'pairdensity': {
'density': <2d numpy array>
'x': <1d numpy array of x-values>
'y': <1d numpy array of y-values>
'x_scale': 'linear',
'y_scale': 'cat'}
}

>>> summary.pair_details('alcohol', 'chlorides')
{'correlation': {
'pearson': -0.36018871210816106,
'spearman': -0.5708064071153713},
{'pairdensity': {
'density': <2d numpy array>
'x': <1d numpy array of x-values>
'y': <1d numpy array of y-values>
'x_scale': 'linear',
'y_scale': 'linear'}
}

pdf(self, column)

Approximate pdf for column

This returns a function representing the pdf of a numeric column.

Parameters: column : str Name of the column. pdf: function Function representing the pdf.

Examples

>>> pdf = summary.pdf('chlorides')
>>> min_value = summary.details('chlorides')['min']
>>> max_value = summary.details('chlorides')['max']
>>> xs = np.linspace(min_value, max_value, 200)
>>> plt.plot(xs, pdf(xs))

rows

Get the number of rows in the dataset.

Returns: int Number of rows

Examples

>>> summary.rows
4898

rows_unique

Get the number of unique rows in the dataset.

Returns: int Number of unique rows.
summary(self, column)

This returns information about the number of nulls and unique values in column as well as which type this column is. This is guaranteed to return a dictionary with the same keys for every column.

The dictionary contains the following keys:

desc
the type of data: currently categorical or numeric. Lens will calculate different quantities for this column depending on the value of desc.
dtype
the type of data in Pandas.
name
column name
notnulls
number of non-null values in the column
nulls
number of null-values in the column
unique
number of unique values in the column
Parameters: column : str Column name dict Dictionary of summary information.

Examples

>>> summary.summary('quality')
{'desc': 'categorical',
'dtype': 'int64',
'name': 'quality',
'notnulls': 4898,
'nulls': 0,
'unique': 7}

>>> summary.summary('chlorides')
{'desc': 'numeric',
'dtype': 'float64',
'name': 'chlorides',
'notnulls': 4898,
'nulls': 0,
'unique': 160}

tdigest(self, column)

Return a TDigest object approximating the distribution of a column

Documentation for the TDigest class can be found at https://github.com/CamDavidsonPilon/tdigest.

Parameters: column : str Name of the column. tdigest.TDigest TDigest instance computed from the values of the column.
tdigest_centroids(self, column)

Get TDigest centroids and counts for column.

Parameters: column : str Name of the column. numpy.array Means of the TDigest centroids. numpy.array Counts for each of the TDigest centroids.
to_json(self, file=None)

Produce a JSON serialization of the report.

Parameters: file : str or buffer, optional File name or writeable buffer to save the JSON report. If omitted, a string containing the report will be returned. str JSON serialization of the summary report

lens.explorer¶

Explore a Summary

class lens.explorer.Explorer(summary, plot_renderer=<function _render>)

An explorer to visualise a Lens Summary

Once a Lens Summary has been generated with lens.summarise.summarise(), this class provides the methods necessary to explore the summary though tables and plots. It is best used from within a Jupyter notebook.

Methods

 cdf_plot(self, column) Plot the empirical cumulative distribution function of a column. column_details(self, column[, sort]) Show type-specific column details. correlation(self[, include, exclude]) Show the correlation matrix for numeric columns. correlation_plot(self[, include, exclude]) Plot the correlation matrix for numeric columns crosstab(self, column1, column2) Show a contingency table of two categorical columns. describe(self) General description of the dataset. distribution(self, column) Show properties of the distribution of values in the column. distribution_plot(self, column[, bins]) Plot the distribution of a numeric column. pairwise_density_plot(self, column1, column2) Plot the pairwise density between two columns.
cdf_plot(self, column)

Plot the empirical cumulative distribution function of a column.

Creates a plotly plot with the empirical CDF of a column.

Parameters: column : str Name of the column.
column_details(self, column, sort=False)

Show type-specific column details.

For numeric columns, this method produces a table with summary statistics, including minimum, maximum, mean, and median. For categorical columns, it produces a frequency table for each category sorted in descending order of frequency.

Parameters: column : str Name of the column. sort : boolean, optional Sort frequency tables in categorical variables by category name.
correlation(self, include=None, exclude=None)

Show the correlation matrix for numeric columns.

Print a Spearman rank order correlation coefficient matrix in tabular form, showing the correlation between columns. The matrix is reordered to group together columns that have a higher correlation coefficient. The columns to be shown in the table can be selected through either the include or exclude keyword arguments. Only one of them can be given.

Parameters: include : list of str List of columns to include in the correlation plot. exclude : list of str List of columns to exclude from the correlation plot.
correlation_plot(self, include=None, exclude=None)

Plot the correlation matrix for numeric columns

Plot a Spearman rank order correlation coefficient matrix showing the correlation between columns. The matrix is reordered to group together columns that have a higher correlation coefficient. The columns to be plotted in the correlation plot can be selected through either the include or exclude keyword arguments. Only one of them can be given.

Parameters: include : list of str List of columns to include in the correlation plot. exclude : list of str List of columns to exclude from the correlation plot.
crosstab(self, column1, column2)

Show a contingency table of two categorical columns.

Print a contingency table for two categorical variables showing the multivariate frequancy distribution of the columns.

Parameters: column1 : str First column. column2 : str Second column.
describe(self)

General description of the dataset.

Produces a table including the following information about each column:

desc
the type of data: currently categorical or numeric. Lens will calculate different quantities for this column depending on the value of desc.
dtype
the type of data in Pandas.
name
column name
notnulls
number of non-null values in the column
nulls
number of null-values in the column
unique
number of unique values in the column
distribution(self, column)

Show properties of the distribution of values in the column.

Parameters: column : str Name of the column.
distribution_plot(self, column, bins=None)

Plot the distribution of a numeric column.

Create a plotly plot with a histogram of the values in a column. The number of bin in the histogram is decided according to the Freedman-Diaconis rule unless given by the bins parameter.

Parameters: column : str Name of the column. bins : int, optional Number of bins to use for histogram. If not given, the Freedman-Diaconis rule will be used to estimate the best number of bins. This argument also accepts the formats taken by the bins parameter of matplotlib’s :function:~matplotlib.pyplot.hist.
pairwise_density_plot(self, column1, column2)

Plot the pairwise density between two columns.

This plot is an approximation of a scatterplot through a 2D Kernel Density Estimate for two numerical variables. When one of the variables is categorical, a 1D KDE for each of the categories is shown, normalised to the total number of non-null observations. For two categorical variables, the plot produced is a heatmap representation of the contingency table.

Parameters: column1 : str First column. column2 : str Second column.