Quickstart#
This notebook provides a brief introduction to nested-pandas, including the motivation and basics for working with the data structure. For more in-depth descriptions, see the other tutorial notebooks.
Installation#
With a valid Python environment, nested-pandas and it’s dependencies are easy to install using the pip package manager. The following command can be used to install it:
[1]:
# % pip install nested-pandas
Overview#
Nested-pandas is tailored towards efficient analysis of nested data sets. This includes data that would normally be represented in a Pandas DataFrames with multiple rows needed to represent a single “thing” and therefor columns whose values will be identical for that item.
As a concrete example, consider an astronomical data set storing information about observations of physical objects, such as stars and galaxies. One way to represent this in pandas is to create one row per observation with an ID column indicating to which physical object the observation corresponds. However this approach ends up repeating a lot of data over each observation of the same object such as its location on the sky (RA, dec), its classification, etc. Further, any operations processing the data as time series requires the user to first perform a (potentially expensive) group-by operation to aggregate all of the data for each object.
Let’s create a flat pandas dataframe with three objects: object 0 has three observations, object 1 has three observations, and object 2 has 4 observations.
[2]:
import pandas as pd
# Represent nested time series information as a classic pandas dataframe.
my_data_frame = pd.DataFrame(
{
"id": [0, 0, 0, 1, 1, 1, 2, 2, 2, 2],
"ra": [10.0, 10.0, 10.0, 15.0, 15.0, 15.0, 12.1, 12.1, 12.1, 12.1],
"dec": [0.0, 0.0, 0.0, -1.0, -1.0, -1.0, 0.5, 0.5, 0.5, 0.5],
"time": [60676.0, 60677.0, 60678.0, 60675.0, 60676.5, 60677.0, 60676.6, 60676.7, 60676.8, 60676.9],
"brightness": [100.0, 101.0, 99.8, 5.0, 5.01, 4.98, 20.1, 20.5, 20.3, 20.2],
"band": ["g", "r", "g", "r", "g", "r", "g", "g", "r", "r"],
}
)
my_data_frame
[2]:
| id | ra | dec | time | brightness | band | |
|---|---|---|---|---|---|---|
| 0 | 0 | 10.0 | 0.0 | 60676.0 | 100.00 | g |
| 1 | 0 | 10.0 | 0.0 | 60677.0 | 101.00 | r |
| 2 | 0 | 10.0 | 0.0 | 60678.0 | 99.80 | g |
| 3 | 1 | 15.0 | -1.0 | 60675.0 | 5.00 | r |
| 4 | 1 | 15.0 | -1.0 | 60676.5 | 5.01 | g |
| 5 | 1 | 15.0 | -1.0 | 60677.0 | 4.98 | r |
| 6 | 2 | 12.1 | 0.5 | 60676.6 | 20.10 | g |
| 7 | 2 | 12.1 | 0.5 | 60676.7 | 20.50 | g |
| 8 | 2 | 12.1 | 0.5 | 60676.8 | 20.30 | r |
| 9 | 2 | 12.1 | 0.5 | 60676.9 | 20.20 | r |
Note that we cannot cleanly compress this by adding more columns (such as such as t0, t1, and so forth), because the number of observations is not bounded and may vary from object to object.
Beyond astronomical data we might be interested in tracking patients blood pressure over time, the measure of intensities of emitted light at different wavelengths, or storing a list of the type of rock found at different depths of core samples. In each case it is possible to represent this data with multiple rows (such as one row for each patient + measurement pair) and associate them together by ids.
Nested-pandas is designed for exactly this type of data by allowing columns to contain nested data. We can have regular columns with the (single) value for the objects’ unvarying characteristics (location on the sky, patentient birth date, location of the core sample) and nested columns for the values of each observation.
Let’s see an example:
[3]:
from nested_pandas.nestedframe import NestedFrame
# Create a nested data set
nf = NestedFrame.from_flat(
my_data_frame,
base_columns=["ra", "dec"], # the columns not to nest
nested_columns=["time", "brightness", "band"], # the columns to nest
on="id", # column used to associate rows
name="lightcurve", # name of the nested column
)
nf
[3]:
| ra | dec | lightcurve | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10.000000 | 0.000000 |
|
|||||||||
| 1 | 15.000000 | -1.000000 |
|
|||||||||
| 2 | 12.100000 | 0.500000 |
|
The above dataframe is a NestedFrame, which extends the capabilities of the Pandas DataFrame to support columns with nested information.
We now have the top level dataframe with 3 rows, each of which corresponds to a single object. The table has three columns beyond “id”. Two columns, “ra” and “dec”, have a single value for the object (in this case the position on the sky). The last column “lightcurve” contains a nested table with a series of observation times and observation brightnesses for the object. The first row of this nested table is provided along with dimensions to provide a sense for the contents of the nested data. As we will see below, this nested table allows the user to easily access to the all of the observations for a given object.
Accessing Nested Data#
We can inspect the contents of the “lightcurve” column using pandas API tooling like loc.
[4]:
nf.loc[0]["lightcurve"]
[4]:
| time | brightness | band | |
|---|---|---|---|
| 0 | 60676.0 | 100.0 | g |
| 1 | 60677.0 | 101.0 | r |
| 2 | 60678.0 | 99.8 | g |
Here we see that within the “lightcurve” column there are tables with their own data. In this case we have 2 columns (“time” and “brightness”) that represent a time series of observations.
Note that loc itself accesses the row, so the combination of nf.loc[0]["lightcurve"] means we are looking at value in the “lightcurve” column for a single row (row 0). If we just use nf.loc[0] we would retrieve the entire row, including the nested “lightcurve” column and all other columns. Similarly if we use nf["lightcurve] we retrieve the nested column for all rows. What makes the nesting useful is that once we access the nested entry for a specific row, we can treat the value
as a table in its own right.
As in Pandas, we can still access individual entries from a column based on the row index. Thus we can access the values (in a table) in row 0 of the nested column as nf["lightcurve"][0] as well.
[5]:
nf["lightcurve"][0]
[5]:
| time | brightness | band | |
|---|---|---|---|
| 0 | 60676.0 | 100.0 | g |
| 1 | 60677.0 | 101.0 | r |
| 2 | 60678.0 | 99.8 | g |
We can also use dot notation to access all the values in a nested sub column:
[6]:
nf["lightcurve.time"]
[6]:
id
0 60676.0
0 60677.0
0 60678.0
1 60675.0
1 60676.5
1 60677.0
2 60676.6
2 60676.7
2 60676.8
2 60676.9
Name: time, dtype: double[pyarrow]
Note that “lightcurve.time” contains the time values for all rows, but also preserves the nesting information. The id column of the returned data maps the top-level row (in nf) to where this value resides.
Similarly, we can access the values for a given top-level row by index. To get all the time values for row 0 we could specify:
[7]:
nf["lightcurve.time"][0]
[7]:
id
0 60676.0
0 60677.0
0 60678.0
Name: time, dtype: double[pyarrow]
Here the [0] is telling our nested frame to access the values of the series nf["lightcurve.time"] where the id = 0. If we try nf["lightcurve.time"][0][0] we again match id = 0 and return the same frame.
To access a single element within the series, we need to use its location:
[8]:
nf["lightcurve.time"][0].iloc[0]
[8]:
60676.0
Inspecting Nested Frames#
We can inspect the available columns using some custom properties of the NestedFrame.
[9]:
# Shows which columns have nested data
nf.nested_columns
[9]:
['lightcurve']
[10]:
# Provides a dictionary of "base" (top-level) and nested column labels
nf.all_columns
[10]:
{'base': Index(['ra', 'dec', 'lightcurve'], dtype='object'),
'lightcurve': ['time', 'brightness', 'band']}
Pandas Operations#
Nested-pandas extends the Pandas API, meaning any operation you could do in Pandas is available within nested-pandas. However, nested-pandas has additional functionality and tooling to better support working with nested datasets. For example, let’s look at query:
[11]:
# Normal queries work as expected, rejecting rows from the dataframe that don't meet the criteria
nf.query("ra > 11.2")
[11]:
| ra | dec | lightcurve | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 15.000000 | -1.000000 |
|
|||||||||
| 2 | 12.100000 | 0.500000 |
|
The above query is native pandas, however with nested-pandas we can use hierarchical column names to extend query to nested layers.
[12]:
# Applies the query to "nested", filtering based on "time > 60676.0"
nf_g = nf.query("lightcurve.time > 60676.0")
nf_g
[12]:
| ra | dec | lightcurve | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10.000000 | 0.000000 |
|
|||||||||
| 1 | 15.000000 | -1.000000 |
|
|||||||||
| 2 | 12.100000 | 0.500000 |
|
This query does not affect the rows of the top-level dataframe, but rather applies the query to the “nested” dataframes. If we look at one of them, we can see the effect of the query.
[13]:
# All t <= 60676.0 have been removed
nf_g.loc[0]["lightcurve"]
[13]:
| time | brightness | band | |
|---|---|---|---|
| 0 | 60677.0 | 101.0 | r |
| 1 | 60678.0 | 99.8 | g |
A limited set of functions have been extended in this way so far, with the aim being to fully support this hierarchical access where applicable in the Pandas API.
The map_rows Function#
Finally, we’ll end with the flexible map_rows function. map_rows functions similarly to pandas’ apply but applies row by row and flattens the inputs from nested layers into array inputs to the given apply function. For example, let’s find the mean flux for each dataframe in “nested”:
[14]:
import numpy as np
# use hierarchical column names to access the flux column
# passed as an array to np.mean
# row_container signals how to pass the data to the function, in this case as direct arguments
nf.map_rows(np.mean, "lightcurve.brightness", row_container="args")
[14]:
| 0 | |
|---|---|
| id | |
| 0 | 100.266667 |
| 1 | 4.996667 |
| 2 | 20.275000 |
3 rows × 1 columns
This can be used to apply any custom functions you need for your analysis, and just to illustrate that point further let’s define a custom function that just returns it’s inputs.
[15]:
def show_inputs(row):
return row
Applying some inputs via map_rows, we see how it sends inputs to a given function. The output frame nf_inputs consists of two columns containing the output of the “ra” column and the “lightcurve.time” column.
[16]:
# row_container="dict" passes the data as a dictionary to the function
nf_inputs = nf.map_rows(show_inputs, columns=["ra", "lightcurve.time"], row_container="dict")
nf_inputs
# map_rows returns a dataframe view of the dicts, but the two columns can be accessed with show_inputs as
# row["ra"] and row["lightcurve.time"]
[16]:
| ra | lightcurve | ||||
|---|---|---|---|---|---|
| 0 | 10.000000 |
|
|||
| 1 | 15.000000 |
|
|||
| 2 | 12.100000 |
|
[17]:
nf_inputs.loc[0]
[17]:
ra 10.0
lightcurve time
0 60676.0
1 60677.0
2 60678.0
Name: 0, dtype: object
[18]:
# row_container="args" passes the data as arguments to the function
def show_inputs(*args):
return args
nf_inputs = nf.map_rows(show_inputs, columns=["ra", "lightcurve.time"], row_container="args")
nf_inputs
[18]:
| 0 | 1 | |
|---|---|---|
| id | ||
| 0 | 10.0 | [60676.0, 60677.0, 60678.0] |
| 1 | 15.0 | [60675.0, 60676.5, 60677.0] |
| 2 | 12.1 | [60676.6, 60676.7, 60676.8, 60676.9] |
3 rows × 2 columns
Extended Series Operations with NestedSeries#
In addition to the extended API offered by the NestedFrame for Dataframe operations, nested-pandas provides the NestedSeries extending Series operations for nested data.
[19]:
# Single columns containing Nested Data are represented as NestedSeries
type(nf["lightcurve"])
[19]:
nested_pandas.series.nestedseries.NestedSeries
[20]:
# It behaves just like a pandas Series
nf["lightcurve"]
[20]:
id
0 [{time: 60676.0, brightness: 100.0, band: 'g'}...
1 [{time: 60675.0, brightness: 5.0, band: 'r'}; ...
2 [{time: 60676.6, brightness: 20.1, band: 'g'};...
Name: lightcurve, dtype: nested<time: [double], brightness: [double], band: [string]>
NestedSeries offers some unique access patterns for getting data:
[21]:
# Accessing sub-columns
nf["lightcurve"]["time"] # Alternative to nf["lightcurve.time"]
[21]:
id
0 60676.0
0 60677.0
0 60678.0
1 60675.0
1 60676.5
1 60677.0
2 60676.6
2 60676.7
2 60676.8
2 60676.9
Name: time, dtype: double[pyarrow]
[22]:
# Multi-selecting sub-columns
nf["lightcurve"][["time", "brightness"]]
[22]:
id
0 [{time: 60676.0, brightness: 100.0}; …] (3 rows)
1 [{time: 60675.0, brightness: 5.0}; …] (3 rows)
2 [{time: 60676.6, brightness: 20.1}; …] (4 rows)
Name: lightcurve, dtype: nested<time: [double], brightness: [double]>
NestedSeries Masking#
[23]:
# Using masks to filter nested data
g_mask = nf["lightcurve"]["band"] == "g"
nf["lightcurve"] = nf["lightcurve"][g_mask]
nf
[23]:
| ra | dec | lightcurve | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10.000000 | 0.000000 |
|
|||||||||
| 1 | 15.000000 | -1.000000 |
|
|||||||||
| 2 | 12.100000 | 0.500000 |
|