Lower-level interface for performance and flexibility#

Reveal the hidden power of nested Series#

This section is for users looking to optimize both the compute and memory performance of their workflows. This section also details a broader suite of data representations usable within nested-pandas. It shows how to deal with individual nested columns: add, remove, and modify data using both “flat-array” and “list-array” representations. It also demonstrates how to convert nested Series to and from different data types, like pd.ArrowDtyped Series, flat dataframes, list-array dataframes, and collections of nested elements.

[1]:

import numpy as np
import pandas as pd
import pyarrow as pa

from nested_pandas import NestedDtype
from nested_pandas.datasets import generate_data
from nested_pandas.series.packer import pack

Generate some data and get a Series of `NestedDtype` type#

We are going to use the built-in data generator to get a NestedFrame with a “nested” column being a Series of NestedDtype type. This column would represent light curves of some astronomical objects.

[2]:

nested_df = generate_data(4, 3, seed=42)
nested_series = nested_df["nested"]
nested_series[2]

[2]:

	t	flux	flux_error	band
0	0.411690	29.214465	1.0	g
1	3.636499	19.967378	1.0	g
2	8.638900	60.754485	1.0	r

Get access to different data views using the `.nest` accessor#

pandas provides an interface to access series with custom “accessors” - special attributes acting like a different view on the data. You may already know `.str accessor <https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-str>`__ for strings or `.dt for datetime-like <https://pandas.pydata.org/pandas-docs/stable/reference/series.html#timedelta-methods>`__ data. Since v2.0, pandas also supports few accessors for ArrowDtyped Series, .list for list-arrays and .struct for struct-arrays.

nested-pandas extends this concept and provides the .nest accessor for NestedDtyped Series, which gives user an object to work with nested data more efficiently and flexibly.

Note: The .nest accessor shares much of it’s API with the NestedSeries API, as NestedSeries uses the .nest accessor under the hood. As a result, many .nest operations can be used directly, without invoking the “.nest” when working with a NestedSeries, but some lower-level functionalities remain unique to the .nest accessor.

`.nest` object is a mapping#

.nest accessor provides an object implementing Mapping interface, so you can use it like an immutable dictionary. Keys of this mapping are the names of the nested columns (fields), and values are “flat” Series representing the nested data.

The only way to modify the nested data in-place with this interface is to re-assign the whole field with a new data of the same length and dtype, see the discussion about the mutability limitations in this GitHub issue.

[3]:

list(nested_series.nest.keys())

[3]:

['t', 'flux', 'flux_error', 'band']

You can also get a list of columns with the .columns attribute:

[4]:

nested_series.nest.columns

[4]:

['t', 'flux', 'flux_error', 'band']

The value of each key is a “flat” Series with repeated index, so the original index of the nested_series is repeated for each element of the nested data.

[5]:

nested_series.nest["t"]

[5]:

0      12.0223
0    16.648853
0     6.084845
1    14.161452
1     4.246782
1    10.495129
2      0.41169
2     3.636499
2       8.6389
3    19.398197
3      3.66809
3     5.824583
Name: t, dtype: double[pyarrow]

You can also get a subset of nested columns as a new nested Series:

[6]:

nested_series.nest[["t", "flux"]].dtype

[6]:

nested<t: [double], flux: [double]>

You can add new columns, drop existing ones, or modify the existing ones. These operations would create new nested Series, however they would create shallow copies of the rest of the fields, so they are quite efficient.

The in-place modification is currently limited to the case when you replace the whole “flat” Series with a new one of the same length and compatible dtype. When modifying the nested data, only the column you are working with is changed, the rest of the data are not affected and not copied.

[7]:

new_series = nested_series.copy()

# Change the data in-place
new_series.nest["flux"] = new_series.nest["flux"] - new_series.nest["flux"].mean()

# Create a new series with a new column
new_series = new_series.nest.set_column("lsst_band", "lsst_" + new_series.nest["band"])

# Create a new series with a column removed, you can also pass a list of columns to remove
new_series = new_series.nest.drop("band")

# Add a new column with a python list instead of a Series
new_series = new_series.nest.set_column(
    "new_column",
    [1, 2] * (new_series.nest.flat_length // 2),
)

# Add a new column repeating values for each nested element
# It can be useful when you want to move some metadata to the nested data
new_series = new_series.nest.set_filled_column("index_mult_100", new_series.index * 100)

# Create a new series, with a column dtype changed
new_series = new_series.nest.set_column("t", new_series.nest["t"].astype(np.int8))

new_series.nest.to_flat()

[7]:

	t	flux	flux_error	lsst_band	new_column	index_mult_100
0	12	21.335778	1.0	lsst_r	1	0
0	16	5.757487	1.0	lsst_g	2	0
0	6	19.391945	1.0	lsst_r	1	0
1	14	-25.900125	1.0	lsst_g	2	100
1	4	38.668085	1.0	lsst_g	1	100
1	10	-35.20447	1.0	lsst_g	2	100
2	0	-10.635047	1.0	lsst_g	1	200
2	3	-19.882133	1.0	lsst_g	2	200
2	8	20.904974	1.0	lsst_r	1	200
3	19	-3.213327	1.0	lsst_r	2	300
3	3	11.573932	1.0	lsst_g	1	300
3	5	-22.797099	1.0	lsst_g	2	300

Different data views#

.nest accessor provides a few different views on the data:

.to_flat() - get a “flat” pandas data frame with repeated index, it is kinda of a concatenation of all nested elements along the nested axis
.to_lists() - get a pandas data frame of nested-array (list-array) Series, where each element is a list of nested elements. Data type would be pd.ArrowDtype of pyarrow list.

Both representations are copy-free, so they could be done very efficiently. The only additional overhead when accessing a “flat” representation is the creation of a new repeating index.

[8]:

nested_series.nest.to_flat(["flux", "t"])

[8]:

	flux	t
0	61.185289	12.0223
0	45.606998	16.648853
0	59.241457	6.084845
1	13.949386	14.161452
1	78.517596	4.246782
1	4.645041	10.495129
2	29.214465	0.41169
2	19.967378	3.636499
2	60.754485	8.6389
3	36.636184	19.398197
3	51.423444	3.66809
3	17.052412	5.824583

[9]:

lists_df = nested_series.nest.to_lists()  # may also accept a list of fields (nested columns) to get
lists_df["t"].list.len()  # here we use pandas' build-in list accessor to get the length of each list

[9]:

0    3
1    3
2    3
3    3
dtype: int32[pyarrow]

List-arrays may be assigned back to the nested Series

[10]:

# Adjust each time to be relative to the first observation
dt = new_series.nest.to_lists()["t"].apply(lambda t: t - t.min())
new_series = new_series.nest.set_list_column("dt", dt)
new_series.nest.to_flat()

[10]:

	t	flux	flux_error	lsst_band	new_column	index_mult_100	dt
0	12	21.335778	1.0	lsst_r	1	0	6
0	16	5.757487	1.0	lsst_g	2	0	10
0	6	19.391945	1.0	lsst_r	1	0	0
1	14	-25.900125	1.0	lsst_g	2	100	10
1	4	38.668085	1.0	lsst_g	1	100	0
1	10	-35.20447	1.0	lsst_g	2	100	6
2	0	-10.635047	1.0	lsst_g	1	200	0
2	3	-19.882133	1.0	lsst_g	2	200	3
2	8	20.904974	1.0	lsst_r	1	200	8
3	19	-3.213327	1.0	lsst_r	2	300	16
3	3	11.573932	1.0	lsst_g	1	300	0
3	5	-22.797099	1.0	lsst_g	2	300	2

Use familiar pandas masking operations through the `.nest` accessor#

A popular usage pattern within pandas is the ability to filter DataFrames/Series using boolean masks. For example:

[11]:

nf = generate_data(5, 5, seed=1)
nf[nf["a"] > 0.3]

[11]:

nested

0.417022

0.184677

t	flux	flux_error	band
8.38389	31.551563	1.0	r
+4 rows	...	...	...

0.720324

0.372520

t	flux	flux_error	band
13.70439	68.650093	1.0	g
+4 rows	...	...	...

0.302333

0.793535

t	flux	flux_error	band
17.562349	1.828828	1.0	g
+4 rows	...	...	...

3 rows x 3 columns

In nested-pandas, the ability to do this masking is contained within the .nest accessor, which looks like this:

[12]:

nf = generate_data(5, 5, seed=1)
nf["nested.flag"] = True  # Add an extra flag column
nf["nested"].nest[(nf["nested.t"] < 5) & nf["nested.flag"]]

[12]:

0    [{t: 1.966937, flux: 5.336255, flux_error: 1.0...
1    [{t: 1.700884, flux: 67.883553, flux_error: 1....
2    [{t: 4.089045, flux: 83.462567, flux_error: 1....
3    [{t: 2.807739, flux: 78.927933, flux_error: 1....
4    [{t: 0.547752, flux: 75.014431, flux_error: 1....
Name: nested, dtype: nested<t: [double], flux: [double], flux_error: [double], band: [string], flag: [bool]>

The result here is a new nested column, with the masking applied, which we can assign back to the nf NestedFrame if we wish.

[13]:

nf["nested"] = nf["nested"].nest[(nf["nested.t"] < 5) & nf["nested.flag"]]
nf

[13]:

nested

0.417022

0.184677

t	flux	flux_error	band	flag
1.966937	5.336255	1.0	g	True

0.720324

0.372520

t	flux	flux_error	band	flag
1.700884	67.883553	1.0	r	True

0.000114

0.691121

t	flux	flux_error	band	flag
4.089045	83.462567	1.0	g	True
0.781096	21.162812	1.0	g	True

0.302333

0.793535

t	flux	flux_error	band	flag
2.807739	78.927933	1.0	r	True
3.396608	26.554666	1.0	g	True

0.146756

1.077633

t	flux	flux_error	band	flag
0.547752	75.014431	1.0	g	True
3.962030	10.322601	1.0	g	True

5 rows x 3 columns

Convert to and from nested Series#

We have already seen how .nest accessor could be used to get different views on the nested data: “flat” dataframe, and list-array dataframe with columns of pd.ArrowDtype.

This section is about converting nested Series to and from other data types. If you just need to add a nested column to a NestedFrame, you can do it with .join_nested() method.

To and from `pd.ArrowDtype`#

We can convert nested Series to and from pd.ArrowDtyped Series. NestedDtype is close to pd.ArrowDtype for arrow struct-arrays, but it is stricter about the nested data structure. nested-pandas also uses pyarrow struct-arrays as a storage format, where struct fields are list-arrays of the same length. So the conversion is quite straightforward, and doesn’t require any data copying.

[14]:

struct_series = pd.Series(nested_series, dtype=nested_series.dtype.to_pandas_arrow_dtype())
struct_series.struct.field("flux")  # pandas build-in accessor for struct-arrays

[14]:

0    [61.18528947 45.60699842 59.24145689]
1    [13.94938607 78.51759614  4.64504127]
2    [29.21446485 19.96737822 60.75448519]
3    [36.63618433 51.42344384 17.05241237]
Name: flux, dtype: large_list<item: double>[pyarrow]

[15]:

nested_series.equals(pd.Series(struct_series, dtype=NestedDtype.from_pandas_arrow_dtype(struct_series.dtype)))

[15]:

True

`pack()` function for creating a new nested Series#

nested-pandas provides a pack() function to create a new nested Series from either a sequence of a single flat dataframe with repeated index.

Using `pack()` to nest a flat dataframe#

You can also use pack() to create a nested Series from a flat dataframe with repeated index, for example from a one given by .nest.to_flat() method.

[16]:

new_series = pack(nested_series.nest.to_flat())
new_series.equals(nested_series)

[16]:

True

[17]:

series_from_flat = pack(
    pd.DataFrame(
        {
            "t": [1, 2, 3, 4, 5, 6],
            "flux": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
        },
        index=[0, 0, 0, 0, 1, 1],
    ),
    name="from_flat",  # optional
)
series_from_flat

[17]:

0    [{t: 1, flux: 0.1}; …] (4 rows)
1    [{t: 5, flux: 0.5}; …] (2 rows)
Name: from_flat, dtype: nested<t: [int64], flux: [double]>

Using `pack()` to nest a collection of elements#

You can use pack() to create a nested Series from a collection of elements, where each element representing a single row of the nested data. Each element can be one of many supported types, and you can mix them in the same collection:

pd.DataFrame
dict with items representing the nested columns, all the same length
pyarrow.StructScalar with elements being list-arrays of the same length
None or pd.NA for missing data

All the elements must have the same columns (fields), but may have the different length of the nested data.

[18]:

series_from_pack = pack(
    [
        pd.DataFrame({"t": [1, 2, 3], "flux": [0.1, 0.2, 0.3]}),
        {"t": [4, 5], "flux": [0.4, 0.5]},
        None,
    ],
    name="from_pack",  # optional
    index=[3, 4, 5],  # optional
)
series_from_pack

[18]:

3    [{t: 1, flux: 0.1}; …] (3 rows)
4    [{t: 4, flux: 0.4}; …] (2 rows)
5                               None
Name: from_pack, dtype: nested<t: [int64], flux: [double]>

If we are not happy with the default dtype, we can specify it explicitly, see more details on how to do it in the next section, here we just show an example.

[19]:

series_from_pack = pack(
    [
        pd.DataFrame({"t": [1, 2, 3], "flux": [0.1, 0.2, 0.3]}),
        {"t": [4, 5], "flux": [0.4, 0.5]},
        None,
    ],
    dtype=NestedDtype.from_columns({"t": pa.float64(), "flux": pa.float32()}),
)
series_from_pack

[19]:

0    [{t: 1.0, flux: 0.1}; …] (3 rows)
1    [{t: 4.0, flux: 0.4}; …] (2 rows)
2                                 None
dtype: nested<t: [double], flux: [float]>

Using pd.Series(values, dtype=NestedDtype.from_fields({…}))#

nested-pandas provides a NestedDtype class to create a new nested Series with a given dtype directly. NestedDtype may be built from a list of fields, where each field is a pair of a name and a data type.

This way allows you to create a new nested Series from a variety of datatypes, but due to pandas interface limitations requires you specifying a concrete dtype.

pd.Series from a sequence of elements#

This is the same as using pack() function, but you need to specify the dtype explicitly.

[20]:

series_from_dtype = pd.Series(
    [
        pd.NA,
        pd.DataFrame({"t": [1, 2, 3], "band": ["g", "r", "r"]}),
        {"t": np.array([4, 5]), "band": [None, "r"]},
    ],
    dtype=NestedDtype.from_columns({"t": pa.float64(), "band": pa.string()}),
)
series_from_dtype

[20]:

0                                  None
1     [{t: 1.0, band: 'g'}; …] (3 rows)
2    [{t: 4.0, band: None}; …] (2 rows)
dtype: nested<t: [double], band: [string]>

pyarrow native objects are also supported. Scalars:

[21]:

series_pa_type = pa.struct({"t": pa.list_(pa.float64()), "band": pa.list_(pa.string())})
scalar_pa_type = pa.struct({"t": pa.list_(pa.int16()), "band": pa.list_(pa.string())})
series_from_pa_scalars = pd.Series(
    # Scalars will be cast to the given type
    [
        pa.scalar(None),
        pa.scalar({"t": [1, 2, 3], "band": ["g", None, "r"]}, type=scalar_pa_type),
    ],
    dtype=NestedDtype(series_pa_type),
    name="from_pa_scalars",
    index=[101, -2],
)
series_from_pa_scalars

[21]:

 101                                 None
-2      [{t: 1.0, band: 'g'}; …] (3 rows)
Name: from_pa_scalars, dtype: nested<t: [double], band: [string]>

pd.Series from an array#

Construction with pyarrow struct arrays is the cheapest way to create a nested Series. It is very similar to the initialization of a pd.Series of pd.ArrowDtype type.

[22]:

pa_struct_array = pa.StructArray.from_arrays(
    [
        [
            np.arange(10),
            np.arange(5),
        ],  # "a" field
        [
            np.linspace(0, 1, 10),
            np.linspace(0, 1, 5),
        ],  # "b" field
    ],
    names=["a", "b"],
)
series_from_pa_struct = pd.Series(
    pa_struct_array,
    dtype=NestedDtype(pa_struct_array.type),
    name="from_pa_struct_array",
    index=["I", "II"],
)

Convert nested Series to different data types#

We have already seen how to convert nested Series to pd.ArrowDtyped Series, to a flat dataframe, or to a list-array dataframe. Let’s summarize it here one more time:

[23]:

# Convert to pd.ArrowDtype Series of struct-arrays
arrow_dtyped_series = pd.Series(nested_series, dtype=nested_series.dtype.to_pandas_arrow_dtype())
# Convert to a flat dataframe
flat_df = nested_series.nest.to_flat()
# Convert to a list-array dataframe
list_df = nested_series.nest.to_lists()

Convert to a collection of nested elements#

Single element representation of the nested Series is pd.DataFrame, so iteration over the nested Series would yield pd.DataFrame objects.

[24]:

for element in nested_series:
    print(element)

           t       flux  flux_error band
0  12.022300  61.185289         1.0    r
1  16.648853  45.606998         1.0    g
2   6.084845  59.241457         1.0    r
           t       flux  flux_error band
0  14.161452  13.949386         1.0    g
1   4.246782  78.517596         1.0    g
2  10.495129   4.645041         1.0    g
          t       flux  flux_error band
0  0.411690  29.214465         1.0    g
1  3.636499  19.967378         1.0    g
2  8.638900  60.754485         1.0    r
           t       flux  flux_error band
0  19.398197  36.636184         1.0    r
1   3.668090  51.423444         1.0    g
2   5.824583  17.052412         1.0    g

All collections built with iterables would have pd.DataFrame as elements:

[25]:

nested_elements = list(nested_series)
nested_elements[-1]

[25]:

	t	flux	flux_error	band
0	19.398197	36.636184	1.0	r
1	3.668090	51.423444	1.0	g
2	5.824583	17.052412	1.0	g

Nested Series also supports direct conversion to numpy array of object dtype:

[26]:

nested_series_with_na = pack([None, pd.NA, {"t": [1, 2], "flux": [0.1, None]}])
# Would have None for top-level missed data
np_array1 = np.array(nested_series_with_na)
print(f"{np_array1[0] = }")

np_array1[0] = None

[27]:

# Would have empty pd.DataFrame for top-level missed data
np_array2 = nested_series_with_na.to_numpy(na_value=pd.DataFrame())
print(f"{np_array2[0] = }")

np_array2[0] = Empty DataFrame
Columns: []
Index: []

Lower-level interface for performance and flexibility

Contents

Lower-level interface for performance and flexibility#

Reveal the hidden power of nested Series#

Generate some data and get a Series of NestedDtype type#

Get access to different data views using the .nest accessor#

.nest object is a mapping#