Lower-level interface for performance and flexibility#

Reveal the hidden power of nested Series#

This section is for users looking to optimize both the compute and memory performance of their workflows. This section also details a broader suite of data representations usable within nested-pandas. It shows how to deal with individual nested columns: add, remove, and modify data using both “flat-array” and “list-array” representations. It also demonstrates how to convert nested Series to and from different data types, like pd.ArrowDtyped Series, flat dataframes, list-array dataframes, and collections of nested elements.

[1]:
import numpy as np
import pandas as pd
import pyarrow as pa

from nested_pandas import NestedDtype
from nested_pandas.datasets import generate_data
from nested_pandas.series.packer import pack

Generate some data and get a Series of NestedDtype type#

We are going to use the built-in data generator to get a NestedFrame with a “nested” column being a Series of NestedDtype type. This column would represent light curves of some astronomical objects.

[2]:
nested_df = generate_data(4, 3, seed=42)
nested_series = nested_df["nested"]
nested_series[2]
[2]:
t flux flux_error band
0 0.411690 29.214465 1.0 g
1 3.636499 19.967378 1.0 g
2 8.638900 60.754485 1.0 r

Get access to different data views using the .nest accessor#

pandas provides an interface to access series with custom “accessors” - special attributes acting like a different view on the data. You may already know `.str accessor <https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-str>`__ for strings or `.dt for datetime-like <https://pandas.pydata.org/pandas-docs/stable/reference/series.html#timedelta-methods>`__ data. Since v2.0, pandas also supports few accessors for ArrowDtyped Series, .list for list-arrays and .struct for struct-arrays.

nested-pandas extends this concept and provides the .nest accessor for NestedDtyped Series, which gives user an object to work with nested data more efficiently and flexibly.

Note: The .nest accessor shares much of it’s API with the NestedSeries API, as NestedSeries uses the .nest accessor under the hood. As a result, many .nest operations can be used directly, without invoking the “.nest” when working with a NestedSeries, but some lower-level functionalities remain unique to the .nest accessor.

.nest object is a mapping#

.nest accessor provides an object implementing Mapping interface, so you can use it like an immutable dictionary. Keys of this mapping are the names of the nested columns (fields), and values are “flat” Series representing the nested data.

The only way to modify the nested data in-place with this interface is to re-assign the whole field with a new data of the same length and dtype, see the discussion about the mutability limitations in this GitHub issue.

[3]:
list(nested_series.nest.keys())
[3]:
['t', 'flux', 'flux_error', 'band']

You can also get a list of columns with the .columns attribute:

[4]:
nested_series.nest.columns
[4]:
['t', 'flux', 'flux_error', 'band']

The value of each key is a “flat” Series with repeated index, so the original index of the nested_series is repeated for each element of the nested data.

[5]:
nested_series.nest["t"]
[5]:
0      12.0223
0    16.648853
0     6.084845
1    14.161452
1     4.246782
1    10.495129
2      0.41169
2     3.636499
2       8.6389
3    19.398197
3      3.66809
3     5.824583
Name: t, dtype: double[pyarrow]

You can also get a subset of nested columns as a new nested Series:

[6]:
nested_series.nest[["t", "flux"]].dtype
[6]:
nested<t: [double], flux: [double]>

You can add new columns, drop existing ones, or modify the existing ones. These operations would create new nested Series, however they would create shallow copies of the rest of the fields, so they are quite efficient.

The in-place modification is currently limited to the case when you replace the whole “flat” Series with a new one of the same length and compatible dtype. When modifying the nested data, only the column you are working with is changed, the rest of the data are not affected and not copied.

[7]:
new_series = nested_series.copy()

# Change the data in-place
new_series.nest["flux"] = new_series.nest["flux"] - new_series.nest["flux"].mean()

# Create a new series with a new column
new_series = new_series.nest.set_column("lsst_band", "lsst_" + new_series.nest["band"])

# Create a new series with a column removed, you can also pass a list of columns to remove
new_series = new_series.nest.drop("band")

# Add a new column with a python list instead of a Series
new_series = new_series.nest.set_column(
    "new_column",
    [1, 2] * (new_series.nest.flat_length // 2),
)

# Add a new column repeating values for each nested element
# It can be useful when you want to move some metadata to the nested data
new_series = new_series.nest.set_filled_column("index_mult_100", new_series.index * 100)

# Create a new series, with a column dtype changed
new_series = new_series.nest.set_column("t", new_series.nest["t"].astype(np.int8))

new_series.nest.to_flat()
[7]:
t flux flux_error lsst_band new_column index_mult_100
0 12 21.335778 1.0 lsst_r 1 0
0 16 5.757487 1.0 lsst_g 2 0
0 6 19.391945 1.0 lsst_r 1 0
1 14 -25.900125 1.0 lsst_g 2 100
1 4 38.668085 1.0 lsst_g 1 100
1 10 -35.20447 1.0 lsst_g 2 100
2 0 -10.635047 1.0 lsst_g 1 200
2 3 -19.882133 1.0 lsst_g 2 200
2 8 20.904974 1.0 lsst_r 1 200
3 19 -3.213327 1.0 lsst_r 2 300
3 3 11.573932 1.0 lsst_g 1 300
3 5 -22.797099 1.0 lsst_g 2 300

Different data views#

.nest accessor provides a few different views on the data:

  • .to_flat() - get a “flat” pandas data frame with repeated index, it is kinda of a concatenation of all nested elements along the nested axis

  • .to_lists() - get a pandas data frame of nested-array (list-array) Series, where each element is a list of nested elements. Data type would be pd.ArrowDtype of pyarrow list.

Both representations are copy-free, so they could be done very efficiently. The only additional overhead when accessing a “flat” representation is the creation of a new repeating index.

[8]:
nested_series.nest.to_flat(["flux", "t"])
[8]:
flux t
0 61.185289 12.0223
0 45.606998 16.648853
0 59.241457 6.084845
1 13.949386 14.161452
1 78.517596 4.246782
1 4.645041 10.495129
2 29.214465 0.41169
2 19.967378 3.636499
2 60.754485 8.6389
3 36.636184 19.398197
3 51.423444 3.66809
3 17.052412 5.824583
[9]:
lists_df = nested_series.nest.to_lists()  # may also accept a list of fields (nested columns) to get
lists_df["t"].list.len()  # here we use pandas' build-in list accessor to get the length of each list
[9]:
0    3
1    3
2    3
3    3
dtype: int32[pyarrow]

List-arrays may be assigned back to the nested Series

[10]:
# Adjust each time to be relative to the first observation
dt = new_series.nest.to_lists()["t"].apply(lambda t: t - t.min())
new_series = new_series.nest.set_list_column("dt", dt)
new_series.nest.to_flat()
[10]:
t flux flux_error lsst_band new_column index_mult_100 dt
0 12 21.335778 1.0 lsst_r 1 0 6
0 16 5.757487 1.0 lsst_g 2 0 10
0 6 19.391945 1.0 lsst_r 1 0 0
1 14 -25.900125 1.0 lsst_g 2 100 10
1 4 38.668085 1.0 lsst_g 1 100 0
1 10 -35.20447 1.0 lsst_g 2 100 6
2 0 -10.635047 1.0 lsst_g 1 200 0
2 3 -19.882133 1.0 lsst_g 2 200 3
2 8 20.904974 1.0 lsst_r 1 200 8
3 19 -3.213327 1.0 lsst_r 2 300 16
3 3 11.573932 1.0 lsst_g 1 300 0
3 5 -22.797099 1.0 lsst_g 2 300 2

Use familiar pandas masking operations through the .nest accessor#

A popular usage pattern within pandas is the ability to filter DataFrames/Series using boolean masks. For example:

[11]:
nf = generate_data(5, 5, seed=1)
nf[nf["a"] > 0.3]
[11]:
  a b nested
0 0.417022 0.184677
t flux flux_error band
8.38389 31.551563 1.0 r
+4 rows ... ... ...
1 0.720324 0.372520
t flux flux_error band
13.70439 68.650093 1.0 g
+4 rows ... ... ...
3 0.302333 0.793535
t flux flux_error band
17.562349 1.828828 1.0 g
+4 rows ... ... ...
3 rows x 3 columns

In nested-pandas, the ability to do this masking is contained within the .nest accessor, which looks like this:

[12]:
nf = generate_data(5, 5, seed=1)
nf["nested.flag"] = True  # Add an extra flag column
nf["nested"].nest[(nf["nested.t"] < 5) & nf["nested.flag"]]
[12]:
0    [{t: 1.966937, flux: 5.336255, flux_error: 1.0...
1    [{t: 1.700884, flux: 67.883553, flux_error: 1....
2    [{t: 4.089045, flux: 83.462567, flux_error: 1....
3    [{t: 2.807739, flux: 78.927933, flux_error: 1....
4    [{t: 0.547752, flux: 75.014431, flux_error: 1....
Name: nested, dtype: nested<t: [double], flux: [double], flux_error: [double], band: [string], flag: [bool]>

The result here is a new nested column, with the masking applied, which we can assign back to the nf NestedFrame if we wish.

[13]:
nf["nested"] = nf["nested"].nest[(nf["nested.t"] < 5) & nf["nested.flag"]]
nf
[13]:
  a b nested
0 0.417022 0.184677
t flux flux_error band flag
1.966937 5.336255 1.0 g True
1 0.720324 0.372520
t flux flux_error band flag
1.700884 67.883553 1.0 r True
2 0.000114 0.691121
t flux flux_error band flag
4.089045 83.462567 1.0 g True
0.781096 21.162812 1.0 g True
3 0.302333 0.793535
t flux flux_error band flag
2.807739 78.927933 1.0 r True
3.396608 26.554666 1.0 g True
4 0.146756 1.077633
t flux flux_error band flag
0.547752 75.014431 1.0 g True
3.962030 10.322601 1.0 g True
5 rows x 3 columns

Convert to and from nested Series#

We have already seen how .nest accessor could be used to get different views on the nested data: “flat” dataframe, and list-array dataframe with columns of pd.ArrowDtype.

This section is about converting nested Series to and from other data types. If you just need to add a nested column to a NestedFrame, you can do it with .join_nested() method.

To and from pd.ArrowDtype#

We can convert nested Series to and from pd.ArrowDtyped Series. NestedDtype is close to pd.ArrowDtype for arrow struct-arrays, but it is stricter about the nested data structure. nested-pandas also uses pyarrow struct-arrays as a storage format, where struct fields are list-arrays of the same length. So the conversion is quite straightforward, and doesn’t require any data copying.

[14]:
struct_series = pd.Series(nested_series, dtype=nested_series.dtype.to_pandas_arrow_dtype())
struct_series.struct.field("flux")  # pandas build-in accessor for struct-arrays
[14]:
0    [61.18528947 45.60699842 59.24145689]
1    [13.94938607 78.51759614  4.64504127]
2    [29.21446485 19.96737822 60.75448519]
3    [36.63618433 51.42344384 17.05241237]
Name: flux, dtype: large_list<item: double>[pyarrow]
[15]:
nested_series.equals(pd.Series(struct_series, dtype=NestedDtype.from_pandas_arrow_dtype(struct_series.dtype)))
[15]:
True

pack() function for creating a new nested Series#

nested-pandas provides a pack() function to create a new nested Series from either a sequence of a single flat dataframe with repeated index.

Using pack() to nest a flat dataframe#

You can also use pack() to create a nested Series from a flat dataframe with repeated index, for example from a one given by .nest.to_flat() method.

[16]:
new_series = pack(nested_series.nest.to_flat())
new_series.equals(nested_series)
[16]:
True
[17]:
series_from_flat = pack(
    pd.DataFrame(
        {
            "t": [1, 2, 3, 4, 5, 6],
            "flux": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
        },
        index=[0, 0, 0, 0, 1, 1],
    ),
    name="from_flat",  # optional
)
series_from_flat
[17]:
0    [{t: 1, flux: 0.1}; …] (4 rows)
1    [{t: 5, flux: 0.5}; …] (2 rows)
Name: from_flat, dtype: nested<t: [int64], flux: [double]>

Using pack() to nest a collection of elements#

You can use pack() to create a nested Series from a collection of elements, where each element representing a single row of the nested data. Each element can be one of many supported types, and you can mix them in the same collection:

  • pd.DataFrame

  • dict with items representing the nested columns, all the same length

  • pyarrow.StructScalar with elements being list-arrays of the same length

  • None or pd.NA for missing data

All the elements must have the same columns (fields), but may have the different length of the nested data.

[18]:
series_from_pack = pack(
    [
        pd.DataFrame({"t": [1, 2, 3], "flux": [0.1, 0.2, 0.3]}),
        {"t": [4, 5], "flux": [0.4, 0.5]},
        None,
    ],
    name="from_pack",  # optional
    index=[3, 4, 5],  # optional
)
series_from_pack
[18]:
3    [{t: 1, flux: 0.1}; …] (3 rows)
4    [{t: 4, flux: 0.4}; …] (2 rows)
5                               None
Name: from_pack, dtype: nested<t: [int64], flux: [double]>

If we are not happy with the default dtype, we can specify it explicitly, see more details on how to do it in the next section, here we just show an example.

[19]:
series_from_pack = pack(
    [
        pd.DataFrame({"t": [1, 2, 3], "flux": [0.1, 0.2, 0.3]}),
        {"t": [4, 5], "flux": [0.4, 0.5]},
        None,
    ],
    dtype=NestedDtype.from_columns({"t": pa.float64(), "flux": pa.float32()}),
)
series_from_pack
[19]:
0    [{t: 1.0, flux: 0.1}; …] (3 rows)
1    [{t: 4.0, flux: 0.4}; …] (2 rows)
2                                 None
dtype: nested<t: [double], flux: [float]>

Using pd.Series(values, dtype=NestedDtype.from_fields({…}))#

nested-pandas provides a NestedDtype class to create a new nested Series with a given dtype directly. NestedDtype may be built from a list of fields, where each field is a pair of a name and a data type.

This way allows you to create a new nested Series from a variety of datatypes, but due to pandas interface limitations requires you specifying a concrete dtype.

pd.Series from a sequence of elements#

This is the same as using pack() function, but you need to specify the dtype explicitly.

[20]:
series_from_dtype = pd.Series(
    [
        pd.NA,
        pd.DataFrame({"t": [1, 2, 3], "band": ["g", "r", "r"]}),
        {"t": np.array([4, 5]), "band": [None, "r"]},
    ],
    dtype=NestedDtype.from_columns({"t": pa.float64(), "band": pa.string()}),
)
series_from_dtype
[20]:
0                                  None
1     [{t: 1.0, band: 'g'}; …] (3 rows)
2    [{t: 4.0, band: None}; …] (2 rows)
dtype: nested<t: [double], band: [string]>

pyarrow native objects are also supported. Scalars:

[21]:
series_pa_type = pa.struct({"t": pa.list_(pa.float64()), "band": pa.list_(pa.string())})
scalar_pa_type = pa.struct({"t": pa.list_(pa.int16()), "band": pa.list_(pa.string())})
series_from_pa_scalars = pd.Series(
    # Scalars will be cast to the given type
    [
        pa.scalar(None),
        pa.scalar({"t": [1, 2, 3], "band": ["g", None, "r"]}, type=scalar_pa_type),
    ],
    dtype=NestedDtype(series_pa_type),
    name="from_pa_scalars",
    index=[101, -2],
)
series_from_pa_scalars
[21]:
 101                                 None
-2      [{t: 1.0, band: 'g'}; …] (3 rows)
Name: from_pa_scalars, dtype: nested<t: [double], band: [string]>

pd.Series from an array#

Construction with pyarrow struct arrays is the cheapest way to create a nested Series. It is very similar to the initialization of a pd.Series of pd.ArrowDtype type.

[22]:
pa_struct_array = pa.StructArray.from_arrays(
    [
        [
            np.arange(10),
            np.arange(5),
        ],  # "a" field
        [
            np.linspace(0, 1, 10),
            np.linspace(0, 1, 5),
        ],  # "b" field
    ],
    names=["a", "b"],
)
series_from_pa_struct = pd.Series(
    pa_struct_array,
    dtype=NestedDtype(pa_struct_array.type),
    name="from_pa_struct_array",
    index=["I", "II"],
)

Convert nested Series to different data types#

We have already seen how to convert nested Series to pd.ArrowDtyped Series, to a flat dataframe, or to a list-array dataframe. Let’s summarize it here one more time:

[23]:
# Convert to pd.ArrowDtype Series of struct-arrays
arrow_dtyped_series = pd.Series(nested_series, dtype=nested_series.dtype.to_pandas_arrow_dtype())
# Convert to a flat dataframe
flat_df = nested_series.nest.to_flat()
# Convert to a list-array dataframe
list_df = nested_series.nest.to_lists()

Convert to a collection of nested elements#

Single element representation of the nested Series is pd.DataFrame, so iteration over the nested Series would yield pd.DataFrame objects.

[24]:
for element in nested_series:
    print(element)
           t       flux  flux_error band
0  12.022300  61.185289         1.0    r
1  16.648853  45.606998         1.0    g
2   6.084845  59.241457         1.0    r
           t       flux  flux_error band
0  14.161452  13.949386         1.0    g
1   4.246782  78.517596         1.0    g
2  10.495129   4.645041         1.0    g
          t       flux  flux_error band
0  0.411690  29.214465         1.0    g
1  3.636499  19.967378         1.0    g
2  8.638900  60.754485         1.0    r
           t       flux  flux_error band
0  19.398197  36.636184         1.0    r
1   3.668090  51.423444         1.0    g
2   5.824583  17.052412         1.0    g

All collections built with iterables would have pd.DataFrame as elements:

[25]:
nested_elements = list(nested_series)
nested_elements[-1]
[25]:
t flux flux_error band
0 19.398197 36.636184 1.0 r
1 3.668090 51.423444 1.0 g
2 5.824583 17.052412 1.0 g

Nested Series also supports direct conversion to numpy array of object dtype:

[26]:
nested_series_with_na = pack([None, pd.NA, {"t": [1, 2], "flux": [0.1, None]}])
# Would have None for top-level missed data
np_array1 = np.array(nested_series_with_na)
print(f"{np_array1[0] = }")
np_array1[0] = None
[27]:
# Would have empty pd.DataFrame for top-level missed data
np_array2 = nested_series_with_na.to_numpy(na_value=pd.DataFrame())
print(f"{np_array2[0] = }")
np_array2[0] = Empty DataFrame
Columns: []
Index: []