GroupBy for NestedPandas#

This notebook explores how Pandas’ built-in groupby interacts with NestedPandas structures.

Because Nested-Pandas extends the Pandas library, native pandas.DataFrame.groupby works with nested-pandas out of the box in some ways.

[1]:

# This will be the nf example used in this doc
from nested_pandas.datasets import generate_data

nf = generate_data(5, 10, seed=1)
nf["c"] = [0, 0, 1, 1, 1]
nf

[1]:

nested

0.417022

0.184677

t	flux	flux_error	band
8.38389	10.233443	1.0	g
+9 rows	...	...	...

0.720324

0.372520

t	flux	flux_error	band
13.70439	41.405599	1.0	g
+9 rows	...	...	...

0.000114

0.691121

t	flux	flux_error	band
4.089045	69.440016	1.0	g
+9 rows	...	...	...

0.302333

0.793535

t	flux	flux_error	band
17.562349	41.417927	1.0	g
+9 rows	...	...	...

0.146756

1.077633

t	flux	flux_error	band
0.547752	4.995346	1.0	r
+9 rows	...	...	...

5 rows x 4 columns

groupby works on non-nested columns and will return a pandas.groupby object.

Grouping by nested columns does not work since nested values are mutable objects so they are unhashable.

Use base columns as group keys or extract scalar identifiers from nested data first.

[2]:

nf.groupby("c")  # returns a Pandas GroupBy object

[2]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7a14f7d96210>

Basic Aggregations#

Some built-in methods like count work but not as expected (view nested column as a single object).
Others (min, max, mean) fail on nested columns.
Interestingly, describe will work as expected with the automatic flattened nested column.

[3]:

# count is viewing nested columns as single objects
nf.groupby("c").count()

[3]:

	a	b	nested
c
0	2	2	2
1	3	3	3

2 rows × 3 columns

[4]:

# min/max/mean fail on nested columns
try:
    grouped_min = nf.groupby("c").min()
    print(grouped_min)
except TypeError as e:
    print(f"Cannot compute min on nested columns: {e}")

Cannot compute min on nested columns: agg function failed [how->min,dtype->nested<t: [double], flux: [double], flux_error: [double], band: [string]>]

[5]:

# describe works as expected with automatic flattened nested column
nf.groupby("c").describe()

[5]:

	a								b								nested.t								nested.flux								nested.flux_error
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
c
0	2.0	0.568673	0.214467	0.417022	0.492848	0.568673	0.644499	0.720324	2.0	0.278599	0.132825	0.184677	0.231638	0.278599	0.325560	0.372520	20.0	10.881513	6.240902	0.387339	7.83715	12.445851	15.226208	19.777222	20.0	51.891513	32.136814	1.582124	21.910878	53.147725	88.645112	94.948926	20.0	1.0	0.0	1.0	1.0	1.0	1.0	1.0
1	3.0	0.149734	0.151131	0.000114	0.073435	0.146756	0.224544	0.302333	3.0	0.854097	0.200247	0.691121	0.742328	0.793535	0.935584	1.077633	30.0	8.700798	6.111402	0.365766	3.537964	6.070383	13.957988	19.157791	30.0	57.975918	27.028715	0.287033	40.029183	60.184998	75.090985	99.732285	30.0	1.0	0.0	1.0	1.0	1.0	1.0	1.0

2 rows × 40 columns

Type Preservation#

Within each group, the object remains accessible as a NestedFrame object and the nested columns remain NestedSeries.

We can check this by applying a custom function on our 2-group groupby object:

[6]:

# check the type
def type_check(df):
    print("Group DataFrame Type:", type(df))
    print("Nested Column Type:", type(df["nested"]))
    print()
    # return df


nf.groupby("c").apply(type_check, include_groups=False)

Group DataFrame Type: <class 'nested_pandas.nestedframe.core.NestedFrame'>
Nested Column Type: <class 'nested_pandas.series.nestedseries.NestedSeries'>

Group DataFrame Type: <class 'nested_pandas.nestedframe.core.NestedFrame'>
Nested Column Type: <class 'nested_pandas.series.nestedseries.NestedSeries'>

[6]:

0 rows × 0 columns

An important note is that when trying to accsss the row of each group with .iloc[], numeric row-wise indexing and slice-based indexing will output different types.

For NestedFrame, when we try to access the first row, row-wise indexing (.iloc[0]) will collapse the result in to 1-D pandas.Series with the nested column stored as a DataFrame. However, slice-based indexing (.iloc[0:1]) will preserve the nested structure and still output the row as a NestedFrame with nested column still being NestedSeries.

[7]:

# check the full row type
def row_type_check(df):
    print("df.iloc[0]: ", type(df.iloc[0]))
    print("df.iloc[0:1]:", type(df.iloc[0:1]))
    print("\n Accessing nested column for both ways:")
    print("df.iloc[0] nested column:", type(df.iloc[0]["nested"]))
    print("df.iloc[0:1] nested column:", type(df.iloc[0:1]["nested"]))
    print()
    # return df


nf.groupby("c").apply(row_type_check, include_groups=False)

df.iloc[0]:  <class 'pandas.core.series.Series'>
df.iloc[0:1]: <class 'nested_pandas.nestedframe.core.NestedFrame'>

 Accessing nested column for both ways:
df.iloc[0] nested column: <class 'pandas.core.frame.DataFrame'>
df.iloc[0:1] nested column: <class 'nested_pandas.series.nestedseries.NestedSeries'>

df.iloc[0]:  <class 'pandas.core.series.Series'>
df.iloc[0:1]: <class 'nested_pandas.nestedframe.core.NestedFrame'>

 Accessing nested column for both ways:
df.iloc[0] nested column: <class 'pandas.core.frame.DataFrame'>
df.iloc[0:1] nested column: <class 'nested_pandas.series.nestedseries.NestedSeries'>

[7]:

0 rows × 0 columns

For nested column with type NestedSeries, accessing a single row from df["nested"] will either output a pandas.DataFrame (.iloc[0]) or a pandas.Series (.iloc[0:1]).

Note that outside groupby, df["nested"].iloc[0] is stored as a pandas.DataFrame, which is expected.

[8]:

# check the nested row type
def nested_row_type_check(df):
    print('df["nested"].iloc[0]:', type(df["nested"].iloc[0]))
    print('df["nested"].iloc[0:1]:', type(df["nested"].iloc[0:1]))
    print()
    # return df


nf.groupby("c").apply(nested_row_type_check, include_groups=False)

df["nested"].iloc[0]: <class 'pandas.core.frame.DataFrame'>
df["nested"].iloc[0:1]: <class 'pandas.core.series.Series'>

df["nested"].iloc[0]: <class 'pandas.core.frame.DataFrame'>
df["nested"].iloc[0:1]: <class 'pandas.core.series.Series'>

[8]:

0 rows × 0 columns

Other operations will preserve the nested structure in general, but if you need to work with the contents of a nested column directly, you may need to flatten it first using .nest.to_flat().

Custom Functions with `apply`#

.apply() for nested operations is supported natively. It generally works if the function flattens or use index slicing to ensure matching type for operations.

Some potential examples:

[9]:

# custom function to flatten nested column
def flatten_nested(df):
    return df["nested"].nest.to_flat()


nf.groupby("c").apply(flatten_nested, include_groups=False)

[9]:

		t	flux	flux_error	band
c
0	0	8.38389	10.233443	1.0	g
0	0	13.40935	53.589641	1.0	g
...	...	...	...	...	...
1	4	9.831463	90.853515	1.0	r
1	4	13.995167	99.732285	1.0	g

50 rows × 4 columns

[10]:

import pandas as pd


# custom function to perform aggregations on flattened nested column
def mean_flux(df):
    flat = df["nested"].nest.to_flat()
    return pd.Series({"mean_flux": flat["flux"].mean(), "mean_t": flat["t"].mean()})


nf.groupby("c").apply(mean_flux, include_groups=False)

[10]:

	mean_flux	mean_t
c
0	51.891513	10.881513
1	57.975918	8.700798

2 rows × 2 columns

Summary#

Always group by base columns, not nested columns.
Use slice-based indexing (.iloc[0:1]) to preserve nested types.
Use ``.nest.to_flat()`` to flatten a nested column when needed for numerical or aggregating operations.
Nested structures are designed to reduce the need for expensive groupby operations by allowing data to stay organized hierarchically. However, when grouping is necessary, pandas’ groupby still works with nested-pandas and maintains type consistency.
Some use cases may behave unexpectedly because of the nested structures. We encourage users to open issues if you run into unexpected behavior or edge cases.

GroupBy for NestedPandas

Contents

GroupBy for NestedPandas#

Basic Aggregations#

Type Preservation#

Custom Functions with apply#

Summary#

Custom Functions with `apply`#