GroupBy for NestedPandas#
This notebook explores how Pandas’ built-in groupby interacts with NestedPandas structures.
Because Nested-Pandas extends the Pandas library, native pandas.DataFrame.groupby works with nested-pandas out of the box in some ways.
[1]:
# This will be the nf example used in this doc
from nested_pandas.datasets import generate_data
nf = generate_data(5, 10, seed=1)
nf["c"] = [0, 0, 1, 1, 1]
nf
[1]:
| a | b | nested | c | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.417022 | 0.184677 |
|
0 | ||||||||||||
| 1 | 0.720324 | 0.372520 |
|
0 | ||||||||||||
| 2 | 0.000114 | 0.691121 |
|
1 | ||||||||||||
| 3 | 0.302333 | 0.793535 |
|
1 | ||||||||||||
| 4 | 0.146756 | 1.077633 |
|
1 |
groupby works on non-nested columns and will return a pandas.groupby object.Use base columns as group keys or extract scalar identifiers from nested data first.
[2]:
nf.groupby("c") # returns a Pandas GroupBy object
[2]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7a14f7d96210>
Basic Aggregations#
Some built-in methods like
countwork but not as expected (view nested column as a single object).Others (
min,max,mean) fail on nested columns.Interestingly,
describewill work as expected with the automatic flattened nested column.
[3]:
# count is viewing nested columns as single objects
nf.groupby("c").count()
[3]:
| a | b | nested | |
|---|---|---|---|
| c | |||
| 0 | 2 | 2 | 2 |
| 1 | 3 | 3 | 3 |
2 rows × 3 columns
[4]:
# min/max/mean fail on nested columns
try:
grouped_min = nf.groupby("c").min()
print(grouped_min)
except TypeError as e:
print(f"Cannot compute min on nested columns: {e}")
Cannot compute min on nested columns: agg function failed [how->min,dtype->nested<t: [double], flux: [double], flux_error: [double], band: [string]>]
[5]:
# describe works as expected with automatic flattened nested column
nf.groupby("c").describe()
[5]:
| a | b | nested.t | nested.flux | nested.flux_error | ||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
| c | ||||||||||||||||||||||||||||||||||||||||
| 0 | 2.0 | 0.568673 | 0.214467 | 0.417022 | 0.492848 | 0.568673 | 0.644499 | 0.720324 | 2.0 | 0.278599 | 0.132825 | 0.184677 | 0.231638 | 0.278599 | 0.325560 | 0.372520 | 20.0 | 10.881513 | 6.240902 | 0.387339 | 7.83715 | 12.445851 | 15.226208 | 19.777222 | 20.0 | 51.891513 | 32.136814 | 1.582124 | 21.910878 | 53.147725 | 88.645112 | 94.948926 | 20.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| 1 | 3.0 | 0.149734 | 0.151131 | 0.000114 | 0.073435 | 0.146756 | 0.224544 | 0.302333 | 3.0 | 0.854097 | 0.200247 | 0.691121 | 0.742328 | 0.793535 | 0.935584 | 1.077633 | 30.0 | 8.700798 | 6.111402 | 0.365766 | 3.537964 | 6.070383 | 13.957988 | 19.157791 | 30.0 | 57.975918 | 27.028715 | 0.287033 | 40.029183 | 60.184998 | 75.090985 | 99.732285 | 30.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
2 rows × 40 columns
Type Preservation#
Within each group, the object remains accessible as a NestedFrame object and the nested columns remain NestedSeries.
We can check this by applying a custom function on our 2-group groupby object:
[6]:
# check the type
def type_check(df):
print("Group DataFrame Type:", type(df))
print("Nested Column Type:", type(df["nested"]))
print()
# return df
nf.groupby("c").apply(type_check, include_groups=False)
Group DataFrame Type: <class 'nested_pandas.nestedframe.core.NestedFrame'>
Nested Column Type: <class 'nested_pandas.series.nestedseries.NestedSeries'>
Group DataFrame Type: <class 'nested_pandas.nestedframe.core.NestedFrame'>
Nested Column Type: <class 'nested_pandas.series.nestedseries.NestedSeries'>
[6]:
0 rows × 0 columns
An important note is that when trying to accsss the row of each group with .iloc[], numeric row-wise indexing and slice-based indexing will output different types.
For NestedFrame, when we try to access the first row, row-wise indexing (.iloc[0]) will collapse the result in to 1-D pandas.Series with the nested column stored as a DataFrame. However, slice-based indexing (.iloc[0:1]) will preserve the nested structure and still output the row as a NestedFrame with nested column still being NestedSeries.
[7]:
# check the full row type
def row_type_check(df):
print("df.iloc[0]: ", type(df.iloc[0]))
print("df.iloc[0:1]:", type(df.iloc[0:1]))
print("\n Accessing nested column for both ways:")
print("df.iloc[0] nested column:", type(df.iloc[0]["nested"]))
print("df.iloc[0:1] nested column:", type(df.iloc[0:1]["nested"]))
print()
# return df
nf.groupby("c").apply(row_type_check, include_groups=False)
df.iloc[0]: <class 'pandas.core.series.Series'>
df.iloc[0:1]: <class 'nested_pandas.nestedframe.core.NestedFrame'>
Accessing nested column for both ways:
df.iloc[0] nested column: <class 'pandas.core.frame.DataFrame'>
df.iloc[0:1] nested column: <class 'nested_pandas.series.nestedseries.NestedSeries'>
df.iloc[0]: <class 'pandas.core.series.Series'>
df.iloc[0:1]: <class 'nested_pandas.nestedframe.core.NestedFrame'>
Accessing nested column for both ways:
df.iloc[0] nested column: <class 'pandas.core.frame.DataFrame'>
df.iloc[0:1] nested column: <class 'nested_pandas.series.nestedseries.NestedSeries'>
[7]:
0 rows × 0 columns
For nested column with type NestedSeries, accessing a single row from df["nested"] will either output a pandas.DataFrame (.iloc[0]) or a pandas.Series (.iloc[0:1]).
Note that outside groupby, df["nested"].iloc[0] is stored as a pandas.DataFrame, which is expected.
[8]:
# check the nested row type
def nested_row_type_check(df):
print('df["nested"].iloc[0]:', type(df["nested"].iloc[0]))
print('df["nested"].iloc[0:1]:', type(df["nested"].iloc[0:1]))
print()
# return df
nf.groupby("c").apply(nested_row_type_check, include_groups=False)
df["nested"].iloc[0]: <class 'pandas.core.frame.DataFrame'>
df["nested"].iloc[0:1]: <class 'pandas.core.series.Series'>
df["nested"].iloc[0]: <class 'pandas.core.frame.DataFrame'>
df["nested"].iloc[0:1]: <class 'pandas.core.series.Series'>
[8]:
0 rows × 0 columns
Other operations will preserve the nested structure in general, but if you need to work with the contents of a nested column directly, you may need to flatten it first using .nest.to_flat().
Custom Functions with apply#
.apply() for nested operations is supported natively. It generally works if the function flattens or use index slicing to ensure matching type for operations.
Some potential examples:
[9]:
# custom function to flatten nested column
def flatten_nested(df):
return df["nested"].nest.to_flat()
nf.groupby("c").apply(flatten_nested, include_groups=False)
[9]:
| t | flux | flux_error | band | ||
|---|---|---|---|---|---|
| c | |||||
| 0 | 0 | 8.38389 | 10.233443 | 1.0 | g |
| 0 | 13.40935 | 53.589641 | 1.0 | g | |
| ... | ... | ... | ... | ... | ... |
| 1 | 4 | 9.831463 | 90.853515 | 1.0 | r |
| 4 | 13.995167 | 99.732285 | 1.0 | g |
50 rows × 4 columns
[10]:
import pandas as pd
# custom function to perform aggregations on flattened nested column
def mean_flux(df):
flat = df["nested"].nest.to_flat()
return pd.Series({"mean_flux": flat["flux"].mean(), "mean_t": flat["t"].mean()})
nf.groupby("c").apply(mean_flux, include_groups=False)
[10]:
| mean_flux | mean_t | |
|---|---|---|
| c | ||
| 0 | 51.891513 | 10.881513 |
| 1 | 57.975918 | 8.700798 |
2 rows × 2 columns
Summary#
- Always group by base columns, not nested columns.
Use slice-based indexing (
.iloc[0:1]) to preserve nested types.Use ``.nest.to_flat()`` to flatten a nested column when needed for numerical or aggregating operations.
Nested structures are designed to reduce the need for expensive groupby operations by allowing data to stay organized hierarchically. However, when grouping is necessary, pandas’ groupby still works with nested-pandas and maintains type consistency.
Some use cases may behave unexpectedly because of the nested structures. We encourage users to open issues if you run into unexpected behavior or edge cases.