Fine Data Manipulation with Nested-Pandas#

This tutorial will briefly showcase how one would perform data manipulation operations from pandas, like adding columns, replacing values, etc. with nested-pandas.

[1]:

import nested_pandas as npd
from nested_pandas.datasets import generate_data

# Begin by generating an example dataset
ndf = generate_data(5, 20, seed=1)
ndf

[1]:

nested

0.417022

0.184677

t	flux	flux_error	band
8.38389	17.234051	1.0	r
+19 rows	...	...	...

0.720324

0.372520

t	flux	flux_error	band
13.70439	13.713575	1.0	g
+19 rows	...	...	...

0.000114

0.691121

t	flux	flux_error	band
4.089045	93.259546	1.0	r
+19 rows	...	...	...

0.302333

0.793535

t	flux	flux_error	band
17.562349	69.681816	1.0	r
+19 rows	...	...	...

0.146756

1.077633

t	flux	flux_error	band
0.547752	6.600017	1.0	g
+19 rows	...	...	...

5 rows x 3 columns

[2]:

# Show one of the nested dataframes
ndf.iloc[0].nested

[2]:

	t	flux	flux_error	band
0	8.383890	17.234051	1.0	r
1	13.409350	75.546305	1.0	g
2	16.014891	1.988013	1.0	r
3	17.892133	53.883106	1.0	g
4	1.966937	58.575927	1.0	r
5	6.310313	23.297427	1.0	g
6	19.777222	55.624023	1.0	r
7	8.957871	10.749413	1.0	g
8	0.387339	7.197428	1.0	g
9	1.067251	74.382585	1.0	g
10	2.046689	23.984776	1.0	g
11	10.717928	1.857620	1.0	r
12	18.068038	31.736241	1.0	g
13	3.307084	74.533443	1.0	g
14	17.666122	62.971751	1.0	r
15	17.917724	80.475456	1.0	r
16	2.294919	26.329677	1.0	r
17	4.740540	93.197207	1.0	r
18	6.532898	95.017612	1.0	r
19	12.467202	48.599067	1.0	r

Nested Column Selection#

First, we can directly fetch a column from our nested column (aptly called “nested”). For example, below we can fetch the time column, “t”, by specifying "nested.t" as the column to retrieve. This returns a “flat” view of the nested t column, where all rows from all dataframes are present in one dataframe.

[3]:

# Directly Nested Column Selection
ndf["nested.t"]

[3]:

0      8.38389
0     13.40935
       ...
4    18.170703
4    19.946457
Name: t, Length: 100, dtype: double[pyarrow]

The advantage of the flat view being that this is easily manipulatable just as any pandas.Series object.

[4]:

ndf["nested.t"] + 100

[4]:

0     108.38389
0     113.40935
        ...
4    118.170703
4    119.946457
Name: t, Length: 100, dtype: double[pyarrow]

Adding or Replacing Nested Columns#

We can use the “base_column.nested_sub_column” syntax to also perform operations that add new columns or replace existing columns for a nested column. For example, we can directly replace the “band” column with a new column that appends an additional string to the values.

[5]:

# prepend lsst_ to the band column

ndf["nested.band"] = "lsst_" + ndf["nested.band"]

ndf["nested.band"]

[5]:

0    lsst_r
0    lsst_g
      ...
4    lsst_r
4    lsst_g
Name: band, Length: 100, dtype: string[pyarrow]

Next, we can create a new column in the “nested” column. For example, we can subtract a value from each time value and return it as a new column.

[6]:

# create a new "corrected_t" column in "nested"

ndf["nested.corrected_t"] = ndf["nested.t"] - 5

ndf["nested.corrected_t"]

[6]:

0      3.38389
0      8.40935
       ...
4    13.170703
4    14.946457
Name: corrected_t, Length: 100, dtype: double[pyarrow]

[7]:

# Show the first dataframe again
ndf.iloc[0].nested

[7]:

	t	flux	flux_error	band	corrected_t
0	8.383890	17.234051	1.0	lsst_r	3.383890
1	13.409350	75.546305	1.0	lsst_g	8.409350
2	16.014891	1.988013	1.0	lsst_r	11.014891
3	17.892133	53.883106	1.0	lsst_g	12.892133
4	1.966937	58.575927	1.0	lsst_r	-3.033063
5	6.310313	23.297427	1.0	lsst_g	1.310313
6	19.777222	55.624023	1.0	lsst_r	14.777222
7	8.957871	10.749413	1.0	lsst_g	3.957871
8	0.387339	7.197428	1.0	lsst_g	-4.612661
9	1.067251	74.382585	1.0	lsst_g	-3.932749
10	2.046689	23.984776	1.0	lsst_g	-2.953311
11	10.717928	1.857620	1.0	lsst_r	5.717928
12	18.068038	31.736241	1.0	lsst_g	13.068038
13	3.307084	74.533443	1.0	lsst_g	-1.692916
14	17.666122	62.971751	1.0	lsst_r	12.666122
15	17.917724	80.475456	1.0	lsst_r	12.917724
16	2.294919	26.329677	1.0	lsst_r	-2.705081
17	4.740540	93.197207	1.0	lsst_r	-0.259460
18	6.532898	95.017612	1.0	lsst_r	1.532898
19	12.467202	48.599067	1.0	lsst_r	7.467202

Adding New Nested Structures#

Finally, we can also add entirely new nested structures using the above syntax.

[8]:

ndf["bands.band_label"] = ndf["nested.band"]
ndf

[8]:

nested

bands

0.417022

0.184677

t	flux	flux_error	band	corrected_t
8.38389	17.234051	1.0	lsst_r	3.38389
+19 rows	...	...	...	...

band_label
lsst_r
+19 rows

0.720324

0.372520

t	flux	flux_error	band	corrected_t
13.70439	13.713575	1.0	lsst_g	8.70439
+19 rows	...	...	...	...

band_label
lsst_g
+19 rows

0.000114

0.691121

t	flux	flux_error	band	corrected_t
4.089045	93.259546	1.0	lsst_r	-0.910955
+19 rows	...	...	...	...

band_label
lsst_r
+19 rows

0.302333

0.793535

t	flux	flux_error	band	corrected_t
17.562349	69.681816	1.0	lsst_r	12.562349
+19 rows	...	...	...	...

band_label
lsst_r
+19 rows

0.146756

1.077633

t	flux	flux_error	band	corrected_t
0.547752	6.600017	1.0	lsst_g	-4.452248
+19 rows	...	...	...	...

band_label
lsst_g
+19 rows

5 rows x 4 columns

This is functionally equivalent to using join_nested:

[9]:

ndf.join_nested(ndf["nested.band"].to_frame(), "bands_from_add_nested")

[9]:

nested

bands

bands_from_add_nested

0.417022

0.184677

t	flux	flux_error	band	corrected_t
8.38389	17.234051	1.0	lsst_r	3.38389
+19 rows	...	...	...	...

band_label
lsst_r
+19 rows

band
lsst_r
+19 rows

0.720324

0.372520

t	flux	flux_error	band	corrected_t
13.70439	13.713575	1.0	lsst_g	8.70439
+19 rows	...	...	...	...

band_label
lsst_g
+19 rows

band
lsst_g
+19 rows

0.000114

0.691121

t	flux	flux_error	band	corrected_t
4.089045	93.259546	1.0	lsst_r	-0.910955
+19 rows	...	...	...	...

band_label
lsst_r
+19 rows

band
lsst_r
+19 rows

0.302333

0.793535

t	flux	flux_error	band	corrected_t
17.562349	69.681816	1.0	lsst_r	12.562349
+19 rows	...	...	...	...

band_label
lsst_r
+19 rows

band
lsst_r
+19 rows

0.146756

1.077633

t	flux	flux_error	band	corrected_t
0.547752	6.600017	1.0	lsst_g	-4.452248
+19 rows	...	...	...	...

band_label
lsst_g
+19 rows

band
lsst_g
+19 rows

5 rows x 5 columns

In addition to assigning individual nested columns, we can use the above syntax to nest an entire flat dataframe.

As an example, we can flatten our existing “nested” frame and use the [] syntax to assign it as an additional nested frame.

[10]:

# Create a flat dataframe from our existing nested dataframe
flat_df = ndf["nested"].to_flat()

# Nest our flat dataframe back into our original dataframe
ndf["example"] = flat_df
ndf

[10]:

nested

bands

example

0.417022

0.184677

t	flux	flux_error	band	corrected_t
8.38389	17.234051	1.0	lsst_r	3.38389
+19 rows	...	...	...	...

band_label
lsst_r
+19 rows

t	flux	flux_error	band	corrected_t
8.38389	17.234051	1.0	lsst_r	3.38389
+19 rows	...	...	...	...

0.720324

0.372520

t	flux	flux_error	band	corrected_t
13.70439	13.713575	1.0	lsst_g	8.70439
+19 rows	...	...	...	...

band_label
lsst_g
+19 rows

t	flux	flux_error	band	corrected_t
13.70439	13.713575	1.0	lsst_g	8.70439
+19 rows	...	...	...	...

0.000114

0.691121

t	flux	flux_error	band	corrected_t
4.089045	93.259546	1.0	lsst_r	-0.910955
+19 rows	...	...	...	...

band_label
lsst_r
+19 rows

t	flux	flux_error	band	corrected_t
4.089045	93.259546	1.0	lsst_r	-0.910955
+19 rows	...	...	...	...

0.302333

0.793535

t	flux	flux_error	band	corrected_t
17.562349	69.681816	1.0	lsst_r	12.562349
+19 rows	...	...	...	...

band_label
lsst_r
+19 rows

t	flux	flux_error	band	corrected_t
17.562349	69.681816	1.0	lsst_r	12.562349
+19 rows	...	...	...	...

0.146756

1.077633

t	flux	flux_error	band	corrected_t
0.547752	6.600017	1.0	lsst_g	-4.452248
+19 rows	...	...	...	...

band_label
lsst_g
+19 rows

t	flux	flux_error	band	corrected_t
0.547752	6.600017	1.0	lsst_g	-4.452248
+19 rows	...	...	...	...

5 rows x 5 columns

The above again being shorthand for the following join_nested call:

[11]:

ndf.join_nested(flat_df, "example_from_add_nested")

[11]:

nested

bands

example

example_from_add_nested

0.417022

0.184677

t	flux	flux_error	band	corrected_t
8.38389	17.234051	1.0	lsst_r	3.38389
+19 rows	...	...	...	...

band_label
lsst_r
+19 rows

t	flux	flux_error	band	corrected_t
8.38389	17.234051	1.0	lsst_r	3.38389
+19 rows	...	...	...	...

t	flux	flux_error	band	corrected_t
8.38389	17.234051	1.0	lsst_r	3.38389
+19 rows	...	...	...	...

0.720324

0.372520

t	flux	flux_error	band	corrected_t
13.70439	13.713575	1.0	lsst_g	8.70439
+19 rows	...	...	...	...

band_label
lsst_g
+19 rows

t	flux	flux_error	band	corrected_t
13.70439	13.713575	1.0	lsst_g	8.70439
+19 rows	...	...	...	...

t	flux	flux_error	band	corrected_t
13.70439	13.713575	1.0	lsst_g	8.70439
+19 rows	...	...	...	...

0.000114

0.691121

t	flux	flux_error	band	corrected_t
4.089045	93.259546	1.0	lsst_r	-0.910955
+19 rows	...	...	...	...

band_label
lsst_r
+19 rows

t	flux	flux_error	band	corrected_t
4.089045	93.259546	1.0	lsst_r	-0.910955
+19 rows	...	...	...	...

t	flux	flux_error	band	corrected_t
4.089045	93.259546	1.0	lsst_r	-0.910955
+19 rows	...	...	...	...

0.302333

0.793535

t	flux	flux_error	band	corrected_t
17.562349	69.681816	1.0	lsst_r	12.562349
+19 rows	...	...	...	...

band_label
lsst_r
+19 rows

t	flux	flux_error	band	corrected_t
17.562349	69.681816	1.0	lsst_r	12.562349
+19 rows	...	...	...	...

t	flux	flux_error	band	corrected_t
17.562349	69.681816	1.0	lsst_r	12.562349
+19 rows	...	...	...	...

0.146756

1.077633

t	flux	flux_error	band	corrected_t
0.547752	6.600017	1.0	lsst_g	-4.452248
+19 rows	...	...	...	...

band_label
lsst_g
+19 rows

t	flux	flux_error	band	corrected_t
0.547752	6.600017	1.0	lsst_g	-4.452248
+19 rows	...	...	...	...

t	flux	flux_error	band	corrected_t
0.547752	6.600017	1.0	lsst_g	-4.452248
+19 rows	...	...	...	...

5 rows x 6 columns

Embedding “base” column into nested column#

We can also assign some “base” (non-nested) column to a nested column, which will be broadcasted to all nested dataframes with the values being repeated.

[12]:

ndf["nested.a"] = ndf["a"]
ndf["nested.a"]

[12]:

0    0.417022
0    0.417022
       ...
4    0.146756
4    0.146756
Name: a, Length: 100, dtype: double[pyarrow]

Or we can do some operations over the base columns first:

[13]:

ndf["nested.ab"] = ndf["a"] + ndf["b"] * 2
ndf["nested.ab"]

[13]:

0    0.786376
0    0.786376
       ...
4    2.302023
4    2.302023
Name: ab, Length: 100, dtype: double[pyarrow]

Concatenating Nested Frames#

Users can concatenate two nested data frames using panda’s concat function. Each row from each individual frame becomes a single row in the combined frame with the nested data correctly propagated.

[14]:

from pandas import concat

ndf_part1 = generate_data(5, 20, seed=1)
ndf_part2 = generate_data(5, 20, seed=2)
ndf_joined = concat([ndf_part1, ndf_part2], ignore_index=True)

ndf_joined

[14]:

nested

0.417022

0.184677

t	flux	flux_error	band
8.38389	17.234051	1.0	r
+19 rows	...	...	...

0.720324

0.372520

t	flux	flux_error	band
13.70439	13.713575	1.0	g
+19 rows	...	...	...

0.000114

0.691121

t	flux	flux_error	band
4.089045	93.259546	1.0	r
+19 rows	...	...	...

0.302333

0.793535

t	flux	flux_error	band
17.562349	69.681816	1.0	r
+19 rows	...	...	...

0.146756

1.077633

t	flux	flux_error	band
0.547752	6.600017	1.0	g
+19 rows	...	...	...

0.435995

0.660670

t	flux	flux_error	band
12.422677	26.377853	1.0	g
+19 rows	...	...	...

0.025926

0.409297

t	flux	flux_error	band
10.582842	91.391548	1.0	g
+19 rows	...	...	...

0.549662

1.238542

t	flux	flux_error	band
2.691599	41.973546	1.0	g
+19 rows	...	...	...

0.435322

0.599309

t	flux	flux_error	band
10.271562	54.019152	1.0	r
+19 rows	...	...	...

0.420368

0.533655

t	flux	flux_error	band
3.688797	60.844216	1.0	r
+19 rows	...	...	...

10 rows x 3 columns

Combining Nested Structures#

There may be cases where you would want to combine two nested structures into a single nested structure. There are multiple ways to do this, but by far the most direct path is through direct assignment, first let’s set up a toy example:

[15]:

# Setup a toy dataframe with two nested columns
list_nf = npd.NestedFrame(
    {
        "a": ["cat", "dog", "bird"],
        "b": [1, 2, 3],
        "c": [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
        "d": [[10, 20, 30], [40, 50, 60], [70, 80, 90]],
    }
)

list_nf = list_nf.nest_lists(["c"], "c")
list_nf = list_nf.nest_lists(["d"], "d")
list_nf

[15]:

cat

c
1
+2 rows

d
10
+2 rows

dog

c
4
+2 rows

d
40
+2 rows

bird

c
7
+2 rows

d
70
+2 rows

3 rows x 4 columns

Given the nested structures, “c” and “d”, they can be combined directly as shown below. Note that this requires “c” and “d” to be compatible, which means that the shapes of the inner dataframes should be aligned for every row of your top-level nestedframe.

[16]:

# Combine "c" and "d"
list_nf["nested"] = list_nf[["c", "d"]]
list_nf = list_nf.drop(columns=["c", "d"])  # drop the original columns
list_nf

[16]:

nested

cat

c	d
1	10
+2 rows	...

dog

c	d
4	40
+2 rows	...

bird

c	d
7	70
+2 rows	...

3 rows x 3 columns

Advanced Manipulation with `.eval()`#

Nested-pandas allows you to perform fast calculations on nested columns using the .eval() method.

[17]:

import pandas as pd
import nested_pandas as npd

# Example 1: Convert a flat DataFrame into a NestedFrame
data = {"star_id": ["Star_A", "Star_A", "Star_B", "Star_B"], "mag": [15.3, 15.4, 12.5, 12.6]}
df = pd.DataFrame(data)

nested_stars = npd.NestedFrame.from_flat(
    df, base_columns=["star_id"], nested_columns=["mag"], name="light_curve"
)
print("--- Packed NestedFrame ---")
print(nested_stars)

# Example 2: Adding a calibration constant to a nested magnitude column
# Evaluate the expression and assign it to a new nested column directly inside the string
nf2 = nested_stars.eval("light_curve.calibrated_mag = light_curve.mag + 25.5")
print("\n--- After calibrated_mag ---")
print(nf2)

# Example 3: Combining flat variables with nested aggregations
# We use an f-string to inject the local variable 'a' directly into the evaluation
a = 10.0
nf3 = nested_stars.eval(f"result = {a} + light_curve.mag.median()")
print("\n--- After median aggregation ---")
print(nf3)

--- Packed NestedFrame ---
  star_id    light_curve
0  Star_A  [{mag: 15.3}]
1  Star_A  [{mag: 15.4}]
2  Star_B  [{mag: 12.5}]
3  Star_B  [{mag: 12.6}]

--- After calibrated_mag ---
  star_id                          light_curve
0  Star_A  [{mag: 15.3, calibrated_mag: 40.8}]
1  Star_A  [{mag: 15.4, calibrated_mag: 40.9}]
2  Star_B  [{mag: 12.5, calibrated_mag: 38.0}]
3  Star_B  [{mag: 12.6, calibrated_mag: 38.1}]

--- After median aggregation ---
  star_id    light_curve  result
0  Star_A  [{mag: 15.3}]   23.95
1  Star_A  [{mag: 15.4}]   23.95
2  Star_B  [{mag: 12.5}]   23.95
3  Star_B  [{mag: 12.6}]   23.95

Fine Data Manipulation with Nested-Pandas

Contents

Fine Data Manipulation with Nested-Pandas#

Nested Column Selection#

Adding or Replacing Nested Columns#

Adding New Nested Structures#

Embedding “base” column into nested column#

Concatenating Nested Frames#

Combining Nested Structures#

Advanced Manipulation with .eval()#

Advanced Manipulation with `.eval()`#