Fine Data Manipulation with Nested-Pandas#

This tutorial will briefly showcase how one would perform data manipulation operations from pandas, like adding columns, replacing values, etc. with nested-pandas.

[1]:
import nested_pandas as npd
from nested_pandas.datasets import generate_data

# Begin by generating an example dataset
ndf = generate_data(5, 20, seed=1)
ndf
[1]:
  a b nested
0 0.417022 0.184677
t flux flux_error band
8.38389 17.234051 1.0 r
+19 rows ... ... ...
1 0.720324 0.372520
t flux flux_error band
13.70439 13.713575 1.0 g
+19 rows ... ... ...
2 0.000114 0.691121
t flux flux_error band
4.089045 93.259546 1.0 r
+19 rows ... ... ...
3 0.302333 0.793535
t flux flux_error band
17.562349 69.681816 1.0 r
+19 rows ... ... ...
4 0.146756 1.077633
t flux flux_error band
0.547752 6.600017 1.0 g
+19 rows ... ... ...
5 rows x 3 columns
[2]:
# Show one of the nested dataframes
ndf.iloc[0].nested
[2]:
t flux flux_error band
0 8.383890 17.234051 1.0 r
1 13.409350 75.546305 1.0 g
2 16.014891 1.988013 1.0 r
3 17.892133 53.883106 1.0 g
4 1.966937 58.575927 1.0 r
5 6.310313 23.297427 1.0 g
6 19.777222 55.624023 1.0 r
7 8.957871 10.749413 1.0 g
8 0.387339 7.197428 1.0 g
9 1.067251 74.382585 1.0 g
10 2.046689 23.984776 1.0 g
11 10.717928 1.857620 1.0 r
12 18.068038 31.736241 1.0 g
13 3.307084 74.533443 1.0 g
14 17.666122 62.971751 1.0 r
15 17.917724 80.475456 1.0 r
16 2.294919 26.329677 1.0 r
17 4.740540 93.197207 1.0 r
18 6.532898 95.017612 1.0 r
19 12.467202 48.599067 1.0 r

Nested Column Selection#

First, we can directly fetch a column from our nested column (aptly called “nested”). For example, below we can fetch the time column, “t”, by specifying "nested.t" as the column to retrieve. This returns a “flat” view of the nested t column, where all rows from all dataframes are present in one dataframe.

[3]:
# Directly Nested Column Selection
ndf["nested.t"]
[3]:
0      8.38389
0     13.40935
       ...
4    18.170703
4    19.946457
Name: t, Length: 100, dtype: double[pyarrow]

The advantage of the flat view being that this is easily manipulatable just as any pandas.Series object.

[4]:
ndf["nested.t"] + 100
[4]:
0     108.38389
0     113.40935
        ...
4    118.170703
4    119.946457
Name: t, Length: 100, dtype: double[pyarrow]

Adding or Replacing Nested Columns#

We can use the “base_column.nested_sub_column” syntax to also perform operations that add new columns or replace existing columns for a nested column. For example, we can directly replace the “band” column with a new column that appends an additional string to the values.

[5]:
# prepend lsst_ to the band column

ndf["nested.band"] = "lsst_" + ndf["nested.band"]

ndf["nested.band"]
[5]:
0    lsst_r
0    lsst_g
      ...
4    lsst_r
4    lsst_g
Name: band, Length: 100, dtype: string[pyarrow]

Next, we can create a new column in the “nested” column. For example, we can subtract a value from each time value and return it as a new column.

[6]:
# create a new "corrected_t" column in "nested"

ndf["nested.corrected_t"] = ndf["nested.t"] - 5

ndf["nested.corrected_t"]
[6]:
0      3.38389
0      8.40935
       ...
4    13.170703
4    14.946457
Name: corrected_t, Length: 100, dtype: double[pyarrow]
[7]:
# Show the first dataframe again
ndf.iloc[0].nested
[7]:
t flux flux_error band corrected_t
0 8.383890 17.234051 1.0 lsst_r 3.383890
1 13.409350 75.546305 1.0 lsst_g 8.409350
2 16.014891 1.988013 1.0 lsst_r 11.014891
3 17.892133 53.883106 1.0 lsst_g 12.892133
4 1.966937 58.575927 1.0 lsst_r -3.033063
5 6.310313 23.297427 1.0 lsst_g 1.310313
6 19.777222 55.624023 1.0 lsst_r 14.777222
7 8.957871 10.749413 1.0 lsst_g 3.957871
8 0.387339 7.197428 1.0 lsst_g -4.612661
9 1.067251 74.382585 1.0 lsst_g -3.932749
10 2.046689 23.984776 1.0 lsst_g -2.953311
11 10.717928 1.857620 1.0 lsst_r 5.717928
12 18.068038 31.736241 1.0 lsst_g 13.068038
13 3.307084 74.533443 1.0 lsst_g -1.692916
14 17.666122 62.971751 1.0 lsst_r 12.666122
15 17.917724 80.475456 1.0 lsst_r 12.917724
16 2.294919 26.329677 1.0 lsst_r -2.705081
17 4.740540 93.197207 1.0 lsst_r -0.259460
18 6.532898 95.017612 1.0 lsst_r 1.532898
19 12.467202 48.599067 1.0 lsst_r 7.467202

Adding New Nested Structures#

Finally, we can also add entirely new nested structures using the above syntax.

[8]:
ndf["bands.band_label"] = ndf["nested.band"]
ndf
[8]:
  a b nested bands
0 0.417022 0.184677
t flux flux_error band corrected_t
8.38389 17.234051 1.0 lsst_r 3.38389
+19 rows ... ... ... ...
band_label
lsst_r
+19 rows
1 0.720324 0.372520
t flux flux_error band corrected_t
13.70439 13.713575 1.0 lsst_g 8.70439
+19 rows ... ... ... ...
band_label
lsst_g
+19 rows
2 0.000114 0.691121
t flux flux_error band corrected_t
4.089045 93.259546 1.0 lsst_r -0.910955
+19 rows ... ... ... ...
band_label
lsst_r
+19 rows
3 0.302333 0.793535
t flux flux_error band corrected_t
17.562349 69.681816 1.0 lsst_r 12.562349
+19 rows ... ... ... ...
band_label
lsst_r
+19 rows
4 0.146756 1.077633
t flux flux_error band corrected_t
0.547752 6.600017 1.0 lsst_g -4.452248
+19 rows ... ... ... ...
band_label
lsst_g
+19 rows
5 rows x 4 columns

This is functionally equivalent to using join_nested:

[9]:
ndf.join_nested(ndf["nested.band"].to_frame(), "bands_from_add_nested")
[9]:
  a b nested bands bands_from_add_nested
0 0.417022 0.184677
t flux flux_error band corrected_t
8.38389 17.234051 1.0 lsst_r 3.38389
+19 rows ... ... ... ...
band_label
lsst_r
+19 rows
band
lsst_r
+19 rows
1 0.720324 0.372520
t flux flux_error band corrected_t
13.70439 13.713575 1.0 lsst_g 8.70439
+19 rows ... ... ... ...
band_label
lsst_g
+19 rows
band
lsst_g
+19 rows
2 0.000114 0.691121
t flux flux_error band corrected_t
4.089045 93.259546 1.0 lsst_r -0.910955
+19 rows ... ... ... ...
band_label
lsst_r
+19 rows
band
lsst_r
+19 rows
3 0.302333 0.793535
t flux flux_error band corrected_t
17.562349 69.681816 1.0 lsst_r 12.562349
+19 rows ... ... ... ...
band_label
lsst_r
+19 rows
band
lsst_r
+19 rows
4 0.146756 1.077633
t flux flux_error band corrected_t
0.547752 6.600017 1.0 lsst_g -4.452248
+19 rows ... ... ... ...
band_label
lsst_g
+19 rows
band
lsst_g
+19 rows
5 rows x 5 columns

In addition to assigning individual nested columns, we can use the above syntax to nest an entire flat dataframe.

As an example, we can flatten our existing “nested” frame and use the [] syntax to assign it as an additional nested frame.

[10]:
# Create a flat dataframe from our existing nested dataframe
flat_df = ndf["nested"].to_flat()

# Nest our flat dataframe back into our original dataframe
ndf["example"] = flat_df
ndf
[10]:
  a b nested bands example
0 0.417022 0.184677
t flux flux_error band corrected_t
8.38389 17.234051 1.0 lsst_r 3.38389
+19 rows ... ... ... ...
band_label
lsst_r
+19 rows
t flux flux_error band corrected_t
8.38389 17.234051 1.0 lsst_r 3.38389
+19 rows ... ... ... ...
1 0.720324 0.372520
t flux flux_error band corrected_t
13.70439 13.713575 1.0 lsst_g 8.70439
+19 rows ... ... ... ...
band_label
lsst_g
+19 rows
t flux flux_error band corrected_t
13.70439 13.713575 1.0 lsst_g 8.70439
+19 rows ... ... ... ...
2 0.000114 0.691121
t flux flux_error band corrected_t
4.089045 93.259546 1.0 lsst_r -0.910955
+19 rows ... ... ... ...
band_label
lsst_r
+19 rows
t flux flux_error band corrected_t
4.089045 93.259546 1.0 lsst_r -0.910955
+19 rows ... ... ... ...
3 0.302333 0.793535
t flux flux_error band corrected_t
17.562349 69.681816 1.0 lsst_r 12.562349
+19 rows ... ... ... ...
band_label
lsst_r
+19 rows
t flux flux_error band corrected_t
17.562349 69.681816 1.0 lsst_r 12.562349
+19 rows ... ... ... ...
4 0.146756 1.077633
t flux flux_error band corrected_t
0.547752 6.600017 1.0 lsst_g -4.452248
+19 rows ... ... ... ...
band_label
lsst_g
+19 rows
t flux flux_error band corrected_t
0.547752 6.600017 1.0 lsst_g -4.452248
+19 rows ... ... ... ...
5 rows x 5 columns

The above again being shorthand for the following join_nested call:

[11]:
ndf.join_nested(flat_df, "example_from_add_nested")
[11]:
  a b nested bands example example_from_add_nested
0 0.417022 0.184677
t flux flux_error band corrected_t
8.38389 17.234051 1.0 lsst_r 3.38389
+19 rows ... ... ... ...
band_label
lsst_r
+19 rows
t flux flux_error band corrected_t
8.38389 17.234051 1.0 lsst_r 3.38389
+19 rows ... ... ... ...
t flux flux_error band corrected_t
8.38389 17.234051 1.0 lsst_r 3.38389
+19 rows ... ... ... ...
1 0.720324 0.372520
t flux flux_error band corrected_t
13.70439 13.713575 1.0 lsst_g 8.70439
+19 rows ... ... ... ...
band_label
lsst_g
+19 rows
t flux flux_error band corrected_t
13.70439 13.713575 1.0 lsst_g 8.70439
+19 rows ... ... ... ...
t flux flux_error band corrected_t
13.70439 13.713575 1.0 lsst_g 8.70439
+19 rows ... ... ... ...
2 0.000114 0.691121
t flux flux_error band corrected_t
4.089045 93.259546 1.0 lsst_r -0.910955
+19 rows ... ... ... ...
band_label
lsst_r
+19 rows
t flux flux_error band corrected_t
4.089045 93.259546 1.0 lsst_r -0.910955
+19 rows ... ... ... ...
t flux flux_error band corrected_t
4.089045 93.259546 1.0 lsst_r -0.910955
+19 rows ... ... ... ...
3 0.302333 0.793535
t flux flux_error band corrected_t
17.562349 69.681816 1.0 lsst_r 12.562349
+19 rows ... ... ... ...
band_label
lsst_r
+19 rows
t flux flux_error band corrected_t
17.562349 69.681816 1.0 lsst_r 12.562349
+19 rows ... ... ... ...
t flux flux_error band corrected_t
17.562349 69.681816 1.0 lsst_r 12.562349
+19 rows ... ... ... ...
4 0.146756 1.077633
t flux flux_error band corrected_t
0.547752 6.600017 1.0 lsst_g -4.452248
+19 rows ... ... ... ...
band_label
lsst_g
+19 rows
t flux flux_error band corrected_t
0.547752 6.600017 1.0 lsst_g -4.452248
+19 rows ... ... ... ...
t flux flux_error band corrected_t
0.547752 6.600017 1.0 lsst_g -4.452248
+19 rows ... ... ... ...
5 rows x 6 columns

Embedding “base” column into nested column#

We can also assign some “base” (non-nested) column to a nested column, which will be broadcasted to all nested dataframes with the values being repeated.

[12]:
ndf["nested.a"] = ndf["a"]
ndf["nested.a"]
[12]:
0    0.417022
0    0.417022
       ...
4    0.146756
4    0.146756
Name: a, Length: 100, dtype: double[pyarrow]

Or we can do some operations over the base columns first:

[13]:
ndf["nested.ab"] = ndf["a"] + ndf["b"] * 2
ndf["nested.ab"]
[13]:
0    0.786376
0    0.786376
       ...
4    2.302023
4    2.302023
Name: ab, Length: 100, dtype: double[pyarrow]

Concatenating Nested Frames#

Users can concatenate two nested data frames using panda’s concat function. Each row from each individual frame becomes a single row in the combined frame with the nested data correctly propagated.

[14]:
from pandas import concat

ndf_part1 = generate_data(5, 20, seed=1)
ndf_part2 = generate_data(5, 20, seed=2)
ndf_joined = concat([ndf_part1, ndf_part2], ignore_index=True)

ndf_joined
[14]:
  a b nested
0 0.417022 0.184677
t flux flux_error band
8.38389 17.234051 1.0 r
+19 rows ... ... ...
1 0.720324 0.372520
t flux flux_error band
13.70439 13.713575 1.0 g
+19 rows ... ... ...
2 0.000114 0.691121
t flux flux_error band
4.089045 93.259546 1.0 r
+19 rows ... ... ...
3 0.302333 0.793535
t flux flux_error band
17.562349 69.681816 1.0 r
+19 rows ... ... ...
4 0.146756 1.077633
t flux flux_error band
0.547752 6.600017 1.0 g
+19 rows ... ... ...
5 0.435995 0.660670
t flux flux_error band
12.422677 26.377853 1.0 g
+19 rows ... ... ...
6 0.025926 0.409297
t flux flux_error band
10.582842 91.391548 1.0 g
+19 rows ... ... ...
7 0.549662 1.238542
t flux flux_error band
2.691599 41.973546 1.0 g
+19 rows ... ... ...
8 0.435322 0.599309
t flux flux_error band
10.271562 54.019152 1.0 r
+19 rows ... ... ...
9 0.420368 0.533655
t flux flux_error band
3.688797 60.844216 1.0 r
+19 rows ... ... ...
10 rows x 3 columns

Combining Nested Structures#

There may be cases where you would want to combine two nested structures into a single nested structure. There are multiple ways to do this, but by far the most direct path is through direct assignment, first let’s set up a toy example:

[15]:
# Setup a toy dataframe with two nested columns
list_nf = npd.NestedFrame(
    {
        "a": ["cat", "dog", "bird"],
        "b": [1, 2, 3],
        "c": [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
        "d": [[10, 20, 30], [40, 50, 60], [70, 80, 90]],
    }
)

list_nf = list_nf.nest_lists(["c"], "c")
list_nf = list_nf.nest_lists(["d"], "d")
list_nf
[15]:
  a b c d
0 cat 1
c
1
+2 rows
d
10
+2 rows
1 dog 2
c
4
+2 rows
d
40
+2 rows
2 bird 3
c
7
+2 rows
d
70
+2 rows
3 rows x 4 columns

Given the nested structures, “c” and “d”, they can be combined directly as shown below. Note that this requires “c” and “d” to be compatible, which means that the shapes of the inner dataframes should be aligned for every row of your top-level nestedframe.

[16]:
# Combine "c" and "d"
list_nf["nested"] = list_nf[["c", "d"]]
list_nf = list_nf.drop(columns=["c", "d"])  # drop the original columns
list_nf
[16]:
  a b nested
0 cat 1
c d
1 10
+2 rows ...
1 dog 2
c d
4 40
+2 rows ...
2 bird 3
c d
7 70
+2 rows ...
3 rows x 3 columns

Advanced Manipulation with .eval()#

Nested-pandas allows you to perform fast calculations on nested columns using the .eval() method.

[17]:
import pandas as pd
import nested_pandas as npd

# Example 1: Convert a flat DataFrame into a NestedFrame
data = {"star_id": ["Star_A", "Star_A", "Star_B", "Star_B"], "mag": [15.3, 15.4, 12.5, 12.6]}
df = pd.DataFrame(data)

nested_stars = npd.NestedFrame.from_flat(
    df, base_columns=["star_id"], nested_columns=["mag"], name="light_curve"
)
print("--- Packed NestedFrame ---")
print(nested_stars)

# Example 2: Adding a calibration constant to a nested magnitude column
# Evaluate the expression and assign it to a new nested column directly inside the string
nf2 = nested_stars.eval("light_curve.calibrated_mag = light_curve.mag + 25.5")
print("\n--- After calibrated_mag ---")
print(nf2)

# Example 3: Combining flat variables with nested aggregations
# We use an f-string to inject the local variable 'a' directly into the evaluation
a = 10.0
nf3 = nested_stars.eval(f"result = {a} + light_curve.mag.median()")
print("\n--- After median aggregation ---")
print(nf3)
--- Packed NestedFrame ---
  star_id    light_curve
0  Star_A  [{mag: 15.3}]
1  Star_A  [{mag: 15.4}]
2  Star_B  [{mag: 12.5}]
3  Star_B  [{mag: 12.6}]

--- After calibrated_mag ---
  star_id                          light_curve
0  Star_A  [{mag: 15.3, calibrated_mag: 40.8}]
1  Star_A  [{mag: 15.4, calibrated_mag: 40.9}]
2  Star_B  [{mag: 12.5, calibrated_mag: 38.0}]
3  Star_B  [{mag: 12.6, calibrated_mag: 38.1}]

--- After median aggregation ---
  star_id    light_curve  result
0  Star_A  [{mag: 15.3}]   23.95
1  Star_A  [{mag: 15.4}]   23.95
2  Star_B  [{mag: 12.5}]   23.95
3  Star_B  [{mag: 12.6}]   23.95