Fine Data Manipulation with Nested-Pandas
This tutorial will briefly showcase how one would perform data manipulation operations from pandas, like adding columns, replacing values, etc. with nested-pandas.
| |
a |
b |
nested |
| 0 |
0.417022 |
0.184677 |
| t |
flux |
flux_error |
band |
| 8.38389 |
17.234051 |
1.0 |
r |
| +19 rows |
... |
... |
... |
|
| 1 |
0.720324 |
0.372520 |
| t |
flux |
flux_error |
band |
| 13.70439 |
13.713575 |
1.0 |
g |
| +19 rows |
... |
... |
... |
|
| 2 |
0.000114 |
0.691121 |
| t |
flux |
flux_error |
band |
| 4.089045 |
93.259546 |
1.0 |
r |
| +19 rows |
... |
... |
... |
|
| 3 |
0.302333 |
0.793535 |
| t |
flux |
flux_error |
band |
| 17.562349 |
69.681816 |
1.0 |
r |
| +19 rows |
... |
... |
... |
|
| 4 |
0.146756 |
1.077633 |
| t |
flux |
flux_error |
band |
| 0.547752 |
6.600017 |
1.0 |
g |
| +19 rows |
... |
... |
... |
|
5 rows x 3 columns
|
t |
flux |
flux_error |
band |
| 0 |
8.383890 |
17.234051 |
1.0 |
r |
| 1 |
13.409350 |
75.546305 |
1.0 |
g |
| 2 |
16.014891 |
1.988013 |
1.0 |
r |
| 3 |
17.892133 |
53.883106 |
1.0 |
g |
| 4 |
1.966937 |
58.575927 |
1.0 |
r |
| 5 |
6.310313 |
23.297427 |
1.0 |
g |
| 6 |
19.777222 |
55.624023 |
1.0 |
r |
| 7 |
8.957871 |
10.749413 |
1.0 |
g |
| 8 |
0.387339 |
7.197428 |
1.0 |
g |
| 9 |
1.067251 |
74.382585 |
1.0 |
g |
| 10 |
2.046689 |
23.984776 |
1.0 |
g |
| 11 |
10.717928 |
1.857620 |
1.0 |
r |
| 12 |
18.068038 |
31.736241 |
1.0 |
g |
| 13 |
3.307084 |
74.533443 |
1.0 |
g |
| 14 |
17.666122 |
62.971751 |
1.0 |
r |
| 15 |
17.917724 |
80.475456 |
1.0 |
r |
| 16 |
2.294919 |
26.329677 |
1.0 |
r |
| 17 |
4.740540 |
93.197207 |
1.0 |
r |
| 18 |
6.532898 |
95.017612 |
1.0 |
r |
| 19 |
12.467202 |
48.599067 |
1.0 |
r |
Nested Column Selection
First, we can directly fetch a column from our nested column (aptly called “nested”). For example, below we can fetch the time column, “t”, by specifying "nested.t" as the column to retrieve. This returns a “flat” view of the nested t column, where all rows from all dataframes are present in one dataframe.
0 8.38389
0 13.40935
...
4 18.170703
4 19.946457
Name: t, Length: 100, dtype: double[pyarrow]
The advantage of the flat view being that this is easily manipulatable just as any pandas.Series object.
0 108.38389
0 113.40935
...
4 118.170703
4 119.946457
Name: t, Length: 100, dtype: double[pyarrow]
Adding or Replacing Nested Columns
We can use the “base_column.nested_sub_column” syntax to also perform operations that add new columns or replace existing columns for a nested column. For example, we can directly replace the “band” column with a new column that appends an additional string to the values.
0 lsst_r
0 lsst_g
...
4 lsst_r
4 lsst_g
Name: band, Length: 100, dtype: string[pyarrow]
Next, we can create a new column in the “nested” column. For example, we can subtract a value from each time value and return it as a new column.
0 3.38389
0 8.40935
...
4 13.170703
4 14.946457
Name: corrected_t, Length: 100, dtype: double[pyarrow]
|
t |
flux |
flux_error |
band |
corrected_t |
| 0 |
8.383890 |
17.234051 |
1.0 |
lsst_r |
3.383890 |
| 1 |
13.409350 |
75.546305 |
1.0 |
lsst_g |
8.409350 |
| 2 |
16.014891 |
1.988013 |
1.0 |
lsst_r |
11.014891 |
| 3 |
17.892133 |
53.883106 |
1.0 |
lsst_g |
12.892133 |
| 4 |
1.966937 |
58.575927 |
1.0 |
lsst_r |
-3.033063 |
| 5 |
6.310313 |
23.297427 |
1.0 |
lsst_g |
1.310313 |
| 6 |
19.777222 |
55.624023 |
1.0 |
lsst_r |
14.777222 |
| 7 |
8.957871 |
10.749413 |
1.0 |
lsst_g |
3.957871 |
| 8 |
0.387339 |
7.197428 |
1.0 |
lsst_g |
-4.612661 |
| 9 |
1.067251 |
74.382585 |
1.0 |
lsst_g |
-3.932749 |
| 10 |
2.046689 |
23.984776 |
1.0 |
lsst_g |
-2.953311 |
| 11 |
10.717928 |
1.857620 |
1.0 |
lsst_r |
5.717928 |
| 12 |
18.068038 |
31.736241 |
1.0 |
lsst_g |
13.068038 |
| 13 |
3.307084 |
74.533443 |
1.0 |
lsst_g |
-1.692916 |
| 14 |
17.666122 |
62.971751 |
1.0 |
lsst_r |
12.666122 |
| 15 |
17.917724 |
80.475456 |
1.0 |
lsst_r |
12.917724 |
| 16 |
2.294919 |
26.329677 |
1.0 |
lsst_r |
-2.705081 |
| 17 |
4.740540 |
93.197207 |
1.0 |
lsst_r |
-0.259460 |
| 18 |
6.532898 |
95.017612 |
1.0 |
lsst_r |
1.532898 |
| 19 |
12.467202 |
48.599067 |
1.0 |
lsst_r |
7.467202 |
Adding New Nested Structures
Finally, we can also add entirely new nested structures using the above syntax.
| |
a |
b |
nested |
bands |
| 0 |
0.417022 |
0.184677 |
| t |
flux |
flux_error |
band |
corrected_t |
| 8.38389 |
17.234051 |
1.0 |
lsst_r |
3.38389 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_r |
| +19 rows |
|
| 1 |
0.720324 |
0.372520 |
| t |
flux |
flux_error |
band |
corrected_t |
| 13.70439 |
13.713575 |
1.0 |
lsst_g |
8.70439 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_g |
| +19 rows |
|
| 2 |
0.000114 |
0.691121 |
| t |
flux |
flux_error |
band |
corrected_t |
| 4.089045 |
93.259546 |
1.0 |
lsst_r |
-0.910955 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_r |
| +19 rows |
|
| 3 |
0.302333 |
0.793535 |
| t |
flux |
flux_error |
band |
corrected_t |
| 17.562349 |
69.681816 |
1.0 |
lsst_r |
12.562349 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_r |
| +19 rows |
|
| 4 |
0.146756 |
1.077633 |
| t |
flux |
flux_error |
band |
corrected_t |
| 0.547752 |
6.600017 |
1.0 |
lsst_g |
-4.452248 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_g |
| +19 rows |
|
5 rows x 4 columns
This is functionally equivalent to using join_nested:
| |
a |
b |
nested |
bands |
bands_from_add_nested |
| 0 |
0.417022 |
0.184677 |
| t |
flux |
flux_error |
band |
corrected_t |
| 8.38389 |
17.234051 |
1.0 |
lsst_r |
3.38389 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_r |
| +19 rows |
|
|
| 1 |
0.720324 |
0.372520 |
| t |
flux |
flux_error |
band |
corrected_t |
| 13.70439 |
13.713575 |
1.0 |
lsst_g |
8.70439 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_g |
| +19 rows |
|
|
| 2 |
0.000114 |
0.691121 |
| t |
flux |
flux_error |
band |
corrected_t |
| 4.089045 |
93.259546 |
1.0 |
lsst_r |
-0.910955 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_r |
| +19 rows |
|
|
| 3 |
0.302333 |
0.793535 |
| t |
flux |
flux_error |
band |
corrected_t |
| 17.562349 |
69.681816 |
1.0 |
lsst_r |
12.562349 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_r |
| +19 rows |
|
|
| 4 |
0.146756 |
1.077633 |
| t |
flux |
flux_error |
band |
corrected_t |
| 0.547752 |
6.600017 |
1.0 |
lsst_g |
-4.452248 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_g |
| +19 rows |
|
|
5 rows x 5 columns
In addition to assigning individual nested columns, we can use the above syntax to nest an entire flat dataframe.
As an example, we can flatten our existing “nested” frame and use the [] syntax to assign it as an additional nested frame.
| |
a |
b |
nested |
bands |
example |
| 0 |
0.417022 |
0.184677 |
| t |
flux |
flux_error |
band |
corrected_t |
| 8.38389 |
17.234051 |
1.0 |
lsst_r |
3.38389 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_r |
| +19 rows |
|
| t |
flux |
flux_error |
band |
corrected_t |
| 8.38389 |
17.234051 |
1.0 |
lsst_r |
3.38389 |
| +19 rows |
... |
... |
... |
... |
|
| 1 |
0.720324 |
0.372520 |
| t |
flux |
flux_error |
band |
corrected_t |
| 13.70439 |
13.713575 |
1.0 |
lsst_g |
8.70439 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_g |
| +19 rows |
|
| t |
flux |
flux_error |
band |
corrected_t |
| 13.70439 |
13.713575 |
1.0 |
lsst_g |
8.70439 |
| +19 rows |
... |
... |
... |
... |
|
| 2 |
0.000114 |
0.691121 |
| t |
flux |
flux_error |
band |
corrected_t |
| 4.089045 |
93.259546 |
1.0 |
lsst_r |
-0.910955 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_r |
| +19 rows |
|
| t |
flux |
flux_error |
band |
corrected_t |
| 4.089045 |
93.259546 |
1.0 |
lsst_r |
-0.910955 |
| +19 rows |
... |
... |
... |
... |
|
| 3 |
0.302333 |
0.793535 |
| t |
flux |
flux_error |
band |
corrected_t |
| 17.562349 |
69.681816 |
1.0 |
lsst_r |
12.562349 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_r |
| +19 rows |
|
| t |
flux |
flux_error |
band |
corrected_t |
| 17.562349 |
69.681816 |
1.0 |
lsst_r |
12.562349 |
| +19 rows |
... |
... |
... |
... |
|
| 4 |
0.146756 |
1.077633 |
| t |
flux |
flux_error |
band |
corrected_t |
| 0.547752 |
6.600017 |
1.0 |
lsst_g |
-4.452248 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_g |
| +19 rows |
|
| t |
flux |
flux_error |
band |
corrected_t |
| 0.547752 |
6.600017 |
1.0 |
lsst_g |
-4.452248 |
| +19 rows |
... |
... |
... |
... |
|
5 rows x 5 columns
The above again being shorthand for the following join_nested call:
| |
a |
b |
nested |
bands |
example |
example_from_add_nested |
| 0 |
0.417022 |
0.184677 |
| t |
flux |
flux_error |
band |
corrected_t |
| 8.38389 |
17.234051 |
1.0 |
lsst_r |
3.38389 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_r |
| +19 rows |
|
| t |
flux |
flux_error |
band |
corrected_t |
| 8.38389 |
17.234051 |
1.0 |
lsst_r |
3.38389 |
| +19 rows |
... |
... |
... |
... |
|
| t |
flux |
flux_error |
band |
corrected_t |
| 8.38389 |
17.234051 |
1.0 |
lsst_r |
3.38389 |
| +19 rows |
... |
... |
... |
... |
|
| 1 |
0.720324 |
0.372520 |
| t |
flux |
flux_error |
band |
corrected_t |
| 13.70439 |
13.713575 |
1.0 |
lsst_g |
8.70439 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_g |
| +19 rows |
|
| t |
flux |
flux_error |
band |
corrected_t |
| 13.70439 |
13.713575 |
1.0 |
lsst_g |
8.70439 |
| +19 rows |
... |
... |
... |
... |
|
| t |
flux |
flux_error |
band |
corrected_t |
| 13.70439 |
13.713575 |
1.0 |
lsst_g |
8.70439 |
| +19 rows |
... |
... |
... |
... |
|
| 2 |
0.000114 |
0.691121 |
| t |
flux |
flux_error |
band |
corrected_t |
| 4.089045 |
93.259546 |
1.0 |
lsst_r |
-0.910955 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_r |
| +19 rows |
|
| t |
flux |
flux_error |
band |
corrected_t |
| 4.089045 |
93.259546 |
1.0 |
lsst_r |
-0.910955 |
| +19 rows |
... |
... |
... |
... |
|
| t |
flux |
flux_error |
band |
corrected_t |
| 4.089045 |
93.259546 |
1.0 |
lsst_r |
-0.910955 |
| +19 rows |
... |
... |
... |
... |
|
| 3 |
0.302333 |
0.793535 |
| t |
flux |
flux_error |
band |
corrected_t |
| 17.562349 |
69.681816 |
1.0 |
lsst_r |
12.562349 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_r |
| +19 rows |
|
| t |
flux |
flux_error |
band |
corrected_t |
| 17.562349 |
69.681816 |
1.0 |
lsst_r |
12.562349 |
| +19 rows |
... |
... |
... |
... |
|
| t |
flux |
flux_error |
band |
corrected_t |
| 17.562349 |
69.681816 |
1.0 |
lsst_r |
12.562349 |
| +19 rows |
... |
... |
... |
... |
|
| 4 |
0.146756 |
1.077633 |
| t |
flux |
flux_error |
band |
corrected_t |
| 0.547752 |
6.600017 |
1.0 |
lsst_g |
-4.452248 |
| +19 rows |
... |
... |
... |
... |
|
| band_label |
| lsst_g |
| +19 rows |
|
| t |
flux |
flux_error |
band |
corrected_t |
| 0.547752 |
6.600017 |
1.0 |
lsst_g |
-4.452248 |
| +19 rows |
... |
... |
... |
... |
|
| t |
flux |
flux_error |
band |
corrected_t |
| 0.547752 |
6.600017 |
1.0 |
lsst_g |
-4.452248 |
| +19 rows |
... |
... |
... |
... |
|
5 rows x 6 columns
Embedding “base” column into nested column
We can also assign some “base” (non-nested) column to a nested column, which will be broadcasted to all nested dataframes with the values being repeated.
0 0.417022
0 0.417022
...
4 0.146756
4 0.146756
Name: a, Length: 100, dtype: double[pyarrow]
Or we can do some operations over the base columns first:
0 0.786376
0 0.786376
...
4 2.302023
4 2.302023
Name: ab, Length: 100, dtype: double[pyarrow]
Concatenating Nested Frames
Users can concatenate two nested data frames using panda’s concat function. Each row from each individual frame becomes a single row in the combined frame with the nested data correctly propagated.
| |
a |
b |
nested |
| 0 |
0.417022 |
0.184677 |
| t |
flux |
flux_error |
band |
| 8.38389 |
17.234051 |
1.0 |
r |
| +19 rows |
... |
... |
... |
|
| 1 |
0.720324 |
0.372520 |
| t |
flux |
flux_error |
band |
| 13.70439 |
13.713575 |
1.0 |
g |
| +19 rows |
... |
... |
... |
|
| 2 |
0.000114 |
0.691121 |
| t |
flux |
flux_error |
band |
| 4.089045 |
93.259546 |
1.0 |
r |
| +19 rows |
... |
... |
... |
|
| 3 |
0.302333 |
0.793535 |
| t |
flux |
flux_error |
band |
| 17.562349 |
69.681816 |
1.0 |
r |
| +19 rows |
... |
... |
... |
|
| 4 |
0.146756 |
1.077633 |
| t |
flux |
flux_error |
band |
| 0.547752 |
6.600017 |
1.0 |
g |
| +19 rows |
... |
... |
... |
|
| 5 |
0.435995 |
0.660670 |
| t |
flux |
flux_error |
band |
| 12.422677 |
26.377853 |
1.0 |
g |
| +19 rows |
... |
... |
... |
|
| 6 |
0.025926 |
0.409297 |
| t |
flux |
flux_error |
band |
| 10.582842 |
91.391548 |
1.0 |
g |
| +19 rows |
... |
... |
... |
|
| 7 |
0.549662 |
1.238542 |
| t |
flux |
flux_error |
band |
| 2.691599 |
41.973546 |
1.0 |
g |
| +19 rows |
... |
... |
... |
|
| 8 |
0.435322 |
0.599309 |
| t |
flux |
flux_error |
band |
| 10.271562 |
54.019152 |
1.0 |
r |
| +19 rows |
... |
... |
... |
|
| 9 |
0.420368 |
0.533655 |
| t |
flux |
flux_error |
band |
| 3.688797 |
60.844216 |
1.0 |
r |
| +19 rows |
... |
... |
... |
|
10 rows x 3 columns
Combining Nested Structures
There may be cases where you would want to combine two nested structures into a single nested structure. There are multiple ways to do this, but by far the most direct path is through direct assignment, first let’s set up a toy example:
| |
a |
b |
c |
d |
| 0 |
cat |
1 |
|
|
| 1 |
dog |
2 |
|
|
| 2 |
bird |
3 |
|
|
3 rows x 4 columns
Given the nested structures, “c” and “d”, they can be combined directly as shown below. Note that this requires “c” and “d” to be compatible, which means that the shapes of the inner dataframes should be aligned for every row of your top-level nestedframe.
| |
a |
b |
nested |
| 0 |
cat |
1 |
|
| 1 |
dog |
2 |
|
| 2 |
bird |
3 |
|
3 rows x 3 columns
Advanced Manipulation with .eval()
Nested-pandas allows you to perform fast calculations on nested columns using the .eval() method.
--- Packed NestedFrame ---
star_id light_curve
0 Star_A [{mag: 15.3}]
1 Star_A [{mag: 15.4}]
2 Star_B [{mag: 12.5}]
3 Star_B [{mag: 12.6}]
--- After calibrated_mag ---
star_id light_curve
0 Star_A [{mag: 15.3, calibrated_mag: 40.8}]
1 Star_A [{mag: 15.4, calibrated_mag: 40.9}]
2 Star_B [{mag: 12.5, calibrated_mag: 38.0}]
3 Star_B [{mag: 12.6, calibrated_mag: 38.1}]
--- After median aggregation ---
star_id light_curve result
0 Star_A [{mag: 15.3}] 23.95
1 Star_A [{mag: 15.4}] 23.95
2 Star_B [{mag: 12.5}] 23.95
3 Star_B [{mag: 12.6}] 23.95