map_rows#
- NestedFrame.map_rows(func: Callable[[...], Any], columns: None | str | list[str] = None, *, row_container: Literal['dict', 'args'] = 'dict', output_names: None | str | list[str] = None, infer_nesting: bool = True, append_columns: bool = False, njit: bool = False, **kwargs) NestedFrame[source]#
Takes a function and applies it to each top-level row of the NestedFrame.
Nested columns are packaged alongside base columns and available for function use, where base columns are passed as scalars and nested columns are passed as numpy arrays. The way in which the row data is packaged is configurable (by default, a dictionary) and controlled by the row_container argument.
- Parameters:
func (callable) – Function to apply to each nested dataframe. The first arguments to func should be which columns to apply the function to. See the Notes for recommendations on writing func outputs.
columns (None | str | list of str) – Specifies which columns to pass to the function in the row_container format. If None, all columns are passed. If list of str, those columns are passed. If str, a single column is passed or if the string is a nested column, then all nested sub-columns are passed (e.g. columns=”nested” passes all columns of the nested dataframe “nested”). To pass individual nested sub-columns, use the hierarchical column name (e.g. columns=[“nested.t”,…]).
row_container ('dict' or 'args', default 'dict') – Specifies how the row data will be packaged when passed as an input to the function. If ‘dict’, the function will be called as func({“col1”: value, …}, **kwargs), so func should expect a single dictionary input with keys corresponding to column names. If ‘args’, the function will be called as func(value, …, **kwargs), so func should expect positional arguments corresponding to the columns specified in args.
output_names (None | str | list of str) – Specifies the names of the output columns in the resulting NestedFrame. If None, the function will return whatever names the user function returns. If specified will override any names returned by the user function provided the number of names matches the number of outputs. When not specified and the user function returns values without names (e.g. a list or tuple), the output columns will be enumerated (e.g. “0”, “1”, …).
infer_nesting (bool, default True) – If True, the function will pack output columns into nested structures based on column names adhering to a nested naming scheme. E.g. “nested.b” and “nested.c” will be packed into a column called “nested” with columns “b” and “c”. If False, all outputs will be returned as base columns. Note that this will trigger off of names specified in output_names in addition to names returned by the user function.
append_columns (bool, default False) – If True, the output columns are appended to those in the original NestedFrame. The output columns can contain nested sub-columns, which should be specified using their hierarchical column name (e.g. “nested.x”). If their base nested column exists in the original NestedFrame, the new output sub-columns will be added into the frame of the existing nested column. See an example below.
njit (bool, default False) –
If Ture, the function will try to use numba’s njit to speed up the execution. This will only work if the custom function is compatible with njit and the requested columns are at most two.
Note that using njit will disable support for row_container=”dict”.
kwargs (keyword arguments, optional) – Keyword arguments to pass to the function.
- Returns:
NestedFrame with the results of the function applied to the columns of the frame.
- Return type:
NestedFrame
Examples
>>> from nested_pandas.datasets.generation import generate_data >>> import numpy as np >>> nf = generate_data(5,5, seed=1) >>> # define a custom user function >>> # map_rows will return a NestedFrame with two columns >>> def example_func(row): ... return np.mean(row["nested.t"]), np.mean(row["nested.t"]) - row["a"]
>>> # apply the function >>> nf.map_rows(example_func, output_names=["mean", "mean_minus_base"]) mean mean_minus_base 0 11.533440 11.116418 1 10.307751 9.587426 2 8.294042 8.293928 3 9.655291 9.352958 4 10.687591 10.540836
We can pass along only the columns we need for the function using the columns argument, which removes the performance overhead of packaging all columns for each row:
>>> nf.map_rows(example_func, columns=["a", "nested.t"], output_names=["mean", "mean_minus_base"]) mean mean_minus_base 0 11.533440 11.116418 1 10.307751 9.587426 2 8.294042 8.293928 3 9.655291 9.352958 4 10.687591 10.540836
Alternatively, we can pass along the row data as positional arguments instead of a dictionary by setting row_container=”args” and adjusting our function signature accordingly:
>>> def example_func(a, time): ... return np.mean(time), np.mean(time) - a
>>> nf.map_rows(example_func, ... columns=["a", "nested.t"], ... output_names=["mean", "mean_minus_base"], ... row_container="args") mean mean_minus_base 0 11.533440 11.116418 1 10.307751 9.587426 2 8.294042 8.293928 3 9.655291 9.352958 4 10.687591 10.540836
Additional arguments that don’t depend on row data can be passed as kwargs:
>>> def example_func(row, scale): ... return np.mean(row["nested.t"]) * scale
>>> nf.map_rows(example_func, columns=["nested.t"], output_names="mean", scale=1) mean 0 11.533440 1 10.307751 2 8.294042 3 9.655291 4 10.687591
Functions that target a single nested structure can just pass along the nested column name and all sub-columns will be available:
>>> def first_val(row): ... return {"first_"+key.split(".")[1]:row[key][0] for key in row.keys()}
>>> nf.map_rows(first_val, columns="nested") first_t first_flux first_flux_error first_band 0 8.383890 31.551563 1.0 r 1 13.704390 68.650093 1.0 g 2 4.089045 83.462567 1.0 g 3 17.562349 1.828828 1.0 g 4 0.547752 75.014431 1.0 g
You may want the result of a map_rows call to have nested structure, we can achieve this by using the infer_nesting kwarg:
>>> # define a custom user function that returns nested structure >>> def example_func(row): ... '''map_rows will return a NestedFrame with nested structure''' ... return {"offsets.t_a": row["nested.t"] - row["a"], ... "offsets.t_b": row["nested.t"] - row["b"]}
By giving both output columns the prefix “offsets.”, we signal to map_rows to infer that these should be packed into a nested column called “offsets”.
>>> # apply the function with `infer_nesting` (True by default) >>> nf.map_rows(example_func, columns=["a", "b", "nested.t"], infer_nesting=True) offsets 0 [{t_a: 7.966868, t_b: 8.199213}; …] (5 rows) 1 [{t_a: 12.984066, t_b: 13.33187}; …] (5 rows) 2 [{t_a: 4.088931, t_b: 3.397924}; …] (5 rows) 3 [{t_a: 17.260016, t_b: 16.768814}; …] (5 rows) 4 [{t_a: 0.400996, t_b: -0.529882}; …] (5 rows)
You may also want to append the output columns to the original NestedFrame. We can achieve this by using the append_columns kwarg:
>>> # define a custom user function that creates a nested sub-column >>> def example_func(row): ... '''map_rows will return a sub-column for the existing 'nested' column''' ... return row["nested.t"] - row["a"]
>>> # apply the function with `append_columns` (False by default) >>> nf.map_rows(example_func, ... columns=["a", "nested.t"], ... output_names=["nested.t_a"], ... append_columns=True) a b nested 0 0.417022 0.184677 [{t: 8.38389, flux: 31.551563, flux_error: 1.0... 1 0.720324 0.372520 [{t: 13.70439, flux: 68.650093, flux_error: 1.... 2 0.000114 0.691121 [{t: 4.089045, flux: 83.462567, flux_error: 1.... 3 0.302333 0.793535 [{t: 17.562349, flux: 1.828828, flux_error: 1.... 4 0.146756 1.077633 [{t: 0.547752, flux: 75.014431, flux_error: 1....
Notes
If concerned about performance, specify columns to only include the columns needed for the function, as this will avoid the overhead of packaging all columns for each row.
By default, map_rows will produce a NestedFrame with enumerated column names for each returned value of the function. It’s recommended to either specify output_names or have func return a dictionary where each key is an output column of the dataframe returned by map_rows (as shown above).
>>> def example_func(row): ... return np.mean(row["nested.t"]), np.mean(row["nested.t"]) - row["a"]
>>> # first output column will be named "0", second "1" >>> nf.map_rows(example_func) 0 1 0 11.533440 11.116418 1 10.307751 9.587426 2 8.294042 8.293928 3 9.655291 9.352958 4 10.687591 10.540836