Performance Impact of nested-pandas#
For use-cases involving nesting data, nested-pandas can offer significant speedups compared to using the native pandas API. Below is a brief example workflow comparison between pandas and nested-pandas, where this example workflow calculates the amplitude of photometric fluxes after a few filtering steps.
[ ]:
import nested_pandas as npd
import pandas as pd
import light_curve as licu
import numpy as np
from nested_pandas.utils import count_nested
Pandas#
[5]:
%%timeit
# Read data
object_df = pd.read_parquet("objects.parquet")
source_df = pd.read_parquet("ztf_sources.parquet")
# Filter on object
filtered_object = object_df.query("ra > 10.0")
# sync object to source --removes any index values of source not found in object
filtered_source = filtered_object[[]].join(source_df, how="left")
# Count number of observations per photometric band and add it to the object table
band_counts = (
source_df.groupby(level=0)
.apply(lambda x: x[["band"]].value_counts().reset_index())
.pivot_table(values="count", index="index", columns="band", aggfunc="sum")
)
filtered_object = filtered_object.join(band_counts[["g", "r"]])
# Filter on our nobs
filtered_object = filtered_object.query("g > 520")
filtered_source = filtered_object[[]].join(source_df, how="left")
# Calculate Amplitude
amplitude = licu.Amplitude()
filtered_source.groupby(level=0).apply(lambda x: amplitude(np.array(x.mjd), np.array(x.flux)))
498 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Nested-Pandas#
[ ]:
%%timeit
# Read in parquet data
# nesting sources into objects
nf = npd.read_parquet("objects.parquet")
nf = nf.join_nested(npd.read_parquet("ztf_sources.parquet"), "ztf_sources")
# Filter on object
nf = nf.query("ra > 10.0")
# Count number of observations per photometric band and add it as a column
nf = count_nested(nf, "ztf_sources", by="band", join=True) # use an existing utility
# Filter on our nobs
nf = nf.query("n_ztf_sources_g > 520")
# Calculate Amplitude
amplitude = licu.Amplitude()
nf.reduce(amplitude, "ztf_sources.mjd", "ztf_sources.flux")
228 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)