Performance Impact of nested-pandas

Performance Impact of nested-pandas#

For use-cases involving nesting data, nested-pandas can offer significant speedups compared to using the native pandas API. Below is a brief example workflow comparison between pandas and nested-pandas, where this example workflow calculates the amplitude of photometric fluxes after a few filtering steps.

[ ]:
import nested_pandas as npd
import pandas as pd
import light_curve as licu
import numpy as np

from nested_pandas.utils import count_nested

Pandas#

[5]:
%%timeit

# Read data
object_df = pd.read_parquet("objects.parquet")
source_df = pd.read_parquet("ztf_sources.parquet")

# Filter on object
filtered_object = object_df.query("ra > 10.0")
# sync object to source --removes any index values of source not found in object
filtered_source = filtered_object[[]].join(source_df, how="left")

# Count number of observations per photometric band and add it to the object table
band_counts = (
    source_df.groupby(level=0)
    .apply(lambda x: x[["band"]].value_counts().reset_index())
    .pivot_table(values="count", index="index", columns="band", aggfunc="sum")
)
filtered_object = filtered_object.join(band_counts[["g", "r"]])

# Filter on our nobs
filtered_object = filtered_object.query("g > 520")
filtered_source = filtered_object[[]].join(source_df, how="left")

# Calculate Amplitude
amplitude = licu.Amplitude()
filtered_source.groupby(level=0).apply(lambda x: amplitude(np.array(x.mjd), np.array(x.flux)))
498 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Nested-Pandas#

[ ]:
%%timeit

# Read in parquet data
# nesting sources into objects
nf = npd.read_parquet("objects.parquet")
nf = nf.join_nested(npd.read_parquet("ztf_sources.parquet"), "ztf_sources")

# Filter on object
nf = nf.query("ra > 10.0")

# Count number of observations per photometric band and add it as a column
nf = count_nested(nf, "ztf_sources", by="band", join=True)  # use an existing utility

# Filter on our nobs
nf = nf.query("n_ztf_sources_g > 520")

# Calculate Amplitude
amplitude = licu.Amplitude()
nf.reduce(amplitude, "ztf_sources.mjd", "ztf_sources.flux")
228 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)