Loading Data into Nested-Pandas#

This notebook provides a brief introduction to loading data into nested-pandas or converting data into a nested structure. For an introduction to nested-pandas, see the quick start tutorial or the readthedocs page

Installation and Imports#

With a valid Python environment, nested-pandas and its dependencies are easy to install using the pip package manager. The following command can be used to install it:

[1]:
# % pip install nested-pandas
[2]:
import os
import tempfile

import pandas as pd

from nested_pandas import NestedFrame, read_parquet
from nested_pandas.datasets import generate_parquet_file

Overview#

Nested-pandas provides multiple mechanisms for loading data or converting data to the nested format. Below we walk through some of the common approaches.

Converting Flat Data#

Commonly existing data sets will be provided in “flat” data structures such as dictionaries or Pandas DataFrames. In these cases the data consists of a rectangular table where each row represents an instance or observation. Multiple instances of the same top-level item are linked together through an ID. All rows with the same ID correspond to the same object/item.

We define one such flat dataframe consisting of 10 rows for 3 distinct items.

[3]:
flat_df = pd.DataFrame(
    data={
        "a": [1, 1, 1, 2, 2, 2, 3, 3, 3, 3],
        "b": [2, 2, 2, 4, 4, 4, 6, 6, 6, 6],
        "c": [0, 2, 4, 1, 4, 3, 1, 4, 1, 1],
        "d": [5, 4, 7, 5, 3, 1, 9, 3, 4, 1],
    },
    index=[0, 0, 0, 1, 1, 1, 2, 2, 2, 2],
)
flat_df
[3]:
a b c d
0 1 2 0 5
0 1 2 2 4
0 1 2 4 7
1 2 4 1 5
1 2 4 4 3
1 2 4 3 1
2 3 6 1 9
2 3 6 4 3
2 3 6 1 4
2 3 6 1 1

The first column provides the object id. As we can see there are three rows with ID=0, three rows with ID=1, and four rows with ID=2. Some of the values are constant for each item. For example both columns “a” and “b” take a single value for object. We are wasting space by repeating them in every row. Other values are different per row (columns “c” and “d”).

As a concrete example, consider patient records. Each patient is assigned a unique id and has static data such as a date birth. They also have measurements that are new with every trip to the doctor, such as blood pressure or temperature.

Converting from Flat Pandas#

The easiest approach to converting the flat table above into a nested structure is to use NestedFrame.from_flat(). This function takes

  • a list of columns that are not nested (base_columns)

  • a list of columns to nest (nested_columns)

  • the name of the nested column (name) Rows are associated using the index by default, but a column name on which to join can also be provided.

[4]:
nf = NestedFrame.from_flat(
    flat_df,
    base_columns=["a", "b"],  # the columns not to nest
    nested_columns=["c", "d"],  # the columns to nest
    name="nested",  # name of the nested column
)
nf
[4]:
  a b nested
0 1 2
c d
0 5
+2 rows ...
1 2 4
c d
1 5
+2 rows ...
2 3 6
c d
1 9
+3 rows ...
3 rows x 3 columns

Inserting Nested Rows#

Alternatively, we can use the NestedFrame constructor to create our base frame from a dictionary of our columns (as we would do with a normal pandas DataFrame). This defines the top-level objects and the values that are constant across rows (“a” and “b”).

[5]:
nf = NestedFrame(
    data={
        "a": [1, 2, 3],
        "b": [2, 4, 6],
    },
    index=[0, 1, 2],
)
nf
[5]:
a b
0 1 2
1 2 4
2 3 6

3 rows × 2 columns

We can then create an additional pandas dataframes for the nested columns and pack them into our NestedFrame with NestedFrame.join_nested() function. join_nested will align the nest based on the index by default (one or more columns may be selected instead via the on kwarg), as we see the nested DataFrame has a repeated index corresponding to the nf NestedFrame.

[6]:
nested = pd.DataFrame(
    data={
        "c": [0, 2, 4, 1, 4, 3, 1, 4, 1, 1],
        "d": [5, 4, 7, 5, 3, 1, 9, 3, 4, 1],
    },
    index=[0, 0, 0, 1, 1, 1, 2, 2, 2, 2],
)

nf = nf.join_nested(nested, "nested")
nf
[6]:
  a b nested
0 1 2
c d
0 5
+2 rows ...
1 2 4
c d
1 5
+2 rows ...
2 3 6
c d
1 9
+3 rows ...
3 rows x 3 columns

The “index” parameter is used to perform the association. All of the values for index=0 are bundled together into a sub-table and stored in row 0’s “nested” column.

[7]:
nf.loc[0]["nested"]
[7]:
c d
0 0 5
1 2 4
2 4 7

We could add other nested columns by creating new sub-tables and adding them with join_nested(). Note that while the tables added with each join_nested() must be rectangular, they do not need to have the same dimensions between calls. We could add another nested row with a different number of observations.

[8]:
nested = pd.DataFrame(
    data={
        "c": [0, 1, 0, 1, 2, 0],
        "d": [5, 4, 5, 4, 3, 5],
    },
    index=[0, 0, 1, 1, 1, 2],
)

nf = nf.join_nested(nested, "nested2")
nf
[8]:
  a b nested nested2
0 1 2
c d
0 5
+2 rows ...
c d
0 5
1 4
1 2 4
c d
1 5
+2 rows ...
c d
0 5
+2 rows ...
2 3 6
c d
1 9
+3 rows ...
c d
0 5
3 rows x 4 columns

When the join key spans multiple columns, pass a list to on. The matching columns are used only as join keys and are dropped from the nested structure, while the original index is preserved.

[9]:
nested = pd.DataFrame(
    data={
        "a": [1, 1, 2, 3],
        "b": [2, 2, 4, 5],
        "c": [100, 200, 300, 400],
    }
)

nf = nf.join_nested(nested, "nested3", on=["a", "b"])
nf
[9]:
  a b nested nested2 nested3
0 1 2
c d
0 5
+2 rows ...
c d
0 5
1 4
c
100
200
1 2 4
c d
1 5
+2 rows ...
c d
0 5
+2 rows ...
c
300
2 3 6
c d
1 9
+3 rows ...
c d
0 5
None
3 rows x 5 columns

Loading Data from Parquet Files#

For larger datasets, we support loading data from parquet files. In the following cell, we generate a series of temporary parquet files with random data, and ingest them with the read_parquet method:

[10]:
# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.
# You can of course remove this and use your own directory and real files on your system.
with tempfile.TemporaryDirectory() as temp_path:
    # Generates parquet files with random data within our temporary directory
    generate_parquet_file(10, {"nested1": 100, "nested2": 10}, os.path.join(temp_path, "test.parquet"))

    # Read the parquet file to a NestedFrame
    nf = read_parquet(os.path.join(temp_path, "test.parquet"))

Nested-Pandas nested columns are compatible with the parquet format, meaning they can be written and read from parquet natively.

[11]:
nf  # nf contains nested columns
[11]:
  a b nested1 nested2
0 0.427066 0.136776
t flux flux_error band
5.419617 43.849803 1.0 g
+99 rows ... ... ...
t flux flux_error band
19.495253 8.251719 1.0 r
+9 rows ... ... ...
1 0.211671 1.326893
t flux flux_error band
8.335402 85.171015 1.0 r
+99 rows ... ... ...
t flux flux_error band
19.53081 96.874015 1.0 r
+9 rows ... ... ...
2 0.055414 1.082414
t flux flux_error band
19.355955 94.628664 1.0 r
+99 rows ... ... ...
t flux flux_error band
4.691422 11.562981 1.0 r
+9 rows ... ... ...
3 0.775171 0.387835
t flux flux_error band
19.033199 75.098404 1.0 r
+99 rows ... ... ...
t flux flux_error band
1.955557 80.127908 1.0 g
+9 rows ... ... ...
4 0.137772 1.039994
t flux flux_error band
5.387827 51.88141 1.0 g
+99 rows ... ... ...
t flux flux_error band
19.929865 74.132066 1.0 g
+9 rows ... ... ...
5 0.253176 0.974404
t flux flux_error band
13.755509 14.049722 1.0 r
+99 rows ... ... ...
t flux flux_error band
7.420792 83.609402 1.0 g
+9 rows ... ... ...
6 0.860085 0.090302
t flux flux_error band
16.859351 79.247587 1.0 r
+99 rows ... ... ...
t flux flux_error band
18.108278 77.695262 1.0 r
+9 rows ... ... ...
7 0.425566 0.216094
t flux flux_error band
17.783718 20.052159 1.0 g
+99 rows ... ... ...
t flux flux_error band
16.896547 99.459486 1.0 r
+9 rows ... ... ...
8 0.290725 0.863737
t flux flux_error band
18.091149 90.084562 1.0 g
+99 rows ... ... ...
t flux flux_error band
9.370695 52.067879 1.0 r
+9 rows ... ... ...
9 0.059254 0.704853
t flux flux_error band
18.312317 52.530806 1.0 r
+99 rows ... ... ...
t flux flux_error band
12.02237 45.384563 1.0 r
+9 rows ... ... ...
10 rows x 4 columns

Saving NestedFrames to Parquet Files#

Additionally we can save an existing NestedFrame as a parquet file using NestedFrame.to_parquet.

Note: Nested-Pandas converts any nested columns to pyarrow datatypes when writing to parquet, meaning that parquet files with nested columns can be read by a parquet reader from other packages so long as they understand pyarrow dtypes.

[12]:
# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.
# You can of course remove this and use your own directory and real files on your system.
with tempfile.TemporaryDirectory() as temp_path:
    nf.to_parquet(
        os.path.join(temp_path, "output.parquet"),  # The output file path
    )

    # List the files in temp_path to ensure they were saved correctly.
    print("The NestedFrame was saved to the following parquet files :", os.listdir(temp_path))
The NestedFrame was saved to the following parquet files : ['output.parquet']