Loading Data into Nested-Pandas#
This notebook provides a brief introduction to loading data into nested-pandas or converting data into a nested structure. For an introduction to nested-pandas, see the quick start tutorial or the readthedocs page
Installation and Imports#
With a valid Python environment, nested-pandas and its dependencies are easy to install using the pip package manager. The following command can be used to install it:
[1]:
# % pip install nested-pandas
[2]:
import os
import tempfile
import pandas as pd
from nested_pandas import NestedFrame, read_parquet
from nested_pandas.datasets import generate_parquet_file
Overview#
Nested-pandas provides multiple mechanisms for loading data or converting data to the nested format. Below we walk through some of the common approaches.
Converting Flat Data#
Commonly existing data sets will be provided in “flat” data structures such as dictionaries or Pandas DataFrames. In these cases the data consists of a rectangular table where each row represents an instance or observation. Multiple instances of the same top-level item are linked together through an ID. All rows with the same ID correspond to the same object/item.
We define one such flat dataframe consisting of 10 rows for 3 distinct items.
[3]:
flat_df = pd.DataFrame(
data={
"a": [1, 1, 1, 2, 2, 2, 3, 3, 3, 3],
"b": [2, 2, 2, 4, 4, 4, 6, 6, 6, 6],
"c": [0, 2, 4, 1, 4, 3, 1, 4, 1, 1],
"d": [5, 4, 7, 5, 3, 1, 9, 3, 4, 1],
},
index=[0, 0, 0, 1, 1, 1, 2, 2, 2, 2],
)
flat_df
[3]:
| a | b | c | d | |
|---|---|---|---|---|
| 0 | 1 | 2 | 0 | 5 |
| 0 | 1 | 2 | 2 | 4 |
| 0 | 1 | 2 | 4 | 7 |
| 1 | 2 | 4 | 1 | 5 |
| 1 | 2 | 4 | 4 | 3 |
| 1 | 2 | 4 | 3 | 1 |
| 2 | 3 | 6 | 1 | 9 |
| 2 | 3 | 6 | 4 | 3 |
| 2 | 3 | 6 | 1 | 4 |
| 2 | 3 | 6 | 1 | 1 |
The first column provides the object id. As we can see there are three rows with ID=0, three rows with ID=1, and four rows with ID=2. Some of the values are constant for each item. For example both columns “a” and “b” take a single value for object. We are wasting space by repeating them in every row. Other values are different per row (columns “c” and “d”).
As a concrete example, consider patient records. Each patient is assigned a unique id and has static data such as a date birth. They also have measurements that are new with every trip to the doctor, such as blood pressure or temperature.
Converting from Flat Pandas#
The easiest approach to converting the flat table above into a nested structure is to use NestedFrame.from_flat(). This function takes
a list of columns that are not nested (base_columns)
a list of columns to nest (nested_columns)
the name of the nested column (name) Rows are associated using the index by default, but a column name on which to join can also be provided.
[4]:
nf = NestedFrame.from_flat(
flat_df,
base_columns=["a", "b"], # the columns not to nest
nested_columns=["c", "d"], # the columns to nest
name="nested", # name of the nested column
)
nf
[4]:
| a | b | nested | |||||||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 |
|
||||||
| 1 | 2 | 4 |
|
||||||
| 2 | 3 | 6 |
|
Inserting Nested Rows#
Alternatively, we can use the NestedFrame constructor to create our base frame from a dictionary of our columns (as we would do with a normal pandas DataFrame). This defines the top-level objects and the values that are constant across rows (“a” and “b”).
[5]:
nf = NestedFrame(
data={
"a": [1, 2, 3],
"b": [2, 4, 6],
},
index=[0, 1, 2],
)
nf
[5]:
| a | b | |
|---|---|---|
| 0 | 1 | 2 |
| 1 | 2 | 4 |
| 2 | 3 | 6 |
3 rows × 2 columns
We can then create an additional pandas dataframes for the nested columns and pack them into our NestedFrame with NestedFrame.join_nested() function. join_nested will align the nest based on the index by default (one or more columns may be selected instead via the on kwarg), as we see the nested DataFrame has a repeated index corresponding to the nf NestedFrame.
[6]:
nested = pd.DataFrame(
data={
"c": [0, 2, 4, 1, 4, 3, 1, 4, 1, 1],
"d": [5, 4, 7, 5, 3, 1, 9, 3, 4, 1],
},
index=[0, 0, 0, 1, 1, 1, 2, 2, 2, 2],
)
nf = nf.join_nested(nested, "nested")
nf
[6]:
| a | b | nested | |||||||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 |
|
||||||
| 1 | 2 | 4 |
|
||||||
| 2 | 3 | 6 |
|
The “index” parameter is used to perform the association. All of the values for index=0 are bundled together into a sub-table and stored in row 0’s “nested” column.
[7]:
nf.loc[0]["nested"]
[7]:
| c | d | |
|---|---|---|
| 0 | 0 | 5 |
| 1 | 2 | 4 |
| 2 | 4 | 7 |
We could add other nested columns by creating new sub-tables and adding them with join_nested(). Note that while the tables added with each join_nested() must be rectangular, they do not need to have the same dimensions between calls. We could add another nested row with a different number of observations.
[8]:
nested = pd.DataFrame(
data={
"c": [0, 1, 0, 1, 2, 0],
"d": [5, 4, 5, 4, 3, 5],
},
index=[0, 0, 1, 1, 1, 2],
)
nf = nf.join_nested(nested, "nested2")
nf
[8]:
| a | b | nested | nested2 | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 |
|
|
||||||||||||
| 1 | 2 | 4 |
|
|
||||||||||||
| 2 | 3 | 6 |
|
|
When the join key spans multiple columns, pass a list to on. The matching columns are used only as join keys and are dropped from the nested structure, while the original index is preserved.
[9]:
nested = pd.DataFrame(
data={
"a": [1, 1, 2, 3],
"b": [2, 2, 4, 5],
"c": [100, 200, 300, 400],
}
)
nf = nf.join_nested(nested, "nested3", on=["a", "b"])
nf
[9]:
| a | b | nested | nested2 | nested3 | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 |
|
|
|
|||||||||||||||
| 1 | 2 | 4 |
|
|
|
|||||||||||||||
| 2 | 3 | 6 |
|
|
None |
Loading Data from Parquet Files#
For larger datasets, we support loading data from parquet files. In the following cell, we generate a series of temporary parquet files with random data, and ingest them with the read_parquet method:
[10]:
# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.
# You can of course remove this and use your own directory and real files on your system.
with tempfile.TemporaryDirectory() as temp_path:
# Generates parquet files with random data within our temporary directory
generate_parquet_file(10, {"nested1": 100, "nested2": 10}, os.path.join(temp_path, "test.parquet"))
# Read the parquet file to a NestedFrame
nf = read_parquet(os.path.join(temp_path, "test.parquet"))
Nested-Pandas nested columns are compatible with the parquet format, meaning they can be written and read from parquet natively.
[11]:
nf # nf contains nested columns
[11]:
| a | b | nested1 | nested2 | |||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.427066 | 0.136776 |
|
|
||||||||||||||||||||||||
| 1 | 0.211671 | 1.326893 |
|
|
||||||||||||||||||||||||
| 2 | 0.055414 | 1.082414 |
|
|
||||||||||||||||||||||||
| 3 | 0.775171 | 0.387835 |
|
|
||||||||||||||||||||||||
| 4 | 0.137772 | 1.039994 |
|
|
||||||||||||||||||||||||
| 5 | 0.253176 | 0.974404 |
|
|
||||||||||||||||||||||||
| 6 | 0.860085 | 0.090302 |
|
|
||||||||||||||||||||||||
| 7 | 0.425566 | 0.216094 |
|
|
||||||||||||||||||||||||
| 8 | 0.290725 | 0.863737 |
|
|
||||||||||||||||||||||||
| 9 | 0.059254 | 0.704853 |
|
|
Saving NestedFrames to Parquet Files#
Additionally we can save an existing NestedFrame as a parquet file using NestedFrame.to_parquet.
Note: Nested-Pandas converts any nested columns to pyarrow datatypes when writing to parquet, meaning that parquet files with nested columns can be read by a parquet reader from other packages so long as they understand pyarrow dtypes.
[12]:
# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.
# You can of course remove this and use your own directory and real files on your system.
with tempfile.TemporaryDirectory() as temp_path:
nf.to_parquet(
os.path.join(temp_path, "output.parquet"), # The output file path
)
# List the files in temp_path to ensure they were saved correctly.
print("The NestedFrame was saved to the following parquet files :", os.listdir(temp_path))
The NestedFrame was saved to the following parquet files : ['output.parquet']