Skip to content

xarray

Scikit-learn with xarray

In the previous post I used linear regression to predict water solubility. I further described how I encountered the lack of support for multi-dimensionality in the Pandas DataFrames, and how that led to problems with dimensions in scikit-learn.

Specifically, Pandas DataFrames store fingerprints as 1-dimensional 'cells', causes an error due to 'incorrect' dimensions

LinearRegression().fit(train.fps, train.expt)
                             ^^^

While the simple workaround .to_list() works, I have been wanting to try xarray for some time. It is a Pandas-like library that explicitly supports multi-dimensional arrays.

It has some very interesting features: * aforementioned multi-dimensional support * Nesting / NetCDF support * Dask support * built on top of NumPy and Pandas

Here is the updated code:

import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split


df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/SAMPL.csv")

# Pandas is interoperable with xarray
# convert to a dataset
ds = df.to_xarray()

fpgen = AllChem.GetMorganGenerator(radius=3, fpSize=2048)
mols = [Chem.MolFromSmiles(smiles) for smiles in df["smiles"]]

# this is a "data variable"
# the global "index" in the dataset will select the row
# P.S. That's where the magic happens!
ds["fp"] = (("index", "bit"), [fpgen.GetFingerprintAsNumPy(mol) for mol in mols])

# the dataset cannot be split the same way as a DataFrame or NumPy array
# therefore, it is not a direct drop-in replacement
train_idx, test_idx = train_test_split(ds.index)

# recover the selected subsets
train = ds.isel(index=train_idx)
test = ds.isel(index=test_idx)

# Voila! the dimensions are correct
reg = LinearRegression().fit(train.fp, train.expt)

score = reg.score(test.fp, test.expt)

Notice that I refer to index as a dimension. By default, data variables should have the same first dimension, which, in our case, is the index that identifies each row. We explicitly state that with ("index", "bit"). This of course requires that a fingerprint is provided for each row in the dataset.

In an ideal world, the xarray dataset would work with scikit-learn without requiring any extra work on the indices. Nevertheless, if you have a highly multidimensional dataset, xarray could overall bring a lot of clarity.

Now, if you decide to convert your xarray dataset back to a DataFrame

import xarray as xr
xr.Dataset.to_dataframe(ds)

it will result in a whopping 1,314,816 rows, where the fingerprint dimension is spread or flattened into the correct 2D dataframe.