A blog about small things and smaller things still

Scikit-learn with xarray

In the previous post I used linear regression to predict water solubility. I further described how I encountered the lack of support for multi-dimensionality in the Pandas DataFrames, and how that led to problems with dimensions in scikit-learn.

Specifically, Pandas DataFrames store fingerprints as 1-dimensional 'cells', causes an error due to 'incorrect' dimensions

LinearRegression().fit(train.fps, train.expt)
                             ^^^

While the simple workaround .to_list() works, I have been wanting to try xarray for some time. It is a Pandas-like library that explicitly supports multi-dimensional arrays.

It has some very interesting features: * aforementioned multi-dimensional support * Nesting / NetCDF support * Dask support * built on top of NumPy and Pandas

Here is the updated code:

import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split


df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/SAMPL.csv")

# Pandas is interoperable with xarray
# convert to a dataset
ds = df.to_xarray()

fpgen = AllChem.GetMorganGenerator(radius=3, fpSize=2048)
mols = [Chem.MolFromSmiles(smiles) for smiles in df["smiles"]]

# this is a "data variable"
# the global "index" in the dataset will select the row
# P.S. That's where the magic happens!
ds["fp"] = (("index", "bit"), [fpgen.GetFingerprintAsNumPy(mol) for mol in mols])

# the dataset cannot be split the same way as a DataFrame or NumPy array
# therefore, it is not a direct drop-in replacement
train_idx, test_idx = train_test_split(ds.index)

# recover the selected subsets
train = ds.isel(index=train_idx)
test = ds.isel(index=test_idx)

# Voila! the dimensions are correct
reg = LinearRegression().fit(train.fp, train.expt)

score = reg.score(test.fp, test.expt)

Notice that I refer to index as a dimension. By default, data variables should have the same first dimension, which, in our case, is the index that identifies each row. We explicitly state that with ("index", "bit"). This of course requires that a fingerprint is provided for each row in the dataset.

In an ideal world, the xarray dataset would work with scikit-learn without requiring any extra work on the indices. Nevertheless, if you have a highly multidimensional dataset, xarray could overall bring a lot of clarity.

Now, if you decide to convert your xarray dataset back to a DataFrame

import xarray as xr
xr.Dataset.to_dataframe(ds)

it will result in a whopping 1,314,816 rows, where the fingerprint dimension is spread or flattened into the correct 2D dataframe.

Linear Regression on RDKit fingerprints

Here is a simple prediction of water solubility using the FreeSolv database. The solubility is defined as a log solubility in moles per litre.

I apply linear regression to Morgan fingerprints. By default, splitting uses 75% of the dataset for training, with the remainder reserved for testing.

import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/SAMPL.csv")

fpgen = AllChem.GetMorganGenerator(radius=3, fpSize=2048)

# add a column with fingerprints
df["fps"] = df["smiles"].apply(
    lambda smiles: fpgen.GetFingerprintAsNumPy(Chem.MolFromSmiles(smiles))
)

train, test = train_test_split(df)

reg = LinearRegression().fit(train.fps.to_list(), train.expt)
score = reg.score(test.fps.to_list(), test.expt)

Score

The average score is typically between 0.70 and 0.75, while the best possible value is 1. Specifically, the score is defined as the coefficient of determination:

\[ R^2 = 1 - \frac{\sum (y - y_{pred})^2} {\sum (y - \bar{y})^2} \]

When every prediction is perfect, the numberator becomes 0, so \(R^2\) reaches 1.

It is interesting to investigate the impact of different splits on performance.
Depending on which part of the dataset is used for training and testing, results can vary significantly:

image info

Pandas, NumPy and scikit-learn intersection

Although this code is straightforward, there's an interesting trap: I used Pandas DataFrame and stored the fingerprints as "cells", ideally letting us do:

LinearRegression().fit(train.fps, train.expt)
However, this fails because the fingerprints add another dimension. The .to_numpy() call returns a one-dimensional array of objects (shape = (rows_n, )) instead of the expected two-dimensional shape ((rows_n, 2048)). That leads to a conversion error.

One workaround is to use .to_list(), which preserves each row’s fingerprint as a separate array that scikit-learn correctly interprets as 2048-dimensional.
So watch out for dimensional mismatches when using fingerprints in scikit-learn.