Scikit-learn with xarray
In the previous post I used linear regression to predict water solubility. I further described how I encountered the lack of support for multi-dimensionality in the Pandas DataFrames, and how that led to problems with dimensions in scikit-learn.
Specifically, Pandas DataFrames store fingerprints as 1-dimensional 'cells', causes an error due to 'incorrect' dimensions
While the simple workaround .to_list() works,
I have been wanting to try xarray for some time.
It is a Pandas-like library that explicitly
supports multi-dimensional arrays.
It has some very interesting features: * aforementioned multi-dimensional support * Nesting / NetCDF support * Dask support * built on top of NumPy and Pandas
Here is the updated code:
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/SAMPL.csv")
# Pandas is interoperable with xarray
# convert to a dataset
ds = df.to_xarray()
fpgen = AllChem.GetMorganGenerator(radius=3, fpSize=2048)
mols = [Chem.MolFromSmiles(smiles) for smiles in df["smiles"]]
# this is a "data variable"
# the global "index" in the dataset will select the row
# P.S. That's where the magic happens!
ds["fp"] = (("index", "bit"), [fpgen.GetFingerprintAsNumPy(mol) for mol in mols])
# the dataset cannot be split the same way as a DataFrame or NumPy array
# therefore, it is not a direct drop-in replacement
train_idx, test_idx = train_test_split(ds.index)
# recover the selected subsets
train = ds.isel(index=train_idx)
test = ds.isel(index=test_idx)
# Voila! the dimensions are correct
reg = LinearRegression().fit(train.fp, train.expt)
score = reg.score(test.fp, test.expt)
Notice that I refer to index as a dimension.
By default, data variables should have the same
first dimension, which, in our case, is the index
that identifies each row. We explicitly
state that with ("index", "bit"). This
of course requires that a fingerprint is provided
for each row in the dataset.
In an ideal world, the xarray dataset would work with scikit-learn without requiring any extra work on the indices. Nevertheless, if you have a highly multidimensional dataset, xarray could overall bring a lot of clarity.
Now, if you decide to convert your xarray dataset back to a DataFrame
it will result in a whopping 1,314,816 rows, where the fingerprint dimension is spread or flattened into the correct 2D dataframe.