Scikit-learn with xarray
In the previous post I used linear regression to predict water solubility. I further described how I encountered the lack of support for multi-dimensionality in the Pandas DataFrames, and how that led to problems with dimensions in scikit-learn.
Specifically, Pandas DataFrames store fingerprints as 1-dimensional 'cells', causes an error due to 'incorrect' dimensions
While the simple workaround .to_list()
works,
I have been wanting to try xarray for some time.
It is a Pandas-like library that explicitly
supports multi-dimensional arrays.
It has some very interesting features: * aforementioned multi-dimensional support * Nesting / NetCDF support * Dask support * built on top of NumPy and Pandas
Here is the updated code:
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/SAMPL.csv")
# Pandas is interoperable with xarray
# convert to a dataset
ds = df.to_xarray()
fpgen = AllChem.GetMorganGenerator(radius=3, fpSize=2048)
mols = [Chem.MolFromSmiles(smiles) for smiles in df["smiles"]]
# this is a "data variable"
# the global "index" in the dataset will select the row
# P.S. That's where the magic happens!
ds["fp"] = (("index", "bit"), [fpgen.GetFingerprintAsNumPy(mol) for mol in mols])
# the dataset cannot be split the same way as a DataFrame or NumPy array
# therefore, it is not a direct drop-in replacement
train_idx, test_idx = train_test_split(ds.index)
# recover the selected subsets
train = ds.isel(index=train_idx)
test = ds.isel(index=test_idx)
# Voila! the dimensions are correct
reg = LinearRegression().fit(train.fp, train.expt)
score = reg.score(test.fp, test.expt)
Notice that I refer to index
as a dimension.
By default, data variables should have the same
first dimension, which, in our case, is the index
that identifies each row. We explicitly
state that with ("index", "bit")
. This
of course requires that a fingerprint is provided
for each row in the dataset.
In an ideal world, the xarray dataset would work with scikit-learn without requiring any extra work on the indices. Nevertheless, if you have a highly multidimensional dataset, xarray could overall bring a lot of clarity.
Now, if you decide to convert your xarray dataset back to a DataFrame
it will result in a whopping 1,314,816 rows, where the fingerprint dimension is spread or flattened into the correct 2D dataframe.