This is going to be the first in a series of posts on what I am calling “DIY Drug Discovery”. Admittedly, though, this title is hyperbolic. Discovering a new drug requires bringing it through a series of trials, which is very hard, if not impossible for an individual to do themselves. What I’m really going to be discussing is DIY drug screening. I believe in the near future individuals will be able to discover candidate drug molecules using commodity GPU hardware and open source software, such as,, and

Table of contents

Background on chemical space and drug discovery

The set of all possible molecules, which is known as chemical space, is incredibly vast. The Chemical Abstracts Service (CAS) registry lists 49,037,297 known molecules (molecules that have actually be synthesized). The number of different molecules that have been synthesized, in both public and private settings, may reach towards 100,000,000. Yet this is only tiny fraction of the number of possible molecules. Recent research has tried to enumerate the number of possible molecules containing Carbon, Nitrogen, Oxygen, Hydrogen, Oxygen, and halogens which may be of interest for drug discovery. So far, possible molecules up to a size of 17 atoms have been enumerated and a large database called GBD has been created. Enumerating possible molecules isn’t as simple as enumerating possible molecular graphs. There are many physical constraints that the authors had to take into account. For instance:

  • bonding - many types of bonds are not possible – covalent bonds must be between an electron donor and an electron acceptor. A Chlorine atom cannot bond to a flourine and a sodium cannot bond to a magnesium.

  • geometrical constraints - Due to the nature of atomic orbitals, each atom prefers to bond in particular configuration. For instance, carbon prefers to bond tetrahedrally and when forming rings prefers to form hexagons. Pentagons and other shapes are not necessarily impossible, but they are generally high unstable. For instance,, where carbon is forced to bond at 90 degrees, is highly unstable and explosive material. Carbon bonding in a triangle, or an isolated pentagon or decagon is impossible.

  • functional group stability - Many arrangements may be structurally stable in isolation but are highly reactive, making them near impossible to contain in the real world. For instance, if an oxygen or nitrogen atom is bonded to a non-aromatic C=C, this creates an Enol or enamine group which is highly reactive. The authors of the GBD database used 12 seperate filters to limit what types of functional groups could exist.

  • stereochemistry - the basic idea of this constraint is that “two things cannot be in the same place at the same time”. For instance, a molecule cannot loop back and intersect itself.

  • synthesizability - the molecule most be synthesizable. For instance molecules with more than three carbon rings or allenes (C=C=C) are exceedingly difficult to synthesize.

These basic physical constraints limit the chemical space significantly, especially if one is interested in . For instance the authors of GBD note “The vast majority of fused small ring systems are high strained and reactive. 96.7% of molecules with carbon are removed by this constraint. However, even after applying 29 filters, the GBD database still contains 10^11 molecules. This is a large number, but tractable in the sense that it could be stored on the very latest computer hardware. (storing the database in SMILES format would consume about 63 terabytes. The entire database is not available for public consumption, only a subset of 50 million compounds is downloadable). Keep in mind that this number was obtained after applying many constraints - the number of organic molecules of drug like size with normal atoms (CNOH + halogens) and up to 4 rings has been estimated to be around 10^60!

Apart from whether a molecule can physically exist and be synthesized, there are further constraints when it comes to making a drug:

  • solubility the molecule must be soluble in the blood.

  • toxicity the molecule cannot be toxic.

  • permeability & lipophilicity can it get through any biological membranes it needs to, such cell wall or the blood brain barrier?

  • binding target affinity will it actually bind to the target of interest?

Traditionally drug design has been done through a trial and error approach. A professor once described it to me as “trying to build an airplane by attaching the wings in at random orientations, and then seeing if it flies.” Just as computer simulations are used to design highly aerodynamic airplanes, so that less actual physical assembly and testing needs to be done, physics based simulation holds the promise of being able to design and optimize drug molecules before they are actually syntheized.

It would be very helpful if machine learning could be used to screen molecules.

Basic concepts of fingerprinting

For more info an in easy to read format, see the

The first step in molecular machine learning is encoding the structure of the molecule in a form that is amenable to machine learning. This is where a lot of research is currently focused. A useful representation encodes features that are relevant and is efficient, so as to avoid the curse of dimensionality. Fortunately, there is a way method of featurazation called fingerprinting which already has a long history of development in the world of drug discovery.

Fingerprinting was originally designed to solve the problem of molecular substructure search – how do I find all the molecules in my database with a particular substructure? A related problem is molecular similarity - given a molecule, how do I find the molecules in my database that are most similar. Only later were fingerprints applied for machine learning, or what has traditionally called quantitative structure property relationships (QSPR).

Fingerprinting creates an efficient representation of the molecular graph. The basic process of fingerprinting is as follows: First the algorithm generates a set of patterns. For instance, enumeration of different paths is common:

0-bond paths: 1-bond paths:
  C OC
  O C==C
  N CN
2-bond paths: 3-bond paths:

Storing all this data would result in an enormous representation. The trick of fingerprinting is to “hash” each of these features, which essentially means they act as seeds to a random number generator called a hash function. The hash function generates a bit string. Typically the hash function is chosen so that 4 or 5 bits per pattern are non-zero in the bit string. Next, all of the bit strings are OR’ed together.

This process is very useful for substructure searching because every bit of the substructure’s fingerprint will be on (=1) in the molecule’s fingerprint. Accidental collisions between patterns or substructures are possible, but if the bit density is low (known as a sparse representation) they are rare. Ideally, the on bit density of the bitstring should be tuned so as to reach a particular discriminatory power that is needed (ie. the chance that two structurally different molecules in my database have the same fingerprint is only 1%). There is a tradeoff between discriminatory power and the efficiency of the representation.

In practice, creating a bit density is done through folding. A very long, very sparse fingerprint is ‘folded’ down (with OR operations) to create a fingerprint with a desired length and good bit density. In case this is not obvious what this means, it can be literally thought of as folding the vector onto itself.

Now let’s test some fingerprints. We use, an open source cheminformatics library. Our data consists of strings, which encode molecular graphs into a string of characters, and experimentally determined solubility values.

from rdkit import Chem
from rdkit.Chem.EState import Fingerprinter
from rdkit.Chem import Descriptors
from rdkit.Chem.rdmolops import RDKFingerprint
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn import cross_validation
from keras import regularizers
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from sklearn import cross_validation
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from rdkit.Avalon.pyAvalonTools import GetAvalonFP

#Read the data
data = pd.read_table('../other_data_sets/solubility_data.txt', sep=' ', skipfooter=0)

def estate_fingerprint_and_mw(mol):
    return np.append(FingerprintMol(mol)[0], Descriptors.MolWt(x))

#Add some new columns
data['Mol'] = data['SMILES'].apply(Chem.MolFromSmiles)
num_mols = len(data)

#Create X and y
#Convert to Numpy arrays
y = data['Solubility'].values
Next we make a bunch of different fingerprints. To do this, I have created a fingerprint object, which stores the name of the fingerprint and contains a method for applying the fingerprint and then converting the output into a NumPy array.
from rdkit.Chem.rdMolDescriptors import
from rdkit.Chem.AtomPairs.Sheridan import GetBPFingerprint
from rdkit.Chem.EState.Fingerprinter import FingerprintMol
from rdkit.Avalon.pyAvalonTools import GetAvalonFP #GetAvalonCountFP  #int vector version
from rdkit.Chem.AllChem import  GetMorganFingerprintAsBitVect, GetErGFingerprint
from rdkit.DataStructs.cDataStructs import ConvertToNumpyArray
import rdkit.DataStructs.cDataStructs

def ExplicitBitVect_to_NumpyArray(bitvector):
    bitstring = bitvector.ToBitString()
    intmap = map(int, bitstring)
    return np.array(list(intmap))

class fingerprint():
    def __init__(self, fp_fun, name):
        self.fp_fun = fp_fun = name
        self.x = []

    def apply_fp(self, mols):
        for mol in mols:
            fp = self.fp_fun(mol)
            if isinstance(fp, tuple):
                fp = np.array(list(fp[0]))
            if isinstance(fp, rdkit.DataStructs.cDataStructs.ExplicitBitVect):
                fp = ExplicitBitVect_to_NumpyArray(fp)
            if isinstance(fp,rdkit.DataStructs.cDataStructs.IntSparseIntVect):
                fp = np.array(list(fp))

            self.x += [fp]

            if (str(type(self.x[0])) != "<class 'numpy.ndarray'>"):
                print("WARNING: type for ",, "is ", type(self.x[0]))

def make_fingerprints(length = 512, verbose=False):
    fp_list = [
         #fingerprint(lambda x : GetBPFingerprint(x, fpfn=AtomPair), 
         #            "Physiochemical properties (1996)"), ##NOTE: takes a long time to compute
         fingerprint(lambda x : GetHashedAtomPairFingerprintAsBitVect(x, nBits = length),
                     &qfuot;Atom pair (1985)"),
         fingerprint(lambda x : GetHashedTopologicalTorsionFingerprintAsBitVect(x, nBits = length),
                     "Topological torsion (1987)"),
         fingerprint(lambda x : GetMorganFingerprintAsBitVect(x, 2, nBits = length),
                     "Morgan circular "),
         fingerprint(FingerprintMol, "Estate (1995)"),
         fingerprint(lambda x: GetAvalonFP(x, nBits=length),
                    "Avalon bit based (2006)"),
         fingerprint(lambda x: np.append(GetAvalonFP(x, nBits=length), Descriptors.MolWt(x)),
                    "Avalon+mol. weight"),
         fingerprint(lambda x: GetErGFingerprint(x), "ErG fingerprint (2006)"),
         fingerprint(lambda x : RDKFingerprint(x, fpSize=length),
                     "RDKit fingerprint")

    for fp in fp_list:
        if (verbose): print("doing",

    return fp_list

fp_list = make_fingerprints()

The following code compares the fingerprints we have created, using 20 fold and a ridge regression model.
def test_model_cv(model, x, y, cv=20):
    scores = cross_validation.cross_val_score(model, x, y, cv=cv, n_jobs=-1, 
scoring='neg_mean_absolute_error') scores = -1scores return scores.mean() def test_fingerprints(fp_list, model, y, verbose = True): fingerprint_scores = {} for fp in fp_list: if verbose: print("doing ", fingerprint_scores[] = test_model_cv(model, fp.x, y) sorted_names = sorted(fingerprint_scores, key=fingerprint_scores.__getitem__, reverse=False) print("\\begin{tabular}{c c}") print(" name & avg abs error in CV (kJ/cc) \\\\") print("\\hline") for i in range(len(sorted_names)): name = sorted_names[i] print("%30s & %5.3f \\\\" % (name, fingerprint_scores[name])) print("\\end{tabular}") test_fingerprints(fp_list, Ridge(alpha=1e-9), y, verbose=True)
doing  Atom pair (1985)
doing  Topological torsion (1987)
doing  Morgan circular
doing  Estate (1995)
doing  Avalon bit based (2006)
doing  Avalon+mol. weight
doing  ErG fingerprint (2006)
doing  RDKit fingerprint

\begin{tabular}{c c}
           name        &  avg abs error in CV (kJ/cc) \\
                 Estate (1995) & 0.699 \\
    Topological torsion (1987) & 1.492 \\
        ErG fingerprint (2006) & 1.697 \\
             RDKit fingerprint & 2.869 \\
              Morgan circular  & 2.928 \\
              Atom pair (1985) & 3.043 \\
            Avalon+mol. weight & 20.190 \\
       Avalon bit based (2006) & 26.482 \\
def test_model_cv(model, x, y, cv=20):
    scores = cross_validation.cross_val_score(model, x, y, cv=cv, n_jobs=-1, scoring='neg_mean_absolute_error')

    scores = -1scores

    return scores.mean()

def test_fingerprint_vs_size(model, num_sizes_to_test = 20, max_size=2048, cv = 20, verbose=False, makeplots=False):

    fp_list = make_fingerprints(length = 10) #test
    num_fp = len(fp_list)

    sizes = np.linspace(100,max_size,num_sizes_to_test)

    scores_vs_size = np.zeros([num_fp, num_sizes_to_test])

    num_fp = 0
    for i in range(num_sizes_to_test):
        if verbose: print(i, ",", end='')
        length = sizes[i]
        fp_list = make_fingerprints(length = int(length))
        num_fp = len(fp_list)
        for j in range(num_fp):
            scores_vs_size[j,i] = test_model_cv(model, fp_list[j].x, y)

    if (makeplots):
        import matplotlib.pyplot as plt
        fig = plt.figure(figsize=(10,10))
        fp_names = []
        for i in range(num_fp):
            plt.plot(sizes, scores_vs_size[i,:],'-')
            fp_names += [fp_list[i].name]
        plt.title('Ridge regression, average CV score vs fingerprint length',fontsize=25)
        plt.ylabel('average mean absolute error in CV ',fontsize=20)
        plt.xlabel('fingerprint length', fontsize=20)

    return scores_vs_size

scores_vs_size = test_fingerprint_vs_size(Ridge(alpha=1e-9), verbose=True, makeplots=True)

We see that the Estate fingerprint performs best. This is not surprising as Estate encodes information about the effective charge on each atom. For other applications I have tried, the Avalon or atom pair fingerprints often perform best. Estate has a fixed length of 80. Another interesting thing we see in this graph is that the error increases once the fingerprint length grows beyond 500. This is not something I have observed in other applications yet, but is likely due to the curse of dimensionality rearing its ugly head, even though we are using a regularized model (Ridge regression). In retrospect, it was pointed out to me that Lasso regression (L1 regularization) would probably do a lot better than Ridge regression here, since fingerprints have a lot of zeros. [L1 regularization introduces a penalty term which has a gradient that forces parameters corresponding to variables with no predictive power to be precisely zero, whereas L2 only force them to be very small.] In any case, we create the array *X* using the Estate fingerprint to feed into machine learning models:

def estate_fingerprint(mol):
    return FingerprintMol(mol)[0]

#Scale X to unit variance and zero mean
data['Fingerprint'] = data['Mol'].apply(estate_fingerprint)

X = np.array(list(data['Fingerprint']))

st = StandardScaler()
X = np.array(list(data['Fingerprint']))
X = st.fit_transform(X)

Next we do grid searches to tune the hyperparameters of the KernelRidge, Ridge, GaussianProcessRegressor, and RandomForestRegressor models in sci-kit-learn:
KRmodel = GridSearchCV(KernelRidge(), cv=10,
              param_grid={"alpha": np.logspace(-10, -5, 10),
             "gamma": np.logspace(-12, -9, 10), "kernel" : ['laplacian', 'rbf']}, scoring='neg_mean_absolute_error', n_jobs=-1)

KRmodel =, y)
Best_KernelRidge = KRmodel.best_estimator_
print("Best Kernel Ridge model")
Best Kernel Ridge model
{'alpha': 3.5938136638046258e-10, 'gamma': 1e-10, 'kernel': 'laplacian'}
Rmodel = GridSearchCV(Ridge(), cv=20,
              param_grid={"alpha": np.logspace(-10, -5, 30),}, scoring='neg_mean_absolute_error', n_jobs=-1)

Rmodel =, y)
Best_Ridge = Rmodel.best_estimator_
print("Best Ridge model")
Best Ridge model
{'alpha': 1e-10}
GPmodel = GridSearchCV(GaussianProcessRegressor(normalize_y=True), cv=20,
              param_grid={"alpha": np.logspace(-15, -10, 30),}, scoring='neg_mean_absolute_error', n_jobs=-1)
GPmodel =, y)
Best_GaussianProcessRegressor = GPmodel.best_estimator_
print("Best Gaussian Process model")
Best Gaussian Process model
{'alpha': 2.2122162910704501e-15}
RFmodel = GridSearchCV(RandomForestRegressor(), cv=20,
              param_grid={"n_estimators": np.linspace(50, 150, 25).astype('int')}, scoring='neg_mean_absolute_error', n_jobs=-1)

RFmodel =, y)
Best_RandomForestRegressor = RFmodel.best_estimator_
print("Best Random Forest model")
Best Random Forest model
{'n_estimators': 91}

Testing out different machine learning models

import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import rbf_kernel

def make_scatter_plot(y_pred_train, y_pred_test, y_train, y_test, title='', figsize=(6,4), fontsize=16):    
    plt.scatter(y_train,y_pred_train, label = 'Train', c='blue')
    plt.xlabel('Experimental Solubility (mol/L)', fontsize=fontsize)
    plt.ylabel('Predicted Solubility (mol/L)', fontsize=fontsize)
    plt.scatter(y_test,y_pred_test,c='lightgreen', label='Test', alpha = 0.8)

def test_models_and_plot(x, y, model_dict, plots=True):
    ''' test a bunch of models and print out a sorted list of CV accuracies
            x: training data features, numpy array or Pandas dataframe
            y: training data labels, numpy array or Pandas dataframe
            model_dict: a dictionary of the form {name : model()}, where 'name' is a string
                        and 'model()' is a sci-kit-learn model object. 

    mean_scores = {}
    percent_errors = {}

    for (name, model) in model_dict.items():
        #print("running %s" % name)
        scores = cross_validation.cross_val_score(model, x, y, cv=20, n_jobs=-1, scoring='neg_mean_absolute_error')
        scores = -1*scores
        mean_score = scores.mean()
        mean_scores[name] = mean_score

        X_train, X_test, y_train, y_test = cross_validation.train_test_split(x, y, test_size=0.1), y_train)

        y_pred_train = model.predict(X_train)

        y_pred_test  = model.predict(X_test)

        percent_error = np.mean( 100*np.abs(y_test -y_pred_test)/np.abs(y_pred_test))

        percent_errors[name] = percent_error

        fulltitle = name+'\n mean % error: '+str(percent_error)

        if plots:
            make_scatter_plot(y_pred_train, y_pred_test, y_train, y_test, title=fulltitle, figsize = (8,6))

    sorted_names = sorted(percent_errors, key=mean_scores.__getitem__, reverse=False)

    print("\\begin{tabular}{c c c}")
    print("           name     &      % test err   & .    abs error in CV \\\\ ")
    for i in range(len(sorted_names)):
        name = sorted_names[i]
        print("%30s & %5.3f & %5.3f \\\\" % (name, percent_errors[name], mean_scores[name]))

alpha_grid = {'alpha': np.logspace(1e-11,1e-1,8)}

model_dict = {
            'Linear Regression': LinearRegression(),
            'Kernel Ridge Regression': Best_KernelRidge,
            'Ridge Regression':Best_Ridge,
            'Guassian Process Regressor': Best_GaussianProcessRegressor,
            'Support Vector Regression': SVR(),
            'KNeighborsRegressor': KNeighborsRegressor(),
            'Neural Network': MLPRegressor(alpha=100,max_iter=8000, hidden_layer_sizes=[8,6], early_stopping=False),
            'Gradient Boosted Trees': GradientBoostingRegressor(n_estimators=100),
            'Random forest': Best_RandomForestRegressor

test_models_and_plot(X, y, model_dict)
\begin{tabular}{c c c}
           name     &      % test err   & .    abs error in CV \\
     Support Vector Regression & 81.770 & 0.641 \\
                 Random forest & 46.825 & 0.643 \\
       Kernel Ridge Regression & 55.056 & 0.685 \\
              Ridge Regression & 40.707 & 0.699 \\
        Gradient Boosted Trees & 40.320 & 0.755 \\
           KNeighborsRegressor & 73.877 & 0.840 \\
    Guassian Process Regressor & 38.252 & 0.967 \\
                Neural Network & 35.076 & 1.043 \\
             Linear Regression & 212.043 & 1384058953421.953 \\

The output is formatted in LaTeX so a nice table can be produced, which ends up looking like this:

latex output


The best performing model in this case was support vector regression, followed closely by Random Forrest and Kernel Ridge Regression. In general, I find that KRR and random forrest are usually the best performing models, and this is the first time SVR has won out. However, in molecular machine learning, the model that is used is really not the limiting factor, provided one does their due diligence and explores several classes of models and optimizes their hyperparameters. The component which really determines success is the featurization technique used. The machine learning model can be thought of an black box machine that extracts the nonlinear dependance of Y on X. As long as the black box is large enough and very well tuned and regularized, we don’t have to worry about it much. What we need to worry about more is creating the right X from our molecule data so that the black box can due its job. Traditionally, human created fingerprints and descriptors have been used, which are based on physics and chemical intuition about what molecular features are important. In the future ML can also be applied to fingerprint design – this will be discussed in a future post.

Another thing we observe is the large difference between the test and CV errors. The test error is for a particular train-test split, and appears to have high variance. The CV error is averaged over all test-train splits. The percentage error depends highly on whether one is working with a very soluble or insoluble molecule. For some applications the error profile of this model may be acceptable, in others it may be too crude. The error is definitely larger than a different application I applied this same code to, which was predicting decomposition energy. In that case I obtained average errors on the order of ~20% or less.