In the paper “Cassotti, M., Ballabio, D., Todeschini, R., & Consonni, V. (2015). A similarity-based QSAR model for predicting acute toxicity towards the fathead minnow (Pimephales promelas). SAR and QSAR in Environmental Research, 26(3), 217–243. https://doi.org/10.1080/1062936X.2015.1018938“, the authors presented a study on the prediction of the acute toxicity of chemicals to fish. In particular, they presented QSAR models to predict the LC50 96 hours for the fathead minnow (Pimephales promelas).
The dataset and information described in the paper were used to build the alvaRunner project that we present here.
alvaRunner project
This alvaRunner project contains two models:
- a KNN regression model built using the 726 molecules of the paper training set (KNN_Training)
- a KNN regression model built using all the 908 molecules (KNN_All)
Both models include the following six descriptors:
- MLOGP: Moriguchi octanol-water partition coefficient
- CIC0: complementary Information Content index (neighborhood symmetry of 0-order)
- NdssC: number of atoms of type dssC
- NdsCH: number of atoms of type dsCH
- SM1_Dz(Z): spectral moment of order 1 from Barysz matrix weighted by atomic number
- GATS1i: Geary autocorrelation of lag 1 weighted by ionization potential
Cassotti highlighted that CIC0 encode information regarding the number of different heteroatoms; NdssC and NdsCH encode information about the electrophilic characteristics of chemicals and SM1_Dz(Z) is highly correlated with the number of heteroatoms, indeed the molecules with the lowest SM1_Dz(Z) values are entirely constituted by carbon atoms while the largest values are observed in highly fluorinated and chlorinated compounds and, more in general, compounds with several heteroatoms. Finally, GATS1i tends to have low values for molecules with several carbon–carbon bonds and molecules with bromine and iodine atoms. Additionally, the distribution of GATS1i has lower values for aromatic compounds.
It’s worth noticing that the paper’s models use the Jaccard-Tanimoto distance instead the models in this project use the Euclidean distance. The molecular descriptors were pre-treated using the Range scale prior to computing the distances. The paper presents a few models using ad hoc distance thresholds (Strict/Soft) for the Applicability Domain. Due to the specificity of this solution, we decided not to include an Applicability Domain in this project.
The scores of the two models of the alvaRunner project are presented in the following table:
Model name | Training | Test | ||||
---|---|---|---|---|---|---|
R2 | Q2CV | RMSE | RMSECV | R2 | RMSE | |
KNN_Training (k: 6, Euclidean) | 0.616 | 0.62 | 0.884 | 0.88 | 0.651 | 0.919 |
KNN_All (k: 6, Euclidean) | 0.626 | 0.633 | 0.89 | 0.882 | – | – |
The following charts show the predicted (Y) and real (X) values of the two models:
KNN_Training | KNN_All |
---|---|