Model: BCF

In the paper “Bhattacharyya, P., Samanta, P., Kumar, A., Das, S., & Ojha, P. K. (2024). Quantitative read-across structure–property relationship (q-RASPR): a novel approach to estimate the bioaccumulative potential for diverse classes of industrial chemicals in aquatic organisms. Environmental Science: Processes & Impacts, 10.1039.D4EM00374H. https://doi.org/10.1039/D4EM00374H“, the authors introduced a q-RASPR model integrating read-across techniques with QSAR principles to predict the Bioconcentration Factor (BCF) of chemicals in aquatic organisms. While the paper highlights the q-RASPR model, this page focuses on the traditional QSAR model developed as part of the study. In particular, the QSAR model predicts the logarithm of BCF (logBCF) where BCF is measured in L/kg. The logBCF is used to evaluate the bioaccumulation potential of chemical substances in reference organisms, and it directly correlates with ecotoxicity.

The dataset and information described in the paper were used to build the alvaRunner project that we present here.

alvaRunner project

This alvaRunner project contains one model:

  • a OLS regression model built using 978 molecules

The model was originally developed using Partial Least Squares (PLS) regression to determine the beta coefficients. These coefficients were utilized in this project to reproduce the model through a simpler Ordinary Least Squares (OLS) linear regression technique. The original dataset consisted of 1,333 molecules. The authors removed 30 molecules, including, for example, inorganic compounds, and divided the remaining dataset into a training set of 978 molecules and a test set of 325 molecules.

The model includes the following six 2D descriptors:

  • X1A: average connectivity index of order 1
  • Cl-089: Cl attached to C1(sp2)
  • MaxsOH: Maximum sOH
  • MaxdO: Maximum dO
  • B03[C-N]: Presence/absence of C – N at topological distance 3
  • MLOGP: Moriguchi octanol-water partition coeff. (logP)

The authors shared some insights into the model descriptors:

  • X1A, the average connectivity index of order 1, represents molecular branching and complexity. Its negative contribution suggests that increased branching and complexity decrease the logBCF value. 
  • Cl-089 refers to the hydrophobicity measures of a Cl atom attached to a sp2 hybridized carbon (C1) atom and showed positive contribution towards the log BCF endpoint, which means that hydrophobicity makes the compound more bioaccumulative.
  • MaxsOH is the maximum atom type E-state of the fragment =O, signifying that compounds containing carbonyl groups, amides, or esters will decrease the value of logBCF due to the formation of hydrogen bonds with water as it has a negative regression coefficient.
  • MaxdO represents the maximum atom-type E-state for the fragment =O which indicates that compounds containing carbonyl groups, amides, or esters are likely to reduce the value of logBCF. This effect is attributed to the formation of hydrogen bonds with water, as reflected by its negative regression coefficient.
  • B03[C–N] reflects the presence or absence of carbon and nitrogen atoms at a topological distance of 3. Its negative contribution indicates that an increase in this fragment reduces the compound’s bioaccumulative potential. As an electron donor, the nitrogen atom forms hydrogen bonds with water, further lowering bioaccumulation potential.
  • MLOGP positively correlates with lipophilicity, which ultimately makes the compound more bioaccumulative.

The scores of the model are presented in the following table:

Model nameTrainingTest
R2Q2CVRMSERMSECVR2RMSE
M10.6540.6470.8310.8390.7180.745
The cross-validation (CV) is 5-fold.

To enhance the reliability of the QSAR model, a distance-based Applicability Domain (AD) has been added. This AD defines the theoretical chemical space where the model’s predictions are considered reliable. Predictions falling outside the AD are flagged as less reliable, providing a clear indication of the model’s predictive boundaries.

The chart on the left shows the predicted (Y) and experimental (X) values and the one on the right is the Williams plot of the model:

logBCFWilliams plot
green: training set, blue: test set, orange: outside the AD

Download

Please, log in in order to access the content.