Mutagenicity
-
Salmonella typhimurium (CPDB)
Endpoint Definition
A chemical is classified within the CPDB as mutagenic, i.e. positive, in the Salmonella assay if it was evaluated overall as either mutagenic or weakly mutagenic by Zeiger or as overall positive by the EPA Gene-Tox Program. All other chemicals evaluated for mutagenicity by these two sources are reported as negative.
[
Details
]
[
Original data
]
Algorithm Definition
lazar
obtains predictions from the experimental results of compounds with similar structures (
neighbors
). For differentiated predictions chemical similarities are always determined
in respect to the endpoint under investigation
. A detailled description and formal definition of the
lazar
algorithm has been published in:
-
C. Helma: Lazy Structure-Activity Relationships (lazar) for the Prediction of Rodent Carcinogenicity and Salmonella Mutagenicity, Molecular Diversity 10, 147-158 (2006) [
preprint
]
The present version of
lazar
uses a slightly modified definition for chemical similarity that uses a) a gaussian distribution function and b) considers the presence of fragments that cannot be evaluated for statistical reasons (i.e. because they are too infrequent in the database). The definition for chemical similarity (Equation 1) is now
You can donwload the source code for this
lazar
version (
GNU General Public License
) with
git
:
git clone
git://github.com/helma/lazar-core.git
Applicability Domain Definition
The applicability domain (AD) of the training set is characterized by the confidence index of a prediction (high confidence index: close to the applicability domain of the training set/reliable prediction, low confidence: far from the applicability domain of the trainingset/unreliable prediction). The confidence index considers (i) the similarity and number of neighbors and (ii) contradictory examples within the neighbors. A formal definition can be found in:
-
C. Helma: Lazy Structure-Activity Relationships (lazar) for the Prediction of Rodent Carcinogenicity and Salmonella Mutagenicity, Molecular Diversity 10, 147-158 (2006) [
preprint
]
The reliability of predictions decreases gradually with increasing distance from the applicability domain (i.e. decreasing confidence index).
Figure 1
shows this dependency visually,
Table 1
weights true/false predictions with their confidence and provides the best indication of the overall performance of the system.
For simplicity we provide also results for an applicability domain definition with a sharp border at a confidence index of
0.025
. These results are summarized in
Table 2
, indicated by the grey area in
Figure 1
and in the ROC curve in
Figure 2.
. Misclassifications within the applicability domain are summarized in the
table of misclassifications
.
The presence of substructures that are unknown to the training set (
unknown fragments
) is another factor that limits the applicability domain. As the training data cannot provide any information about unknown fragments, their relevance has to be evaluated by an expert user (as a rule of thumb large fragments are of less concern, because all shorter subfragments have been evaluated by the system). For this reason the presence/absence of unknown fragments is not considered in the formal applicability domain definition, but their presence is indicated in the
table of misclassifications
.
Validation Results (leave-one-out crossvalidation)
Definition and experimental comparison with external validation procedures:
-
R. Benigni, T. I. Netzeva, E. Benfenati, C. Bossa andR. Franke, C. Helma, E. Hulzebos, C. Marchant, A. Richard, Y.-T. Woo, and C. Yang. The expanding role of predictive toxicology: an update on the (Q)SAR models for mutagens and carcinogens. J Environ Sci Health C Environ Carcinog Ecotoxicol Rev., 25:53-97, 2007.
-
C. Helma: Lazy Structure-Activity Relationships (lazar) for the Prediction of Rodent Carcinogenicity and Salmonella Mutagenicity, Molecular Diversity 10, 147-158 (2006) [
preprint
]
-
C. Helma and J. Kazius: Artificial Intelligence and Data Mining for Toxicity Prediction, Current Computer-Aided Drug Design 2, 1-19 (2006) [
preprint
]
-
Presentation at Workshop on Evaluating Prediction Models in Mutagenicity and Carcinogenicity, Rome, Italy (2006) [
presentation
]
|
True positive predictions
|
tp
|
28.72
|
|
True negative predictions
|
tn
|
23.94
|
|
False positive predictions
|
fp
|
4.98
|
|
False negative predictions
|
fn
|
6.22
|
|
Sensitivity (true positive rate)
|
tp/(tp+fn)
|
0.82
|
|
Specificity (true negative rate)
|
tn/(tn+fp)
|
0.83
|
|
Positive predictivity
|
tp/(tp+fp)
|
0.85
|
|
Negative predictivity
|
tn/(tn+fn)
|
0.79
|
|
False positive rate
|
fp/(tp+fn)
|
0.14
|
|
False negative rate
|
fn/(tn+fp)
|
0.22
|
|
Accuracy (concordance)
|
(tp+tn)/(tp+fp+tn+fn)
|
0.82
|
Best indication of the overall performance (see
Applicability Domain Definition
)
|
True positive predictions
|
tp
|
220
|
|
True negative predictions
|
tn
|
251
|
|
False positive predictions
|
fp
|
57
|
|
False negative predictions
|
fn
|
72
|
|
Sensitivity (true positive rate)
|
tp/(tp+fn)
|
0.75
|
|
Specificity (true negative rate)
|
tn/(tn+fp)
|
0.81
|
|
Positive predictivity
|
tp/(tp+fp)
|
0.79
|
|
Negative predictivity
|
tn/(tn+fn)
|
0.78
|
|
False positive rate
|
fp/(tp+fn)
|
0.2
|
|
False negative rate
|
fn/(tn+fp)
|
0.23
|
|
Accuracy (concordance)
|
(tp+tn)/(tp+fp+tn+fn)
|
0.79
|
Predictions with a confidence >
0.025
are considered to be within the applicability domain (see
Applicability Domain Definition
)
Table 3: All predictions
|
True positive predictions
|
tp
|
263
|
|
True negative predictions
|
tn
|
314
|
|
False positive predictions
|
fp
|
84
|
|
False negative predictions
|
fn
|
116
|
|
Sensitivity (true positive rate)
|
tp/(tp+fn)
|
0.69
|
|
Specificity (true negative rate)
|
tn/(tn+fp)
|
0.79
|
|
Positive predictivity
|
tp/(tp+fp)
|
0.76
|
|
Negative predictivity
|
tn/(tn+fn)
|
0.73
|
|
False positive rate
|
fp/(tp+fn)
|
0.22
|
|
False negative rate
|
fn/(tn+fp)
|
0.29
|
|
Accuracy (concordance)
|
(tp+tn)/(tp+fp+tn+fn)
|
0.74
|
Poor indication of the overall performance. Depends predominatly on the fraction of compounds beyond the applicability domain, which are by definition poorly predictable (see
Applicability Domain Definition
)
Depicts the dependency of predictive accuracy on the confidence index (i.e. the distance to the applicability domain, see
Applicability Domain Definition
). Fluctuations at the left hand side of the figure are statistical artefacts (small sample sizes) and therefore irrelevant.
Depicts true versus false positive rates. An optimal model would reside in the top left corner, random guessing would lead to point near the diagonal line.
The
table of misclassifications
shows all misclassified instances within the applicability domain.
Neighbors
Neighbors
are compounds that are similar in respect to
salmonella typhimurium (cpdb)
. It is likely that compounds with high similarity act by similar mechanisms as the query compound. You can retrieve additional experimental data and literature citations for the neighbors and the query structure by following the "Search PubChem" links on the prediction page.
Fragments
Activating and deactivating parts of the query compound are highlighted in red and green. Fragments that are unknown (or too infrequent for statistical evaluation are marked in yellow. You can retrieve additional statistical information about the individual fragments by following the "Relevant Fragments" link. Please note that
lazar
predictions are based on neighbors and not on fragments. Fragments and their statistical significance are used for the calculation of activity specific similarities.