Accuracy assessment
We use predictions for genes from the known sites library to assess pipeline performance. By comparing experimentally known binding sites and predicted motifs we define the following observables:
- TP (true positives): total number of overlapping bases in discovered and experimental motifs
- FN (false negatives): total number of bases in experimental motifs that are not covered by discovered motifs
- FP (false positives): total number of bases in discovered motifs that are not in experimental motifs
- TN (true negatives): total number of bases in the search region used in the performance analysis that are covered neither by experimental motifs nor discovery motifs
Note all of the above observables are defined at the base pair level.
We also calculate the total number of true hits, TH. We call a discovered motif a "true hit", when it overlaps with an experimental motif for at least 5 bases. Any other discovered motif is called a "false hit". We denote the total number of false hits as FH.
We define sensitivity SN = TP / (TP + FN), specificity SP = TN / (TN + FP), and positive predictive value PPV=TP / (TP + FP).
We define positive predictive value (at the hit level) as HPPV = TH / (TH + FH).
Figure 1 Results of a performance assessment of the motif prediction pipeline. Sensitivity, specificity and positive predictive value are shown as functions of p-value. Predictive performance is better than random (curve not shown). There are a total of 7757 bases in known sites from 161 genes. Predicted motifs that overlap with experimentally known sites have higher significance (low p-value).