Accuracy assessment

We use predictions for genes from the known sites library to assess pipeline performance. By comparing experimentally known binding sites and predicted motifs we define the following observables:

Note all of the above observables are defined at the base pair level.

We also calculate the total number of true hits, TH. We call a discovered motif a "true hit", when it overlaps with an experimental motif for at least 5 bases. Any other discovered motif is called a "false hit". We denote the total number of false hits as FH.

We define sensitivity SN = TP / (TP + FN), specificity SP = TN / (TN + FP), and positive predictive value PPV=TP / (TP + FP).

We define positive predictive value (at the hit level) as HPPV = TH / (TH + FH).

Figure 1   Results of a performance assessment of the motif prediction pipeline. Sensitivity, specificity and positive predictive value are shown as functions of p-value. Predictive performance is better than random (curve not shown). There are a total of 7757 bases in known sites from 161 genes. Predicted motifs that overlap with experimentally known sites have higher significance (low p-value).

Questions or comments: cisred@bcgsc.ca