Recent progress in the use of data mining and statistical techniques to automatically classify related software failures and to localize defects suggests that if appropriate information is collected about executions of deployed software, then such techniques can assist developers in prioritizing action on soft failures reported by users and in diagnosing their causes. Users, however, are not reliable judges of correct software behavior: they may overlook real failures, neglect to report failures they do observe, or report spurious failures. Instead, I propose to employ users as independent checks on each other. Previous work demonstrated that executions with similar execution profiles often represent similar program behavior. By grouping similar executions together, developers can use user-submitted labels to corroborate each other:
similar executions with the same label represent consensus, and similar executions
with differing labels represent suspicious or confusing behavior.
An empirical evaluation of two proposed techniques, Corroboration-based Filtering, Review-All-FAILUREs plus k-Nearest Neighbors, indicates that they discover significantly more failures and defects than the naive review-all-FAILUREs strategy. A third technique, round-robin cluster sampling, discovers failures and defects more quickly than RAF.