Software defect data - predictability and exploration



Journal Title

Journal ISSN

Volume Title



Software defect reports have been prominently used in reliability modeling. Data about the defects found during software testing is recorded in software defect reports or bug reports. The data consists of defect information including defect number at various testing stages, complexity (of the defect), severity, information about the component to which the defect belongs, tester, and person fixing the defect. Reliability models mainly use data about the number of defects and its corresponding time to predict the remaining number of defects.

This thesis proposes an empirical approach to systematically elucidate useful information from software defect reports by (1) employing a data exploration technique that analyzes relationships between software quality of different releases using appropriate statistics, and (2) constructing predictive models for forecasting time for fixing defects using existing machine learning and data mining techniques. This work differs from traditional software reliability in two ways. First, it aims to predict time for fixing defects, as opposed to the remaining number of defects. While the latter gives a useful measure of software quality, in practice it cannot be used directly for development planning since defect number is not linear with respect to time and resources required. On the contrary, prediction of the time for fixing defects can be used directly to help schedule and manage software activities. Second, while reliability models are mainly based on a small number of attributes of defect data with numerical attribute values, the proposed approach extends use of defect data to include more relevant attributes whose values can be both quantitative and qualitative.

To illustrate the approach, we present an empirical study on a software defect report collected during the testing of a large medical software system. For data exploration, we use defects found per component and investigate relationships between defects in modules before and after release. For building predictive models, we apply various well-established machine learning and data mining algorithms including the decision tree learner, the Naive Bayes learner and neural networks with back propagation learning. The average results obtained from these algorithms are compared and also to illustrate the robustness of the proposed approach to predict time for fixing defects. The results obtained are promising with the top performance model having an average accuracy of 93.5%.



Effort estimation, Prediction, Data mining, Defect, Software defects