TY - JOUR
T1 - A statistical methodology to simplify software metric models constructed using incomplete data samples
AU - Chan, Victor K.Y.
AU - Wong, W. Eric
AU - Xie, T. F.
N1 - Funding Information:
This paper is based on work supported by grant 045/2005/A from the Science and Technology Development Fund of the Government of the Macau Special Administrative Region, China.
PY - 2007/12
Y1 - 2007/12
N2 - Software metric models predict the target software metric(s), e.g., the development work effort or defect rates, for any future software project based on the project's predictor software metric(s), e.g., the project team size. Obviously, the construction of such a software metric model makes use of a data sample of such metrics from analogous past projects. However, incomplete data often appear in such data samples. Moreover, the decision on whether a particular predictor metric should be included is most likely based on an intuitive or experience-based assumption that the predictor metric has an impact on the target metric with a statistical significance. However, this assumption is usually not verifiable "retrospectively" after the model is constructed, leading to redundant predictor metric(s) and/or unnecessary predictor metric complexity. To solve all these problems, we derived a methodology consisting of the k-nearest neighbors (k-NN) imputation method, statistical hypothesis testing, and a "goodness-of-fit" criterion. This methodology was tested on software effort metric models and software quality metric models, the latter usually suffers from far more serious incomplete data. This paper documents this methodology and the tests on these two types of software metric models.
AB - Software metric models predict the target software metric(s), e.g., the development work effort or defect rates, for any future software project based on the project's predictor software metric(s), e.g., the project team size. Obviously, the construction of such a software metric model makes use of a data sample of such metrics from analogous past projects. However, incomplete data often appear in such data samples. Moreover, the decision on whether a particular predictor metric should be included is most likely based on an intuitive or experience-based assumption that the predictor metric has an impact on the target metric with a statistical significance. However, this assumption is usually not verifiable "retrospectively" after the model is constructed, leading to redundant predictor metric(s) and/or unnecessary predictor metric complexity. To solve all these problems, we derived a methodology consisting of the k-nearest neighbors (k-NN) imputation method, statistical hypothesis testing, and a "goodness-of-fit" criterion. This methodology was tested on software effort metric models and software quality metric models, the latter usually suffers from far more serious incomplete data. This paper documents this methodology and the tests on these two types of software metric models.
KW - Imputation method
KW - Missing data
KW - Model simplification
KW - Models
KW - Software metrics
KW - Software quality
UR - http://www.scopus.com/inward/record.url?scp=38849129507&partnerID=8YFLogxK
U2 - 10.1142/S0218194007003495
DO - 10.1142/S0218194007003495
M3 - Article
AN - SCOPUS:38849129507
SN - 0218-1940
VL - 17
SP - 689
EP - 707
JO - International Journal of Software Engineering and Knowledge Engineering
JF - International Journal of Software Engineering and Knowledge Engineering
IS - 6
ER -