Applying statistical methodology to optimize and simplify software metric models with missing data

W. Eric Wong, Jin Zhao, Victor K.Y. Chan

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

14 Citations (Scopus)


During the construction of a software metric model, the decision on whether a particular predictor metric should be included is most likely based on an intuitive or experience based assumption that the predictor metric has an impact on the target metric with a statistical significance. However, a model constructed based on such an assumption may contain redundant predictor metric(s) and/or unnecessary predictor metric complexity. This is because the assumption made before the model construction is not verified after the model is constructed. To resolve the first problem (i.e., possible redundant predictor metric(s)), we propose a statistical hypothesis testing methodology to verify "retrospectively" the statistical significance of the impact of each predictor metric on the target metric. If the variation of a predictor metric does not correlate enough with the variation of the target metric, the predictor metric should be deleted from the model. For the second problem (i.e., unnecessary predictor metric complexity), we use "goodness-of-fit" to determine whether certain categories of a categorical predictor metric should be combined together. In addition, missing data often appear in the data sample used for constructing the model. We use a modified k-nearest neighbors (k-NN) imputation method to deal with this problem. A study using data from the "Repository Data Disk - Release 6" is reported. The results indicate that our methodology can be useful in trimming redundant predictor metrics and identifying unnecessary categories initially assumed for a categorical predictor metric in the model.

Original languageEnglish
Title of host publicationApplied Computing 2006 - The 21st Annual ACM Symposium on Applied Computing - Proceedings of the 2006 ACM Symposium on Applied Computing
PublisherAssociation for Computing Machinery
Number of pages6
ISBN (Print)1595931082, 9781595931085
Publication statusPublished - 2006
Event2006 ACM Symposium on Applied Computing - Dijon, France
Duration: 23 Apr 200627 Apr 2006

Publication series

NameProceedings of the ACM Symposium on Applied Computing


Conference2006 ACM Symposium on Applied Computing


  • Imputation method
  • Missing data
  • Model optimization
  • Model simplification
  • Models
  • Software metrics


Dive into the research topics of 'Applying statistical methodology to optimize and simplify software metric models with missing data'. Together they form a unique fingerprint.

Cite this