主管:中国科学院
主办:中国优选法统筹法与经济数学研究会
   中国科学院科技战略咨询研究院

Chinese Journal of Management Science ›› 2019, Vol. 27 ›› Issue (3): 11-19.doi: 10.16381/j.cnki.issn1003-207x.2019.03.002

• Articles • Previous Articles     Next Articles

An EM-similar Imputation Algorithm for Multivariable Data Missing and its Application in Credit Scoring

JIANG Hui, MA Chao-qun, XU Xu-qing, LAN Qiu-jun   

  1. Business School of Hunan University, Changsha 410082, China
  • Received:2018-01-17 Revised:2018-05-25 Online:2019-03-20 Published:2019-04-28

Abstract: Data missing can significantly reduce the accuracy and usability of the credit scoring model, especially in multivariate missing situations. The classical method to fill missing data is the substitution of mean and mode. And EM algorithm becomes popular recently.
Aiming at the data missing in the phase of credit scoring, a new multivariable data filling method is proposed in this paper, whose idea is similar to EM algorithm. However, it has wider applicability because the distribution functions of the missing variables are not required. The method consists of two stages:models preparation stage and data filling stage. At the models preparation stage, Naive Bayes method is used to train prediction models based on the initial data set for all variables with missing possibility. At the second stage, the variables of a sample with missed data arefilled using prediction models built at the previous stage and by a way of alternately iteration. It is proved that the algorithm is monotonically convergent.
Three data sets are collected for experiments. One is downloaded from Renrendai, a famous P2P financial company, and two of them (German and Australia), are the benchmark data sets provided by UCI. Experimental results show that both the accuracy of data recovery and the accuracy of credit evaluation of the proposed method are obviously better than that of mode filling and EM methods for all three experimental data sets. This significantly indicates that the proposed method has better capability to solve the problem of multivariable data missing in credit evaluation.

Key words: EM algorithm, credit scoring, data missing, data mining

CLC Number: