主管:中国科学院
主办:中国优选法统筹法与经济数学研究会
   中国科学院科技战略咨询研究院

中国管理科学 ›› 2019, Vol. 27 ›› Issue (3): 11-19.doi: 10.16381/j.cnki.issn1003-207x.2019.03.002

• 论文 • 上一篇    下一篇

仿EM的多变量缺失数据填补算法及其在信用评估中的应用

蒋辉, 马超群, 许旭庆, 兰秋军   

  1. 湖南大学工商管理学院, 湖南 长沙 410082
  • 收稿日期:2018-01-17 修回日期:2018-05-25 出版日期:2019-03-20 发布日期:2019-04-28
  • 通讯作者: 兰秋军(1972-),男(汉族),湖南娄底人,湖南大学工商管理学院教授,研究方向:金融数据挖掘,E-mail:lanqiujun@hnu.edu.cn. E-mail:lanqiujun@hnu.edu.cn
  • 基金资助:

    国家自然科学基金重点资助项目(71431008);国家自然科学基金应急项目(71850012);教育部人文社会科学研究规划基金资助项目(18YJAZH038)

An EM-similar Imputation Algorithm for Multivariable Data Missing and its Application in Credit Scoring

JIANG Hui, MA Chao-qun, XU Xu-qing, LAN Qiu-jun   

  1. Business School of Hunan University, Changsha 410082, China
  • Received:2018-01-17 Revised:2018-05-25 Online:2019-03-20 Published:2019-04-28

摘要: 数据缺失会显著降低信用评估模型的准确性和可用性,尤其是多变量同时有数据缺失时。本文针对模型应用阶段的多变量数据缺失问题,提出了一种新的数据填补算法。该算法由两阶段构成:准备阶段和数据填补阶段。在准备阶段,算法基于朴素贝叶斯方法以初始数据集进行训练,对每个可能缺失的变量构建起相应的单变量预测估计模型;而数据填补阶段则借鉴了EM算法的思想,利用前期的单变量预测估计模型,对给定的多变量数据缺失样本进行交替迭代,逐步填补更新。理论证明,该算法具有单调收敛性。以人人贷数据集和UCI提供的德国和澳大利亚两个信用评估基准数据集为例,将其与众数填补法、EM填补法进行性能对比实验,结果表明本文方法的数据还原性能和填补后信用评估准确性都明显更优。这为解决信用评估时的数据多变量缺失问题提供了一种更好的处理方法。

关键词: EM算法, 信用评估, 数据缺失, 数据挖掘

Abstract: Data missing can significantly reduce the accuracy and usability of the credit scoring model, especially in multivariate missing situations. The classical method to fill missing data is the substitution of mean and mode. And EM algorithm becomes popular recently.
Aiming at the data missing in the phase of credit scoring, a new multivariable data filling method is proposed in this paper, whose idea is similar to EM algorithm. However, it has wider applicability because the distribution functions of the missing variables are not required. The method consists of two stages:models preparation stage and data filling stage. At the models preparation stage, Naive Bayes method is used to train prediction models based on the initial data set for all variables with missing possibility. At the second stage, the variables of a sample with missed data arefilled using prediction models built at the previous stage and by a way of alternately iteration. It is proved that the algorithm is monotonically convergent.
Three data sets are collected for experiments. One is downloaded from Renrendai, a famous P2P financial company, and two of them (German and Australia), are the benchmark data sets provided by UCI. Experimental results show that both the accuracy of data recovery and the accuracy of credit evaluation of the proposed method are obviously better than that of mode filling and EM methods for all three experimental data sets. This significantly indicates that the proposed method has better capability to solve the problem of multivariable data missing in credit evaluation.

Key words: EM algorithm, credit scoring, data missing, data mining

中图分类号: