数据缺失会显著降低信用评估模型的准确性和可用性,尤其是多变量同时有数据缺失时。本文针对模型应用阶段的多变量数据缺失问题,提出了一种新的数据填补算法。该算法由两阶段构成:准备阶段和数据填补阶段。在准备阶段,算法基于朴素贝叶斯方法以初始数据集进行训练,对每个可能缺失的变量构建起相应的单变量预测估计模型;而数据填补阶段则借鉴了EM算法的思想,利用前期的单变量预测估计模型,对给定的多变量数据缺失样本进行交替迭代,逐步填补更新。理论证明,该算法具有单调收敛性。以人人贷数据集和UCI提供的德国和澳大利亚两个信用评估基准数据集为例,将其与众数填补法、EM填补法进行性能对比实验,结果表明本文方法的数据还原性能和填补后信用评估准确性都明显更优。这为解决信用评估时的数据多变量缺失问题提供了一种更好的处理方法。
Data missing can significantly reduce the accuracy and usability of the credit scoring model, especially in multivariate missing situations. The classical method to fill missing data is the substitution of mean and mode. And EM algorithm becomes popular recently.
Aiming at the data missing in the phase of credit scoring, a new multivariable data filling method is proposed in this paper, whose idea is similar to EM algorithm. However, it has wider applicability because the distribution functions of the missing variables are not required. The method consists of two stages:models preparation stage and data filling stage. At the models preparation stage, Naive Bayes method is used to train prediction models based on the initial data set for all variables with missing possibility. At the second stage, the variables of a sample with missed data arefilled using prediction models built at the previous stage and by a way of alternately iteration. It is proved that the algorithm is monotonically convergent.
Three data sets are collected for experiments. One is downloaded from Renrendai, a famous P2P financial company, and two of them (German and Australia), are the benchmark data sets provided by UCI. Experimental results show that both the accuracy of data recovery and the accuracy of credit evaluation of the proposed method are obviously better than that of mode filling and EM methods for all three experimental data sets. This significantly indicates that the proposed method has better capability to solve the problem of multivariable data missing in credit evaluation.
[1] 卿固, 辛超群. 信用评级方法模型的研究综述与展望[J]. 西部金融, 2015,(5):41-45.
[2] Durand D. Risk Elements in Consumer Instalment Financing, Technical Edition[M]. Cambridge, MA:National Bureau of Economic Research, Inc, 1941.
[3] Wiginton J C.A note on the comparison of logit and discriminant models of consumer credit behaviour[J].Financial Quantitative Anal,1980,(15):757-770.
[4] Altman E I. Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy[J]. Journal of Finance, 1968, 23(4):589-609.
[5] 肖斌卿,杨旸,李心丹,等. 基于模糊神经网络的小微企业信用评级研究[J].管理科学学报,2016,19(11):114-126.
[6] Chen Weimin, Ma Chaoqun, Ma Lin. Mining the customer credit using hybrid support vector machine technique[J]. Expert Systems with Applications, 2009, 36(4):7611-7616.
[7] Chen Feilong,Li Fengchia.Combination of feature selection approaches with SVM in credit scoring[J].Expert Systems with Applications,2010,37(7):4902-4909.
[8] 庞素琳, 巩吉璋. C5.0分类算法及在银行个人信用评级中的应用[J]. 系统工程理论与实践, 2009, 29(12):94-104
[9] 许艳秋, 潘美芹. 层次分析法和支持向量机在个人信用评估中的应用[J]. 中国管理科学, 2016,(s1).
[10] 衣柏衡, 朱建军, 李杰. 基于改进SMOTE的小额贷款公司客户信用风险非均衡SVM分类[J]. 中国管理科学, 2016, 24(3):24-30.
[11] 肖进, 薛书田, 黄静,等. 客户信用评估半监督协同训练模型研究[J]. 中国管理科学, 2016, 24(6):124-131.
[12] Sustersic M,Mramor D,Zupan J. Consumer credit scoring models with limited data[J]. Expert Systems with Applications,2009,36(3):4736-4744.
[13] 肖进, 刘敦虎, 顾新,等. 银行客户信用评估动态分类器集成选择模型[J]. 管理科学学报, 2015,(3):114-126.
[14] Florez-Lopez R. Effects of Missing Data in Credit Risk Scoring. A Comparative Analysis of Methods to Achieve Robustness in the Absence of Sufficient Data[J]. Journal of the Operational Research Society, 2010, 61(3):486-501.
[15] 张松兰, 王鹏, 徐子伟. 基于统计相关的缺失值数据处理研究[J]. 统计与决策, 2016,(12):13-16.
[16] Little R J A, Rubin D B. Statistical analysis with missing data[M]. New Jersey:Wiley, 1986.
[17] 金勇进. 缺失数据的插补调整[J]. 数理统计与管理, 2001, 20(6):47-53.
[18] Gustavo E A P A. Batista, Maria Carolina Monard. An analysis of four missing data treatment methods for supervised learning[J]. Applied Artificial Intelligence, 2003,17(5-6):519-533.
[19] Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the algorithm[J]. Journal of the Royal Statistical Society, 1977, 39(1):1-38.
[20] Meng Xiaoli, Rubin D B. Maximum likelihood estimation via the ECM algorithm:A general framework[J]. Biometrika, 1993, 80(2):267-278.
[21] Walker S. An EM Algorithm for Nonlinear Random Effects Models[J]. Biometrics, 1996, 52(3):934-944.
[22] 翟继友, 张鹏. 高斯混合模型参数估值算法的优化[J].计算机技术与发展, 2011, 21(11):145-148.