An EM-similar Imputation Algorithm for Multivariable Data Missing and its Application in Credit Scoring

JIANG Hui, MA Chao-qun, XU Xu-qing, LAN Qiu-jun

doi:10.16381/j.cnki.issn1003-207x.2019.03.002

Chinese Journal of Management Science >

2019 , Vol. 27 >Issue 3: 11 - 19

DOI: https://doi.org/10.16381/j.cnki.issn1003-207x.2019.03.002

Articles

An EM-similar Imputation Algorithm for Multivariable Data Missing and its Application in Credit Scoring

Expand

Business School of Hunan University, Changsha 410082, China

Received date: 2018-01-17

Revised date: 2018-05-25

Online published: 2019-04-28

Fold

Abstract

Data missing can significantly reduce the accuracy and usability of the credit scoring model, especially in multivariate missing situations. The classical method to fill missing data is the substitution of mean and mode. And EM algorithm becomes popular recently.
Aiming at the data missing in the phase of credit scoring, a new multivariable data filling method is proposed in this paper, whose idea is similar to EM algorithm. However, it has wider applicability because the distribution functions of the missing variables are not required. The method consists of two stages:models preparation stage and data filling stage. At the models preparation stage, Naive Bayes method is used to train prediction models based on the initial data set for all variables with missing possibility. At the second stage, the variables of a sample with missed data arefilled using prediction models built at the previous stage and by a way of alternately iteration. It is proved that the algorithm is monotonically convergent.
Three data sets are collected for experiments. One is downloaded from Renrendai, a famous P2P financial company, and two of them (German and Australia), are the benchmark data sets provided by UCI. Experimental results show that both the accuracy of data recovery and the accuracy of credit evaluation of the proposed method are obviously better than that of mode filling and EM methods for all three experimental data sets. This significantly indicates that the proposed method has better capability to solve the problem of multivariable data missing in credit evaluation.

Key words： EM algorithm; credit scoring; data missing; data mining

Cite this article

JIANG Hui, MA Chao-qun, XU Xu-qing, LAN Qiu-jun . An EM-similar Imputation Algorithm for Multivariable Data Missing and its Application in Credit Scoring[J]. Chinese Journal of Management Science, 2019 , 27(3) : 11 -19 . DOI: 10.16381/j.cnki.issn1003-207x.2019.03.002

References

[1] 卿固, 辛超群. 信用评级方法模型的研究综述与展望[J]. 西部金融, 2015,(5):41-45.

[2] Durand D. Risk Elements in Consumer Instalment Financing, Technical Edition[M]. Cambridge, MA:National Bureau of Economic Research, Inc, 1941.

[3] Wiginton J C.A note on the comparison of logit and discriminant models of consumer credit behaviour[J].Financial Quantitative Anal,1980,(15):757-770.

[4] Altman E I. Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy[J]. Journal of Finance, 1968, 23(4):589-609.

[5] 肖斌卿,杨旸,李心丹,等. 基于模糊神经网络的小微企业信用评级研究[J].管理科学学报,2016,19(11):114-126.

[6] Chen Weimin, Ma Chaoqun, Ma Lin. Mining the customer credit using hybrid support vector machine technique[J]. Expert Systems with Applications, 2009, 36(4):7611-7616.

[7] Chen Feilong,Li Fengchia.Combination of feature selection approaches with SVM in credit scoring[J].Expert Systems with Applications,2010,37(7):4902-4909.

[8] 庞素琳, 巩吉璋. C5.0分类算法及在银行个人信用评级中的应用[J]. 系统工程理论与实践, 2009, 29(12):94-104

[9] 许艳秋, 潘美芹. 层次分析法和支持向量机在个人信用评估中的应用[J]. 中国管理科学, 2016,(s1).

[10] 衣柏衡, 朱建军, 李杰. 基于改进SMOTE的小额贷款公司客户信用风险非均衡SVM分类[J]. 中国管理科学, 2016, 24(3):24-30.

[11] 肖进, 薛书田, 黄静,等. 客户信用评估半监督协同训练模型研究[J]. 中国管理科学, 2016, 24(6):124-131.

[12] Sustersic M,Mramor D,Zupan J. Consumer credit scoring models with limited data[J]. Expert Systems with Applications,2009,36(3):4736-4744.

[13] 肖进, 刘敦虎, 顾新,等. 银行客户信用评估动态分类器集成选择模型[J]. 管理科学学报, 2015,(3):114-126.

[14] Florez-Lopez R. Effects of Missing Data in Credit Risk Scoring. A Comparative Analysis of Methods to Achieve Robustness in the Absence of Sufficient Data[J]. Journal of the Operational Research Society, 2010, 61(3):486-501.

[15] 张松兰, 王鹏, 徐子伟. 基于统计相关的缺失值数据处理研究[J]. 统计与决策, 2016,(12):13-16.

[16] Little R J A, Rubin D B. Statistical analysis with missing data[M]. New Jersey:Wiley, 1986.

[17] 金勇进. 缺失数据的插补调整[J]. 数理统计与管理, 2001, 20(6):47-53.

[18] Gustavo E A P A. Batista, Maria Carolina Monard. An analysis of four missing data treatment methods for supervised learning[J]. Applied Artificial Intelligence, 2003,17(5-6):519-533.

[19] Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the algorithm[J]. Journal of the Royal Statistical Society, 1977, 39(1):1-38.

[20] Meng Xiaoli, Rubin D B. Maximum likelihood estimation via the ECM algorithm:A general framework[J]. Biometrika, 1993, 80(2):267-278.

[21] Walker S. An EM Algorithm for Nonlinear Random Effects Models[J]. Biometrics, 1996, 52(3):934-944.

[22] 翟继友, 张鹏. 高斯混合模型参数估值算法的优化[J].计算机技术与发展, 2011, 21(11):145-148.

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

References