基于半监督支持向量机的信用评分模型

doi:10.16381/j.cnki.issn1003-207x.2021.2434

摘要/Abstract

摘要：

针对信用评分中有标记样本获取难度大、成本高的问题，本文提出一种新的基于半监督支持向量机的信用评分模型。通过给未标记样本引入新的参数，使得模型无需满足随机缺失假设，具有良好的适用性。同时，在损失函数中加入半监督部分鼓励有标记样本和未标记样本系数的相似性，从而能够有效融合未标记样本信息，提升估计效果。此外，本文利用Group LASSO进行变量选择，可以充分利用组结构信息，筛选重要变量。通过数值模拟和一个信用卡风险违约预测实例数据证明了所提方法的可行性，以及在变量选择、系数估计和分类预测上的优良效果。

关键词: 半监督分类, 支持向量机, 变量选择, 信用评分

Abstract:

To address the problem of difficulty and high cost in obtaining labeled samples in credit scoring， a new credit scoring model is proposed based on semi-supervised support vector machines. By introducing new parameters to the unlabeled samples， the model need not satisfy the random missing assumption and has good applicability. Meanwhile， adding a semi-supervised part to the loss function encourages the similarity between the coefficients of labeled and unlabeled samples， which can effectively fuse the unlabeled sample information and improve the estimation effect. In addition， Group LASSO is used for variable selection， which can make full use of the group structure information and screen important variables. The feasibility of the proposed method and its excellent results in variable selection， coefficient estimation and classification prediction are demonstrated by numerical simulations and an example data of credit card risk default prediction.

Key words: semi-supervised classification, support vector machines, variable selection, credit scoring

中图分类号:

C812

陈耸,于秀运,邱涌钦,方匡南. 基于半监督支持向量机的信用评分模型[J]. 中国管理科学, 2024, 32(3): 1-8.

Song Chen,Xiuyun Yu,Yongqin Qiu,Kuangnan Fang. Credit Scoring Based on Semi-supervised Support Vector Machine[J]. Chinese Journal of Management Science, 2024, 32(3): 1-8.

图/表 5

表1

α=β的模拟结果"

模型

n 1 = 50

n 1 = 100

n 1 = 200

准确率

精准率

真阳率

准确率

精准率

真阳率

准确率

精准率

真阳率

ρ L = ρ U = 0.2

GP_SSVM

0.709

(0.061)

0.734

(0.067)

0.925

(0.166)

0.764

(0.047)

0.776

(0.062)

0.971

(0.057)

0.819

(0.035)

0.828

(0.045)

0.998

(0.011)

TSVM

0.683

(0.065)

0.699

(0.094)

0.750

(0.040)

0.748

(0.064)

0.810

(0.032)

0.807

(0.051)

GP_SVM

0.590

(0.071)

0.711

(0.157)

0.667

(0.174)

0.666

(0.067)

0.721

(0.149)

0.725

(1.040)

0.783

(0.041)

0.754

(0.060)

0.963

(0.410)

ρ L = ρ U = 0.8

GP_SSVM

0.706

(0.067)

0.730

(0.060)

0.897

(0.109)

0.766

(0.059)

0.777

(0.057)

0.967

(0.059)

0.803

(0.033)

0.800

(0.037)

0.997

(0.013)

TSVM

0.663

(0.060)

0.686

(0.064)

0.721

(0.055)

0.734

(0.073)

0.766

(0.031)

0.744

(0.042)

GP_SVM

0.582

(0.060)

0.608

(0.109)

0.575

(0.135)

0.648

(0.053)

0.645

(0.074)

0.613

(0.605)

0.728

(0.050)

0.709

(0.076)

0.975

(0.066)

ρ L = 0.2, ρ U = 0.8

GP_SSVM

0.700

(0.061)

0.736

(0.058)

0.901

(0.136)

0.744

(0.064)

0.783

(0.064)

0.973

(0.052)

0.801

(0.042)

0.817

(0.040)

0.997

(0.013)

TSVM

0.647

(0.065)

0.685

(0.081)

0.711

(0.053)

0.731

(0.069)

0.770

(0.049)

0.777

(0.057)

GP_SVM

0.577

(0.053)

0.633

(0.092)

0.692

(0.776)

0.661

(0.065)

0.664

(0.059)

0.683

(1.210)

0.739

(0.050)

0.742

(0.069)

0.958

(0.444)

ρ L = 0.8, ρ U = 0.2

GP_SSVM

0.699

(0.072)

0.708

(0.073)

0.880

(0.136)

0.761

(0.044)

0.766

(0.058)

0.972

(0.073)

0.818

(0.030)

0.824

(0.039)

0.993

(0.027)

TSVM

0.667

(0.062)

0.675

(0.097)

0.736

(0.053)

0.729

(0.065)

0.787

(0.038)

0.771

(0.062)

GP_SVM

0.559

(0.076)

0.597

(0.113)

0.605

(0.096)

0.624

(0.060)

0.608

(0.090)

0.675

(0.826)

0.764

(0.048)

0.742

(0.079)

0.942

(0.587)

表1

表2

α≠β的模拟结果"

模型

n 1 = 50

n 1 = 100

n 1 = 200

准确率

精准率

真阳率

准确率

精准率

真阳率

准确率

精准率

真阳率

ρ L = ρ U = 0.2

GP_SSVM

0.698

(0.062)

0.728

(0.066)

0.897

(0.103)

0.760

(0.044)

0.774

(0.049)

0.977

(0.050)

0.802

(0.036)

0.814

(0.040)

0.997

(0.013)

TSVM

0.688

(0.057)

0.700

(0.082)

0.736

(0.045)

0.737

(0.072)

0.795

(0.031)

0.796

(0.052)

GP_SVM

0607

(0.073)

0.693

(0.069)

0.667

(0.205)

0.681

(0.064)

0.692

(0.121)

0.708

(0.910)

0.751

(0.051)

0.763

(0.085)

0.983

(0.061)

ρ L = ρ U = 0.8

GP_SSVM

0.693

(0.060)

0.714

(0.057)

0.907

(0.097)

0.743

(0.054)

0.743

(0.057)

0.967

(0.053)

0.786

(0.031)

0.780

(0.038)

0.991

(0.028)

TSVM

0.645

(0.048)

0.667

(0.072)

0.710

(0.055)

0.713

(0.067)

0.744

(0.034)

0.728

(0.052)

GP_SVM

0.611

(0.058)

0.734

(0.050)

0.964

(0.102)

0.624

(0.069)

0.673

(0.110)

0.658

(1.395)

0.718

(0.043)

0.680

(0.073)

0.883

(0.801)

ρ L = 0.2, ρ U = 0.8

GP_SSVM

0.687

(0.062)

0.717

(0.063)

0.903

(0.122)

0.730

(0.070)

0.764

(0.066)

0.976

(0.052)

0.767

(0.046)

0.784

(0.045)

0.996

(0.015)

TSVM

0.647

(0.067)

0.671

(0.074)

0.705

(0.049)

0.716

(0.049)

0.743

(0.049)

0.761

(0.058)

GP_SVM

0.578

(0.063)

0.633

(0.108)

0.648

(0.074)

0.651

(0.073)

0.706

(0.121)

0.667

(0.795)

0.739

(0.041)

0.728

0.958

(0.444)

(0.072)

ρ L = 0.8, ρ U = 0.2

GP_SSVM

0.691

(0.057)

0.705

(0.062)

0.864

(0.129)

0.746

(0.050)

0.744

(0.053)

0.970

(0.064)

0.803

(0.036)

0.810

(0.046)

0.998

(0.011)

TSVM

0.660

(0.063)

0.677

(0.087)

0.721

(0.045)

0.718

(0.065)

0.777

(0.037)

0.765

(0.058)

GP_SVM

0.604

(0.089)

0.584

(0.092)

0.667

(0.102)

0.621

(0.070)

0.606

(0.088)

0.625

(0.607)

0.728

(0.065)

(0.681)

(0.076)

0.908

(0.686)

表2

图1

表3

表4

系数估计结果"

变量名称	变量	$β ?$	$α ?$	变量名称	变量	$β ?$	$α ?$
逾期30天	X1	0.407	0.335	学历	X14.1	0	0
是否有呆账记录	X2	0.158	0.218		X14.2	0	0
借款余额是否大于800万元	X3	0.123	0.209		X14.3	0	0
是否有退票记录	X4	0.401	0.338	职业	X15	0	0
是否有拒往记录	X5	0.155	0.179	月均收入	X16	-0.006	-0.007
是否有强制停卡记录	X6	0.398	0.323	月均开销	X17	-0.020	-0.018
拥有信用卡张数	X7	-0.015	-0.210	住家情况	X18.1	0.004	0
使用信用卡频率	X8	-0.212	-0.210		X18.2	-0.0004	0
户籍所在地	X9.1	0	0		X18.3	-0.007	0
	X9.2	0	0		X18.4	-0.002	0
	X9.3	0	0		X18.5	0.002	0
性别	X11	-0.006	0	家庭月均收入	X19	-0.016	-0.016
年龄	X12	-0.006	0	月均刷卡金额	X20	-0.034	-0.034
婚姻状况	X13.1	0	0
婚姻状况	X13.2	0	0

表4

参考文献 24

1	叶强，刘作仪，孟庆峰，等. 互联网金融的国家战略需求和关键科学问题［J］. 中国科学基金， 2016， 30（2）： 150-158.
	Ye Q， Liu Z Y， Meng Q F， et al. National strategic demand and key scientific issues in relation to internet finance［J］. Bulletin of National Natural Science Foundation of China， 2016， 30（2）： 150-158.
2	Louzada F， Ara A， Fernandes G B. Classification methods applied to credit scoring： systematic review and overall comparison［J］. Surveys in Operations Research and Management Science， 2016， 21（2）： 117-134.
3	余乐安，张有德. 基于关联规则赋权特征选择集成的信用分类研究［J］. 系统工程理论与实践， 2020， 40（2）： 366-372.
	Yu L A， Zhang Y D. Weight-selected attribute bagging based on association rules for credit dataset classification［J］. System Engineering — Theory & Practice， 2020， 40（2）： 366-372.
4	王钊，蒋翠清，丁勇.基于混合生存分析的动态信用评分方法［J］.系统工程理论与实践， 2021， 41（2）： 389-399.
	Wang Z， Jiang C Q， Ding Y. Dynamic credit scoring method based on mixture survival analysis［J］. System Engineering—Theory & Practice，2021，41（2）： 389-399.
5	王小燕，张中艳，马双鸽. 基于文本先验信息的贷款信用风险评估模型［J］. 中国管理科学， 2021， 29（5）： 34-44.
	Wang X Y， Zhang Z Y， Ma S G. A loan credit risk model incorporating text prior information［J］. Chinese Journal of Management Science， 2021， 29（5）： 34-44.
6	王小燕，袁腾，段湘斌. 基于零膨胀分位数两部模型的银行贷款违约预测研究［J］. 中国管理科学， 2022， 30（10）： 1-13.
	Wang X Y， Yuan T， Duan X B. Loan default forecasting based on zero—inflated quantile two—part model［J］. Chinese Journal of Management Science， 2022， 30（10）： 1-13.
7	Orgler Y E. A credit scoring model for commercial loans［J］. Journal of Money， Credit and Banking， 1970， 2（4）： 435-445.
8	Wiginton J C. A note on the comparison of logit and discriminant models of consumer credit behavior［J］. Journal of Financial and Quantitative Analysis， 1980， 15（3）： 757-770.
9	方匡南，章贵军，张惠颖. 基于Lasso-logistic模型的个人信用风险预警方法［J］. 数量经济技术经济研究， 2014， 31（2）： 125-136.
	Fang K N， Zhang G J， Zhang H Y. Individual credit risk prediction method： application of a Lasso-logistic model［J］. Journal of Quantitative & Technical Economics， 2014， 31（2）： 125-136.
10	Martens D， Baesens B， Van Gestel T， et al. Comprehensible credit scoring models using rule extraction from support vector machines［J］. European Journal of Operational Research， 2007， 183（3）： 1466-1476.
11	Maldonado S， Pérez J， Bravo C. Cost-based feature selection for support vector machines： an application in credit scoring［J］. European Journal of Operational Research， 2017， 261（2）： 656-665.
12	李建平，徐伟宣，刘京礼，等. 消费者信用评估中支持向量机方法研究［J］. 系统工程， 2004（10）： 35-39.
	Li J P， Xu W X， Liu J L， et al. Support vector machines approach to credit evaluation［J］. Systems Engineering， 2004（10）： 35-39.
13	刘京礼，李建平，徐伟宣，等. 信用评估中的鲁棒赋权自适应L_p最小二乘支持向量机方法［J］. 中国管理科学， 2010， 18（5）： 28-33.
	Liu J L， Li J P， Xu W X， et al. A robust weighted adaptive LpLS-SVM method for credit risk assessment［J］. Chinese Journal of Management Science， 2010， 18（5）： 28-33.
14	姚潇，余乐安. 模糊近似支持向量机模型及其在信用风险评估中的应用［J］. 系统工程理论与实践， 2012， 32（3）： 549-554.
	Yao X， Yu L A. A fuzzy proximal support vector machine model and its application to credit risk analysis［J］. System Engineering — Theory & Practice， 2012， 32（03）： 549-554.
15	余乐安. 基于最小二乘近似支持向量回归模型的电子商务信用风险预警［J］. 系统工程理论与实践， 2012， 32（3）： 508-514.
	Yu L A. E-commerce credit risk early-warning with a least squares proximal support vector regression model［J］. System Engineering — Theory & Practice， 2012， 32（3）： 508-514.
16	陆爱国，王珏，刘红卫. 基于改进的SVM学习算法及其在信用评分中的应用［J］. 系统工程理论与实践， 2012， 32（3）： 515-521.
	Lu A G， Wang J， Liu H W. An improved SVM learning algorithm and its applications to credit scorings［J］. System Engineering — Theory & Practice， 2012， 32（03）： 515-521.
17	韩璐，韩立岩. 正交支持向量机及其在信用评分中的应用［J］. 管理工程学报， 2017， 31（2）： 128-136.
	Han L， Han L Y. Orthogonal support vector machine and its application in credit scoring［J］. Journal of Industrial Engineering and Engineering Management， 2017， 31（2）： 128-136.
18	黎春，周振宇. 信用评分模型中拒绝推断问题研究：基于半监督协同训练法的改进［J］. 统计研究， 2019， 36（9）： 82-92.
	Li C， Zhou Z Y. Research on reject inference in credit scoring model： based on the improvement of semi-supervised co-training method［J］. Statistical Research， 2019， 36（9）： 82-92.
19	Maldonado S， Paredes G. A semi-supervised approach for reject inference in credit scoring using SVMs［C］//Proceedings of Advances in Data Mining. Applications and Theoretical Aspects： 10th Industrial Conference， Berlin， Germany， July 12-14， 2010. Springer Berlin Heidelberg， 2010： 558-571.
20	Huang S C， Tang Y C， Lee C W， et al. Kernel local Fisher discriminant analysis based manifold-regularized SVM model for financial distress predictions［J］. Expert Systems with Applications， 2012， 39（3）： 3855-3861.
21	Li Z， Tian Y， Li K， et al. Reject inference in credit scoring using semi-supervised support vector machines［J］. Expert Systems with Applications： An International Journal， 2017， 74（C）： 105-114.
22	Shen F， Yang Z， Zhao X， et al. Reject inference in credit scoring using a three-way decision and safe semi-supervised support vector machine［J］. Information Sciences， 2022， 606： 614-627.
23	Yang Y， Zou H. A fast unified algorithm for solving group-lasso penalize learning problems［J］. Statistics and Computing， 2015， 25（6）： 1129-1141.
24	Collobert R， Sinz F， Weston J， et al. Large scale transductive SVMs［J］. Journal of Machine Learning Research， 2006（7）： 1687-1712.

[1]	赵为华, 王玲, 程喆, 张日权. 比例数据基于Tobit分位数回归模型的贝叶斯变量选择[J]. 中国管理科学, 2022, 30(4): 63-73.
[2]	周德强. 估计灰色Verhulst模型参数的LS-SVM方法及应用[J]. 中国管理科学, 2022, 30(3): 280-286.
[3]	王小燕, 袁腾, 段湘斌. 基于零膨胀分位数两部模型的银行贷款违约预测研究[J]. 中国管理科学, 2022, 30(10): 1-13.
[4]	王昱, 杨珊珊. 考虑多维效率的上市公司财务困境预警研究[J]. 中国管理科学, 2021, 29(2): 32-41.
[5]	蒋翠清, 王睿雅, 丁勇. 融入软信息的P2P网络借贷违约预测方法[J]. 中国管理科学, 2017, 25(11): 12-21.
[6]	黄星, 袁明, 王绍玉. 基于Mexican Wv-SVM的震灾人员存活量模型[J]. 中国管理科学, 2016, 24(9): 140-146.
[7]	王书平, 朱艳云. 基于多尺度分析的小麦价格预测研究[J]. 中国管理科学, 2016, 24(5): 85-91.
[8]	衣柏衡, 朱建军, 李杰. 基于改进SMOTE的小额贷款公司客户信用风险非均衡SVM分类[J]. 中国管理科学, 2016, 24(3): 24-30.
[9]	许启发, 张金秀, 蒋翠侠. 基于非线性分位数回归模型的多期VaR风险测度[J]. 中国管理科学, 2015, 23(3): 56-65.
[10]	陈艳, 王宣承. 基于变量选择和遗传网络规划的期货高频交易策略研究[J]. 中国管理科学, 2015, 23(10): 47-56.
[11]	张亮, 张玲玲, 陈懿冰, 腾伟丽. 基于信息融合的数据挖掘方法在公司财务预警中的应用[J]. 中国管理科学, 2015, 23(10): 170-176.
[12]	王书平, 胡爱梅, 吴振信. 基于多尺度组合模型的铜价预测研究[J]. 中国管理科学, 2014, 22(8): 21-28.
[13]	赛英, 张凤廷, 张涛. 基于支持向量机的中国股指期货回归预测研究[J]. 中国管理科学, 2013, 21(3): 35-39.
[14]	刘京礼, 李建平, 徐伟宣, 石勇. 信用评估中的鲁棒赋权自适应L_p最小二乘支持向量机方法[J]. 中国管理科学, 2010, 18(5): 28-33.
[15]	赵琨, 孔祥纬, 田英杰. 带有多面体扰动的半监督v-支持向量分类机[J]. 中国管理科学, 2010, 18(1): 143-148.