高维主动PU学习及其在信用评分中的应用

doi:10.16381/j.cnki.issn1003-207x.2023.0483

摘要/Abstract

摘要：

在分类问题中，常常会遇到只能获得正标签和无标签样本的情况，即PU（positive and unlabeled）数据。针对此类PU数据建模，现有的研究大多需要类别先验（class prior），并在样本量充足的情况下才能取得较好的效果，当数据呈现“高维小样本”特点时，模型估计效果往往不佳。基于此，本文提出了高维主动PU学习方法，通过对经典的A-optimality准则进行调整，不仅能够在高维情况下有效挑选新样本，提升模型估计效果，同时，显著减少了样本挑选的时间成本。此外，在挑选样本并标记的过程中，本文提出的方法无需初值即可对类别先验进行参数估计，减少先验信息错误带来的偏差。通过模拟实验发现，本文所提出的方法在变量选择、系数估计和分类预测上的效果均优于对比方法。最后，将本文提出的模型应用到实际的消费金融贷信用评分数据中，实证结果表明，利用本文提出的方法可以显著提高模型的预测效果。

关键词: 主动学习, PU学习, 高维A-optimality, 信用评分

Abstract:

In classification problems， there is a situation that only positive and unlabeled samples are available， i.e.， PU （positive and unlabeled） data. Most existing studies require class prior and sufficient sample size to achieve good results， and the model estimation results are often poor when the data is high-dimensional and small sample. For this purpose， a high-dimensional active PU learning model is propased. By adjusting the classical A-optimality criterion， it not only can effectively select new samples in high-dimensional cases and improve the model estimation， but also significantly reduces the time cost of sample selection. In addition， in the process of selecting and labeling samples， the proposed method can estimate the parameters of the class prior without initial value for parameter， reducing the bias caused by prior information errors. Through simulation experiments， it is found that the proposed method in this paper outperforms the comparative methods in variable selection， coefficient estimation and classification prediction. Furthermore， compared to the classical A-optimality criterion， the method achieves a substantial reduction in selection time. Finally， the model proposed in this paper is applied to the consumer finance loan credit score data. The results indicate that compared to randomly selecting samples for labeling， actively selecting samples can better enhance predictive performance， especially when the number of labels is limited， the improvement is more pronounced. Additionally， the approach is also more robust in the selection of important variables. The model developed in this paper can provide a powerful tool for credit risk analysis for newly launched credit products.

Key words: active learning, PU learning, high-dimensional A-optimality, credit scoring

中图分类号:

C812

邱涌钦, 方匡南, 张庆昭, 余乐安. 高维主动PU学习及其在信用评分中的应用[J]. 中国管理科学, 2025, 33(9): 57-66.

Yongqin Qiu, Kuangnan Fang, Qingzhao Zhang, Lean Yu. High-dimensional Active PU Learning with Its Application to Credit Scoring[J]. Chinese Journal of Management Science, 2025, 33(9): 57-66.

图/表 7

图1

表1

实验一：各模型更新100次的表现情况"

p

模型

变量选择

预测精度

参数估计精度

TPR

FPR

AUC

AUC adjust

MCR

RMSE

Random Logistic

0.908

(0.112)

0.020

(0.022)

0.815

(0.020)

0.816

(0.046)

0.517

(0.007)

1.589

(0.052)

Active Logistic

0.944

(0.095)

0.027

(0.023)

0.825

(0.016)

0.826

(0.042)

0.517

(0.007)

1.587

(0.039)

Random PU

0.934

(0.114)

0.052

(0.087)

0.824

(0.019)

0.825

(0.047)

0.273

(0.017)

1.190

(0.198)

Active PU

0.966

(0.076)

0.046

(0.040)

0.835

(0.012)

0.836

(0.012)

0.259

(0.012)

1.172

(0.136

High-Active PU

0.964

(0.077)

0.070

(0.056)

0.836

(0.012)

0.838

(0.041)

0.257

(0.013)

1.143

(0.185)

300

Random Logistic

0.964

(0.077)

0.070

(0.061)

0.836

(0.012)

0.838

(0.04)

0.258

(0.014)

1.132

(0.218)

Active Logistic

0.818

(0.134)

0.004

(0.004)

0.800

(0.024)

0.801

(0.042)

0.518

(0.007)

1.657

(0.051)

Random PU

0.864

(0.124)

0.005

(0.004)

0.812

(0.021)

0.812

(0.038)

0.518

(0.007)

1.683

(0.032)

Active PU

0.914

(0.111)

0.015

(0.009)

0.816

(0.017)

0.816

(0.036)

0.283

(0.014)

1.261

(0.13)

High-Active PU

0.950

(0.092)

0.031

(0.010)

0.823

(0.015)

0.824

(0.038)

0.274

(0.015)

1.229

(0.094)

表1

图2

表2

挑选样本平均花费总时间（更新200次）"

$p$	High-Active PU（秒）	Active PU（秒）
50	1.703	420.812
300	8.965	41749.217

表2

图3

图4

表3

参考文献 34

[1]	许艳秋，潘美芹. 层次分析法和支持向量机在个人信用评估中的应用［J］.中国管理科学， 2016， 24（S1）： 106-112.
	Xu Y Q， Pan M Q. Application of analytic hierarchy process and support vector machine in personal credit evaluation［J］. Chinese Journal of Management Science， 2016， 24（S1）： 106-112.
[2]	王小燕，张中艳，马双鸽. 基于文本先验信息的贷款信用风险评估模型［J］. 中国管理科学， 2021， 29（5）： 34-44.
	Wang X Y， Zhang Z Y， Ma S G. A loan credit risk model incorporating text prior information［J］. Chinese Journal of Management Science， 2021， 29（5）： 34-44.
[3]	王小燕，冮建伟，徐龙滔. 基于CMCP和余弦间隔交叉熵的深度神经网络及其应用［J］. 数量经济技术经济研究， 2022， 39（10）： 170-188.
	Wang X Y， Gang J W， Xu L T. A deep neural network based on CMCP and cosine margin cross entropy and its application［J］. Journal of Quantitative & Technological Economics， 2022， 39（10）： 170-188.
[4]	杨莲，石宝峰. 基于Focal Loss修正交叉熵损失函数的信用风险评价模型及实证［J］. 中国管理科学， 2022， 30（5）： 65-75.
	Yang L， Shi B F. Credit risk evaluation model and empirical research based on focal loss modified cross-entropy loss function［J］. Chinese Journal of Management Science， 2022， 30（5）： 65-75.
[5]	陆阳，石宝峰，迟国泰，等. 基于违约损失逆序最小的非线性信用风险评价模型及实证［J］. 中国管理科学，2023，DOI：10.16381/j.cnki.issn1003-207x.2023.0192 .
	Lu Y， Shi B F， Chi G T， et al. A novel nonlinear credit risk evaluation model and its empirical analysis based on minimizing the inversion number of loss given default sequence［J］. Chinese Journal of Management Science，2023，DOI：10.16381/j.cnki.issn1003-207x.2023.0192 .
[6]	Ward G， Hastie T， Barry S， et al. Presence-only data and the EM algorithm［J］. Biometrics， 2009， 65（2）： 554-563.
[7]	Lancaster T， Imbens G. Case-control studies with contaminated controls［J］. Journal of Econometrics， 1996， 71（1-2）： 145-160.
[8]	Song H， Raskutti G. PUlasso： High-dimensional variable selection with presence-only data［J］. Journal of the American Statistical Association， 2020， 115（529）： 334-347.
[9]	Du Plessis M C， Niu G， Sugiyama M. Analysis of learning from positive and unlabeled data［C］//Proceedings of the 28th International Conference on Neural Information Processing Systems， Montreal， Canada， December 8-13 ， MIT Press， 2014： 703-711.
[10]	Liu B， Dai Y， Li X， et al. Building text classifiers using positive and unlabeled examples［C］//Proceedings of the Third IEEE International Conference on Data Mining， Melbourne， FL， USA， November 22， IEEE， 2003： 179-186.
[11]	Elkan C， Noto K. Learning classifiers from only positive and unlabeled data［C］//Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining， Las Vegas Nevada USA， August 24-27， ACM， 2008： 213-220.
[12]	Blanchard G， Lee G， Scott C. Semi-supervised novelty detection［J］. The Journal of Machine Learning Research， 2010（11）： 2973-3009.
[13]	Xie M， Singh K， Strawderman W E. Confidence distributions and a unifying framework for meta-analysis［J］. Journal of the American Statistical Association， 2011， 106（493）： 320-333.
[14]	Liu J， Huang J， Ma S. Integrative analysis of cancer diagnosis studies with composite penalization［J］. Scandinavian Journal of Statistics， Theory and Applications， 2014， 41（1）： 87-103.
[15]	Huang Y， Zhang Q， Zhang S， et al. Promoting similarity of sparsity structures in integrative analysis with penalization［J］. Journal of the American Statistical Association， 2017， 112（517）： 342-350.
[16]	Yang X， Yan X， Huang J. High-dimensional integrative analysis with homogeneity and sparsity recovery［J］. Journal of Multivariate Analysis， 2019， 174： 104529.
[17]	Tang L， Song P X K. Fused lasso approach in regression coefficients clustering - learning parameter heterogeneity in data integration［J］. Journal of Machine Learning Research， 2016， 17（113）：1-23.
[18]	Terada Y， Ogasawara I， Nakata K. Classification from only positive and unlabeled functional data［J］. The Annals of Applied Statistics， 2020， 14（4）： 1724-1742.
[19]	Divino F， Golini N， Jona Lasinio G， et al. Bayesian logistic regression for presence-only data［J］. Stochastic Environmental Research and Risk Assessment， 2015， 29（6）： 1721-1736.
[20]	Bekker J， Davis J. Estimating the class prior in positive and unlabeled data through decision tree induction［C］//Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence， New Orleans， USA， February 2-7 ， AAAI Press， 2018： 2712 -2719.
[21]	Deng X， Joseph V R， Sudjianto A， et al. Active learning through sequential design， with applications to detection of money laundering［J］. Journal of the American Statistical Association， 2009， 104（487）： 969-981.
[22]	Vahdat A， Belbahri M， Nia V P. Active learning for high-dimensional binary features［C］//Proceedings of 2019 15th International Conference on Network and Service Management （CNSM）， Halifax， NS， Canada， October 21-25，IEEE， 2019： 1-4.
[23]	Hsu H L， Chang Y I， Chen R B. Greedy active learning algorithm for logistic regression models［J］. Computational Statistics & Data Analysis， 2019， 129： 119-134.
[24]	Li J， Chen Z， Wang Z， et al. Active learning in multiple-class classification problems via individualized binary models［J］. Computational Statistics & Data Analysis， 2020， 145： 106911.
[25]	Perini L， Vercruyssen V， Davis J. Class prior estimation in active positive and unlabeled learning［C］//Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence， Yokohama， Japan， July 11-17 ， International Joint Conferences on Artificial Intelligence Organization， 2020： 2915-2921.
[26]	Yin J， Du X. Active learning with generalized sliced inverse regression for high-dimensional reliability analysis［J］. Structural Safety， 2022， 94： 102151.
[27]	方匡南，陈子岚. 基于半监督广义可加Logistic回归的信用评分方法［J］. 系统工程理论与实践， 2020， 40（2）： 392-402.
	Fang K N， Chen Z L. Credit scoring based on semi-supervised generalized additive logistic regression［J］. Systems Engineering-Theory & Practice， 2020， 40（2）： 392-402.
[28]	胡心瀚，叶五一，缪柏其. 上市公司信用风险分析模型中的变量选择［J］. 数理统计与管理， 2012， 31（6）： 1117-1124.
	Hu X H， Ye W Y， Miao B Q. Variable selection in credit risk models for Chinese listed companies［J］. Journal of Applied Statistics and Management， 2012， 31（6）： 1117-1124.
[29]	方匡南，赵梦峦. 基于多源数据融合的个人信用评分研究［J］. 统计研究， 2018， 35（12）： 92-101.
	Fang K N， Zhao M L. A study on credit scoring based on multi-source data integration［J］. Statistical Research， 2018， 35（12）： 92-101.
[30]	Settles B. Active learning literature survey［J］. Science， 1995， 10（3）： 237-304.
[31]	Montgomery D C. Design and analysis of experiments［M］. New York， NY： Springer New York， 1999.
[32]	Shakeel N， Mehmood T. Inverse matrix problem in regression for high-dimensional data sets［J］. Mathematical Problems in Engineering， 2023， 2023（1）： 2308541.
[33]	Bertsekas D， Gallager R. Data networks： Second edition［M］. Nashua， NH： Athena Scientific， 2021.
[34]	Kalousis A， Prados J， Hilario M. Stability of feature selection algorithms： A study on high-dimensional spaces［J］. Knowledge and Information Systems， 2007， 12（1）： 95-116.

初始正样本量		更新次数
初始正样本量	模型	10	20	30
20	Random Logistic	0.538（0.041）	0.559（0.040）	0.579（0.053）
	Active Logistic	0.556（0.041）	0.576（0.030）	0.652（0.057）
	Random PU	0.584（0.038）	0.627（0.063）	0.709（0.069）
	Active PU	0.720（0.084）	0.773（0.063）	0.812（0.046）
	High Active PU	0.731（0.085）	0.790（0.082）	0.825（0.041）
40	Random Logistic	0.581（0.025）	0.656（0.069）	0.696（0.075）
	Active Logistic	0.643（0.044）	0.729（0.065）	0.790（0.044）
	Random PU	0.666（0.052）	0.727（0.051）	0.775（0.047）
	Active PU	0.736（0.039）	0.805（0.031）	0.815（0.034）
	High Active PU	0.747（0.073）	0.802（0.057）	0.832（0.033）
60	Random Logistic	0.702（0.049）	0.732（0.053）	0.757（0.049）
	Active Logistic	0.737（0.053）	0.758（0.046）	0.771（0.049）
	Random PU	0.715（0.036）	0.758（0.039）	0.794（0.037）
	Active PU	0.779（0.030）	0.810（0.030）	0.827（0.027）
	High Active PU	0.772（0.045）	0.809（0.032）	0.842（0.025）