主管:中国科学院
主办:中国优选法统筹法与经济数学研究会
   中国科学院科技战略咨询研究院

中国管理科学 ›› 2025, Vol. 33 ›› Issue (9): 57-66.doi: 10.16381/j.cnki.issn1003-207x.2023.0483

• • 上一篇    下一篇

高维主动PU学习及其在信用评分中的应用

邱涌钦1, 方匡南2,3(), 张庆昭2,3, 余乐安4   

  1. 1.中国科学技术大学管理学院,安徽 合肥 230026
    2.厦门大学经济学院,福建 厦门 361000
    3.厦门大学信用大数据与智能风控研究中心,福建 厦门 361000
    4.四川大学商学院,四川 成都 610065
  • 收稿日期:2023-03-24 修回日期:2024-05-01 出版日期:2025-09-25 发布日期:2025-09-29
  • 通讯作者: 方匡南 E-mail:xmufkn@xmu.edu.cn
  • 基金资助:
    国家自然科学基金项目(72071169);国家自然科学基金项目(72233002);中央高校基本科研业务专项资金项目(20720231060)

High-dimensional Active PU Learning with Its Application to Credit Scoring

Yongqin Qiu1, Kuangnan Fang2,3(), Qingzhao Zhang2,3, Lean Yu4   

  1. 1.School of Management,University of Science and Technology of China,Hefei 230026,China
    2.School of Economics,Xiamen University,Xiamen 361000,China
    3.Research Center of Credit Risk Control and Big Data,Xiamen University,Xiamen 361000,China
    4.Business School,Sichuan University,Chengdu 610065,China
  • Received:2023-03-24 Revised:2024-05-01 Online:2025-09-25 Published:2025-09-29
  • Contact: Kuangnan Fang E-mail:xmufkn@xmu.edu.cn

摘要:

在分类问题中,常常会遇到只能获得正标签和无标签样本的情况,即PU(positive and unlabeled)数据。针对此类PU数据建模,现有的研究大多需要类别先验(class prior),并在样本量充足的情况下才能取得较好的效果,当数据呈现“高维小样本”特点时,模型估计效果往往不佳。基于此,本文提出了高维主动PU学习方法,通过对经典的A-optimality准则进行调整,不仅能够在高维情况下有效挑选新样本,提升模型估计效果,同时,显著减少了样本挑选的时间成本。此外,在挑选样本并标记的过程中,本文提出的方法无需初值即可对类别先验进行参数估计,减少先验信息错误带来的偏差。通过模拟实验发现,本文所提出的方法在变量选择、系数估计和分类预测上的效果均优于对比方法。最后,将本文提出的模型应用到实际的消费金融贷信用评分数据中,实证结果表明,利用本文提出的方法可以显著提高模型的预测效果。

关键词: 主动学习, PU学习, 高维A-optimality, 信用评分

Abstract:

In classification problems, there is a situation that only positive and unlabeled samples are available, i.e., PU (positive and unlabeled) data. Most existing studies require class prior and sufficient sample size to achieve good results, and the model estimation results are often poor when the data is high-dimensional and small sample. For this purpose, a high-dimensional active PU learning model is propased. By adjusting the classical A-optimality criterion, it not only can effectively select new samples in high-dimensional cases and improve the model estimation, but also significantly reduces the time cost of sample selection. In addition, in the process of selecting and labeling samples, the proposed method can estimate the parameters of the class prior without initial value for parameter, reducing the bias caused by prior information errors. Through simulation experiments, it is found that the proposed method in this paper outperforms the comparative methods in variable selection, coefficient estimation and classification prediction. Furthermore, compared to the classical A-optimality criterion, the method achieves a substantial reduction in selection time. Finally, the model proposed in this paper is applied to the consumer finance loan credit score data. The results indicate that compared to randomly selecting samples for labeling, actively selecting samples can better enhance predictive performance, especially when the number of labels is limited, the improvement is more pronounced. Additionally, the approach is also more robust in the selection of important variables. The model developed in this paper can provide a powerful tool for credit risk analysis for newly launched credit products.

Key words: active learning, PU learning, high-dimensional A-optimality, credit scoring

中图分类号: