主管:中国科学院
主办:中国优选法统筹法与经济数学研究会
   中国科学院科技战略咨询研究院

Chinese Journal of Management Science ›› 2025, Vol. 33 ›› Issue (9): 57-66.doi: 10.16381/j.cnki.issn1003-207x.2023.0483

Previous Articles     Next Articles

High-dimensional Active PU Learning with Its Application to Credit Scoring

Yongqin Qiu1, Kuangnan Fang2,3(), Qingzhao Zhang2,3, Lean Yu4   

  1. 1.School of Management,University of Science and Technology of China,Hefei 230026,China
    2.School of Economics,Xiamen University,Xiamen 361000,China
    3.Research Center of Credit Risk Control and Big Data,Xiamen University,Xiamen 361000,China
    4.Business School,Sichuan University,Chengdu 610065,China
  • Received:2023-03-24 Revised:2024-05-01 Online:2025-09-25 Published:2025-09-29
  • Contact: Kuangnan Fang E-mail:xmufkn@xmu.edu.cn

Abstract:

In classification problems, there is a situation that only positive and unlabeled samples are available, i.e., PU (positive and unlabeled) data. Most existing studies require class prior and sufficient sample size to achieve good results, and the model estimation results are often poor when the data is high-dimensional and small sample. For this purpose, a high-dimensional active PU learning model is propased. By adjusting the classical A-optimality criterion, it not only can effectively select new samples in high-dimensional cases and improve the model estimation, but also significantly reduces the time cost of sample selection. In addition, in the process of selecting and labeling samples, the proposed method can estimate the parameters of the class prior without initial value for parameter, reducing the bias caused by prior information errors. Through simulation experiments, it is found that the proposed method in this paper outperforms the comparative methods in variable selection, coefficient estimation and classification prediction. Furthermore, compared to the classical A-optimality criterion, the method achieves a substantial reduction in selection time. Finally, the model proposed in this paper is applied to the consumer finance loan credit score data. The results indicate that compared to randomly selecting samples for labeling, actively selecting samples can better enhance predictive performance, especially when the number of labels is limited, the improvement is more pronounced. Additionally, the approach is also more robust in the selection of important variables. The model developed in this paper can provide a powerful tool for credit risk analysis for newly launched credit products.

Key words: active learning, PU learning, high-dimensional A-optimality, credit scoring

CLC Number: