主管:中国科学院
主办:中国优选法统筹法与经济数学研究会
   中国科学院科技战略咨询研究院

中国管理科学 ›› 2025, Vol. 33 ›› Issue (11): 29-40.doi: 10.16381/j.cnki.issn1003-207x.2024.0438

• • 上一篇    下一篇

基于多源文本数据和特征增强树模型的上市公司欺诈预测研究

李刚1, 仇朝朝1, 张志鹏2(), 秦思萌1, 薛星楠1   

  1. 1.东北大学工商管理学院,辽宁 沈阳 110000
    2.上海交通大学安泰经济与管理学院,上海 200030
  • 收稿日期:2024-03-28 修回日期:2024-07-05 出版日期:2025-11-25 发布日期:2025-11-28
  • 通讯作者: 张志鹏 E-mail:zhangzhipeng@sjtu.edu.cn
  • 基金资助:
    国家自然科学基金青年项目(72401189);国家自然科学基金面上项目(71971051);国家自然科学基金面上项目(72371067);国家社会科学基金重大项目(23ZDA039);上海市科委启明星计划扬帆专项(23YF1415000);河北省教育厅在读研究生创新能力培养项目(CXZZSS2025155)

Research on Predicting Corporate Fraud of Listed Companies Based on Multi-Source Text Data and Feature-Augmented Tree Models

Gang Li1, Chaochao Qiu1, Zhipeng Zhang2(), Simeng Qin1, Xingnan Xue1   

  1. 1.School of Business Administration,Northeastern University,Shenyang 110000,China
    2.Antai College of Economics & Management,Shanghai Jiao Tong University,Shanghai 200030,China
  • Received:2024-03-28 Revised:2024-07-05 Online:2025-11-25 Published:2025-11-28
  • Contact: Zhipeng Zhang E-mail:zhangzhipeng@sjtu.edu.cn

摘要:

本文基于上市公司年报、省级政府工作报告和央行货币政策报告等多源文本数据,通过提取文本相似度、文本语调、文本可读性等在内的多维度文本指标,结合上市公司财务数据等非文本指标,采用特征增强树模型(Augboost)对上市公司欺诈进行预测。基于2001—2020年我国A股制造业上市公司的实证结果表明:(1)多源文本指标提供了额外的信息增量。(2)不同类型的文本所带来的信息增量不同:相较于上市公司年报和省级政府工作报告文本,央行货币政策文本提供的信息增量最为显著。(3)相较于逻辑回归等常见算法,本文所采用的特征增强树能够更准确地预测上市公司是否存在欺诈行为。

关键词: 多源文本, Augboost模型, 欺诈预测, 上市公司, 文本分析

Abstract:

The existing financial fraud prediction of listed companies are mainly based on financial features, but the research on the combination of text information and multi-source text information for fraud prediction needs to be further explored. It is based on multi-source text data in this paper, such as the annual reports of listed companies, work reports of provincial governments, and monetary policy reports of central bank. Textual features, including text similarity, text tone, and text readability, are extracted using text mining technology. These features, along with the financial and other non-text features of listed companies, are utilized by the feature-enhanced tree model (Augboost) to predict the fraud of listed companies. Its accuracy is evaluated in prediction scenarios combining multi-source textual information along with other prediction models (such as Logistic Regression, Naive Bayes model, Random Forest, etc.). The empirical results based on China's A-share listed companies in the manufacturing industry from 2001 to 2020 indicate that: (1) Additional incremental information on the financial fraud of listed companies is provided by multi-source textual features. (2) The incremental information provided by different types of texts varies. Compared with the annual reports of listed companies and the work reports of provincial governments, the monetary policy of central bank offers the most significant incremental information. (3) Compared with common algorithms such as logistic regression, the Augboost model is used in this paper to predict the fraudulent status of listed companies more accurately. The prediction framework integrating multi-source text information proposed in this paper constitutes an expansion of the existing research on financial fraud prediction of listed companies, which can enhance the prediction accuracy of financial fraud of listed companies and holds practical significance in corporate regulation and decision-making.

Key words: multi-source text, Augboost model, fraud prediction, listed companies, text analysis

中图分类号: