主管:中国科学院
主办:中国优选法统筹法与经济数学研究会
   中国科学院科技战略咨询研究院

Chinese Journal of Management Science ›› 2025, Vol. 33 ›› Issue (11): 29-40.doi: 10.16381/j.cnki.issn1003-207x.2024.0438

Previous Articles     Next Articles

Research on Predicting Corporate Fraud of Listed Companies Based on Multi-Source Text Data and Feature-Augmented Tree Models

Gang Li1, Chaochao Qiu1, Zhipeng Zhang2(), Simeng Qin1, Xingnan Xue1   

  1. 1.School of Business Administration,Northeastern University,Shenyang 110000,China
    2.Antai College of Economics & Management,Shanghai Jiao Tong University,Shanghai 200030,China
  • Received:2024-03-28 Revised:2024-07-05 Online:2025-11-25 Published:2025-11-28
  • Contact: Zhipeng Zhang E-mail:zhangzhipeng@sjtu.edu.cn

Abstract:

The existing financial fraud prediction of listed companies are mainly based on financial features, but the research on the combination of text information and multi-source text information for fraud prediction needs to be further explored. It is based on multi-source text data in this paper, such as the annual reports of listed companies, work reports of provincial governments, and monetary policy reports of central bank. Textual features, including text similarity, text tone, and text readability, are extracted using text mining technology. These features, along with the financial and other non-text features of listed companies, are utilized by the feature-enhanced tree model (Augboost) to predict the fraud of listed companies. Its accuracy is evaluated in prediction scenarios combining multi-source textual information along with other prediction models (such as Logistic Regression, Naive Bayes model, Random Forest, etc.). The empirical results based on China's A-share listed companies in the manufacturing industry from 2001 to 2020 indicate that: (1) Additional incremental information on the financial fraud of listed companies is provided by multi-source textual features. (2) The incremental information provided by different types of texts varies. Compared with the annual reports of listed companies and the work reports of provincial governments, the monetary policy of central bank offers the most significant incremental information. (3) Compared with common algorithms such as logistic regression, the Augboost model is used in this paper to predict the fraudulent status of listed companies more accurately. The prediction framework integrating multi-source text information proposed in this paper constitutes an expansion of the existing research on financial fraud prediction of listed companies, which can enhance the prediction accuracy of financial fraud of listed companies and holds practical significance in corporate regulation and decision-making.

Key words: multi-source text, Augboost model, fraud prediction, listed companies, text analysis

CLC Number: