在许多大陆法系国家,不断产生的新型法律关系使得成文法无法及时制定和修改的缺陷逐渐显现。与此同时,世界各国纠纷诉讼的数目也在急剧增长,所以,很多国家面临如何在保证审判质量的前提下提高司法系统审判效率的问题。因此,在进行制度改革的同时,建立决策支持系统将会有效地辅助司法判决。本文以中国的医疗损害诉讼文本为例,使用文本挖掘和自动分类技术提出了一个法院裁判决策支持系统(CJ-DSS),该系统可以依据以往判例预测新诉讼文本的判决结果:驳回与非驳回。结合案例,本文研究发现,组合特征提取法确实能够改进和提高分类器的分类性能,而且针对支持向量机(SVM)、人工神经网络(ANN)、K最近邻(KNN)三种不同的分类器,文档词频-卡方(DF-CHI)组合特征提取法对性能的改进程度有所差异,其中ANN的性能改进最高。除此之外,集成学习后该系统的分类性能更加稳定,显著优于单一分类器,F1值达到93.3%。
In many other countries with the continental legal system, the constant generation of new legal relationships makes, the defect of statute law which is unable to be timely formulate and modify gradually become obvious. As the number of dispute lawsuit rapidly grows, many countries in the world face the problem how to improve the efficiency of the judicial system under the premise of guaranteeing the quality of the trial. Therefore, in addition to reforming the system, the decision support system will effectively improve judicial decisions.
In this paper, medical damage judgment documents in China are taken as example, and a court judgment decision support system (CJ-DSS) is proposed based on text mining and the automatic classification technology. The system can predict the trail results of the new lawsuit texts according to the previous cases verdict:rejected and no rejected. By combining different feature extraction methods (DF, Chi-square and DF-CHI feature combination extraction method) and classifiers (SVM, ANN and KNN), multiple combinations that meet the expected performance as the base learning machines are selected. Based on the theory of Delphi Method, integrated learning is used to predict new cases. Integrated learning refers to constructing a new model and using the prediction result of base learning machines that have met expectations as input after proper training, and finally outputting a prediction result with maximum probability through linear or non-linear calculations.
At the same time, by combining with real cases, it is found that the combination feature extraction method can indeed improve the classifier's performance, especially for SVM, ANN and KNN classifiers. In addition, the system classification performance became more consistent after integrated learning. The best performance reached 93.3%, which significantly increased system accuracy.
This paper's data source is the "BeiDaFaBao" legal database. "Medical malpractice" is used as the keyword and more than 300 court verdict and mediation documents from 2013 are retrieved. Due to the short format of mediation documents and its brief case explanations, they are eliminated from the study. The rest of the documents are trained and tested after preprocessing.
In previous studies, the accuracy of text classification system has been greatly influenced by the training set size:the larger the training set data, the better the performance. This paper has a reference value for constructing structured high-performance system based on a small sample training set in the future. Meanwhile, since the process of labelling documents is costly, therefore, the study and model construction for unlabeled text should be the focus of future research for data scientists.
[1] 董茂云. 法典法,判例法与中国的法典化道路[J]. 比较法研究, 1997, 11(4):1-31.
[2] Prevedello L M, Raja A S, Ip I K, et al. Does clinical decision support reduce unwarranted variation in yield of CT pulmonary angiogram?[J]. American Journal of Medicine, 2013, 126(11):975-81.
[3] Park S H, Rha S W, Byun J S, et al. Performance evaluation of the machine learning algorithms used in inference mechanism of a medical decision support system.[J]. The Scientific World Journal,2014,2014(7):137896-137896.
[4] O'Sullivan D, Fraccaro P, Carson E, et al. Decision time for clinical decision support systems.[J]. Clinical Medicine, 2014, 14(4):338-41.
[5] 万映红, 李江, 李怀祖. 虚拟合作的伙伴选择智能决策支持系统框架研究[J]. 系统工程理论与实践, 2001, 21(12):60-65.
[6] 方淑芬, 吕文元. 设备维修管理智能决策支持系统的研究[J]. 系统工程理论与实践, 2001, 21(12):53-59.
[7] Tseng Y H, Lin C J, Lin Y I. Text mining techniques for patent analysis[J]. Information Processing & Management, 2007, 43(5):1216-1247.
[8] Kim J H, Choi K S. Patent document categorization based on semantic structural information[J]. Information Processing & Management An International Journal, 2007, 43(5):1200-1215.
[9] Pong Y H, Kwok C W, Lau Y K, et al. A comparative study of two automatic document classification methods in a library setting[J]. Journal of Information Science, 2008, 34(2):213-230.
[10] Fang Ruihua, Schindelman G, Auken K V, et al. Automatic categorization of diverse experimental information in the bioscience literature[J]. Bmc Bioinformatics, 2012, 13:1-12.
[11] 余乐安,汪寿阳. 基于核主元分析的带可变惩罚因子最小二乘模糊支持向量机模型及其在信用分类中的应用[J]. 系统科学与数学,2009,29(10):1311-1326.
[12] Coussement K, Poel D V D. Improving customer complaint management by automatic email classification using linguistic style features as predictors[J]. Decision Support Systems, 2008, 44(4):870-882.
[13] 梁昕露,李美娟. 电信业投诉分类方法及其应用研究[J]. 中国管理科学,2015,23(S1):188-192.
[14] Al Qady M, Kandil A. Automatic classification of project documents on the basis of text content[J]. American Society of Civil Engineers, 2015,29(3):04014043.
[15] 周茜, 赵明生, 扈旻. 中文文本分类中的特征选择研究[J]. 中文信息学报, 2004, 18(3):17-23.
[16] Salton G, Yang C, S A Wang A. A vector space model for automatic indexing. Communications of the ACM, 1975,18(11):613-620.
[17] Rocchio J J. Relevance feedback in information retrieval[M]//Salton G, The SMART retrieval system:Experiments in automatic document processing. Englewood cliffs, NJ:Practice-Hall,1971.
[18] Salton G, Buckley C. Term weighting approaches in automatic text retrieval[J]. Information Processing and Management, 1988,(5):24,513-523.
[19] 赵燕平,李超.网络安全信息挖掘中的特征选择与专利分析研究[J].中国管理科学,2004, 12(S1):514-518.
[20] Yang Yiming, Pedersen J O. A comparative study on feature selection in text categorization[C]//Proceedings of the 14th International Conference on Machine Learning San,Fransisco, July 08-12,1997.
[21] Lee C, Lee G G. Information gain and divergence-based feature selection for machine learning-based text categorization[J].Information Processing and Management,2006,42(1):155-165.
[22] 代六玲,黄河燕,陈肇雄. 中文文本分类中特征抽取方法的比较研究[J]. 中文信息学报, 2004, 18(1):26-32.
[23] Vapnik V. The nature of statistical learning theory[M]. Berlin Springer, 2000.
[24] Burges C J C. A tutorial on support vector machines for pattern recognition[J]. Data Mining and Knowledge Discovery,1998,2(2):121-167.
[25] 程显毅. 文本挖掘原理[M]. 北京:科学出版社, 2010.
[26] B Lantz. Machine learning with R[M]. Bejjing:China Machine Press,2015.
[27] 刘钢, 胡四泉, 范植华,等. 神经网络在文本分类上的一种应用[J]. 计算机工程与应用, 2003, (36):73-74.
[28] Dasarathy B V. Nearest neighbor (NN) norms:NN pattern classification techniques[M]. Los Alamitos:IEEE Computer Society Press, 1990.
[29] Weiss S M, Indurkhya,Zhang Tong. Fundamentals of predictive text mining[M]. Berlin:Springer,2012.