近年来,目标客户选择建模成为客户关系管理领域的研究热点。为了解决用于目标客户选择建模的训练样本类别分布高度不平衡的问题,本文首先提出了混合抽样方法。进一步地,将数据分组处理(GMDH)神经元网络引入到客户特征选择中,提出新的特征选择算法Log-GMDH。该算法分别从传递函数的选择和新的外准则的构建两个方面对传统GMDH网络模型进行了改进。最后,将提出的混合抽样、Log-GMDH和Logistic回归分类算法相结合,构建目标客户选择模型LogGMDH-Logistic。在CoIL2000预测竞赛中某汽车保险公司的目标客户选择数据集上进行实证分析,结果表明,LogGMDH-Logistic模型不仅在性能上优于已有的一些目标客户选择模型,而且具有很好的可解释性。
In recent years, database marketing has become a hot topic in customer relationship management (CRM), and customer targeting modeling is one of the most important issues in database marketing. Essentially, customer targeting modeling is a binary classification problem, that is, all customers are divided into two categories: the customers responding to the corporate marketing activities and the ones responding to no activities. This study combines group method of data handling (GMDH) neural networks, re-sampling technique, as well as Logistic regression classification algorithm to construct customer targeting model LogGMDH-Logistic. This model consists of three phases: (1) In order to solve the highly imbalanced class distribution of training set for customer targeting modeling, a new resampling method (hybrid sampling) is proposed to balance the class distribution of training set; (2) To select some key features from a large number of characteristics describing the customers, the GMDH neural network is introduced and a new feature selection algorithm Log-GMDH is presented, which improves the traditional GMDH neural network model in both the selection of transfer function and the construction of new external criterion. In terms of the selection of transfer function, it uses the non-linear Logistic regression function to replace the linear transfer function of the traditional GMDH neural network; and in the construction of external criterion, it selects the hit rate suitable for the customer targeting modeling to replace the regularization criterion of the traditional GMDH neural network; (3) It obtains the training set by mapping according to the selected feature subset, trains the Logistic regression classification algorithm and predicts the response probability of potential customers. The experiment is carried out in a customer targeting dataset of a car insurance company from CoIL2000 prediction competition, and the results show that LogGMDH-Logistic model is superior to some existing customer targeting models both in performance and interpretability. In CRM, there are a lot of customer classification problems, such as customer churn prediction, customer credit scoring, which are similar to customer targeting modeling. Thus, the model proposed in this study can also be used to solve the above problems, and is expected to achieve satisfaction classification performance.
[1] Yao Zhiyuan, Sarlin P, Eklund T, et al. Combining visual customer segmentation and response modeling[J]. Neural Computing and Applications, 2014,25(1):123-134.
[2] Ou Chuanxin, Liu Chunnian, Huang Jiajing, et al. On data mining for direct marketing[M]//Slezak D, Wang Guoyin, Liu Qing, et al. Rough sets, Fuzzy sets, data mining, and granular computing. Berlin, Heidelberg:Springer, 2003,491-498.
[3] Blattberg R C, Kim B D, Neslin S A. Database marketing:Analyzing and managing customers[M]. Berlin:Springer, 2008.
[4] Maalouf M, Siddiqi M. Weighted logistic regression for large-scale imbalanced and rare events data[J]. Knowledge-Based Systems, 2014, 59:142-148.
[5] 贺昌政. 自组织数据挖掘与经济预测[M]. 北京:科学出版社, 2005.
[6] 司昕. 预测方法中的神经网络模型[J]. 预测, 1998, 17(2):32-35.
[7] Dahl G E, Yu Dong, Deng Li, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. Audio, Speech, and Language Processing, IEEE Transactions on, 2012, 20(1):30-42.
[8] Lawrence S, Giles C L, Tsoi A C. Lessons in neural network training:Overfitting may be harder than expected[C]//Proceedings of the 14th National Con ference on Artificial Intelligence and 9th Innovative Applications of Artifictal Intelligence Conference, Providence, Rhode Island, July 27-31, 1997.
[9] 陈涛, 谢阳群. 文本分类中的特征降维方法综述[J]. 情报学报, 2005, 24(6):690-695.
[10] Malhi A, Gao R X. PCA-based feature selection scheme for machine defect classification[J]. Instrumentation and Measurement, 2004, 53(6):1517-1525.
[11] Kim Y, Street W N, Russell G J, et al. Customer targeting:A neural network approach guided by genetic algorithms[J]. Management Science, 2005, 51(2):264-276.
[12] 周昉, 何洁月. 生物信息学中基因芯片的特征选择技术综述[J]. 计算机科学, 2007, 34(12):143-150.
[13] 姚旭, 王晓丹, 张玉玺, 等. 特征选择方法综述[J]. 控制与决策, 2012, 27(2):161-166.
[14] Ivakhnenko A. The group method of data handling in prediction problems[J]. Soviet Automatic Control, 1976, 9(6):21-30.
[15] 何跃, 鲍爱根, 贺昌政, 自组织建模方法和GDP增长模型研究[J]. 中国管理科学, 2004, 12(2):139-142.
[16] 朱帮助, 张秋菊, 邹昊飞, 等. 基于OSA算法和GMDH网络集成的电子商务客户流失预测[J]. 中国管理科学, 2011, 19(5):64-70.
[17] 邹昊飞, 夏国平,杨方廷. 基于自组织算法的改进型GAANN预测模型[J]. 中国管理科学, 2005, 13(6):75-80.
[18] Malhotra R, Chug A. Application of group method of data handling model for software maintainability prediction using object oriented systems[J]. International Journal of System Assurance Engineering and Management, 2014, 5(2):165-173.
[19] Ratrout N T. Short-term traffic flow prediction using group method data handling (GMDH)-based abductive networks[J]. Arabian Journal for Science and Engineering, 2014, 39(2):631-646.
[20] Berry M J, Linoff G S. Data mining techniques:For marketing, sales, and customer relationship management[M]. New Jersey:John Wiley & Sons, 2004.
[21] Mueller J A, Lemke F. Self-organising data mining:An intelligent approach to extract knowledge from data[M]. Hamburg:Libri, 2000.
[22] Li Hu, Zou Peng, Wang Xiang, et al. A new combination sampling method for imbalanced data[C]//Sun Zengqi, Deng Zhidong. Proceedings of 2013 Chinese Intelligent Automation Conference. Beijing:Springer Berlin Heidelberg, 2013:543-554.
[23] Sarychev A. An averaged regularity criterion for the group method of data handling in the problem of searching for the best regression[J]. Soviet Journal of Automation and Information Sciences c/c of Avtomatika, 1990, 23(5):24-29.
[24] Van der Putten P, De Ruiter M, Van Someren M. Coil challenge 2000 tasks and results:Predicting and explaining caravan policy ownership[J]. Working paper, Universiteit van Amsterdam, 2000.
[25] Fawcett T, Flach P A. A response to webb and ting's on the application of ROC analysis to predict classification performance under varying class distributions[J]. Machine Learning, 2005, 58(1):33-38.
[26] Groot L, Zonneveld E. European union budget contributions and expenditures:A Lorenz Curve Approach[J]. Journal of Common Market Studies, 2013, 51(4):649-666.