Elsevier

Decision Support Systems

Volume 48, Issue 3, February 2010, Pages 480-487
Decision Support Systems

A hybrid approach for efficient ensembles

https://doi.org/10.1016/j.dss.2009.06.007Get rights and content

Abstract

An ensemble of classifiers, or a systematic combination of individual classifiers, often results in better classifications in comparison to a single classifier. However, the question regarding what classifiers should be chosen for a given situation to construct an optimal ensemble has often been debated. In addition, ensembles are often computationally expensive since they require the execution of multiple classifiers for a single classification task. To address these problems, we propose a hybrid approach for selecting and combining data mining models to construct ensembles by integrating Data Envelopment Analysis and stacking. Experimental results show the efficiency and effectiveness of the proposed approach.

Introduction

To reveal interesting data patterns, usually hidden within large data sets, diverse mining concepts and techniques have often been used [8], [17], [18]. Classification, a common mining technique, is one of the most explored topics within data mining. Significant effort and research has been devoted to the construction of a reliable classifier or classifiers which can accurately predict values for the categorical variable class. For example, to test the robustness of classifiers, training data – which contains known class values – is often utilized. Classifiers can then learn to more accurately predict class values for similar, unknown datasets. Other methods for predicting class values often use a variety of algorithms, many of which have been proposed in the accompanying literature [13], [20], [26], [28]. While many effective algorithms have been developed for constructing classifiers, no single algorithm has been shown to be either empirically or theoretically better than other algorithms in all scenarios [4], [30]. From this perspective, determining which data mining algorithm to utilize can be perplexing. Although some basic rules exist in choosing an appropriate algorithm, in many cases, the final decision is based more on random, personal preference.

To address the problem, one proposed solution involves the use of ‘ensemble methods’ — systematically combining different classifiers. Research has shown that combining a set of simple classifiers may result in better classification in comparison to any single sophisticated classifier [7], [17]. Furthermore, the subsequent combination of sophisticated and unsophisticated classifiers improves classification dramatically when compared to simple classifier combinations [1], [7], [14], [16], [17], [27]. Among the diverse ensemble classification techniques that are available, voting based Bagging (short for Bootstrap Aggregating) and Boosting are most often utilized. To understand how the combination of Bagging or Boosting with other classification techniques is beneficial, basic understanding of individual classification techniques is required. Individually, Bagging and Boosting both finalize classifications through voting; however, each derives models in dissimilar ways [4], [23]. Although Bagging and Boosting have shown significant classification improvements, including all classifiers into one ensemble would be impractical. As a result, researchers have begun investigating a dynamic approach to constructing ensembles by taking into account characteristics of any dataset and incorporating the best classifiers for that particular dataset. However, the selection of a classifier as the meta-learner is not clearly defined. In addition, the combination of these classifiers has remained simple-minded.

In this study, we propose a new hybrid approach for constructing an ensemble classifier by integrating Data Envelopment Analysis (DEA) and stacked generalization. DEA is a nonparametric method developed by Professor William Cooper in Operations Research and Economics for estimation of production frontiers [6]. An experimental study is designed to show the efficiency and effectiveness of the proposed method. Rest of the papers is organized as follows. Section 2 reviews the literature in ensemble data mining; Section 3 introduces our hybrid DEA-based ensemble construction approach; Section 4 presents the framework of the proposed system and its components, Section 5 presents experiments and results, and Section 6 concludes the work and provides directions for future research.

Section snippets

Related work

The main objective of an ensemble approach is to improve classification accuracy by aggregating the classifications of a diverse of classifiers. Previous research has shown that an ensemble of classifiers is often more accurate than any of the single classifiers in the ensemble [17]. Two popular ensemble methods are Bagging [4] and Boosting [9], [10], [23] and they both employ re-sampling techniques to obtain different training sets for each of the classifiers.

AdaBoost, short for Adaptive

Design of a two-step procedure

Ensemble methods have been proved to be more effective than single classifiers. However, two questions arise when creating an ensemble: (1) What models to choose? (2) How to combine them? In this research, we propose a two-step procedure to construct efficient and effective ensemble as described below.

A hybrid approach for ensemble classifiers

Most ensemble methods need to evaluate multiple models for one specific task. Given that it is impractical to apply as many models as possible to the entire target data space, it is still possible to evaluate as many of them as possible on the training dataset if the evaluation is done properly. Instead of evaluating models sequentially on a single machine, we propose to assign the evaluation task to different machines so they can be performed simultaneously. In our framework, different

Experiments and results

Extensive experiments were conducted to empirically evaluate the performance of the proposed ensemble construction approach. In this section, we describe the design of our experiments and subsequent results.

Conclusion

In this study, we combined the strength of Data Envelopment Analysis and the stacking method to develop a new framework for constructing efficient and effective ensembles. Wrapped in a distributed framework, DEA can be more efficiently used in the model selection process. In addition, stacking allows for effective combination of different models. The proposed approach appears to be the best ensemble building and combination approach as indicated by outperforming all other benchmarking

Acknowledgment

This research is supported in part by a fund from the Information Infrastructure Institute (iCube) and a research fund from College of Business at Iowa State University. I would like to thank Mr. X. Yang and Mr. Prashant Singh for their technical assistance. I would also like to thank the editors and three anonymous reviewers. Additionally, I would like to acknowledge the help from Mr. James Hall, Mr. Abhijit Rao and Mr. Kurt Roots for their careful proofreading of the paper.

Dan Zhu obtained her Ph.D. degree in Management Science and Information Systems from Carnegie Mellon University. Her current research interests are in business intelligence and decision support systems. Dr. Zhu's research has been published in the Proceedings of National Academy of Sciences, Information System Research, Decision Sciences, Naval Research Logistics, Annals of Statistics, Annals of Operations Research, Journal of Databases, Journal of Information and Software Technology,

References (32)

  • L. Breiman

    Random forests–random features

    Machine Learning

    (2001)
  • W. Cooper et al.

    Data envelopment analysis: A comprehensive text with models, applications, references and DEA-solver software

    (2002)
  • T. Dietterich

    An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization

    Machine Learning

    (2000)
  • U. Fayyad et al.

    Advances in knowledge discovery and data mining

    (1997)
  • Y. Freund

    Boosting a weak learning algorithm by majority

    Information and Computation

    (1996)
  • Y. Freund et al.

    Experiments with a New Boosting Algorithm, in 13th Int'l Conf. on Machine Learning, Bari, Italy

    (1996)
  • Cited by (38)

    • Cost-sensitive multiple-instance learning method with dynamic transactional data for personal credit scoring

      2020, Expert Systems with Applications
      Citation Excerpt :

      Cost-sensitive learning methods could give a higher cost to the minor class’s misclassification. Therefore, the classifier will place more attention to minor class (Bansal, Sinha, & Zhao, 2009; Lee & Zhu, 2011; Zhu, 2010). First, some researchers handle cost-sensitive problems by preprocessing the training data, which we called the resampling method.

    • A spectral clustering based ensemble pruning approach

      2014, Neurocomputing
      Citation Excerpt :

      Many other dynamic approaches have also been proposed [24,37–39]. Zhu [40] integrated data envelopment analysis and stacking, and described a hybrid approach to classifier selection. Bakker and Heskes [41] proposed a clustering method for ensemble classifier extraction, in which a small collection of representative entities is used to represent a large entity collection.

    View all citing articles on Scopus

    Dan Zhu obtained her Ph.D. degree in Management Science and Information Systems from Carnegie Mellon University. Her current research interests are in business intelligence and decision support systems. Dr. Zhu's research has been published in the Proceedings of National Academy of Sciences, Information System Research, Decision Sciences, Naval Research Logistics, Annals of Statistics, Annals of Operations Research, Journal of Databases, Journal of Information and Software Technology, International Journal of Knowledge Management, Omega, etc. Her work has been funded by the National Science Foundation. She teaches Business Intelligence, Databases, System Analysis and Design, and Advanced Software Development courses.

    View full text