Accurate and Interpretable Regression Trees using Oracle Coaching
2014 (English)Conference paper, Published paper (Refereed)
Abstract [en]
In many real-world scenarios, predictive models
need to be interpretable, thus ruling out many machine learning
techniques known to produce very accurate models, e.g., neural
networks, support vector machines and all ensemble schemes.
Most often, tree models or rule sets are used instead, typically
resulting in significantly lower predictive performance. The over-
all purpose of oracle coaching is to reduce this accuracy vs.
comprehensibility trade-off by producing interpretable models
optimized for the specific production set at hand. The method
requires production set inputs to be present when generating the
predictive model, a demand fulfilled in most, but not all, predic-
tive modeling scenarios. In oracle coaching, a highly accurate, but
opaque, model is first induced from the training data. This model
(“the oracle”) is then used to label both the training instances and
the production instances. Finally, interpretable models are trained
using different combinations of the resulting data sets. In this
paper, the oracle coaching produces regression trees, using neural
networks and random forests as oracles. The experiments, using
32 publicly available data sets, show that the oracle coaching
leads to significantly improved predictive performance, compared
to standard induction. In addition, it is also shown that a
highly accurate opaque model can be successfully used as a pre-
processing step to reduce the noise typically present in data, even
in situations where production inputs are not available. In fact,
just augmenting or replacing training data with another copy
of the training set, but with the predictions from the opaque
model as targets, produced significantly more accurate and/or
more compact regression trees.
Place, publisher, year, edition, pages
IEEE , 2014.
Keywords [en]
Oracle coaching, Regression trees, Predictive modeling, Interpretable models, Machine learning, Data mining
National Category
Computer Sciences Computer and Information Sciences
Identifiers
URN: urn:nbn:se:hb:diva-7319Local ID: 2320/14712ISBN: 978-1-4799-4518-4/14 (print)OAI: oai:DiVA.org:hb-7319DiVA, id: diva2:888032
Conference
5th IEEE Symposium Computational Intelligence and Data Mining, 9-12 Decmber, Orlando, FL, USA
Note
Sponsorship:
This work was supported by the Swedish Foundation for Strategic
Research through the project High-Performance Data Mining for Drug Effect
Detection (IIS11-0053), the Swedish Retail and Wholesale Development
Council through the project Innovative Business Intelligence Tools (2013:5)
and the Knowledge Foundation through the project Big Data Analytics by
Online Ensemble Learning (20120192).
2015-12-222015-12-222018-01-10