Efficient multisplitting on numerical data

Tapio Elomaa, Juho Rousu

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

Abstract

Numerical data poses a problem to symbolic learning methods, since numerical value ranges inherently need to be partitioned into intervals for representation and handling. An evaluation function is used to approximate the goodness of different partition candidates. Most existing methods for multisplitting on numerical attributes are based on heuristics, because of the apparent efficiency advantages. We characterize a class of well-behaved cumulative evaluation functions for which efficient discovery of the optimal multisplit is possible by dynamic programming. A single pass through the data suffices to evaluate multisplits of all arities. This class contains many important attribute evaluation functions familiar from symbolic machine learning research. Our empirical experiments convey that there is no significant differences in efficiency between the method that produces optimal partitions and those that are based on heuristics. Moreover, we demonstrate that optimal multisplitting can be beneficial in decision tree learning in contrast to using the much applied binarization of numerical attributes or heuristical multisplitting.

Original languageEnglish
Title of host publicationPrinciples of Data Mining and Knowledge Discovery
PublisherSpringer
Pages178 - 188
Number of pages11
Volume1263
ISBN (Electronic)978-3-540-69236-2
ISBN (Print)978-3-540-63223-8
DOIs
Publication statusPublished - 1997
MoE publication typeA4 Article in a conference publication
EventFirst European symposium on principles of data mining and knowledge discovery (PKDD '97) - Trondheim, Norway
Duration: 24 Jun 199727 Jun 1997

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
Volume1263
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceFirst European symposium on principles of data mining and knowledge discovery (PKDD '97)
CountryNorway
CityTrondheim
Period24/06/9727/06/97

Fingerprint

Function evaluation
Decision trees
Dynamic programming
Learning systems
Numerical methods
Experiments

Cite this

Elomaa, T., & Rousu, J. (1997). Efficient multisplitting on numerical data. In Principles of Data Mining and Knowledge Discovery (Vol. 1263, pp. 178 - 188). Springer. Lecture Notes in Computer Science, Vol.. 1263 https://doi.org/10.1007/3-540-63223-9_117
Elomaa, Tapio ; Rousu, Juho. / Efficient multisplitting on numerical data. Principles of Data Mining and Knowledge Discovery. Vol. 1263 Springer, 1997. pp. 178 - 188 (Lecture Notes in Computer Science, Vol. 1263).
@inproceedings{c2bdc8ed3ad348af8b8bdebdadcc94de,
title = "Efficient multisplitting on numerical data",
abstract = "Numerical data poses a problem to symbolic learning methods, since numerical value ranges inherently need to be partitioned into intervals for representation and handling. An evaluation function is used to approximate the goodness of different partition candidates. Most existing methods for multisplitting on numerical attributes are based on heuristics, because of the apparent efficiency advantages. We characterize a class of well-behaved cumulative evaluation functions for which efficient discovery of the optimal multisplit is possible by dynamic programming. A single pass through the data suffices to evaluate multisplits of all arities. This class contains many important attribute evaluation functions familiar from symbolic machine learning research. Our empirical experiments convey that there is no significant differences in efficiency between the method that produces optimal partitions and those that are based on heuristics. Moreover, we demonstrate that optimal multisplitting can be beneficial in decision tree learning in contrast to using the much applied binarization of numerical attributes or heuristical multisplitting.",
author = "Tapio Elomaa and Juho Rousu",
year = "1997",
doi = "10.1007/3-540-63223-9_117",
language = "English",
isbn = "978-3-540-63223-8",
volume = "1263",
series = "Lecture Notes in Computer Science",
publisher = "Springer",
pages = "178 -- 188",
booktitle = "Principles of Data Mining and Knowledge Discovery",
address = "Germany",

}

Elomaa, T & Rousu, J 1997, Efficient multisplitting on numerical data. in Principles of Data Mining and Knowledge Discovery. vol. 1263, Springer, Lecture Notes in Computer Science, vol. 1263, pp. 178 - 188, First European symposium on principles of data mining and knowledge discovery (PKDD '97), Trondheim, Norway, 24/06/97. https://doi.org/10.1007/3-540-63223-9_117

Efficient multisplitting on numerical data. / Elomaa, Tapio; Rousu, Juho.

Principles of Data Mining and Knowledge Discovery. Vol. 1263 Springer, 1997. p. 178 - 188 (Lecture Notes in Computer Science, Vol. 1263).

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

TY - GEN

T1 - Efficient multisplitting on numerical data

AU - Elomaa, Tapio

AU - Rousu, Juho

PY - 1997

Y1 - 1997

N2 - Numerical data poses a problem to symbolic learning methods, since numerical value ranges inherently need to be partitioned into intervals for representation and handling. An evaluation function is used to approximate the goodness of different partition candidates. Most existing methods for multisplitting on numerical attributes are based on heuristics, because of the apparent efficiency advantages. We characterize a class of well-behaved cumulative evaluation functions for which efficient discovery of the optimal multisplit is possible by dynamic programming. A single pass through the data suffices to evaluate multisplits of all arities. This class contains many important attribute evaluation functions familiar from symbolic machine learning research. Our empirical experiments convey that there is no significant differences in efficiency between the method that produces optimal partitions and those that are based on heuristics. Moreover, we demonstrate that optimal multisplitting can be beneficial in decision tree learning in contrast to using the much applied binarization of numerical attributes or heuristical multisplitting.

AB - Numerical data poses a problem to symbolic learning methods, since numerical value ranges inherently need to be partitioned into intervals for representation and handling. An evaluation function is used to approximate the goodness of different partition candidates. Most existing methods for multisplitting on numerical attributes are based on heuristics, because of the apparent efficiency advantages. We characterize a class of well-behaved cumulative evaluation functions for which efficient discovery of the optimal multisplit is possible by dynamic programming. A single pass through the data suffices to evaluate multisplits of all arities. This class contains many important attribute evaluation functions familiar from symbolic machine learning research. Our empirical experiments convey that there is no significant differences in efficiency between the method that produces optimal partitions and those that are based on heuristics. Moreover, we demonstrate that optimal multisplitting can be beneficial in decision tree learning in contrast to using the much applied binarization of numerical attributes or heuristical multisplitting.

U2 - 10.1007/3-540-63223-9_117

DO - 10.1007/3-540-63223-9_117

M3 - Conference article in proceedings

SN - 978-3-540-63223-8

VL - 1263

T3 - Lecture Notes in Computer Science

SP - 178

EP - 188

BT - Principles of Data Mining and Knowledge Discovery

PB - Springer

ER -

Elomaa T, Rousu J. Efficient multisplitting on numerical data. In Principles of Data Mining and Knowledge Discovery. Vol. 1263. Springer. 1997. p. 178 - 188. (Lecture Notes in Computer Science, Vol. 1263). https://doi.org/10.1007/3-540-63223-9_117