Postponing the evaluation of attributes with a high number of boundary points

Tapio Elomaa, Juho Rousu

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

Abstract

The efficiency of the otherwise expedient decision tree learning can be impaired in processing data-mining-sized data if superlineartime processing is required in attribute selection. An example of such a technique is optimal multisplitting of numerical attributes. Its efficiency is hit hard even by a single troublesome attribute in the domain. Analysis shows that there is a direct connection between the ratio of the numbers of boundary points and training examples and the maximum goodness score of a numerical attribute. Class distribution information from preprocessing can be applied to obtain tighter bounds for an attribute's relevance in class prediction. These analytical bounds, however, are too loose for practical purposes. We experiment with heuristic methods which postpone the evaluation of attributes that have a high number of boundary points. The results show that substantial time savings can be obtained in the most critical data sets without having to give up on the accuracy of the resulting classifier.
Original languageEnglish
Title of host publicationPrinciples of Data Mining and Knowledge Discovery
Subtitle of host publicationSecond European Symposium, PKDD ’98
PublisherSpringer
Pages221-229
ISBN (Electronic)978-3-540-49687-8
ISBN (Print)978-3-540-65068-3
Publication statusPublished - 1998
MoE publication typeA4 Article in a conference publication
Event2nd Eur. Symp., PKDD'98. Principles of Data Mining and Knowledge Discovery. Nantes, 23 - 26 Sept. 1998 -
Duration: 1 Jan 1998 → …

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
Volume1510
ISSN (Print)0302-9743

Conference

Conference2nd Eur. Symp., PKDD'98. Principles of Data Mining and Knowledge Discovery. Nantes, 23 - 26 Sept. 1998
Period1/01/98 → …

Fingerprint

Heuristic methods
Decision trees
Processing
Data mining
Classifiers
Experiments

Cite this

Elomaa, T., & Rousu, J. (1998). Postponing the evaluation of attributes with a high number of boundary points. In Principles of Data Mining and Knowledge Discovery: Second European Symposium, PKDD ’98 (pp. 221-229). Springer. Lecture Notes in Computer Science, Vol.. 1510
Elomaa, Tapio ; Rousu, Juho. / Postponing the evaluation of attributes with a high number of boundary points. Principles of Data Mining and Knowledge Discovery: Second European Symposium, PKDD ’98. Springer, 1998. pp. 221-229 (Lecture Notes in Computer Science, Vol. 1510).
@inproceedings{da995c7c25f743e78435e4d08a5ab9a4,
title = "Postponing the evaluation of attributes with a high number of boundary points",
abstract = "The efficiency of the otherwise expedient decision tree learning can be impaired in processing data-mining-sized data if superlineartime processing is required in attribute selection. An example of such a technique is optimal multisplitting of numerical attributes. Its efficiency is hit hard even by a single troublesome attribute in the domain. Analysis shows that there is a direct connection between the ratio of the numbers of boundary points and training examples and the maximum goodness score of a numerical attribute. Class distribution information from preprocessing can be applied to obtain tighter bounds for an attribute's relevance in class prediction. These analytical bounds, however, are too loose for practical purposes. We experiment with heuristic methods which postpone the evaluation of attributes that have a high number of boundary points. The results show that substantial time savings can be obtained in the most critical data sets without having to give up on the accuracy of the resulting classifier.",
author = "Tapio Elomaa and Juho Rousu",
year = "1998",
language = "English",
isbn = "978-3-540-65068-3",
series = "Lecture Notes in Computer Science",
publisher = "Springer",
pages = "221--229",
booktitle = "Principles of Data Mining and Knowledge Discovery",
address = "Germany",

}

Elomaa, T & Rousu, J 1998, Postponing the evaluation of attributes with a high number of boundary points. in Principles of Data Mining and Knowledge Discovery: Second European Symposium, PKDD ’98. Springer, Lecture Notes in Computer Science, vol. 1510, pp. 221-229, 2nd Eur. Symp., PKDD'98. Principles of Data Mining and Knowledge Discovery. Nantes, 23 - 26 Sept. 1998, 1/01/98.

Postponing the evaluation of attributes with a high number of boundary points. / Elomaa, Tapio; Rousu, Juho.

Principles of Data Mining and Knowledge Discovery: Second European Symposium, PKDD ’98. Springer, 1998. p. 221-229 (Lecture Notes in Computer Science, Vol. 1510).

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

TY - GEN

T1 - Postponing the evaluation of attributes with a high number of boundary points

AU - Elomaa, Tapio

AU - Rousu, Juho

PY - 1998

Y1 - 1998

N2 - The efficiency of the otherwise expedient decision tree learning can be impaired in processing data-mining-sized data if superlineartime processing is required in attribute selection. An example of such a technique is optimal multisplitting of numerical attributes. Its efficiency is hit hard even by a single troublesome attribute in the domain. Analysis shows that there is a direct connection between the ratio of the numbers of boundary points and training examples and the maximum goodness score of a numerical attribute. Class distribution information from preprocessing can be applied to obtain tighter bounds for an attribute's relevance in class prediction. These analytical bounds, however, are too loose for practical purposes. We experiment with heuristic methods which postpone the evaluation of attributes that have a high number of boundary points. The results show that substantial time savings can be obtained in the most critical data sets without having to give up on the accuracy of the resulting classifier.

AB - The efficiency of the otherwise expedient decision tree learning can be impaired in processing data-mining-sized data if superlineartime processing is required in attribute selection. An example of such a technique is optimal multisplitting of numerical attributes. Its efficiency is hit hard even by a single troublesome attribute in the domain. Analysis shows that there is a direct connection between the ratio of the numbers of boundary points and training examples and the maximum goodness score of a numerical attribute. Class distribution information from preprocessing can be applied to obtain tighter bounds for an attribute's relevance in class prediction. These analytical bounds, however, are too loose for practical purposes. We experiment with heuristic methods which postpone the evaluation of attributes that have a high number of boundary points. The results show that substantial time savings can be obtained in the most critical data sets without having to give up on the accuracy of the resulting classifier.

M3 - Conference article in proceedings

SN - 978-3-540-65068-3

T3 - Lecture Notes in Computer Science

SP - 221

EP - 229

BT - Principles of Data Mining and Knowledge Discovery

PB - Springer

ER -

Elomaa T, Rousu J. Postponing the evaluation of attributes with a high number of boundary points. In Principles of Data Mining and Knowledge Discovery: Second European Symposium, PKDD ’98. Springer. 1998. p. 221-229. (Lecture Notes in Computer Science, Vol. 1510).