TY - BOOK
T1 - Mining sequential patterns
AU - Ahola, Jussi
PY - 2001
Y1 - 2001
N2 - Discovering associations is one of the fundamental tasks
of data mining. Its aim is to
automatically seek for dependencies from vast amounts of
data. The task results in socalled
association rules, which are of form: If A occurs in the
data then B occurs also. Only
those rules that occur in the data frequently enough are
generated.
However, various information sources generate data with
an inherent sequential nature,
i.e., it is composed of discrete events which have a
temporal/spatial ordering. This kind of
data can be obtained from, e.g., telecommunications
networks, electronic commerce,
www-servers of Internet, and various scientific sources,
like gene databases. The
sequential nature of the data is totally ignored in the
generation of the association rules.
Thus, a part of the useful information included in the
data is discarded. Thus, since the mid
90's the interest in discovering also the sequential
associations in the data has arisen
among the data mining community.
The sequential associations or sequential patterns can be
presented in the form: when A
occurs, B occurs within some certain time. So, the
difference to traditional association
rules is that here the time information is included both
in the rule itself and also in the
mining process in the form of timing constraints.
Nowadays there exist several highly
efficient methods for mining these kind of patterns. The
problem with them is that they
assume the input data to be sequences of discrete events
including only the information of
the ordering, usually the time. Often, however, the
events are associated with some
additional attributes. The existing methods cannot take
this multi-dimensionality of the
data into account and so they lose the additional
information it involves. Furthermore, the
methods are designed for some specific problem, and are
not, as such, applicable to
different types of sequential data.
In this report, a general formulation of the sequential
patterns is introduced as it is
presented in [1]. By using this approach the last problem
of the existing algorithms can be
tackled. A survey of the existing algorithm is then done.
Three algorithms are presented in
detail: WINEPI [2] and GSP [4] as they form the basis of
the algorithms, and cSPADE [6]
since it seems to be the most promising method proposed
for the problem yet. Also the
other relevant approaches are shortly introduced. Lastly,
the extension of the patterns into
the multi-dimensional is considered. Some ideas of
handling the problem are given and
also the features of the existing algorithms supporting
multi-dimensionality are studied.
AB - Discovering associations is one of the fundamental tasks
of data mining. Its aim is to
automatically seek for dependencies from vast amounts of
data. The task results in socalled
association rules, which are of form: If A occurs in the
data then B occurs also. Only
those rules that occur in the data frequently enough are
generated.
However, various information sources generate data with
an inherent sequential nature,
i.e., it is composed of discrete events which have a
temporal/spatial ordering. This kind of
data can be obtained from, e.g., telecommunications
networks, electronic commerce,
www-servers of Internet, and various scientific sources,
like gene databases. The
sequential nature of the data is totally ignored in the
generation of the association rules.
Thus, a part of the useful information included in the
data is discarded. Thus, since the mid
90's the interest in discovering also the sequential
associations in the data has arisen
among the data mining community.
The sequential associations or sequential patterns can be
presented in the form: when A
occurs, B occurs within some certain time. So, the
difference to traditional association
rules is that here the time information is included both
in the rule itself and also in the
mining process in the form of timing constraints.
Nowadays there exist several highly
efficient methods for mining these kind of patterns. The
problem with them is that they
assume the input data to be sequences of discrete events
including only the information of
the ordering, usually the time. Often, however, the
events are associated with some
additional attributes. The existing methods cannot take
this multi-dimensionality of the
data into account and so they lose the additional
information it involves. Furthermore, the
methods are designed for some specific problem, and are
not, as such, applicable to
different types of sequential data.
In this report, a general formulation of the sequential
patterns is introduced as it is
presented in [1]. By using this approach the last problem
of the existing algorithms can be
tackled. A survey of the existing algorithm is then done.
Three algorithms are presented in
detail: WINEPI [2] and GSP [4] as they form the basis of
the algorithms, and cSPADE [6]
since it seems to be the most promising method proposed
for the problem yet. Also the
other relevant approaches are shortly introduced. Lastly,
the extension of the patterns into
the multi-dimensional is considered. Some ideas of
handling the problem are given and
also the features of the existing algorithms supporting
multi-dimensionality are studied.
KW - data mining
M3 - Report
T3 - VTT Research Report
BT - Mining sequential patterns
PB - VTT Technical Research Centre of Finland
CY - Espoo
ER -