Abstract
Short documents have become an important corpus for
several text mining applications. In consumer research,
for instance, one effective way of gathering opinions is
to use an online questionnaire that
contains questions to which the users can answer freely.
One questionnaire may contain several questions that
often get thousands of answers that are both informal and
short. In these cases the traditional term weighting
measures have trouble identifying the important words.
One particular problem is TF=1 challenge; as each word
occurs almost always only once per document, methods that
rely on term frequency (e.g., TF-IDF) do not produce good
results. In this paper we describe a term weighting
approach for text categorization that does not rely on
term frequencies but uses other statistics instead. In
addition, we propose a novel multi-label learning method
that is based on a divide and conquer approach. For
categorization, we use a Naive Bayes classifier.We
evaluate our approach by comparing it against 2 and
TF-IDF term weighting, and TWCNB and kNN classification
approaches. The results show that the proposed approach
produces the best results when used for short document
categorization
Original language | English |
---|---|
Publication status | Published - 2011 |
MoE publication type | Not Eligible |
Event | 7th International Conference on Advanced Data Mining and Applications, ADMA'11 - Beijing, China Duration: 17 Dec 2011 → 19 Dec 2011 |
Conference
Conference | 7th International Conference on Advanced Data Mining and Applications, ADMA'11 |
---|---|
Abbreviated title | ADMA'11 |
Country/Territory | China |
City | Beijing |
Period | 17/12/11 → 19/12/11 |
Keywords
- Text categorization
- Naive Bayes
- multi-label learning
- feature selection
- short document categorization