Classication of short documents to categorize consumer opinions

Mika Timonen, Paula Silvonen, Melissa Kasari

Research output: Contribution to conferenceOther conference contributionScientific

Abstract

Short documents have become an important corpus for several text mining applications. In consumer research, for instance, one effective way of gathering opinions is to use an online questionnaire that contains questions to which the users can answer freely. One questionnaire may contain several questions that often get thousands of answers that are both informal and short. In these cases the traditional term weighting measures have trouble identifying the important words. One particular problem is TF=1 challenge; as each word occurs almost always only once per document, methods that rely on term frequency (e.g., TF-IDF) do not produce good results. In this paper we describe a term weighting approach for text categorization that does not rely on term frequencies but uses other statistics instead. In addition, we propose a novel multi-label learning method that is based on a divide and conquer approach. For categorization, we use a Naive Bayes classifier.We evaluate our approach by comparing it against 2 and TF-IDF term weighting, and TWCNB and kNN classification approaches. The results show that the proposed approach produces the best results when used for short document categorization
Original languageEnglish
Publication statusPublished - 2011
MoE publication typeNot Eligible
Event7th International Conference on Advanced Data Mining and Applications, ADMA'11 - Beijing, China
Duration: 17 Dec 201119 Dec 2011

Conference

Conference7th International Conference on Advanced Data Mining and Applications, ADMA'11
Abbreviated titleADMA'11
CountryChina
CityBeijing
Period17/12/1119/12/11

Fingerprint

Labels
Classifiers
Statistics

Keywords

  • Text categorization
  • Naive Bayes
  • multi-label learning
  • feature selection
  • short document categorization

Cite this

Timonen, M., Silvonen, P., & Kasari, M. (2011). Classication of short documents to categorize consumer opinions. 7th International Conference on Advanced Data Mining and Applications, ADMA'11, Beijing, China.
Timonen, Mika ; Silvonen, Paula ; Kasari, Melissa. / Classication of short documents to categorize consumer opinions. 7th International Conference on Advanced Data Mining and Applications, ADMA'11, Beijing, China.
@conference{d55b77208bd848eb83a85060f3b1c1cd,
title = "Classication of short documents to categorize consumer opinions",
abstract = "Short documents have become an important corpus for several text mining applications. In consumer research, for instance, one effective way of gathering opinions is to use an online questionnaire that contains questions to which the users can answer freely. One questionnaire may contain several questions that often get thousands of answers that are both informal and short. In these cases the traditional term weighting measures have trouble identifying the important words. One particular problem is TF=1 challenge; as each word occurs almost always only once per document, methods that rely on term frequency (e.g., TF-IDF) do not produce good results. In this paper we describe a term weighting approach for text categorization that does not rely on term frequencies but uses other statistics instead. In addition, we propose a novel multi-label learning method that is based on a divide and conquer approach. For categorization, we use a Naive Bayes classifier.We evaluate our approach by comparing it against 2 and TF-IDF term weighting, and TWCNB and kNN classification approaches. The results show that the proposed approach produces the best results when used for short document categorization",
keywords = "Text categorization, Naive Bayes, multi-label learning, feature selection, short document categorization",
author = "Mika Timonen and Paula Silvonen and Melissa Kasari",
year = "2011",
language = "English",
note = "7th International Conference on Advanced Data Mining and Applications, ADMA'11, ADMA'11 ; Conference date: 17-12-2011 Through 19-12-2011",

}

Timonen, M, Silvonen, P & Kasari, M 2011, 'Classication of short documents to categorize consumer opinions' 7th International Conference on Advanced Data Mining and Applications, ADMA'11, Beijing, China, 17/12/11 - 19/12/11, .

Classication of short documents to categorize consumer opinions. / Timonen, Mika; Silvonen, Paula; Kasari, Melissa.

2011. 7th International Conference on Advanced Data Mining and Applications, ADMA'11, Beijing, China.

Research output: Contribution to conferenceOther conference contributionScientific

TY - CONF

T1 - Classication of short documents to categorize consumer opinions

AU - Timonen, Mika

AU - Silvonen, Paula

AU - Kasari, Melissa

PY - 2011

Y1 - 2011

N2 - Short documents have become an important corpus for several text mining applications. In consumer research, for instance, one effective way of gathering opinions is to use an online questionnaire that contains questions to which the users can answer freely. One questionnaire may contain several questions that often get thousands of answers that are both informal and short. In these cases the traditional term weighting measures have trouble identifying the important words. One particular problem is TF=1 challenge; as each word occurs almost always only once per document, methods that rely on term frequency (e.g., TF-IDF) do not produce good results. In this paper we describe a term weighting approach for text categorization that does not rely on term frequencies but uses other statistics instead. In addition, we propose a novel multi-label learning method that is based on a divide and conquer approach. For categorization, we use a Naive Bayes classifier.We evaluate our approach by comparing it against 2 and TF-IDF term weighting, and TWCNB and kNN classification approaches. The results show that the proposed approach produces the best results when used for short document categorization

AB - Short documents have become an important corpus for several text mining applications. In consumer research, for instance, one effective way of gathering opinions is to use an online questionnaire that contains questions to which the users can answer freely. One questionnaire may contain several questions that often get thousands of answers that are both informal and short. In these cases the traditional term weighting measures have trouble identifying the important words. One particular problem is TF=1 challenge; as each word occurs almost always only once per document, methods that rely on term frequency (e.g., TF-IDF) do not produce good results. In this paper we describe a term weighting approach for text categorization that does not rely on term frequencies but uses other statistics instead. In addition, we propose a novel multi-label learning method that is based on a divide and conquer approach. For categorization, we use a Naive Bayes classifier.We evaluate our approach by comparing it against 2 and TF-IDF term weighting, and TWCNB and kNN classification approaches. The results show that the proposed approach produces the best results when used for short document categorization

KW - Text categorization

KW - Naive Bayes

KW - multi-label learning

KW - feature selection

KW - short document categorization

M3 - Other conference contribution

ER -

Timonen M, Silvonen P, Kasari M. Classication of short documents to categorize consumer opinions. 2011. 7th International Conference on Advanced Data Mining and Applications, ADMA'11, Beijing, China.