Classication of short documents to categorize consumer opinions

Mika Timonen, Paula Silvonen, Melissa Kasari

Research output: Contribution to conferenceOther conference contributionScientific

Abstract

Short documents have become an important corpus for several text mining applications. In consumer research, for instance, one effective way of gathering opinions is to use an online questionnaire that contains questions to which the users can answer freely. One questionnaire may contain several questions that often get thousands of answers that are both informal and short. In these cases the traditional term weighting measures have trouble identifying the important words. One particular problem is TF=1 challenge; as each word occurs almost always only once per document, methods that rely on term frequency (e.g., TF-IDF) do not produce good results. In this paper we describe a term weighting approach for text categorization that does not rely on term frequencies but uses other statistics instead. In addition, we propose a novel multi-label learning method that is based on a divide and conquer approach. For categorization, we use a Naive Bayes classifier.We evaluate our approach by comparing it against 2 and TF-IDF term weighting, and TWCNB and kNN classification approaches. The results show that the proposed approach produces the best results when used for short document categorization
Original languageEnglish
Publication statusPublished - 2011
MoE publication typeNot Eligible
Event7th International Conference on Advanced Data Mining and Applications, ADMA'11 - Beijing, China
Duration: 17 Dec 201119 Dec 2011

Conference

Conference7th International Conference on Advanced Data Mining and Applications, ADMA'11
Abbreviated titleADMA'11
Country/TerritoryChina
CityBeijing
Period17/12/1119/12/11

Keywords

  • Text categorization
  • Naive Bayes
  • multi-label learning
  • feature selection
  • short document categorization

Fingerprint

Dive into the research topics of 'Classication of short documents to categorize consumer opinions'. Together they form a unique fingerprint.

Cite this