Statistical Approach for Term Weighting in Very Short Documents for Text Categorization

Mika Timonen, Melissa Kasari

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

Abstract

In this paper, we propose a novel approach for term weighting in very short documents that is used with a Support Vector Machine classifier. We focus on market research and social media documents. In both of these data sources, the average length of a document is below twenty words. As the documents are short, each word occurs usually only once within a document. This is known as hapax legomenon and in our previous work as Term Frequency=1 challenge. For this reason, the traditional term weighting approaches become less effective with short documents. In this paper we propose a novel approach for term weighting that does not use term frequency within a document butsubstitutes it with other word statistics. In the experimental evaluation and comparison against several other term weighting approaches the proposed method produced promising results by out-performing the competition.
Original languageEnglish
Title of host publicationKnowledge Discover, Knowledge Engineering and Knowledge Management
Subtitle of host publication4th International Joint Conference, IC3K 2012
EditorsAna Fred, Jan L.G. Dietz, Kecheng Liu, Joaquim Filipe
PublisherSpringer
Pages3-18
ISBN (Electronic)978-3-642-54105-6
ISBN (Print)978-3-642-54104-9
DOIs
Publication statusPublished - 2013
MoE publication typeA4 Article in a conference publication
Event4th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2012 - Barcelona, Spain
Duration: 4 Oct 20127 Oct 2012

Publication series

SeriesCommunications in Computer and Information Science
Volume415
ISSN1865-0929

Conference

Conference4th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2012
Abbreviated titleIC3K 2012
Country/TerritorySpain
CityBarcelona
Period4/10/127/10/12

Keywords

  • feature weighting
  • hapax legomenon
  • short documetn categorization
  • support vector machine
  • text categorization

Fingerprint

Dive into the research topics of 'Statistical Approach for Term Weighting in Very Short Documents for Text Categorization'. Together they form a unique fingerprint.

Cite this