Statistical Approach for Term Weighting in Very Short Documents for Text Categorization

Mika Timonen, M. Kasari

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

Abstract

In this paper, we propose a novel approach for term weighting in very short documents that is used with a Support Vector Machine classifier. We focus on market research and social media documents. In both of these data sources, the average length of a document is below twenty words. As the documents are short, each word occurs usually only once within a document. This is known as hapax legomenon and in our previous work as Term Frequency=1 challenge. For this reason, the traditional term weighting approaches become less effective with short documents. In this paper we propose a novel approach for term weighting that does not use term frequency within a document butsubstitutes it with other word statistics. In the experimental evaluation and comparison against several other term weighting approaches the proposed method produced promising results by out-performing the competition.
Original languageEnglish
Title of host publicationKnowledge Discover, Knowledge Engineering and Knowledge Management
Subtitle of host publicationIC3K 2012
EditorsA. Fred, J.L.G. Dietz, K. Liu, J. Filipe
PublisherSpringer
Pages3-18
ISBN (Electronic)978-3-642-54105-6
ISBN (Print)978-3-642-54104-9
DOIs
Publication statusPublished - 2013
MoE publication typeNot Eligible
Event4th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2012 - Barcelona, Spain
Duration: 4 Oct 20127 Oct 2012

Publication series

SeriesCommunications in Computer and Information Science
Volume415
ISSN1865-0929

Conference

Conference4th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2012
Abbreviated titleIC3K 2012
CountrySpain
CityBarcelona
Period4/10/127/10/12

    Fingerprint

Keywords

  • feature weighting
  • hapax legomenon
  • short documetn categorization
  • support vector machine
  • text categorization

Cite this

Timonen, M., & Kasari, M. (2013). Statistical Approach for Term Weighting in Very Short Documents for Text Categorization. In A. Fred, J. L. G. Dietz, K. Liu, & J. Filipe (Eds.), Knowledge Discover, Knowledge Engineering and Knowledge Management: IC3K 2012 (pp. 3-18). Springer. Communications in Computer and Information Science, Vol.. 415 https://doi.org/10.1007/978-3-642-54105-6_1