Map of technology: Topic modelling full-text patent data

Arho Suominen, Hannes Toivanen

    Research output: Contribution to conferenceConference articleScientific

    32 Citations (Scopus)

    Abstract

    A number of studies have focused on creating patent maps [1,2,3]. Many of the patent map studies rely on traditional bibliometrics, such as [4] using citation analysis to form a patent mapping to public policy analysis and [5] shows what can be regarded as the de facto bibliometric technology forecasting example. Recently the focus has turned to the use big data and data mining, specifically text mining, in patent mapping. In 2007, [6] illustrated text mining techniques for patent analysis. Reference [7] focused on creating machine produced summarizations and mapping of patents. A data mining approach has also been used in technology roadmapping [8]. As computational methods in the machine-learning field are becoming more available and stable, studies have moved towards using stable processes and focusing on applying the methods to for example management of technology. For example, unsupervised learning methods, such as Latent Dirichlet Allocation, are relatively stable methods relatively easily accessed by science scholars outside the computer domain. In this study, we show the capabilities of unsupervised learning with a single-node computer in learning the thematic areas of all full-text patent documents published by the USPTO in 2014 (N=374,704). We further discuss the key challenges in running the study, interpreting the outcome and further developments. Unsupervised learning produces an outcome based on an input while not receiving any feedback from the environment. Unsupervised learning differs from supervised or reinforced learning by its reliance on a formal framework that enables the algorithm to find patterns. The majority of unsupervised methods rely on a probabilistic model of the input data. An unsupervised learning method estimates the model that represents the probability distribution for an input either based on previous inputs or independently. Topic models are unsupervised learning methods and Latent Dirichlet Allocation (LDA) is one topic model that draws out latent patterns from text. In 2007, [9] showed the usability of topic models in modeling the structure of semantic text. In presenting the methodology [9] noted that topic models " ...can extract surprisingly interpretable and useful structure without any explicit "understanding" of the language by computer". The basic idea behind the model is that each document in a corpus is a random mixture over latent topics, and each latent topic is characterized by a distribution over words. In the LDA model, each document is a mixture of a number of topics based on the words attributable to each of the topics. LDA allows us to uncover these latent probability distributions based on the sematic text used in the document, thus classifying the documents based on the latent patterns within them. For an detailed explanation on the algorithm refer to for example [10] and for an evaluation analyzing scientific publications refer to [11]. We analyse USPTO published patent data from the year 2014 (N=374,704). The data consists of all patents published in 2014 and the analysis uses the full-text description as source data for unsupervised learning. Prior to analysis the abstract texts were pre-processed with a Python script. The Python script removes stopwords and punctuations. Terms that occur only once in the whole data were also removed at this stage. After all of the before-mentioned terms were removed, the text was tokenized and each token was transformed to a corresponding number, to further reduce the complexity of the data. As, LDA requires a fixed number of topics, we employed the KL divergence based evaluation of the natural number of topics [12]. The qualitative evaluation of KL divergence values and multiple runs of the algorithm, we produced 200 topics. The topics were visualized using using wordclouds. Our results show, how we are able to draw out meaningful latent patterns from a large text corpus with a single-node computer. Our setup classified the 374,704 full-text documents in a practical time, creating a model that can be used to infer the classification of new documents. The key challenge of LDA based analysis is estimating the number of topics built and pre-processing needed. The method proposed by for example [12] takes significant computational time and produces limited value for the analysis. For pre-processing [11] suggested limiting the pre-processing of data prior to analysis. Our results however show that there is an added value of taking on a more aggressive approach. More research is however needed. Clearly methodological development in machine-learning methods is in a point where algorithms are available "of the shelf". Our abilities of visualizing matrices of size 374,704 times 200 is however more challenging and there is a clear need to turn focus on creating actionable results for users. 
    Original languageEnglish
    Publication statusPublished - 2015
    MoE publication typeNot Eligible
    Event5th Global TechMining Conference, GTM 2015 - Atlanta, United States
    Duration: 16 Sept 201516 Sept 2015
    Conference number: 5

    Conference

    Conference5th Global TechMining Conference, GTM 2015
    Abbreviated titleGTM 2015
    Country/TerritoryUnited States
    CityAtlanta
    Period16/09/1516/09/15

    Keywords

    • topic modelling
    • patents
    • machine-learning
    • classification

    Fingerprint

    Dive into the research topics of 'Map of technology: Topic modelling full-text patent data'. Together they form a unique fingerprint.

    Cite this