Map of technology: Topic modelling full-text patent data

Arho Suominen, Hannes Toivanen

Research output: Contribution to conferenceConference articleScientific

15 Citations (Scopus)

Abstract

A number of studies have focused on creating patent maps [1,2,3]. Many of the patent map studies rely on traditional bibliometrics, such as [4] using citation analysis to form a patent mapping to public policy analysis and [5] shows what can be regarded as the de facto bibliometric technology forecasting example. Recently the focus has turned to the use big data and data mining, specifically text mining, in patent mapping. In 2007, [6] illustrated text mining techniques for patent analysis. Reference [7] focused on creating machine produced summarizations and mapping of patents. A data mining approach has also been used in technology roadmapping [8]. As computational methods in the machine-learning field are becoming more available and stable, studies have moved towards using stable processes and focusing on applying the methods to for example management of technology. For example, unsupervised learning methods, such as Latent Dirichlet Allocation, are relatively stable methods relatively easily accessed by science scholars outside the computer domain. In this study, we show the capabilities of unsupervised learning with a single-node computer in learning the thematic areas of all full-text patent documents published by the USPTO in 2014 (N=374,704). We further discuss the key challenges in running the study, interpreting the outcome and further developments. Unsupervised learning produces an outcome based on an input while not receiving any feedback from the environment. Unsupervised learning differs from supervised or reinforced learning by its reliance on a formal framework that enables the algorithm to find patterns. The majority of unsupervised methods rely on a probabilistic model of the input data. An unsupervised learning method estimates the model that represents the probability distribution for an input either based on previous inputs or independently. Topic models are unsupervised learning methods and Latent Dirichlet Allocation (LDA) is one topic model that draws out latent patterns from text. In 2007, [9] showed the usability of topic models in modeling the structure of semantic text. In presenting the methodology [9] noted that topic models " ...can extract surprisingly interpretable and useful structure without any explicit "understanding" of the language by computer". The basic idea behind the model is that each document in a corpus is a random mixture over latent topics, and each latent topic is characterized by a distribution over words. In the LDA model, each document is a mixture of a number of topics based on the words attributable to each of the topics. LDA allows us to uncover these latent probability distributions based on the sematic text used in the document, thus classifying the documents based on the latent patterns within them. For an detailed explanation on the algorithm refer to for example [10] and for an evaluation analyzing scientific publications refer to [11]. We analyse USPTO published patent data from the year 2014 (N=374,704). The data consists of all patents published in 2014 and the analysis uses the full-text description as source data for unsupervised learning. Prior to analysis the abstract texts were pre-processed with a Python script. The Python script removes stopwords and punctuations. Terms that occur only once in the whole data were also removed at this stage. After all of the before-mentioned terms were removed, the text was tokenized and each token was transformed to a corresponding number, to further reduce the complexity of the data. As, LDA requires a fixed number of topics, we employed the KL divergence based evaluation of the natural number of topics [12]. The qualitative evaluation of KL divergence values and multiple runs of the algorithm, we produced 200 topics. The topics were visualized using using wordclouds. Our results show, how we are able to draw out meaningful latent patterns from a large text corpus with a single-node computer. Our setup classified the 374,704 full-text documents in a practical time, creating a model that can be used to infer the classification of new documents. The key challenge of LDA based analysis is estimating the number of topics built and pre-processing needed. The method proposed by for example [12] takes significant computational time and produces limited value for the analysis. For pre-processing [11] suggested limiting the pre-processing of data prior to analysis. Our results however show that there is an added value of taking on a more aggressive approach. More research is however needed. Clearly methodological development in machine-learning methods is in a point where algorithms are available "of the shelf". Our abilities of visualizing matrices of size 374,704 times 200 is however more challenging and there is a clear need to turn focus on creating actionable results for users. 
Original languageEnglish
Publication statusPublished - 2015
Event5th Global TechMining Conference, GTM 2015 - Atlanta, United States
Duration: 16 Sep 201516 Sep 2015
Conference number: 5

Conference

Conference5th Global TechMining Conference, GTM 2015
Abbreviated titleGTM 2015
CountryUnited States
CityAtlanta
Period16/09/1516/09/15

Fingerprint

Unsupervised learning
Probability distributions
Data mining
Learning systems
Processing
Computational methods
Semantics
Feedback

Keywords

  • topic modelling
  • patents
  • machine-learning
  • classification

Cite this

Suominen, A., & Toivanen, H. (2015). Map of technology: Topic modelling full-text patent data. Paper presented at 5th Global TechMining Conference, GTM 2015, Atlanta, United States.
Suominen, Arho ; Toivanen, Hannes. / Map of technology : Topic modelling full-text patent data. Paper presented at 5th Global TechMining Conference, GTM 2015, Atlanta, United States.
@conference{5f24852eb2bf43cca889db7feea1492c,
title = "Map of technology: Topic modelling full-text patent data",
abstract = "A number of studies have focused on creating patent maps [1,2,3]. Many of the patent map studies rely on traditional bibliometrics, such as [4] using citation analysis to form a patent mapping to public policy analysis and [5] shows what can be regarded as the de facto bibliometric technology forecasting example. Recently the focus has turned to the use big data and data mining, specifically text mining, in patent mapping. In 2007, [6] illustrated text mining techniques for patent analysis. Reference [7] focused on creating machine produced summarizations and mapping of patents. A data mining approach has also been used in technology roadmapping [8]. As computational methods in the machine-learning field are becoming more available and stable, studies have moved towards using stable processes and focusing on applying the methods to for example management of technology. For example, unsupervised learning methods, such as Latent Dirichlet Allocation, are relatively stable methods relatively easily accessed by science scholars outside the computer domain. In this study, we show the capabilities of unsupervised learning with a single-node computer in learning the thematic areas of all full-text patent documents published by the USPTO in 2014 (N=374,704). We further discuss the key challenges in running the study, interpreting the outcome and further developments. Unsupervised learning produces an outcome based on an input while not receiving any feedback from the environment. Unsupervised learning differs from supervised or reinforced learning by its reliance on a formal framework that enables the algorithm to find patterns. The majority of unsupervised methods rely on a probabilistic model of the input data. An unsupervised learning method estimates the model that represents the probability distribution for an input either based on previous inputs or independently. Topic models are unsupervised learning methods and Latent Dirichlet Allocation (LDA) is one topic model that draws out latent patterns from text. In 2007, [9] showed the usability of topic models in modeling the structure of semantic text. In presenting the methodology [9] noted that topic models {"} ...can extract surprisingly interpretable and useful structure without any explicit {"}understanding{"} of the language by computer{"}. The basic idea behind the model is that each document in a corpus is a random mixture over latent topics, and each latent topic is characterized by a distribution over words. In the LDA model, each document is a mixture of a number of topics based on the words attributable to each of the topics. LDA allows us to uncover these latent probability distributions based on the sematic text used in the document, thus classifying the documents based on the latent patterns within them. For an detailed explanation on the algorithm refer to for example [10] and for an evaluation analyzing scientific publications refer to [11]. We analyse USPTO published patent data from the year 2014 (N=374,704). The data consists of all patents published in 2014 and the analysis uses the full-text description as source data for unsupervised learning. Prior to analysis the abstract texts were pre-processed with a Python script. The Python script removes stopwords and punctuations. Terms that occur only once in the whole data were also removed at this stage. After all of the before-mentioned terms were removed, the text was tokenized and each token was transformed to a corresponding number, to further reduce the complexity of the data. As, LDA requires a fixed number of topics, we employed the KL divergence based evaluation of the natural number of topics [12]. The qualitative evaluation of KL divergence values and multiple runs of the algorithm, we produced 200 topics. The topics were visualized using using wordclouds. Our results show, how we are able to draw out meaningful latent patterns from a large text corpus with a single-node computer. Our setup classified the 374,704 full-text documents in a practical time, creating a model that can be used to infer the classification of new documents. The key challenge of LDA based analysis is estimating the number of topics built and pre-processing needed. The method proposed by for example [12] takes significant computational time and produces limited value for the analysis. For pre-processing [11] suggested limiting the pre-processing of data prior to analysis. Our results however show that there is an added value of taking on a more aggressive approach. More research is however needed. Clearly methodological development in machine-learning methods is in a point where algorithms are available {"}of the shelf{"}. Our abilities of visualizing matrices of size 374,704 times 200 is however more challenging and there is a clear need to turn focus on creating actionable results for users. ",
keywords = "topic modelling, patents, machine-learning, classification",
author = "Arho Suominen and Hannes Toivanen",
note = "Project : 101488 ; 5th Global TechMining Conference, GTM 2015, GTM 2015 ; Conference date: 16-09-2015 Through 16-09-2015",
year = "2015",
language = "English",

}

Suominen, A & Toivanen, H 2015, 'Map of technology: Topic modelling full-text patent data' Paper presented at 5th Global TechMining Conference, GTM 2015, Atlanta, United States, 16/09/15 - 16/09/15, .

Map of technology : Topic modelling full-text patent data. / Suominen, Arho; Toivanen, Hannes.

2015. Paper presented at 5th Global TechMining Conference, GTM 2015, Atlanta, United States.

Research output: Contribution to conferenceConference articleScientific

TY - CONF

T1 - Map of technology

T2 - Topic modelling full-text patent data

AU - Suominen, Arho

AU - Toivanen, Hannes

N1 - Project : 101488

PY - 2015

Y1 - 2015

N2 - A number of studies have focused on creating patent maps [1,2,3]. Many of the patent map studies rely on traditional bibliometrics, such as [4] using citation analysis to form a patent mapping to public policy analysis and [5] shows what can be regarded as the de facto bibliometric technology forecasting example. Recently the focus has turned to the use big data and data mining, specifically text mining, in patent mapping. In 2007, [6] illustrated text mining techniques for patent analysis. Reference [7] focused on creating machine produced summarizations and mapping of patents. A data mining approach has also been used in technology roadmapping [8]. As computational methods in the machine-learning field are becoming more available and stable, studies have moved towards using stable processes and focusing on applying the methods to for example management of technology. For example, unsupervised learning methods, such as Latent Dirichlet Allocation, are relatively stable methods relatively easily accessed by science scholars outside the computer domain. In this study, we show the capabilities of unsupervised learning with a single-node computer in learning the thematic areas of all full-text patent documents published by the USPTO in 2014 (N=374,704). We further discuss the key challenges in running the study, interpreting the outcome and further developments. Unsupervised learning produces an outcome based on an input while not receiving any feedback from the environment. Unsupervised learning differs from supervised or reinforced learning by its reliance on a formal framework that enables the algorithm to find patterns. The majority of unsupervised methods rely on a probabilistic model of the input data. An unsupervised learning method estimates the model that represents the probability distribution for an input either based on previous inputs or independently. Topic models are unsupervised learning methods and Latent Dirichlet Allocation (LDA) is one topic model that draws out latent patterns from text. In 2007, [9] showed the usability of topic models in modeling the structure of semantic text. In presenting the methodology [9] noted that topic models " ...can extract surprisingly interpretable and useful structure without any explicit "understanding" of the language by computer". The basic idea behind the model is that each document in a corpus is a random mixture over latent topics, and each latent topic is characterized by a distribution over words. In the LDA model, each document is a mixture of a number of topics based on the words attributable to each of the topics. LDA allows us to uncover these latent probability distributions based on the sematic text used in the document, thus classifying the documents based on the latent patterns within them. For an detailed explanation on the algorithm refer to for example [10] and for an evaluation analyzing scientific publications refer to [11]. We analyse USPTO published patent data from the year 2014 (N=374,704). The data consists of all patents published in 2014 and the analysis uses the full-text description as source data for unsupervised learning. Prior to analysis the abstract texts were pre-processed with a Python script. The Python script removes stopwords and punctuations. Terms that occur only once in the whole data were also removed at this stage. After all of the before-mentioned terms were removed, the text was tokenized and each token was transformed to a corresponding number, to further reduce the complexity of the data. As, LDA requires a fixed number of topics, we employed the KL divergence based evaluation of the natural number of topics [12]. The qualitative evaluation of KL divergence values and multiple runs of the algorithm, we produced 200 topics. The topics were visualized using using wordclouds. Our results show, how we are able to draw out meaningful latent patterns from a large text corpus with a single-node computer. Our setup classified the 374,704 full-text documents in a practical time, creating a model that can be used to infer the classification of new documents. The key challenge of LDA based analysis is estimating the number of topics built and pre-processing needed. The method proposed by for example [12] takes significant computational time and produces limited value for the analysis. For pre-processing [11] suggested limiting the pre-processing of data prior to analysis. Our results however show that there is an added value of taking on a more aggressive approach. More research is however needed. Clearly methodological development in machine-learning methods is in a point where algorithms are available "of the shelf". Our abilities of visualizing matrices of size 374,704 times 200 is however more challenging and there is a clear need to turn focus on creating actionable results for users. 

AB - A number of studies have focused on creating patent maps [1,2,3]. Many of the patent map studies rely on traditional bibliometrics, such as [4] using citation analysis to form a patent mapping to public policy analysis and [5] shows what can be regarded as the de facto bibliometric technology forecasting example. Recently the focus has turned to the use big data and data mining, specifically text mining, in patent mapping. In 2007, [6] illustrated text mining techniques for patent analysis. Reference [7] focused on creating machine produced summarizations and mapping of patents. A data mining approach has also been used in technology roadmapping [8]. As computational methods in the machine-learning field are becoming more available and stable, studies have moved towards using stable processes and focusing on applying the methods to for example management of technology. For example, unsupervised learning methods, such as Latent Dirichlet Allocation, are relatively stable methods relatively easily accessed by science scholars outside the computer domain. In this study, we show the capabilities of unsupervised learning with a single-node computer in learning the thematic areas of all full-text patent documents published by the USPTO in 2014 (N=374,704). We further discuss the key challenges in running the study, interpreting the outcome and further developments. Unsupervised learning produces an outcome based on an input while not receiving any feedback from the environment. Unsupervised learning differs from supervised or reinforced learning by its reliance on a formal framework that enables the algorithm to find patterns. The majority of unsupervised methods rely on a probabilistic model of the input data. An unsupervised learning method estimates the model that represents the probability distribution for an input either based on previous inputs or independently. Topic models are unsupervised learning methods and Latent Dirichlet Allocation (LDA) is one topic model that draws out latent patterns from text. In 2007, [9] showed the usability of topic models in modeling the structure of semantic text. In presenting the methodology [9] noted that topic models " ...can extract surprisingly interpretable and useful structure without any explicit "understanding" of the language by computer". The basic idea behind the model is that each document in a corpus is a random mixture over latent topics, and each latent topic is characterized by a distribution over words. In the LDA model, each document is a mixture of a number of topics based on the words attributable to each of the topics. LDA allows us to uncover these latent probability distributions based on the sematic text used in the document, thus classifying the documents based on the latent patterns within them. For an detailed explanation on the algorithm refer to for example [10] and for an evaluation analyzing scientific publications refer to [11]. We analyse USPTO published patent data from the year 2014 (N=374,704). The data consists of all patents published in 2014 and the analysis uses the full-text description as source data for unsupervised learning. Prior to analysis the abstract texts were pre-processed with a Python script. The Python script removes stopwords and punctuations. Terms that occur only once in the whole data were also removed at this stage. After all of the before-mentioned terms were removed, the text was tokenized and each token was transformed to a corresponding number, to further reduce the complexity of the data. As, LDA requires a fixed number of topics, we employed the KL divergence based evaluation of the natural number of topics [12]. The qualitative evaluation of KL divergence values and multiple runs of the algorithm, we produced 200 topics. The topics were visualized using using wordclouds. Our results show, how we are able to draw out meaningful latent patterns from a large text corpus with a single-node computer. Our setup classified the 374,704 full-text documents in a practical time, creating a model that can be used to infer the classification of new documents. The key challenge of LDA based analysis is estimating the number of topics built and pre-processing needed. The method proposed by for example [12] takes significant computational time and produces limited value for the analysis. For pre-processing [11] suggested limiting the pre-processing of data prior to analysis. Our results however show that there is an added value of taking on a more aggressive approach. More research is however needed. Clearly methodological development in machine-learning methods is in a point where algorithms are available "of the shelf". Our abilities of visualizing matrices of size 374,704 times 200 is however more challenging and there is a clear need to turn focus on creating actionable results for users. 

KW - topic modelling

KW - patents

KW - machine-learning

KW - classification

M3 - Conference article

ER -

Suominen A, Toivanen H. Map of technology: Topic modelling full-text patent data. 2015. Paper presented at 5th Global TechMining Conference, GTM 2015, Atlanta, United States.