TY - JOUR
T1 - Connecting firm's web scraped textual content to body of science
T2 - Utilizing microsoft academic graph hierarchical topic modeling
AU - Hajikhani, Arash
AU - Pukelis, Lukas
AU - Suominen, Arho
AU - Ashouri, Sajad
AU - Schubert, Torben
AU - Notten, Ad
AU - Cunningham, Scott W.
N1 - Funding Information:
This project has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement No 870822.
Publisher Copyright:
© 2022 The Authors
PY - 2022
Y1 - 2022
N2 - This paper demonstrates a method to transform and link textual information scraped from companies' websites to the scientific body of knowledge. The method illustrates the benefit of Natural Language Processing (NLP) in creating links between established economic classification systems with novel and agile constructs that new data sources enable. Therefore, we experimented on the European classification of economic activities (known as NACE) on sectoral and company levels. We established a connection with Microsoft Academic Graph hierarchical topic modeling based on companies' website content. Central to the operationalization of our method are a web scraping process, NLP and a data transformation/linkage procedure. The method contains three main steps: data source identification, raw data retrieval, and data preparation and transformation. These steps are applied to two distinct data sources.
AB - This paper demonstrates a method to transform and link textual information scraped from companies' websites to the scientific body of knowledge. The method illustrates the benefit of Natural Language Processing (NLP) in creating links between established economic classification systems with novel and agile constructs that new data sources enable. Therefore, we experimented on the European classification of economic activities (known as NACE) on sectoral and company levels. We established a connection with Microsoft Academic Graph hierarchical topic modeling based on companies' website content. Central to the operationalization of our method are a web scraping process, NLP and a data transformation/linkage procedure. The method contains three main steps: data source identification, raw data retrieval, and data preparation and transformation. These steps are applied to two distinct data sources.
KW - A method for creating a linkage between web scraped company's websitecontent to scientific literature topical structure
KW - Economic classification scheme
KW - Knowledge transformation
KW - Natural language processing
KW - Web scraping
UR - http://www.scopus.com/inward/record.url?scp=85125892314&partnerID=8YFLogxK
U2 - 10.1016/j.mex.2022.101650
DO - 10.1016/j.mex.2022.101650
M3 - Article
C2 - 35284247
AN - SCOPUS:85125892314
SN - 2215-0161
VL - 9
SP - 101650
JO - MethodsX
JF - MethodsX
M1 - 101650
ER -