Website Classification from Webpage Renders

Leonardo Espinosa-Leal*, Anton Akusok, Amaury Lendasse, Kaj-Mikael Björk

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

Abstract

In this paper, we present a fast and accurate method for the classification of web content. Our algorithm uses the visual information of the main homepage saved in an image format by means of a full body snapshot. Sliding windows of different sizes and overlaps are used to obtain a large subset of images for each render. For each sub-image, a feature vector is extracted by means of a pre-trained deep learning model. A Extreme Learning Machine (ELM) model is trained for different values of hidden neurons using the large collection of features from a curated dataset of 5979 webpages with different classes: adult, alcohol, dating, gambling, shopping, tobacco and weapons. Our results show that the ELM classifier can be trained without the manual specific object tagging of the sub-images by giving excellent results in comparison to more complex deep learning models. A random forest classifier was trained for the specific class of weapons providing an accuracy of 95% with a F1 score of 0.8.
Original languageEnglish
Title of host publicationProceedings of ELM2019
EditorsJiuwen Cao, Chi Man Vong, Yoan Miche, Amaury Lendasse
Place of PublicationCham
PublisherSpringer
Pages41-50
ISBN (Electronic)978-3-030-58989-9
ISBN (Print)978-3-030-58988-2, 978-3-030-59049-9
DOIs
Publication statusPublished - 12 Sept 2020
MoE publication typeA4 Article in a conference publication
Event2019 International Conference on Extreme Learning Machine (ELM 2019) - Yangzhou, China
Duration: 14 Dec 201916 Dec 2019

Publication series

SeriesProceedings in Adaptation, Learning and Optimization
Volume14
ISSN2363-6084

Conference

Conference2019 International Conference on Extreme Learning Machine (ELM 2019)
Country/TerritoryChina
CityYangzhou
Period14/12/1916/12/19

Fingerprint

Dive into the research topics of 'Website Classification from Webpage Renders'. Together they form a unique fingerprint.

Cite this