TY - GEN
T1 - Website Classification from Webpage Renders
AU - Espinosa-Leal, Leonardo
AU - Akusok, Anton
AU - Lendasse, Amaury
AU - Björk, Kaj-Mikael
PY - 2020/9/12
Y1 - 2020/9/12
N2 - In this paper, we present a fast and accurate method for the classification of web content. Our algorithm uses the visual information of the main homepage saved in an image format by means of a full body snapshot. Sliding windows of different sizes and overlaps are used to obtain a large subset of images for each render. For each sub-image, a feature vector is extracted by means of a pre-trained deep learning model. A Extreme Learning Machine (ELM) model is trained for different values of hidden neurons using the large collection of features from a curated dataset of 5979 webpages with different classes: adult, alcohol, dating, gambling, shopping, tobacco and weapons. Our results show that the ELM classifier can be trained without the manual specific object tagging of the sub-images by giving excellent results in comparison to more complex deep learning models. A random forest classifier was trained for the specific class of weapons providing an accuracy of 95% with a F1 score of 0.8.
AB - In this paper, we present a fast and accurate method for the classification of web content. Our algorithm uses the visual information of the main homepage saved in an image format by means of a full body snapshot. Sliding windows of different sizes and overlaps are used to obtain a large subset of images for each render. For each sub-image, a feature vector is extracted by means of a pre-trained deep learning model. A Extreme Learning Machine (ELM) model is trained for different values of hidden neurons using the large collection of features from a curated dataset of 5979 webpages with different classes: adult, alcohol, dating, gambling, shopping, tobacco and weapons. Our results show that the ELM classifier can be trained without the manual specific object tagging of the sub-images by giving excellent results in comparison to more complex deep learning models. A random forest classifier was trained for the specific class of weapons providing an accuracy of 95% with a F1 score of 0.8.
U2 - 10.1007/978-3-030-58989-9_5
DO - 10.1007/978-3-030-58989-9_5
M3 - Conference article in proceedings
SN - 978-3-030-58988-2
SN - 978-3-030-59049-9
T3 - Proceedings in Adaptation, Learning and Optimization
SP - 41
EP - 50
BT - Proceedings of ELM2019
A2 - Cao, Jiuwen
A2 - Vong, Chi Man
A2 - Miche, Yoan
A2 - Lendasse, Amaury
PB - Springer
CY - Cham
T2 - 2019 International Conference on Extreme Learning Machine (ELM 2019)
Y2 - 14 December 2019 through 16 December 2019
ER -