TY - GEN
T1 - Encoding Temporal Information for Automatic Depression Recognition from Facial Analysis
AU - De Melo, Wheidima Carneiro
AU - Granger, Eric
AU - Lopez, Miguel Bordallo
PY - 2020/5
Y1 - 2020/5
N2 - Depression is a mental illness that may be harmful to an individual's health. Using deep learning models to recognize the facial expressions of individuals captured in videos has shown promising results for automatic depression detection. Typically, depression levels are recognized using 2D-Convolutional Neural Networks (CNNs) that are trained to extract static features from video frames, which impairs the capture of dynamic spatio-temporal relations. As an alternative, 3D-CNNs may be employed to extract spatiotemporal features from short video clips, although the risk of overfitting increases due to the limited availability of labeled depression video data. To address these issues, we propose a novel temporal pooling method to capture and encode the spatio-temporal dynamic of video clips into an image map. This approach allows fine-tuning a pre-trained 2D CNN to model facial variations, and thereby improving the training process and model accuracy. Our proposed method is based on two-stream model that performs late fusion of appearance and dynamic information. Extensive experiments on two benchmark AVEC datasets indicate that the proposed method is efficient and outperforms the state-of-the-art schemes.
AB - Depression is a mental illness that may be harmful to an individual's health. Using deep learning models to recognize the facial expressions of individuals captured in videos has shown promising results for automatic depression detection. Typically, depression levels are recognized using 2D-Convolutional Neural Networks (CNNs) that are trained to extract static features from video frames, which impairs the capture of dynamic spatio-temporal relations. As an alternative, 3D-CNNs may be employed to extract spatiotemporal features from short video clips, although the risk of overfitting increases due to the limited availability of labeled depression video data. To address these issues, we propose a novel temporal pooling method to capture and encode the spatio-temporal dynamic of video clips into an image map. This approach allows fine-tuning a pre-trained 2D CNN to model facial variations, and thereby improving the training process and model accuracy. Our proposed method is based on two-stream model that performs late fusion of appearance and dynamic information. Extensive experiments on two benchmark AVEC datasets indicate that the proposed method is efficient and outperforms the state-of-the-art schemes.
KW - Affective Computing
KW - Depression Detection
KW - Expression Recognition
KW - Temporal Pooling
KW - Two-stream Model
UR - http://www.scopus.com/inward/record.url?scp=85089237658&partnerID=8YFLogxK
U2 - 10.1109/ICASSP40776.2020.9054375
DO - 10.1109/ICASSP40776.2020.9054375
M3 - Conference article in proceedings
AN - SCOPUS:85089237658
SN - 978-1-5090-6632-2
T3 - Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
SP - 1080
EP - 1084
BT - ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing
PB - IEEE Institute of Electrical and Electronic Engineers
T2 - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020
Y2 - 4 May 2020 through 8 May 2020
ER -