Visual Speech Recognition with Lightweight Psychologically Motivated Gabor Features

Zhang, Xuejie; Xu, Yan; Abel, Andrew K; Smith, Leslie S; Watt, Roger; Hussain, Amir; Gao, Chengxiang

doi:10.3390/e22121367

Please use this identifier to cite or link to this item: http://hdl.handle.net/1893/32078

Appears in Collections:	Computing Science and Mathematics Journal Articles
Peer Review Status:	Refereed
Title:	Visual Speech Recognition with Lightweight Psychologically Motivated Gabor Features
Author(s):	Zhang, Xuejie Xu, Yan Abel, Andrew K Smith, Leslie S Watt, Roger Hussain, Amir Gao, Chengxiang
Keywords:	Speech Recognition Image Processing Gabor Features Lip Reading Explainable
Issue Date:	Dec-2020
Date Deposited:	11-Dec-2020
Citation:	Zhang X, Xu Y, Abel AK, Smith LS, Watt R, Hussain A & Gao C (2020) Visual Speech Recognition with Lightweight Psychologically Motivated Gabor Features. Entropy, 22 (12), Art. No.: 1367. https://doi.org/10.3390/e22121367
Abstract:	Extraction of relevant lip features is of continuing interest in the visual speech domain. 1 Using end-to-end feature extraction can produce good results, but at the cost of the results being 2 difficult for humans to comprehend and relate to. We present a new, lightweight feature extraction 3 approach, motivated by human-centric glimpse based psychological research into facial barcodes, 4 and demonstrate that these simple, easy to extract 3D geometric features (produced using Gabor 5 based image patches), can successfully be used for speech recognition with LSTM based machine 6 learning. This approach can successfully extract low dimensionality lip parameters with a minimum 7 of processing. One key difference between using these Gabor-based features and using other features 8 such as traditional DCT, or the current fashion for CNN features is that these are human-centric 9 features that can be visualised and analysed by humans. This means that it is easier to explain and 10 visualise the results. They can also be used for reliable speech recognition, as demonstrated using the 11 Grid corpus. Results for overlapping speakers using our lightweight system gave a recognition rate 12 of over 82%, which compares well to less explainable features in the literature. 13
DOI Link:	10.3390/e22121367
Rights:	Copyright 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Licence URL(s):	http://creativecommons.org/licenses/by/4.0/

Files in This Item:

File	Description	Size	Format
entropy-22-01367.pdf	Fulltext - Published Version	2.47 MB	Adobe PDF	View/Open

This item is protected by original copyright

View License

Show full item record

A file in this item is licensed under a Creative Commons License

Items in the Repository are protected by copyright, with all rights reserved, unless otherwise indicated.

The metadata of the records in the Repository are available under the CC0 public domain dedication: No Rights Reserved https://creativecommons.org/publicdomain/zero/1.0/

If you believe that any material held in STORRE infringes copyright, please contact library@stir.ac.uk providing details and we will remove the Work from public display in STORRE and investigate your claim.

STORRE

STORRE: Stirling Online Research Repository