Embedded Vision Systems: A Review of the Literature

. Over the past two decades, the use of low power Field Programmable Gate Arrays (FPGA) for the acceleration of various vision systems mainly on embedded devices have become widespread. The reconﬁgurable and parallel nature of the FPGA opens up new opportunities to speed-up computationally intensive vision and neural algorithms on embedded and portable devices. This paper presents a comprehensive review of embedded vision algorithms and applications over the past decade. The review will discuss vision based systems and approaches, and how they have been implemented on embedded devices. Topics covered include image acquisition, preprocessing, object detection and tracking, recognition as well as high-level classiﬁcation. This is followed by an outline of the advantages and disadvantages of the various embedded implementations. Finally, an overview of the challenges in the ﬁeld and future research trends are presented. This review is expected to serve as a tutorial and reference source for embedded computer vision systems.


Introduction
Scene understanding and prompt reaction to an event is a critical feature for any time critical computer vision system. The deployment scenarios include a range of applications such as mobile robotics, autonomous cars, mobile and wearable devices or public space surveillance (airport / railway station). Modern vision systems which play a significant role in such interaction process require higher level scene understanding with ultra-fast processing capabilities operating at extremely low power. Currently, such systems rely on traditional computer vision techniques which often follow compute intensive brute-force approaches (slower response time) and prone to fail in environments with limited power, bandwidth and computing resources. The aim of this paper is to review state-of-the-art embedded vision systems available from the literature and in the industry; and therefore to aid researchers for future development.
Research into computer vision has made steady and significant progress in the past two decades. The tremendous progress, coupled with cheap computational power has enabled many portable and embedded devices to operate with vision capabilities. Digital Signal Processing and for that matter Digital Image Processing (DIP) is an exciting area to be involved in today. Having been around for over two decades, it is typically used in application areas where cost and performance are key [7], including the entertainment industry, security surveillance systems, medical systems, automotive industry and defence. DIP systems are often implemented using the ubiquitous general purpose processors (GPPs). The increasing demand for high-speed has resulted in the use of dedicated Digital Signal Processors (DSPs) and General Purpose Graphics Processing Units (GPGPU); special types of GPP optimised for signal processing algorithms. However, power dissipation is important in almost all DSP-based consumer electronic devices; hence the high-speed, power-hungry GPPs become unattractive. Battery-powered products are highly sensitive to energy consumption, and even line-powered products are often sensitive to power consumption [41]. For hardware acceleration and low power consumption, DIP designers have opted for alternatives like the Field Programmable Gate Array (FPGA) and Application Specific Integrated Circuits (ASIC).
The use of FPGAs in application areas like communication, image processing and control engineering has increased significantly over the past decade [54]. Computer vision and image processing algorithms often perform a large number of inherently parallel operations, and are not good candidates for implementation on machines designed around the von Neumann architecture. Some image processing algorithms have successfully been implemented on embedded system architectures running in real-time on portable devices [35] [45], and relatively small literature has been dedicated to the development of high-level algorithms for embedded hardware [39] [63]. The demand for real-time processing in the design of any practical imaging system has led to the development of the Intel Open source Computer Vision library (OpenCV) for the acceleration of various image processing tasks on GPPs [46]. Many imaging systems rely heavily on the increasing processing speed of today's GPPs to run in real-time.

Application specific vision systems
Every embedded vision systems follows a common pipeline of image processing functional blocks as depicted in Fig. 1. The image sensor or camera is the starting point of this pipeline followed by a frame grabber that controls the frame synchronization and frame rate. The raw pixels are then passed for further processing which includes image pre-processing, feature extraction and classification. Within this higher level abstraction various vision systems implemented required functionalities as shown in the figure. Image preprocessing functions are often pixel processing and offer stream computations. However features extraction and classification tasks are complex in nature and usually involves non-deterministic loop conditions. Analysis and optimisations [59] of such complexity with respect to performance and power [13] is an emerging topic of interest and often seen as a trade-off including the choice of the hardware. Embedded vision systems are usually developed either to accelerate complex algorithms that handles large stream of image data, e.g., stereo matching, video compression etc.; or to minimize power at resource constraint systems including unmanned aerial vehicle (UAV) or autonomous driver assistant systems. While a large number of applications of embedded vision systems can be found in the literature, they can be grouped to major application areas including robotics, face detection applications, multimedia compression, autonomous driving and assisted living as shown in Table 1. Various implementation techniques are proposed in the literature that considers a range of image processing algorithms. Efforts were made either to parallelize the algorithms, or to approximate computing to reduce computational complexities.
While the first approach has implications in performance improvement, the latter ones are more suitable for low power applications. Popular higher level complex image processing algorithms that are used in embedded computer vision literature includes stereo vision, feature extraction and tracking, motion estimation, object detection, scene segmentation and more recent convolutional neural network (CNN). These categories and corresponding literature are captured in Table 2.

Central Processing Unit (CPU)
The widespread adoption of imaging and vision applications in industrial automation, robotics and surveillance calls for a better way of implementing such techniques for real-time purposes. The need to address the gap in knowledge for students who have either studied computer vision or microelectronics to fill positions in the industry requiring both expertise has been address with the introduction of various CPU based platforms like Beagleboard [47] and Raspberry-Pi [48]. Hashmi et al. [27] used a beagleboard-xM low-power open-source hardware to prototype a real-time copyright protection algorithm. A human tracking system which reliably detect and track human motion has been implemented on a beagleboard-xM [24]. In [5], a LeopardBoard has been used to implement an efficient edge-detection algorithm for tracking activity level in an indoor environment. Similarly, Sharma and Kumar [56] presented an image enhancement algorithm on a beagleboard, mainly for monitoring the health condition of an individual. To demonstrate the efficiency of embedded image processing Sahani and Mohanty [55] showcased various computer vision applications developed on Raspberry-Pi. The system uses a camera powered by the raspberry-pi with a resolution of 1280×720 to detect text and images in real-time. Various other computer vision algorithm have been implemented on small dedicated platforms using Raspberry-Pi. In [30], a robot with on-board camera for carrying lightweight objects is presented and uses raspberry-pi to process the camera data in aid of navigation. Other robotic systems like [37] [44] have all implemented some vision based algorithms on a Raspberry-Pi because of its portability and ease of programmability.

Graphic Processing Unit (GPU)
The parallel nature of GPUs have made them a choice for the acceleration of many computer vision algorithms [66]. Coupled with the emerging heterogeneous programming models like OpenCL, GPGPU has been enabled on mobile devices. To explore the capabilities of mobile GPU for the acceleration of computer vision algorithms, Wang et 5 al. [66] presented and exemplar-based inpainting algorithm for object removal. Rister et al. [53] presented an implementation of the Scale-Invariant Feature Transform feature detection algorithm on a mobile based GPU to achieved 7× speed-up over optimised GPP implementation. A face detection and recognition system implementation on two GPU architectures are presented in [71] with reported speed-up of approximately 3.7×. A mobile GPU based object detection algorithm with twofold speed-up compared to a similar implementation on a mobile GPP is presented in [3]. The implementation also reported energy savings of up to 84% compared to a smartphone GPP. A GPU enabled architecture for scaling up convolutional networks have been presented in [62]. The explored networks [62] are trained with stochastic gradient distributed machine learning system using 50 replicas on a NVidia Kepler GPU. Deep learning or Convolutional Neural Network (CNN) has become popular in the fields of machine learning and computer vision, because of it's high performance in object detection [33]. Using only GPP, a complex CNN may require more than one month to train [19]. GPUs offer approximately ten fold speed-up compared to GPP, which is demonstrated in [33] for faster training and testing. A number of other computer vision and image processing algorithms [57] [34] [9] have been implemented on GPU mainly to accelerate them for real time needs.

Field Programmable Gate Array (FPGA)
FPGAs are successfully used in many application areas, including embedded computer vision and image processing. The key advantage of FPGAs over conventional CPUs or GPUs is configurability. Resource allocation and memory hierarchy on general purpose processors must perform well across a range of applications, whereas FPGA designs leave many of those decisions to the application designer to optimally use logic gates to implement one specific application. Moreover, they can be significantly faster as their nature supports fine-grained, massively parallel and pipelined execution. FPGAs allows stream processing from camera input and offers parallel execution of processing blocks that resembles the vision system pipeline as depicted in Fig. 1. Various forms of parallelism, e.g., pipeline, task or data parallelism were exploited in FPGA based vision systems [59]. Additionally FPGAs are known for low power execution and vision system designers often exploit this characteristics by using multi-clock domain design paradigm [13]. However on the downside, FPGAs are blamed on programmability aspect as FPGAs are most often specified directly in low level less expressive hardware description languages such as Verilog or VHDL. The intrinsic parallel architecture of FPGAs have also been exploited in a number of application areas including high level feature classification with conventional neural networks [60] [29], convolutional neural networks [52] [18] [12] and architecture specific neural networks [49] [4]. A variant of self-oganising map designed specifically for FPGA is presented in [4] and tested on two computer vision applications; character recognition and appearance-based object identification. The implementation in [4] was achieved using Xilinx Virtex-4 XC4VLX160 and capable being trained with approximately 25,000 patterns every second. Embedded vision systems, implemented on FPGAs are usually evaluated on a few objective measurements including 1) performance measured in throughput (e.g., frames per second or fps); 2) clock frequency; 3) input image frame size; 4) FPGA resource usage (e.g., DSP, BRAM, FF/LUTs) and 5) power consumption. Power consumption on FPGAs consists of a) static power, which is directly proportional to the amount of used logic; and b) dynamic power, which is a weighted sum of several components (these include clock signal propagation power, proportional to clock frequency; signals power, proportional to signal switching rates, among others). The implementation relies on available programmable logic gates available on different FPGA boards from handful of manufacturers, including Xilinx and Altera (now Intel). Table 3 provides a comparative overview of these measurements metrics reported in the literature that are referred earlier in Section 2.

ASIC
Vision based applications and systems are typically associated with high computational cost, slow when implemented on general purpose processors and not very useful in realtime applications. To address some of theses problems, mainly the real-time requirements, most researchers have resulted to the use of dedicated and application specific systems. In [61], Sugiura et al used an application specific instruction-set processor to execute a lossless data compression method as part of a visual prosthesis systems. Deep networks, models for understanding the content of images, videos and audio have been used successfully in various application [40] with relatively high computational cost. Gokhale et al [26] presented a scalable, low-power co-processor for enabling real-time execution of deep neural networks on mobile devices. This was implemented using a large number of parallel operators, optimised to process multiple streams of information. The implementation presented in [26] shows that image understanding with deep networks can be accelerated on custom hardware to achieve better performance per watt. Chen et al. [16] presented an application specific integrated circuit accelerator on a 65nm scale technology, for large-scale convolutional and deep neural networks capable of performing 452 GOP/s of key neural network operations in a small footprint. A convolution chip built on 0.35 µm CMOS technology for event-driven vision sensing and processing is presented in [14]. 7

Future trends and conclusions
In this paper we made a modest effort to review embedded computer vision systems that satisfy application specific constraints e.g., performance or power. The literature is scattered and covers a range of application areas, vision algorithms and target hardware. This paper made an effort to categorize them in an orderly fashion. We identified two emerging trends (described below) in this domain namely, heterogeneous computing and bi-inspired computing for efficient vision systems.

Heterogeneous computing for vision systems
Current computer vision algorithms are highly complex and consist of different functional blocks that are suitable for a variety of targets i.e., CPUs, GPUs or FPGAs. Therefore, designing computer vision systems for single target hardware platform is inefficient and does not necessarily meet performance and power budgets especially for embedded and remote operations. A heterogeneous architecture is a natural alternative but manifests new challenges: -design choices to dissect the algorithm according to their suitability for the target hardware, -interoperability and data flow synchronisations between functional units as different blocks may have different timing constraints. -programmability and coordination between different hardware platforms. There is a need for unified programming environment.
Although recently hardware manufacturers launched new heterogeneous products, e.g., Xilinx Zynq Ultrascale+ MPSoC 1 (CPU, GPU and FPGA) and Altera SoC products 2 (CPU and FPGA), these are not fully exploited in computer vision domain (except handful of recent work, e.g., Zhang et al. [73]) as majority of the existing algorithms are not designed to target heterogeneous platforms. Consideration of target hardware during the algorithmic development cycle is not always necessary and the domain experts often prototype new algorithms using library-rich languages such as MATLAB. However, efficient deployment of these prototypes on a heterogeneous hardware is challenging. Asynchronous data process network [20] may provide a plausible solution to this problem, however requires further research.

Biologically inspired vision systems
The ability to detect moving objects in a scene is a fundamental problem in computer vision. This is a baseline problem that requires detection accuracy as well as computational efficiency to guarantee a successful high level processing in behavioural or event analysis [72]. Various background subtraction methods [25] have been proposed and 8 proven to be successful for detecting moving objects with the use of stationary cameras. These methods build statistical background models and extract moving objects by finding regions which do not have similar characteristics to the background model. Human visual systems processes a very high volume of data and hence it is often selective and activity driven (responsive to the scene event). The high volume data problem is also faced by many modern technical systems like computer vision systems which need to deal with a multitude of image pixels at any point in time. Physiological research has illustrated that biological vision systems use neuronal circuits to extract movement in the visual scenes [38]. Biological visual systems are intrinsically complex hierarchical processing systems with diverse specialised neurons, displaying very powerful specific biological processing functionalities that traditional computer vision techniques have not yet fully emulated [38]. Another important finding during the last decades, that most neuromorphic designers may overlook is the fact that processing of the visual information is not serial but rather highly parallel [23] and hence such implementations should target parallel architectures.
A concept proposed and implemented in [21], shows that motion information can be capture with the use of one retina sheet and two LGN sheets (one ON and one OFF). Orientation preference has successfully been modelled using a Gain Control, Adaptation, Laterally (GCAL) model consisting of four two-dimensional sheets. Solari et al. [58], presented a feed-forward model based on the biological visual system to solve motion estimation problem. The model integrates media temporal (MT) neurons for estimation of optical flow by extending it into a scalable framework. What is missing from their model is the feedback capabilities as perceived in the visual pathway, but the results are very promising and acts as a good starting point for building bio-inspired scalable computer vision algorithms.