TL;DR It is presented a dual-cam first-vision translation system using convolutional neural networks. A prototype was developed to recognize 24 gestures. The vision system is composed of a head-mounted camera and a chest-mounted camera and the machine learning model is composed of two convolutional neural networks, one for each camera.
TL; DR 提出了一种使用卷积神经网络的双摄像头第一视觉翻译系统。 开发了一个原型来识别24个手势。 视觉系统由头戴式摄像头和胸部安装式摄像头组成，机器学习模型由两个卷积神经网络组成，每个卷积神经网络一个。
Sign language recognition is a problem that has been addressed in research for years. However, we are still far from finding a complete solution available in our society.
Among the works developed to address this problem, the majority of them have been based on basically two approaches: contact-based systems, such as sensor gloves; or vision-based systems, using only cameras. The latter is way cheaper and the boom of deep learning makes it more appealing.
在为解决这个问题而开发的作品中，大部分都基于两种方法：基于接触的系统，例如传感器手套； 或基于视觉的系统，仅使用摄像头。 后者更便宜，而深度学习的兴起使其更具吸引力。
This post presents a prototype of a dual-cam first-person vision translation system for sign language using convolutional neural networks. The post is divided into three main parts: the system design, the dataset, and the deep learning model training and evaluation.
Vision is a key factor in sign language, and every sign language is intended to be understood by one person located in front of the other, from this perspective, a gesture can be completely observable. Viewing a gesture from another perspective makes it difficult or almost impossible to be understood since every finger position and movement will not be observable.
Trying to understand sign language from a first-vision perspective has the same limitations, some gestures will end up looking the same way. But, this ambiguity can be solved by locating more cameras in different positions. In this way, what a camera can’t see, can be perfectly observable by another camera.
试图从第一视觉的角度理解手语具有相同的局限性，某些手势最终将以相同的方式出现。 但是，可以通过在不同位置放置更多摄像机来解决这种歧义。 这样，一台摄像机看不到的东西可以被另一台摄像机完全观察到。
The vision system is composed of two cameras: a head-mounted camera and a chest-mounted camera. With these two cameras we obtain two different views of a sign, a top-view, and a bottom-view, that works together to identify signs.
Another benefit of this design is that the user will gain autonomy. Something that is not achieved in classical approaches, in which the user is not the person with disability but a third person that needs to take out a system with a camera and focus a signer while the signer is performing a sign.
To develop the first prototype of this system is was used a dataset of 24 static signs from the Panamanian Manual Alphabet.
To model this problem as an image recognition problem, dynamic gestures such as letter J, Z, RR, and Ñ were discarded because of the extra complexity they add to the solution.
To collect the dataset it was asked to four users to wear the vision system and perform every gesture for 10 seconds while both cameras were recording in a 640×480 pixel resolution.
It was requested to the users to perform this process in three different scenarios: indoors, outdoors, and in a green background scenario. For the indoors and outdoors scenarios the users were requested to move around while performing the gestures in order to obtain images with different backgrounds, light sources, and positions. The green background scenario was intended for a data augmentation process, we’ll describe later.
要求用户在三种不同的情况下执行此过程：室内，室外和绿色背景。 对于室内和室外场景，要求用户在执行手势时四处走动，以获取具有不同背景，光源和位置的图像。 绿色背景方案是用于数据增强过程的，我们将在后面介绍。
After obtaining the videos, the frames were extracted and reduced to a 125×125 pixel resolution.
Since the preprocessing before going to the convolutional neural networks was simplified to just rescaling, the background will always get passed to the model. In this case, the model needs to be able to recognize a sign despite the different backgrounds it can have.
To improve the generalization capability of the model it was artificially added more images with different backgrounds replacing the green backgrounds. This way it is obtained more data without investing too much time.
During the training, it was also added another data augmentation process consisting of performing some transformations, such as some rotations, changes in light intensity, and rescaling.
These two data augmentation process were chosen to help improve the generalization capability of the model.
This problem was model as a multiclass classification problem with 24 classes, and the problem itself was divided into two smaller multi-class classification problems.
The approach to decide which gestures would be classified whit the top view model and which ones with the bottom view model was to select all the gestures that were too similar from the bottom view perspective as gestures to be classified from the top view model and the rest of gestures were going to be classified by the bottom view model. So basically, the top view model was used to solved ambiguities.
As a result, the dataset was divided into two parts, one for each model as shown in the following table.
As state-of-the-art technology, convolutional neural networks was the option chosen for facing this problem. It was trained two models: one model for the top view and one for the bottom view.
The same convolutional neural network architecture was used for both, the top view and the bottom view models, the only difference is the number of output units.
The architecture of the convolutional neural networks is shown in the following figure.
To improve the generalization capability of the models it was used dropout techniques between layers in the fully connected layer to improve model performance.
The models were evaluated in a test set with data corresponding to a normal use of the system in indoors, in other words, in the background it appears a person acting as the observer, similar to the input image in the figure above (Convolutional neural networks architecture). The results are shown below.
在测试集中对模型进行了评估，并使用了与室内正常使用系统相对应的数据，换句话说，在背景中看起来像是人在充当观察者，类似于上图中的输入图像( 卷积神经网络)建筑 )。 结果如下所示。
Although the model learned to classify some signs, such as Q, R, H; in general, the results are kind of discouraging. It seems that the generalization capability of the models wasn’t too good. However, the model was also tested with real-time data showing the potential of the system.
尽管模型学会了对一些符号进行分类，例如Q，R，H； 一般来说，结果令人沮丧。 这些模型的泛化能力似乎不太好。 但是，该模型还通过显示系统潜力的实时数据进行了测试。
The bottom view model was tested with real-time video with a green uniform background. I wore the chest-mounted camera capturing video at 5 frames per second while I was running the bottom view model in my laptop and try to fingerspell the word fútbol (my favorite sport in Spanish). The entries for every letter were emulated by a click. The results are shown in the following video.
底视图模型已通过具有绿色统一背景的实时视频进行了测试。 当我在笔记本电脑中运行底视图模型时，我戴着胸部摄像头以每秒5帧的速度捕获视频，并尝试拼写fútbol(我最喜欢的西班牙语运动)一词。 通过单击可以模拟每个字母的条目。 结果显示在以下视频中。
Note: Due to the model performance, I had to repeat it several times until I ended up with a good demo video.
Sign language recognition is a hard problem if we consider all the possible combinations of gestures that a system of this kind needs to understand and translate. That being said, probably the best way to solve this problem is to divide it into simpler problems, and the system presented here would correspond to a possible solution to one of them.
The system didn’t perform too well but it was demonstrated that it can be built a first-person sign language translation system using only cameras and convolutional neural networks.
It was observed that the model tends to confuse several signs with each other, such as U and W. But thinking a bit about it, maybe it doesn’t need to have a perfect performance since using an orthography corrector or a word predictor would increase the translation accuracy.
The next step is to analyze the solution and study ways to improve the system. Some improvements could be carrying by collecting more quality data, trying more convolutional neural network architectures, or redesigning the vision system.
I developed this project as part of my thesis work in university and I was motivated by the feeling of working in something new. Although the results weren’t too great, I think it can be a good starting point to make a better and biggest system.
If you are interested in this work, here is the link to my thesis (it is written in Spanish)
Thanks for reading!