Google Translate app now offers real-time visual translation in 20 additional languages in July 2015. Therefore, we’ve got your back the next time you’re in Prague and unable to read a menu. But how do we distinguish between these novel languages?
How the Computer Vision Works?
The Google Translate software must first locate the letters in a camera image when it receives one. It needs to identify the words we want translated while ignoring background elements like trees or cars. It looks at clusters of pixels that are close to other clusters of pixels with colors that are similar to one another. These might be letters, and if they are close to one another, they form a continuous line that we should read.
Second, Translate has to recognize what each letter actually is. Deep learning comes into play here. In order to teach a convolutional neural network how different letters look, we train it on both letters and non-letters.
The final stage entails using those identified letters to look for translations for them in a dictionary. The dictionary lookup must be approximate because every step before it could have failed in some way. In this manner, the word “5uper” will still be readable even if we read a “S” as a “5”.
Finally, we overlay the translation in the same manner as the original text over the original words. We are able to accomplish this because we have already identified and read the letters in the image, so we are aware of their precise locations. We can use the colors around the letters to cover over the actual letters by looking at those hues. Using the original foreground color, we can then draw the translation on top.
How Google Translate Run On the Phone ?
Now, it wouldn’t be too difficult if Google could perform this visual translation within Google data centers. But many of our consumers, particularly those who are just starting to use the internet, have sluggish or erratic network connections, and their cellphones are severely lacking in processing capacity. A good laptop is already much slower than the data centers that typically power our image recognition systems, and these low-end phones can be about 50 times slower. How does Google manage to do real-time visual translation when the camera moves on these phones without a cloud connection?
It was necessary for Google to create a very modest neural network and impose strict restrictions on how much it attempted to teach it, thereby setting an upper limit on the amount of information it can handle. The difficulty in this situation was in producing the best training data. Google made a lot of effort to include only the right data and nothing more because Google generates its own training data. For instance, Google wants to be able to identify letters that have a moderate amount of rotation. The neural network will utilize too much of its information density on irrelevant items if Google overdoes the rotation. Google therefore made an attempt to create tools that would provide us quick iteration times and high-quality graphics.
Google is able to generate training data, retrain on it, and visualize it within a short period of time. Google can then examine the kind of letters that are failing and their causes. At one point, Google was severely distorting our training data, causing “$” to be mistaken for “S.” That was easily discovered by Google, who then corrected the issue by adjusting the warping parameters. It was like attempting to paint an image of letters that you would actually see in real life, complete with all of their faults.