Microsoft’s custom voice recognition service hits public beta
The service lets developers tailor cloud voice recognition for specific scenarios
By Blair Hanley Frank
PCWorldFeb 7, 2017 11:16 am PST
Companies building applications that leverage speech recognition have a new machine-learning based tool to improve their work. Microsoft is opening the public beta for its Custom Speech Service, the company said Tuesday.
The service, formerly known as CRIS, allows customers to train a speech recognition system to work in a specific scenario, allowing it to produce more accurate results. For example, the Custom Speech Service can be trained to provide better results in a noisy airport or set up to work better with voices from a particular group, like kids or people with different accents.
Right now, the Custom Speech Service works with English and Chinese, but one of its advantages is that it can be trained to work with accents from non-native speakers.
Microsoft is making it available as part of its suite of Cognitive Services, a set of cloud-based tools aimed at opening up the fruits of the company’s artificial intelligence and machine learning research to the rest of the world.
Right now, there are eight such cognitive services generally available, and an additional 17 in beta. More than 424,000 developers have tried the services since they launched, Microsoft said. Developers all over the world can access the services, many of which are available for purchase through Microsoft Azure.
Each of the services has a free tier with heavy limits on its use, so developers have the freedom to test the APIs out without spending a cent. The Custom Speech Service has a complicated, tiered pricing model that includes a subscription fee along with charges based on the number of voice samples fed into the system and the amount of acoustic adaptation training.
The Custom Speech Service is a key tool in the arsenal of Human Interact, a small game development shop using voice commands as the sole means of interaction for its forthcoming game Starship Commander. Custom speech recognition, along with Microsoft’s Language Understanding Intelligent Service (LUIS), makes up key parts of the voice recognition and understanding system that players use to guide their ship.
The service allows Human Interact to create its own dictionary specific to Starship Commander, which means the system can understand players when they ask about the Ecknians, the game’s alien antagonists. After players’ speech has been translated into machine readable text, LUIS processes it and translates it into game commands.
Both systems are important to the core gameplay of Starship Commander. Human Interact set out to make an interactive experience for virtual reality that was broadly accessible to a wide range of players, not just those who have been playing video games for years, creative director Alexander Mejia said.
“The answer was stupidly clear,” Mejia said. “What if you just talk to somebody? I mean, if we put a person in front of you, and they start talking to you, would you talk back?”
To that end, the company opted to use the microphones that are built into the Oculus Rift and Gear VR systems and create a game that feels like a much more open-ended and immersive choose-your-own-adventure book.
Microsoft is far from the only company providing machine learning-based cloud voice recognition, but its services were the best for what the team is doing, Mejia said. The services provide what the team needs for not only custom dictionaries, but also fast response times and the ability to see and validate the results that the voice recognition system puts out.
Two other cognitive services from Microsoft will reach general availability next month. The Content Moderator service is designed to automatically block objectionable content in text, videos, and images while allowing for human review of questionable cases. It can detect profanity in more than 100 languages and allows customers to include custom lists of objectionable text as well.
The Bing Speech API is designed to give developers an easy, generalized way to convert speech to text and vice versa. It supports voice recognition from 18 languages and dialects from 28 countries, including German, French, Chinese, Spanish, and Arabic. Developers can also use the API to do text-to-speech work in 10 languages with support for dialects from 18 countries.
Microsoft is battling with a number of other cloud companies in this area, including Google, Amazon, and IBM, which each have their own set of machine intelligence-based tools.