Evolving voice and speech recognition technology in the automotive space

In-vehicle technology is seeing unprecedented growth. Within it, voice and speech recognition systems are witnessing confident advancements in managing navigation and infotainment systems. The automotive voice recognition system market size is expected to cross USD 4.87 billion by 2030 at a CAGR of 22.48% during the forecast period 2022 to 2030. Technology advancements are slated to not only increase driver safety but will also help avoid mishaps apart from enabling frictionless command executions for a better driver/ passenger experience.

In this light, Todd Mozer, the CEO of Sensory Inc. and a prominent leader and a pioneer in voice and biometrics sat down with Siddharth Jaiswal, Automotive Practice Head at Netscribes, to discuss the evolving space of speech and voice recognition systems in the automotive realm in 2023 and beyond.

 

Siddharth: Can you tell us what Sensory is all about? What are its solutions and your journey with it?

Todd: Sure. Sensory began more than 25 years ago as a technology development and licensing company. We started with speech recognition and simple natural language understanding. Over the years, we’ve added biometrics including face and voice and we’ve done speech synthesis, sound identification, and a variety of what’s now called deep learning technologies. We started with the vision of enabling products to communicate with people, in precisely the way we communicate with each other.

We decided to do this on device using neural nets. Today we would call it edge computing and shallow learning. A lot of popular technologies used today we were using quite a few years ago just with lesser data. We’ve built real expertise in edge-based technologies with low power. 

In terms of licensing the experience has been pretty successful. We’ve shipped in over 3 billion products and our licensees are the biggest names in consumer electronics and technology. Even companies that have huge AI teams have licensed our tech because of our unique specialty for small low power AI. Google and Amazon and Microsoft and Samsung and all these giants have licensed technology from us. And what’s interesting is we’ve now come out with not just the tiny low end of technologies, but we have cloud-based solutions that challenge the state of the art in AI, and everything in between.

 

Siddharth: Wow, that’s fascinating! Speaking a bit more about the automotive industry, what kind of work Sensory is doing for this space. What kind of offerings do you have and where do you fit in this yet conventional but emerging ecosystem of automotive tech?

Todd: As I mentioned, we’ve been very successful with low power, high accuracy on-device technologies and on-device has been very important for the automotive industry for a long time because when you’re driving around, you don’t always have an Internet connection. I would say the area that we’re probably most famous for is wake words: speaking either a single word or a series of small, on-device phrases to a device. We’ve done a lot of that type of work in the automotive environment because our wake word tech is the most accurate available today.

We’ve also done some biometrics work and a lot of interesting things like in-car voice assistants, sound identification, and computer vision to watch the driver’s face for drowsiness. These kinds of technologies are in discussion for licenses in the automotive industry. The in-car systems handle all sorts of different voice-driven functions like navigation, infotainment, all car controls, heating, and seat adjustments.

These kinds of things make a lot of sense for voice control, and sound identification can be useful for identifying sirens. We’ve had an interest in even health-related solutions for Ubers and taxis so that if people are coughing in the car, it can be identified. So we’re working on a lot of different things that fit into the automotive environment.

 

Siddharth: Excellent! No doubt, especially when we’re talking about the future of mobility. Automotive is largely perceived as a very conventional sort of industry with most of the disruption coming in from non-auto companies like Google, Amazon, Microsoft, and so on. What do you think is the role of these companies? I’ve heard you mention that these big tech giants rely on you for certain technologies in speech. So how does this pan out in the automotive context?

Todd: It’s a really good complex question and it’s an evolving kind of answer as well. As you know, Google and Amazon are currently doing pretty major cutbacks in their area of voice assistants. I think Google and Amazon were reported to be losing around USD 10 billion a year in devices with their assistants. So they’re trying to cut those losses in areas that they invested heavily and got into too much competition with each other and over-invested.

It’s easy to see that in retrospect, and I’d say you can now see that Apple played its cards right by not getting into that kind of marketing battle concerning assistants. Recently, Google had Android Auto for the automotive industry and they dropped that and replaced it with the Google Assistant driving mode. And that was dropped last month. Nevertheless, Android and the iOS ecosystem are always going to play a role in the automotive environment because we bring our phones into the cars.

It seems like automotive manufacturers want to have their in-house technologies now. For instance: an assistant that runs on-device and handles all the car-related controls – whether it’s navigation or adjusting seats or similar functions. Then they can turn to third-party assistants for doing things like “what’s the weather in Stockholm? or “Tell me my stock quote.” or “Order me a pizza.”  Those kinds of functions don’t need to be run by automotive assistance. And it’s fine to turn to Google or Siri for those sorts of things. It gets the auto companies out of the escalating cost of a state-of-the-art online voice assistant. Basically focussing on what they know and how they can differentiate.

A lot of interesting stuff is going on. You mentioned that it was these big giants disrupting the automotive industry. I would add to that – the rise of these independent car companies. Elon Musk’s Tesla is probably the prime example that has done much to revolutionize the auto industry and their use of technologies. 

Some of these giants are very powerful, but automotive companies are demanding a lot of things. They are using Android and they’re using Siri and they’re using different voice technologies, but they’re also scared that these giants could come out with cars someday. Apple was purported to be developing a car, and there have been conversations about even Google and Amazon doing similar kinds of things. So it’s a very mixed and interesting environment.

 

Siddharth: Totally agree. So like you mentioned, conventional OEMs do want to control the in-vehicle experience. They want their systems to interact with the consumers and be more at the center of the whole consumer experience. So since OEMs are very keen to bring in the smartphone-like experience, what kind of solutions does Sensory have in this space?

Todd: Wake words are super important in every experience, whether automotive or otherwise because it’s the technology that’s always on. And when something’s always on and always listening, it has the opportunity to fail with false accepts or false rejects at any moment. You don’t want it accidentally going off. So this component of the technology stack is extremely important and a lot of manufacturers underestimate the importance of this and how much of an effect it has on user experience

Sensory has done a whole lot of work in this area and we not only have Wake words that can be predefined with the automotive manufacturer’s name, but we have tools that give them the freedom to create their own wake words and experiment with them – VoiceHub is a free tool that we let companies use. We can add speaker verification to wake words. We can let the end user that drives a car create their own wake words. So maybe I want my car to wake up when I say “Hey Todd Mozer’s car”. I could train it on that.

Sensory has a whole host of different wake words-based technologies and we cannot only do it on a super tiny device with 50 kilobytes of memory. But we can do revalidations in the car or in the cloud to make it more accurate and we can do it on device with ultra-low power. We have the most complete suite of wake word technologies in the world and we can do it in any language, and on most platform choices including microcontrollers and DSPs or on any OS.

The good news for automotive vendors is if they have a bad wake word, it’s a really easy component to switch out of their system. That way if they wanted to change to Sensory, it’s not tied into everything else and it’s a separate contained solution.

Here’s a quick anecdote. I was at CES a few years ago and I went by the Mercedes booth and they had a “Hey Mercedes!” wake word. And of course, I was interested. It wasn’t Sensory and I was interested in trying it out and what they told me is, “‘We’re not showing it on the show floor because it’s too noisy in here.’” That’s a really good sign that they need to switch wake words because cars are noisy.


Siddharth: Stepping on the other spectrum, I remember you mentioning your work in biometrics. We’ve been hearing about it for about a decade, but we haven’t seen it kick off. We don’t see it in many cars as such, or many OEMs showing a keen interest in it. What seems to be the direction of biometric technology within the automotive space?

Todd: I’ll both agree with you and disagree. I would say it hasn’t shown up as much as we would like, but there has been a lot of interest in it and it’s probably been just behind the scenes. At different CES shows, we’ve done demos with Chrysler, using our face biometrics and we did demos with Mitsubishi Automotive with our voice biometrics. So we’ve seen a lot of interest in this and a lot of these use cases are making things more convenient for their customers.

Right now, I share my car with my whole family because I don’t drive to work anymore. So we have fewer cars and we’re sharing them more and it would be nice if when I sit down all my favorite settings just automatically get set. I don’t have to adjust my speed. I don’t have to change the radio or move destinations around. By looking at my face, it should be able to do things like that. As social media and these kinds of things become more important for passengers, you can identify who’s sitting where and cater to their needs, favorites and histories. 

So there’s a lot of interest in biometrics within and outside of the car. Remember the old commercials of somebody walking up to a car with a bag of groceries and they wiggle their foot under the back trunk and the trunk pops open? Well, you don’t need to wiggle your foot anymore with voice biometrics. You can walk up and just say, “Open the door,” and it can pop up securely because its the right voice. We’ve seen a lot of interest in these kinds of use cases.

 

Siddharth: Interesting. When companies understand or appreciate new technology, they start forming a sort of consortium of people who are technically competitors, but eager to discuss certain technologies. So, for technologies such as speech recognition or biometrics, do you see this as a good step to take it to the next level?

Todd: Yes. There are a lot of different types of consortiums that we could talk about, including standards consortiums. Now the interesting thing about a standard is there are often competing standards so that they’re not really standards. Then, there are things like open source consortiums, which are really a very interesting space within AI, because a lot of very high-quality AI open-source systems have gotten very popular and are really advancing the state-of-the-art.

If you look at some of these giants like Google and Meta, they’re open-sourcing a lot of great technologies that they develop. Just in the last few weeks chatGPT from OpenAI has gotten a lot of press coverage and there’s been a lot of really amazingly impressive demos. What Sensory found is that when we experiment with the latest open source technologies, a lot of times they’re really big and in terms of practical deployment, it’s very difficult to deploy them for example, in a car.

The Open Source is typically built for accuracy over size and often has to be run in a cloud environment. Part of what we do is take these state-of-the-art technologies and make them a lot smaller so that they can run on-device and in an automotive environment where you’re just a lot safer with them on device. 

On the standards, we recently joined the SOAFEE Consortium that was started by ARM and it’s an interesting approach. We joined it, of course, to use ARM processors and other components, but the idea is that you can take cloud-based containers and move them on-device in a standard way. So if you have cloud containers, you can run them in the SOAFEE environment. 

It’s a nice approach that could cause a real standard to evolve in automotive, which makes it easier to incorporate AI technologies and gives automotive manufacturers a little more control of their environment. The more standards that emerge in hardware and software the easier it is to take out components and put other components in so that they can control their architecture and not be reliant on a sole vendor which may or may not work out for them.

 

Siddharth: Very interesting. You mentioned there is a lot of sharing happening for building that standard or just kick-starting this technology and bringing it to a level where OEMs recognize this and start building it within their vehicles. Now within the context of scalability where people are using shared architectures, what does it mean for Sensory? And second, what does it mean to the end consumer? Will all of this technology in essence make the experience better, or will it just complicate things?

Todd: Good question. Probably the answer is both. A lot depends on how automotive manufacturers incorporate things. If you look at what’s happened with autonomous driving, they’ve invested huge amounts to bring that in-house. If you look at what’s happened with speech recognition, they’ve invested large amounts not as big, but they haven’t yet brought it in-house. And I think there’s going to be a tendency for automotive companies to build bigger and stronger in-house AI teams so that they can take advantage of open source and other technologies.

And I think the standards that are evolving will help companies like Sensory because we can help automotive manufacturers bring these things in-house and have state-of-the-art technologies. After all, our expertise is moving big things onto devices and deploying efficiently on-device. We can help automotive manufacturers in this manner and we have a business model where we’re pretty flexible. 

We’re a small company and so we’ve been in discussions now about source code licensing so that automotive companies can take our technology stack and train them on it. They can take our tools and then they can take over and not be dependent on us. It’s an interesting state of the speech and AI technology industry where you’ve got a mix of these giant technology companies with their technology stacks that automotive guys think, maybe they’re going to be competitors.

The stability of smaller companies is less secure and they don’t want to rely on them. They want to bring it in-house and have control over the technology with the investment of or the dependence on the giants. So that’s where Sensory is good at helping these companies achieve that.

 

Siddharth: Interesting. You’ve also mentioned the importance of on-device computing. So talking about an autonomous future, what is your opinion or perspective on how autonomous technology will evolve and what would it mean for Sensory in terms of in-vehicle experience?

Todd: I think at least in the short term, everything has to happen on-device. You can’t risk going external for autonomous driving or speech recognition when you’re in a car. You don’t want to be distracted. When I was talking about wake words, they can have false accepts or false rejects. That’s a distraction. And it’s kind of a bother and it hurts the user experience if it fails. 

But if you’re doing something even more important, like relying on your car to, say, spot signs or people or make visual decisions of any kind, you can’t have mistakes happen, or else crashes will happen. And for safety reasons, at least in the near future, everything’s going to need to move on-device and our cars are already giant computers. That’s going to stay the same. For a lot of industries, we’ve seen movements from the device to the cloud to the device and back and forth over time.

If you think about the telephone and telephone answering systems, initially you had tiny little tape cassettes built into your phone answering machine, probably before you were born. Then it moved into these giant cloud solutions, They didn’t call them clouds back then. It was a client-server kind of architecture that would store your messages. And then with the advent of digital technology, the message came back to your home devices.

Nowadays we don’t even have that because people barely use home phone systems. Everybody’s got a cell phone. So we’ve seen this evolving on-device to cloud. The same is gonna happen with automotive. But for things to move to the cloud again it’s got to be super secure so that the cloud can always be reached. And I think we’re not gonna see that level of security for several years.

So the cars will stay giant computers and be able to carry out all the AI functions fast. Probably an interesting question is how that architecture will play out. We’ve talked to a lot of different manufacturers that want us to stay super tiny even though the car has a lot of resources.

 

Siddharth: This brings me to this question. Now that we are changing architectures and moving into more centralized domains where you club things and make one master computer to do multiple functions. In this light, what is your perspective on the way forward for the auto industry in speech technology, and with that an extension for Sensory?

Todd: I think that the approach is to have a big platform that can do a lot of different things. I think they’re going to experience cost savings with approaches like SOAFEE that are quite large but enable a lot of different functions to be carried out on the same hardware. In doing that, it’ll give vendors like Sensory more leeway to offer better technologies. 

Right now, we might get told by an automotive company they want a 50-kilobyte wake word, that doesn’t draw much power because of the size of that model. It’s not going to be as good as a 1MB wake word model. So we’re always asking if can we revalidate on-device to make it perform better and the more they move to architectures that provide more resources to everyone, the better. 

It’s interesting, what we’re seeing in terms of edge-based computing. There are a lot of specialized inference engines on the edge that are emerging and not just stand-alone engines, but you know, ARM controllers and these kinds of things that add specialized hardware functions for inferencing, and it’s allowing us to do the computing that we want, much, much easier. The issues that we’re seeing are a lack of memory and the forced quantization for deep learning. We would prefer to use our own quantization approaches because that’s what we’ve trained on. 

But the more we move into some of these predefined inference engines and pseudo standards like PyTorch and TensorFlow, the more we have to use their quantization schemes. So a lot is going on there and it’s an interesting dynamic. But I do think the centralized architecture will appear with specialized inference and inferencing capabilities within that architecture.

 

Siddharth: Great! So here’s my final question. As a leader in this space, what would you want to express or communicate to our readership? Anything that you want to tell them about what is in store, what do you expect? Since we’re at the brink of the new year, what is your prediction for 2023 and beyond?

Todd: So my first thought was me as a consumer – because you know, most of us are consumers of automotive devices. I would love to have an autonomous RV that can take my wife and me around the country. I’m sixty-two years old and one of these years I’m going to retire and it would be really cool to have an autonomous RV that I could say to, “Take me to Yosemite.” and it takes me there. 

My guess is that we’re a good ten or fifteen years away from that reality. But I think that’ll be a real future need. If Sensory can help in moving towards that future by automating the voice commands, so that they always work, we would be happy to help everyone.

Todd Mozer

Todd F. Mozer

CEO of Sensory, Inc.

Todd F. Mozer is the CEO of Sensory, Inc., a leading supplier of AI software for both edge and cloud applications, especially in the automotive space. Todd founded Sensory in 1994 and has successfully raised venture capital to fund its growth to profitability. At Sensory, he has been involved in both corporate and product line acquisitions and has worked on incorporating speech recognition into the products of companies such as Amazon, Google, Huawei, Nokia, Samsung, LG, Sony, Toshiba, JVC, and many of today’s leaders in consumer electronics. Sensory technologies have shipped in over 3 billion products from hundreds of leading brands. Mr. Mozer has over 20 patents issued to him in speech technology, he received BA degrees from UC Santa Barbara and an MBA from Stanford University. In his leisure time, he plays musical instruments and runs marathons.