Why Voice Interfaces Are an Evolutionary Dead End and the Future Is in "Thought"
Why Voice Interfaces Are an Evolutionary Dead End and the Future Is in "Thought"
Imagine this scene, which perhaps for many is not imagination, but vivid reality: you are in an immense open space, born from the fantasy of a designer who believed that transparency and collaboration were the key to innovation. Around you are dozens of colleagues, each immersed in their own screen, with colorful noise-canceling headphones, mechanical keyboards, phones ringing, and people talking and laughing standing in the middle of the room.
Right at that moment, you need to interact with your favorite AI; you need to reply to an email or get a summary of the last meeting. You need to tell your wife that for Christmas you want to get a "clock" for your father-in-law, but the AI keeps hearing "cock" and reminds you that it is an inappropriate gift for a relative.
What do you do? Do you scream like a madman "be quiet!", go to an isolated corner of the building hoping no one passes by to interrupt you, go to the bathroom and lock yourself in hoping no one thinks you're crazy?
The Captain Kirk Syndrome
A childhood spent watching Star Trek episodes, where Captain Kirk managed to issue complex orders to the ship's computer in the absolute silence of the Enterprise bridge, can only represent the romantic expectation of a future that is achievable only in the silence of one's own room, but almost unachievable in a world where the open space reigns as the sole tool of socialization and privacy is a necessity for almost everyone.
Let's reflect on this: are we convinced that this is the future? Are we certain that the pinnacle of technological evolution consists of talking to oneself in an empty room or shouting commands in a noisy environment hoping that a Natural Language Processing algorithm understands the difference between "clock" and "cock" when you talk to your partner?
What If It Was All an Input Problem?
If we carefully analyze all the technology we have churned out in recent years, we notice a worrying asymmetry. We have built an exceptional output infrastructure, but we have remained in the Stone Age regarding input.
Let's try asking ourselves a question: what if we were getting it all wrong? What if the era of voice assistants wasn't the destination, but just a clumsy transitional phase, the classic temporary patch awaiting a real solution? What if we were still in a sort of primitive era where our vocal cords laboriously vibrating in the air are the punched cards of this entire system?
Let's explore a different type of input, where the future is not "speaking", but "thinking". If we think about it, the voice is an unsustainable bottleneck for the bandwidth of our brain: it has a whole series of constraints related to how we emit sounds, the environment, pronunciation. Thought does not: it is faster than speech, its understanding is not influenced by noises, and it has privacy equal to 100% of what we need to say.
Get ready to question your smart speaker and look at your headphones with different eyes. In one of the hypothetical futures of our metaverse, talking to computers will provide the same astonishment that using a rotary dial telephone gives to Generation Z kids today.
The Man-Machine Interface Asymmetry
To understand the problem of current interfaces, we must first analyze the context in which we operate. We live in a technological paradox: we have brilliantly solved half of the communication problem: we manage to generate video, audio, text at the speed of light, producing mountains of data, but we leave the other half in a state of dysfunctional chaos.
In recent years, billions have been invested to perfect the way machines send information to our brain through the auditory channel. New hardware and evolved algorithms have allowed us to have listening experiences so immersive that we can close our eyes and imagine ourselves in other worlds: we are not yet in Holodecks, but that is the path.
From the scratchy speakers of gramophones to the high-fidelity wearable devices we find in our pockets, the leap has been enormous. One of the key innovations has been the spread of Active Noise Cancellation, a technology capable of "listening" to the chaos in which we are immersed—traffic, noisy colleagues, vacuum cleaners—and canceling it in real-time with a sound wave of opposite phase.
This technology creates an acoustic "privacy bubble" making us perceive only the sound coming from the headphones we wear, even if it completely alienates us from the outside world.
What have we gained from this technological progress? Three key benefits:
-
Digital Intimacy: Bone conduction technology and "open ear" earphones allow digital audio to be superimposed on physical reality without blocking the auditory canal. We can listen to your wife asking where you put the salt while you are in a meeting, and your colleagues' voices arrive in your brain as if it were an implanted thought.
-
Channel Ubiquity: We can receive output everywhere. Listen to an audiobook while grocery shopping, receive a reading of a WhatsApp notification during a meeting, or get guided by GPS while riding a bike.
-
Quality and Humanization: Speech synthesis has reached embarrassing levels of realism. If we think back to the first robotic voices, current neural models generate voices indistinguishable from human ones, with emotional inflections, pauses for breathing, tone variations.
As far as output is concerned, technology has made giant strides.
The Failure of Input: The Voice Bottleneck
Well, fantastic, exceptional: we can receive much more data than we are able to assimilate. But when we have to provide the data? In this case, technology stumbles. Voice input suffers from physical, social, and cognitive limitations that at the moment no technology seems to be able to solve in the short term.
It is not a tool problem: it is a physical interface problem. Voice interfaces assume an ideal environment that rarely exists in real life: silence or controlled noise. Without this starting point, the risk of mixing real data with out-of-context data is very high.
The real world is full of acoustic chaos: background music, coughing, TVs on, neighbors talking. In this context, voice input becomes problematic and unreliable.
The problems fall into three macro-categories:
-
The Cocktail Party Effect: The human brain manages to isolate a single voice in a crowded room; it is a natural process called "selective attention". It is the reason why when we speak at a restaurant table with our dining companions we manage to isolate ourselves, but if we want to, we can totally concentrate on the conversations of our neighbors. For a microphone, it is a huge algorithmic challenge. Despite current technologies, the error rate increases exponentially in noisy environments.
-
The Loudness War: In a noisy environment, whoever wants to prevail over the noise of others tends to raise their voice. This transforms an interaction that should be fluid into a physical clash, which exposes one to high stress and a progressive reduction of privacy.
-
Vocal and Cognitive Fatigue: Talking to a computer requires effort. Words must be articulated, sentences created in a way that they are more understandable, avoiding implications so as not to risk having to repeat the same phrase a thousand times. This creates a cognitive and physical load that does not exist with fluid thought or typing.
The Absence of Privacy in Voice Input
How many times have you found yourself in the middle of a group of people and felt embarrassed to book a proctologist appointment? How many times did you not want to let it be known that you were looking for a house, or that you had family problems?
There is an insurmountable psychological barrier in using voice in public:
-
Privacy Violation: Sending a voice message with your shared account password in the middle of a train car is not the best choice. Asking an AI how to treat venereal diseases could create uncontrolled rumors about you among colleagues. Sending a message to your accountant on delicate tax issues could bring out information you don't want the world to know. The voice interface makes public what should be private.
-
Disturbance of Peace: In some public places, the rule of "silence" applies. In a "modern" office, if everyone spoke with their AIs simultaneously to manage emails, calendars, and searches, the noise level would make working impossible. In many open spaces, there is an explicit ban on loud conversations, making the voice interface socially unacceptable.
-
The Madman Syndrome: Talking to an object, even if this practice is starting to become socially accepted, makes you look like a madman. Have you never thought about this seeing someone in a car screaming and gesturing alone?
Input Channel Bandwidth
If we think about the ways in which humans produce input towards machines, we can classify interfaces into three main methods:
- Thought: The fastest and most natural data flow, occurring directly in the brain, multimodal and parallel.
- Voice: Slow, ambiguous, influenced by the environment, requires physical and cognitive effort.
- Writing: Very slow, requires the use of hands, but precise and structured.
Speaking is like trying to download a 4K movie using an old modem connection. If we use voice as input, we have to take a complex concept, compress it into imperfect words, and hope that the machine correctly understands the message and returns a relevant result.
It is a lossy compression in which much of the nuance is lost: using voice in this way inexorably degrades quality.
Are We Primitive?
We are probably trying to mount a warp drive on a horse-drawn carriage. We are dazzled by the power of the engine, but we forget that the transmission system is obsolete.
We are in a particular historical moment. We are giving unlimited access to Large Language Models and like Captain Kirk, we think we live in an advanced future, but the interface with which we feed this machine is millennia old. The human vocal apparatus evolved for short-range tribal communication. This type of communication is not designed for high-speed, high-data-density human-machine interaction.
We are therefore led to reduce the complexity of the interaction, we tend to simplify and linearize information, hiding the options and nuances that a visual or textual interface could show. Going in this direction, we are training AIs to be synthetic and superficial oracles rather than deep analysis tools, simply because asking complex things by voice is extremely tiring.
If we want to move faster, we must rethink the input interface, towards a mechanism that allows us to communicate with machines at a higher speed.
What If We Used Thought?
The highest-performance input we have available is thought: fast, precise, silent, and private. So why not replace the voice input tool with a thought-based input tool?
Someone asked this question and started working on two great schools of technological thought:
- The non-invasive approach: using micro-movements of the face
- The direct and invasive approach: sensors implanted in the brain
There is a project at MIT called AlterEgo (https://www.media.mit.edu/projects/alterego/overview/) that uses subvocalization as input.
The concept is quite simple: reading the motor intent of the word. When we read silently or speak "in our head", our brain sends very weak neuromuscular signals to the speech organs and, even if we do not emit any sound, it is possible to intercept these signals. This physiological phenomenon is called subvocalization.
The device looks like headphones with a microphone that reaches from the ear to the chin, intercepts the electrical signals emitted by the brain, and translates them into words.
With such an object, we solve elegantly a series of problems related to voice, without using harmful and invasive apparatus. No privacy problems, no noise problems, no social embarrassment problems.
If, on the other hand, we allow a company to drill a small hole in our head, we can experience the final frontier of technology: Brain-Computer Interfaces (BCI). If AlterEgo reads muscles, BCIs read neurons directly.
Companies like Neuralink, Synchron, Precision Neuroscience, and others are trying to completely skip the physical mediation of the body.
The goal is to completely bypass the mouth and hands and read the brain's electrical signals directly, decode them, and send them to the computer: greater speed of thought, no noise, no privacy problems, immediate effects between man and machine.
This future is still far away, but the progress made in recent years is impressive and makes us intuit all the problems related to this technology and potential social control.
The goal towards which everyone is trying to go is to increase bandwidth at the root. If we can intercept the intent before it propagates into the body, we will be able to communicate with machines at the speed of thought.
How Far Are We from Man-Machine Fusion?
Some studies from Stanford University (https://www.lescienze.it/news/2023/08/30/news/paralizzato_dispositivi_lettura_cervello_parlare-13239619/) have shown that the ability to decode imagined speech is similar to the speed of natural conversation and exceeds that of average smartphone typing.
Unlike AlterEgo, high-performance BCIs require invasive surgery. Until this "small" problem is resolved, BCIs will remain confined to extreme medical cases: quadriplegics, patients with ALS, or in any case situations not curable with normal medicine. Difficult, but not impossible, that a healthy person decides to have a chip implanted in their brain to improve their work productivity.
Unfortunately, or perhaps it would be better to say "fortunately", our skull was designed to protect the brain from external damage, and any surgery carries significant risks of infection, rejection, and neurological damage.
It is also true that non-invasive BCIs exist, although at the moment they do not have the same "performance" compared to an invasive intervention.
And When Will We Become Machines?
Once we understand that current input mechanisms are poor performers, and that technology through massive marketing campaigns could make it seem natural for us to accept invasive or non-invasive BCIs, we must start asking ourselves what the pros and cons of these technologies are.
From a certain point of view, it is scary, also because slowness has always been a protection against potential disasters: what will happen if we no longer have this barrier? We are creating Thanos' gauntlet and putting it in anyone's hands.
We must question all the "involuntary" thoughts that could be intercepted by these devices: are we sure we want to open the door to our mind? How many times have you thought something embarrassing and didn't say it out loud? And what if a device were able to intercept these thoughts? Probably almost all of us would go to prison for some thought crime.
Think of all the moments of hatred in which you would have killed your boss by throwing him out the window, or a possible automaton connected to your thought: how dystopian would our future be?
And let's talk about hacking: it would be enough to manipulate a thought to create enormous damage.
With great power comes great responsibility
Is Transhumanism Our Future?
For those who support BCIs, the most heard mantra is: "If you can't beat them, join them".
In a world where machines, little by little, will become smarter and faster than us, it is useless to try to compete: we must merge our mind with theirs. The Borg Queen would be proud of us.
Increasing the volume of data from us to machines allows us to have increasingly faster and efficient interactions and this "should" improve our lives.
We Are in Full Evolutionary Refactoring
Are we ready to leave the "noise" to embrace the "signal"?
The right to slowness, to reflection, to taking time to think, is something that millions of years of evolution have given us to protect us from errors. What we see as a limit could actually be our lifeline.
In twenty years, when our children ask us why we still type on the keyboard instead of thinking directly at the computer, what will we answer?
Are we cavemen who learned to play with sand or are we aware people who understood that moving at the speed of thought can create more damage than opportunity?
The future is not inevitable. The future is a sum of choices we make every day. We can choose which type of interface to use to connect with an apparatus we have built, but every time we make these choices we make a decision that defines us.
The voice interface is technologically limited, but it represents a fundamental compromise: maintaining a distance between us and what we have created.
New interfaces promise to eliminate this filter, painting it as a limit, but perhaps it is precisely the limit that protects us and avoids the loss of human contact.
Clean code has taught us that the best refactoring is the one that helps us better understand what we do, not the one that makes us accelerate at the cost of comprehensibility. It is what helps us not to make mistakes, not what makes us do more things in less time.
Slowness, imperfection, and errors are a symbol of humanity, an aspect we must preserve and should never want to optimize.