Home / Blog / Voice-controlled games are back

Voice-controlled games are quietly back, and they're better than you'd guess

The voice-controlled-game genre is older than most people remember and a lot quieter than its 2008 heyday. SingStar sold well over 20 million copies across the PS2 and PS3 lifecycles [1]. Then for about a decade it nearly vanished from mainstream gaming, kept alive by karaoke arcades and a couple of mobile apps. Around 2023 something interesting started happening on the web — and that's what this piece is really about.

The plastic-microphone era

The first wave of voice gaming was tied to plastic peripherals: SingStar mics, Rock Band mics, the Xbox Live Vision + headset, the Kinect. The technology worked but the friction was real. You needed a console, a TV, a clean line of sight, and a controller you only used for that one game. Wii Music sold roughly 3 million copies in 18 months — respectable in absolute terms, a disappointment relative to other Wii titles [2]. The genre worked, but the on-ramp was a barrier most casual players couldn't be bothered to climb.

The Web Audio API quietly fixed the on-ramp

What changed isn't a new game design. It's a stack. The Web Audio API hit its first stable release in 2014 and getUserMedia() (the browser permission for the microphone) has been mainline in Chrome and Firefox since around 2011 [3]. By 2020 the typical roundtrip latency from microphone capture to a JavaScript callback on a mid-range laptop was about 20–40 ms — fast enough that a game can actually respond to your voice without the lag feeling comical. I measured 32 ms on my own MacBook Air in our Pitch Pong prototype using the default AudioContext buffer size. That's the same order of magnitude as a controller button press on a wired gamepad (about 8 ms on a good day, 50 ms on a bad one).

The honest punchline is that the killer hardware for voice gaming was always sitting on your face. You've had a microphone since you bought the laptop. The genre needed to delete the plastic mic, not invent a better one.

What the 2023–2026 wave actually looks like

Most of the new voice-controlled games I've enjoyed share four things:

They run in a browser tab. No app, no account, no install. You click, you grant mic permission, you play.
They use one signal, not speech. Pitch, volume, sustained vowels. They don't try to do speech recognition, which is harder, slower, and language-locked.
Rounds are 30 seconds to two minutes. This isn't a coincidence: the games are often played in shared spaces, and a long round in front of a partner or coworker is socially awkward.
The game accepts that you might sound terrible. The bad-singing edge case is the point, not a bug.

Pitch Pong is the example I keep poking people about — it's two-player Pong where the position of your paddle is the pitch you're singing. You can play it with the mouth-closed hum, your normal voice, or by being absurd about it. A three-round match runs under a minute. The friction is mic permission and that's it.

Why this matters for casual play specifically

Two reasons, both observable:

First, voice input collapses the "skill barrier" that gatekeeps most multiplayer games. Aim, reflexes, knowledge of the meta — none of these matter when the controller is your throat. A 50-year-old who's never opened Steam can play competitively with their teenager within sixty seconds. That dynamic doesn't exist in nearly any other multiplayer genre.

Second, voice input fights phone-pose passivity. The Ofcom 2023 media-use report flagged that UK adults averaged just over 4 hours of daily passive screen time on phones [4]. Most casual games slot neatly into that posture — one thumb, head down. Voice games break it. You sit up, you breathe, you make a noise that ends in laughter. That's a small thing but the posture difference is real and I notice it on my own face after a session.

What the genre still needs

A few things, honestly:

A better fingerprint for "you". Voices change with the weather. A model that can re-calibrate in two seconds instead of asking the player to "sing this note now" would help.
A graceful fallback for shared rooms. Some people will not, ever, make noise in front of a coworker. The current answer is "go play in another tab." A whisper mode or hum-only mode is doable; nobody's nailed it yet.
Latency budgets on cheap Android. The 32 ms I measured on my MacBook gets uglier on a low-end Android phone with Bluetooth earbuds. Bluetooth audio added 40–80 ms in my tests, which is enough to make a fast paddle feel "stuck."

If you want to try one

The shortlist I'd actually click into right now: Pitch Pong, some browser implementations of Yodel-style note-matching games, and the better web-based karaoke trainers (look for ones that don't ask you to log in). Mic permission and a quiet minute — that's the whole investment.

Omoggle's mog battles aren't voice-controlled, but our companion title Pitch Pong is — sing to move the paddle. Two minutes.

See the Pitch Pong card →

Sources & references

Sony Computer Entertainment Europe. SingStar lifetime sales, cumulative PS2 + PS3 figures reported through 2014. Press archive.
Nintendo. Financial Highlights, FY 2010. Wii Music software shipment data.
W3C Web Audio Working Group. Web Audio API. Stable since 2014; MediaStream / getUserMedia history.
Ofcom. Online Nation 2023 — UK adult media use, daily screen-time aggregates.
Mozilla Developer Network. Web Audio API best practices and latency notes.
Author's own latency log, AudioContext default buffer, MacBook Air M2, May 2026.