Google’s new translation model does something the older tools couldn’t pull off: it keeps pace with an actual conversation. No waiting, no awkward gaps, no robotic voice reading back words. You speak, and roughly two to three seconds later, the other person hears you in their language. That’s the whole pitch, and it mostly holds up.

The setup matters before you decide whether this is worth your time.

What it actually does well

The voice quality is the biggest leap over anything before it. Previous translation apps stripped out everything that makes someone sound human. The speed, the stress on certain words, the slight rise at the end of a question. Gemini 3.5 keeps most of that intact. In short conversations, you’d be hard-pressed to tell the translated voice apart from a fluent speaker.

Coverage is genuinely wide. Seventy-plus languages, over 2,000 language pair combinations in one session, no longer locked to English as an intermediate step. So a Tamil speaker and a Japanese speaker can talk directly, without the model bouncing through English in the background. That matters a lot for conversations where English isn’t anyone’s first language.

For everyday use, the Google Translate app update is already live on Android and iOS. Android users get a “listening mode” where you hold the phone to your ear, speak, and hear the translation through the earpiece. Nobody around you hears it. Useful for travel, guided tours, or any time you want discretion.

Where it falls short

Long sessions are a problem. After about 30 minutes, the output voice starts to drift. It can shift gender, change tone, or lock onto one voice even when multiple people are speaking. Google acknowledges this. For a quick call or short meeting, you won’t notice. For a 90-minute client presentation, you might.

Heavy accents trip it up more than you’d expect for a model claiming this level of accuracy. Non-native speakers of any language can cause misdetections, and similar languages like Spanish and Portuguese occasionally get confused. European Spanish is also listed as an incomplete dialect, which is a gap worth knowing before you rely on it.

Noisy environments mostly work, but “mostly” is doing real work in that sentence. A quiet café is fine. A busy street or loud conference hall introduces audio artifacts. Not deal-breaking, but noticeable.

Pricing if you’re a developer

Audio input costs $3.50 per million tokens, output runs $21.00 per million tokens. A one-minute conversation works out to roughly $0.024. A one-hour meeting lands around $2.30. Compared to building the same pipeline yourself with separate speech recognition, translation, and text-to-speech services, that’s meaningfully cheaper and far simpler to maintain.

The API uses persistent WebSocket connections, which means context carries across a conversation without you re-sending everything. That’s a sensible design choice for real-time use.

Who should use it now

Short calls, travel situations, customer support, one-on-one meetings. The model handles these well. Google Meet integration is still in private enterprise preview, so if you were hoping to roll this out for team meetings, you’re waiting until late 2026 at the earliest.

Healthcare and legal contexts need caution. The accuracy is high, but “high” still leaves room for errors in settings where a single mistranslation can cause real harm.

Quick verdict

The best real-time voice translation available. It beats every competitor on language coverage, latency, and voice naturalness, often at lower cost. The voice drift in long sessions and accent sensitivity are real weaknesses, not theoretical ones. Test it against your specific use case before committing, especially if your users speak with heavy regional accents or if your conversations regularly run past the 30-minute mark.