Home ›› Artificial Intelligence ›› What’s in a /ˈneɪm/?

What’s in a /ˈneɪm/?

by Danny DeRuntz

6 min read

August 3, 2023

Share this post on

Save

Getting AI to pronounce names correctly.

Part of a series on prototyping with generative AI. The prototype further down the page is an example of how to extract phonemes from a voice recording containing a name, and then use those phonemes to control how an AI voice assistant pronounces that name.

Ok, now that we’re getting bored with chatting with AI in text, it’s time to use our voice. There are various Speech-to-text and Text-to-speech services that can make this relatively easy in theory. But once you have something working, it only takes a simple greeting to realize we have a whole new set of problems coming for us. For example, getting your AI to correctly pronounce someone’s name!

I’ll try to illustrate the challenge in this simple courtesy:

Bot — “Hello, I’m looking forward to working with you. To get started, can you tell me your name?”

Hannah — “Sure. My name is Hannah.”
User pronounced “Han” like Han… Solo

Bot — “It’s nice to meet you Hannah!”
The Bot pronounced “Han” like Hand

Hannah — “No, it’s Hannah. My name is Hannah”

Bot — “Ah, I’m sorry for the confusion. Your name is Hannah. It’s nice to meet you Hannah!”
AI still pronounced the “Han” like Hand

Hannah just said their name correctly into a microphone, “Isn’t this AI?!” This can quickly become annoying, or even insulting, as the AI assistant reinforces the idea that this AI was definitely not made for Hannah. Meanwhile, we haven’t even gotten started with whatever this voice assistant is supposed to do. So, let’s dig in. There are two spots where Hannah’s correct pronunciation can (and will) degrade or get lost.

Audio processing — Hannah speaks into the microphone that’s recording
Transcription — Recording transcribed to text — degrades here
Chat completion — Text message sent to AI, AI sends the message back
Text-to-speech — AI message converted to Speech — degrades here
Playback — AI speech plays through the speaker

We are going to try doing this.

Speech-To-Text (aka transcription)

The first spot where we lose information is when rich audio input is converted into text. Every STT (speech-to-text) service can give a slightly different transcription, especially with proper nouns.

If we send Hannah’s “Sure. My name is Hannah.” audio clip to OpenAI’s Whisper-1 it might transcribe her name as “Hana” vs “Hannah” which might result in correct pronunciation later. But, to use whisper you are going slow: Hannah finishes talking → package up audio recording → send to whisper-1 → wait for transcription. You’re also going to pay for a whisper.

Most devices, and browsers, have free STT (Speech-to-text) built in. Chrome’s STT works in real-time and is my choice for this project. The instant the user is done talking, you already have the transcribed message and can send it to AI for a reply. But, Chrome transcribes Hannah as “Hannah” no matter how I pronounced it.

I think it’s safest to assume the transcription will write their name wrong, or write it in a way that doesn’t necessarily tell you how to pronounce it. To solve this, I found an STT model on Hugging Face that transcribes directly to phonemes. I send the voice recording containing Hannah’s name to both whisper-1 and wav2vec2-ljspeech-gruut and get this back:

whisper: "My name is Hannah, but some people mispronounce it as Hannah."
wav2vec2_ljspeech_gruut: "maɪ n eɪm æ z h ɑ n ə b ʌ t s ʌ m p i p ə l m ɪ s p ɹ ə naʊn s ɪ d æ z h æ n ə"

Then I send both results to GPT-4 to have it correlate the phonemes to the words and get back a json object that’s a pronunciation dictionary with a few different tools to use with various TTS (text-to-speech) services.

name: "Hannah",
ipa: "hɑnə",
x_sampa: "hAn@",
uroman: "hana",
raw_phonemes: "h ɑ n ə"

You can step through the process below and hear a few services attempt pronunciation:

It feels inevitable that new services like Whisper-1 (by OpenAI) might start to integrate phonetic features, which play sometimes larger roles in non-roman alphabet languages. But it’s not there today.

Text-to-speech +

A lot of the noticeable improvements in TTS (text-to-speech) have to do with the reader’s “comprehension” of what they are reading (is this a whispery secret, part of a question, sarcastic, etc). And while the pronunciation of proper nouns keeps improving, there is still a lot of variety to it. If a name has “conventional spelling” and “conventional pronunciation,” it will probably be spoken correctly by AI. But to be sure, we can make adjustments to the messages we ask these services to read.

Google Speech Studio has a few studio voice models that are high quality, and importantly accept SSML (Speech Synthesis Markup Language). We can send it the name Hannah enclosed in a phoneme tag that tells it how to pronounce that name phonetically using the IPA (International Phonetic Alphabet) field we got back from GPT-4. If you send the right phonemes, this is going to work. GPT-3.5/4 will easily give you messages in the required format:

"<speak>It's nice to meet you <phoneme alphabet="ipa" ph="hɑnə">Hannah</phoneme>!</speak>"

Eleven Labs has some really natural voices with great vocal presence. However, to achieve proper pronunciation, the only trick we can currently do is to replace the user name with the uroman (Universal Romanizer) spelling. For “uroman,” GPT-4 wrote “Hana” which Eleven Labs will say correctly. This isn’t as foolproof as using a service with SSML, but it’s a conceptual start.

Alexa skills and many other services can accept SSML. In those cases, it really comes down to you figuring out how to extract those phonemes. I built a second more brutish method you can test out. A simple voice bot that uses SSML and Google TTS. Its main trick is that it will iterate its IPA spelling based on your conversation.

I’m curious to hear how it worked for you 👆

Higgins Call to Get Phonemes

That’s it for this one. I think the most beneficial code is mainly contained in this single server call below. ChatGPT can easily help you run this and send your own audio file for processing.

app.post("/higgins", upload.single('audio'), async (req, res) => {

  const audioFileBuffer = req.file.buffer;
  const audioFilePath = path.join(__dirname, 'temp_audio.mp3');
  await fs.promises.writeFile(audioFilePath, audioFileBuffer);

  const ASRResPromise = hf.automaticSpeechRecognition({
    processor: "bookbot/wav2vec2-ljspeech-gruut",
    model: "bookbot/wav2vec2-ljspeech-gruut",
    data: audioFileBuffer,
  });
  const WhisperResPromise = openai.createTranscription(
    fs.createReadStream(audioFilePath),
    "whisper-1"
  );
  const [ASRRes, WhisperRes] = await Promise.all([ASRResPromise, WhisperResPromise]);
  console.log(WhisperRes.data.text)
  console.log(ASRRes.text)

  // Extract and tidy up any names
  const messages = [
    { "role": "system", "content": system_command },
    { "role": "assistant", "content": `Transcription: ${WhisperRes.data.text} \n\n Phonemes: ${ASRRes.text}` }
  ]
  const AIresponse = await openai.createChatCompletion({
    model: "gpt-4", // significantly better than 3.5 at this task
    messages: messages,
    temperature: 0,
  });
  console.log(AIresponse.data.choices[0].message.content)

  // Package up transcriptions, extracted names, send to client
  const reply = {
    transcription: WhisperRes.data.text,
    phonemes: ASRRes.text,
    extractedNames: JSON.parse(AIresponse.data.choices[0].message.content)
  }
  res.json(reply);

});

const system_command = `You are a Speech to text chatbot. You have just processed an audio file detecting a  voice file. First you detected and transcribed in english. Then you detected and transcribed phonemes.

Your job is to be able to detect regional accents by detecting any mention of a user name along with their detected phonemes and list them out in json. If you detect a user name:

{
 "detected" : true,
 "names" : [
  {
    "name" : <name>,
    "phonemes" : <name raw phonemes>,
    "x_sampa" : <phonemes converted into X-SAMPA format removing spaces>,
    "ipa" : <phonemes converted into IPA removing spaces>,
    "uroman" : <convert the IPA into U-ROMAN, or roman character as pronounced in english, removing any spaces, and representing unstressed syllables apropriately>
  }
 ]
}

or if there are no user names return

{
 detected: false,
}`;

Artificial Intelligence, Defining AI, New and Emerging Technologies, Prototypes, Voice & Natural Language

Danny DeRuntz
As an Executive Design Director at IDEO Cambridge, I help prototype and explore the intersection of emerging technology with core human needs.

Ideas In Brief

The article delves into the complexities of AI voice assistants mispronouncing names and offers a solution using phoneme extraction and TTS services to ensure accurate and personalized pronunciations for a better user experience.

The Price of the Mirror: When Silicon Valley Colonizes the Human Soul

AI Ethics, Artificial Intelligence, Cognition

Who pays the real price for AI’s magic? Behind every smart response is a hidden human cost, and it’s time we saw the hands holding the mirror.

Article by Bernard Fitzgerald

September 4, 2025

7 min read

Random Acts of Intelligence

AGI, AI in UX, AI Orchestration, AI Strategy, Artificial Intelligence, Conversational AI, Digital Transformation, Human-Centered Design, Organizational AI

AI’s promise isn’t about more tools — it’s about orchestrating them with purpose. This article shows why random experiments fail, and how systematic design can turn chaos into ‘Organizational AGI.’

Article by Yves Binda

September 9, 2025

5 min read

The “Do a Kickflip” Era of Agentic AI

Agentic AI, AI Agents, AI Automation, AI Ecosystem, AI Strategy, Artificial Intelligence, Digital Transformation, Enterprise AI, Innovation, LLM

Most companies are trying to do a kickflip with AI and falling flat. Here’s how to fail forward, build real agentic ecosystems, and turn experimentation into impact.

Article by Josh Tyson

September 11, 2025

7 min read

What’s in a /ˈneɪm/?

Save

Speech-To-Text (aka transcription)

Text-to-speech +

Higgins Call to Get Phonemes

Related Articles

The Price of the Mirror: When Silicon Valley Colonizes the Human Soul

Random Acts of Intelligence

The “Do a Kickflip” Era of Agentic AI

This website uses cookies to ensure you get the best experience on our website. Check our privacy policy and

What’s in a /ˈneɪm/?

Save

Speech-To-Text (aka transcription)

Text-to-speech +

Higgins Call to Get Phonemes

Related Articles

The Price of the Mirror: When Silicon Valley Colonizes the Human Soul

Share this link

Random Acts of Intelligence

Share this link

The “Do a Kickflip” Era of Agentic AI

Share this link

Tell us about you. Enroll in the course.

This website uses cookies to ensure you get the best experience on our website. Check our privacy policy and