Home ›› Artificial Intelligence ›› What’s in a /ˈneɪm/?

What’s in a /ˈneɪm/?

by Danny DeRuntz

6 min read

August 3, 2023

Share this post on

Save

Getting AI to pronounce names correctly.

Part of a series on prototyping with generative AI. The prototype further down the page is an example of how to extract phonemes from a voice recording containing a name, and then use those phonemes to control how an AI voice assistant pronounces that name.

Ok, now that we’re getting bored with chatting with AI in text, it’s time to use our voice. There are various Speech-to-text and Text-to-speech services that can make this relatively easy in theory. But once you have something working, it only takes a simple greeting to realize we have a whole new set of problems coming for us. For example, getting your AI to correctly pronounce someone’s name!

I’ll try to illustrate the challenge in this simple courtesy:

Bot — “Hello, I’m looking forward to working with you. To get started, can you tell me your name?”

Hannah — “Sure. My name is Hannah.”
User pronounced “Han” like Han… Solo

Bot — “It’s nice to meet you Hannah!”
The Bot pronounced “Han” like Hand

Hannah — “No, it’s Hannah. My name is Hannah”

Bot — “Ah, I’m sorry for the confusion. Your name is Hannah. It’s nice to meet you Hannah!”
AI still pronounced the “Han” like Hand

Hannah just said their name correctly into a microphone, “Isn’t this AI?!” This can quickly become annoying, or even insulting, as the AI assistant reinforces the idea that this AI was definitely not made for Hannah. Meanwhile, we haven’t even gotten started with whatever this voice assistant is supposed to do. So, let’s dig in. There are two spots where Hannah’s correct pronunciation can (and will) degrade or get lost.

Audio processing — Hannah speaks into the microphone that’s recording
Transcription — Recording transcribed to text — degrades here
Chat completion — Text message sent to AI, AI sends the message back
Text-to-speech — AI message converted to Speech — degrades here
Playback — AI speech plays through the speaker

We are going to try doing this.

Speech-To-Text (aka transcription)

The first spot where we lose information is when rich audio input is converted into text. Every STT (speech-to-text) service can give a slightly different transcription, especially with proper nouns.

If we send Hannah’s “Sure. My name is Hannah.” audio clip to OpenAI’s Whisper-1 it might transcribe her name as “Hana” vs “Hannah” which might result in correct pronunciation later. But, to use whisper you are going slow: Hannah finishes talking → package up audio recording → send to whisper-1 → wait for transcription. You’re also going to pay for a whisper.

Most devices, and browsers, have free STT (Speech-to-text) built in. Chrome’s STT works in real-time and is my choice for this project. The instant the user is done talking, you already have the transcribed message and can send it to AI for a reply. But, Chrome transcribes Hannah as “Hannah” no matter how I pronounced it.

I think it’s safest to assume the transcription will write their name wrong, or write it in a way that doesn’t necessarily tell you how to pronounce it. To solve this, I found an STT model on Hugging Face that transcribes directly to phonemes. I send the voice recording containing Hannah’s name to both whisper-1 and wav2vec2-ljspeech-gruut and get this back:

whisper: "My name is Hannah, but some people mispronounce it as Hannah."
wav2vec2_ljspeech_gruut: "maɪ n eɪm æ z h ɑ n ə b ʌ t s ʌ m p i p ə l m ɪ s p ɹ ə naʊn s ɪ d æ z h æ n ə"

Then I send both results to GPT-4 to have it correlate the phonemes to the words and get back a json object that’s a pronunciation dictionary with a few different tools to use with various TTS (text-to-speech) services.

name: "Hannah",
ipa: "hɑnə",
x_sampa: "hAn@",
uroman: "hana",
raw_phonemes: "h ɑ n ə"

You can step through the process below and hear a few services attempt pronunciation:

It feels inevitable that new services like Whisper-1 (by OpenAI) might start to integrate phonetic features, which play sometimes larger roles in non-roman alphabet languages. But it’s not there today.

Text-to-speech +

A lot of the noticeable improvements in TTS (text-to-speech) have to do with the reader’s “comprehension” of what they are reading (is this a whispery secret, part of a question, sarcastic, etc). And while the pronunciation of proper nouns keeps improving, there is still a lot of variety to it. If a name has “conventional spelling” and “conventional pronunciation,” it will probably be spoken correctly by AI. But to be sure, we can make adjustments to the messages we ask these services to read.

Google Speech Studio has a few studio voice models that are high quality, and importantly accept SSML (Speech Synthesis Markup Language). We can send it the name Hannah enclosed in a phoneme tag that tells it how to pronounce that name phonetically using the IPA (International Phonetic Alphabet) field we got back from GPT-4. If you send the right phonemes, this is going to work. GPT-3.5/4 will easily give you messages in the required format:

"<speak>It's nice to meet you <phoneme alphabet="ipa" ph="hɑnə">Hannah</phoneme>!</speak>"

Eleven Labs has some really natural voices with great vocal presence. However, to achieve proper pronunciation, the only trick we can currently do is to replace the user name with the uroman (Universal Romanizer) spelling. For “uroman,” GPT-4 wrote “Hana” which Eleven Labs will say correctly. This isn’t as foolproof as using a service with SSML, but it’s a conceptual start.

Alexa skills and many other services can accept SSML. In those cases, it really comes down to you figuring out how to extract those phonemes. I built a second more brutish method you can test out. A simple voice bot that uses SSML and Google TTS. Its main trick is that it will iterate its IPA spelling based on your conversation.

I’m curious to hear how it worked for you 👆

Higgins Call to Get Phonemes

That’s it for this one. I think the most beneficial code is mainly contained in this single server call below. ChatGPT can easily help you run this and send your own audio file for processing.

app.post("/higgins", upload.single('audio'), async (req, res) => {

  const audioFileBuffer = req.file.buffer;
  const audioFilePath = path.join(__dirname, 'temp_audio.mp3');
  await fs.promises.writeFile(audioFilePath, audioFileBuffer);

  const ASRResPromise = hf.automaticSpeechRecognition({
    processor: "bookbot/wav2vec2-ljspeech-gruut",
    model: "bookbot/wav2vec2-ljspeech-gruut",
    data: audioFileBuffer,
  });
  const WhisperResPromise = openai.createTranscription(
    fs.createReadStream(audioFilePath),
    "whisper-1"
  );
  const [ASRRes, WhisperRes] = await Promise.all([ASRResPromise, WhisperResPromise]);
  console.log(WhisperRes.data.text)
  console.log(ASRRes.text)

  // Extract and tidy up any names
  const messages = [
    { "role": "system", "content": system_command },
    { "role": "assistant", "content": `Transcription: ${WhisperRes.data.text} \n\n Phonemes: ${ASRRes.text}` }
  ]
  const AIresponse = await openai.createChatCompletion({
    model: "gpt-4", // significantly better than 3.5 at this task
    messages: messages,
    temperature: 0,
  });
  console.log(AIresponse.data.choices[0].message.content)

  // Package up transcriptions, extracted names, send to client
  const reply = {
    transcription: WhisperRes.data.text,
    phonemes: ASRRes.text,
    extractedNames: JSON.parse(AIresponse.data.choices[0].message.content)
  }
  res.json(reply);

});

const system_command = `You are a Speech to text chatbot. You have just processed an audio file detecting a  voice file. First you detected and transcribed in english. Then you detected and transcribed phonemes.

Your job is to be able to detect regional accents by detecting any mention of a user name along with their detected phonemes and list them out in json. If you detect a user name:

{
 "detected" : true,
 "names" : [
  {
    "name" : <name>,
    "phonemes" : <name raw phonemes>,
    "x_sampa" : <phonemes converted into X-SAMPA format removing spaces>,
    "ipa" : <phonemes converted into IPA removing spaces>,
    "uroman" : <convert the IPA into U-ROMAN, or roman character as pronounced in english, removing any spaces, and representing unstressed syllables apropriately>
  }
 ]
}

or if there are no user names return

{
 detected: false,
}`;

Artificial Intelligence, Defining AI, New and Emerging Technologies, Prototypes, Voice & Natural Language

Danny DeRuntz
As an Executive Design Director at IDEO Cambridge, I help prototype and explore the intersection of emerging technology with core human needs.

Ideas In Brief

The article delves into the complexities of AI voice assistants mispronouncing names and offers a solution using phoneme extraction and TTS services to ensure accurate and personalized pronunciations for a better user experience.

Design Isn’t Dead. You Sound Dumb

Artificial Intelligence, Design Thinking, UX Design

“Design is dead”? No, you just never understood it. This bold piece calls out lazy hot takes, holds designers accountable, and makes a sharp case for what design really is (and isn’t) in the age of AI.

Article by Nate Schloesser

May 15, 2025

6 min read

Introducing Over-Alignment

AI Alignment, AI and Mental Health, AI Ethics, AI Transparency, Emotional Impact of AI, Human-AI Interaction, Responsible AI, User Experience

AI that always agrees? Over-alignment might be the hidden danger, reinforcing your misconceptions and draining your mind. Learn why this subtle failure mode is more harmful than you think — and how we can fix it.

Article by Bernard Fitzgerald

May 20, 2025

4 min read

Built to Serve: AI, Women, and the Future of Administrative Work

Artificial Intelligence, Future of Work, Gender Equality, Women in Tech, Workplace Culture

As AI assistants quietly absorb the tasks once held by human secretaries, are we erasing the hidden influence of women in the workplace, or simply rewriting it in code?

Article by Thasya Ingriany

May 22, 2025

7 min read

What’s in a /ˈneɪm/?

Save

Speech-To-Text (aka transcription)

Text-to-speech +

Higgins Call to Get Phonemes

Related Articles

Design Isn’t Dead. You Sound Dumb

Introducing Over-Alignment

Built to Serve: AI, Women, and the Future of Administrative Work

This website uses cookies to ensure you get the best experience on our website. Check our privacy policy and

What’s in a /ˈneɪm/?

Save

Speech-To-Text (aka transcription)

Text-to-speech +

Higgins Call to Get Phonemes

Related Articles

Design Isn’t Dead. You Sound Dumb

Share this link

Introducing Over-Alignment

Share this link

Built to Serve: AI, Women, and the Future of Administrative Work

Share this link

Tell us about you. Enroll in the course.

This website uses cookies to ensure you get the best experience on our website. Check our privacy policy and