Flag

We stand with Ukraine and our team members from Ukraine. Here are ways you can help

Get exclusive access to thought-provoking articles, bonus podcast content, and cutting-edge whitepapers. Become a member of the UX Magazine community today!

Home ›› Artificial Intelligence ›› What’s in a /ˈneɪm/?

What’s in a /ˈneɪm/?

by Danny DeRuntz
6 min read
Share this post on
Tweet
Share
Post
Share
Email
Print

Save

Getting AI to pronounce names correctly.

Part of a series on prototyping with generative AI. The prototype further down the page is an example of how to extract phonemes from a voice recording containing a name, and then use those phonemes to control how an AI voice assistant pronounces that name.

Ok, now that we’re getting bored with chatting with AI in text, it’s time to use our voice. There are various Speech-to-text and Text-to-speech services that can make this relatively easy in theory. But once you have something working, it only takes a simple greeting to realize we have a whole new set of problems coming for us. For example, getting your AI to correctly pronounce someone’s name!

I’ll try to illustrate the challenge in this simple courtesy:

Bot — “Hello, I’m looking forward to working with you. To get started, can you tell me your name?”

Hannah — “Sure. My name is Hannah.”
User pronounced “Han” like Han… Solo

Bot — “It’s nice to meet you Hannah!”
The Bot pronounced “Han” like Hand

Hannah — “No, it’s Hannah. My name is Hannah”

Bot — “Ah, I’m sorry for the confusion. Your name is Hannah. It’s nice to meet you Hannah!”
AI still pronounced the “Han” like Hand

Hannah just said their name correctly into a microphone, “Isn’t this AI?!” This can quickly become annoying, or even insulting, as the AI assistant reinforces the idea that this AI was definitely not made for Hannah. Meanwhile, we haven’t even gotten started with whatever this voice assistant is supposed to do. So, let’s dig in. There are two spots where Hannah’s correct pronunciation can (and will) degrade or get lost.

  1. Audio processing — Hannah speaks into the microphone that’s recording
  2. Transcription — Recording transcribed to text — degrades here
  3. Chat completion — Text message sent to AI, AI sends the message back
  4. Text-to-speech — AI message converted to Speech — degrades here
  5. Playback — AI speech plays through the speaker

We are going to try doing this.

Speech-To-Text (aka transcription)

The first spot where we lose information is when rich audio input is converted into text. Every STT (speech-to-text) service can give a slightly different transcription, especially with proper nouns.

If we send Hannah’s Sure. My name is Hannah.” audio clip to OpenAI’s Whisper-1 it might transcribe her name as “Hana” vs “Hannah” which might result in correct pronunciation later. But, to use whisper you are going slow: Hannah finishes talking → package up audio recording → send to whisper-1 → wait for transcription. You’re also going to pay for a whisper.

Most devices, and browsers, have free STT (Speech-to-text) built in. Chrome’s STT works in real-time and is my choice for this project. The instant the user is done talking, you already have the transcribed message and can send it to AI for a reply. But, Chrome transcribes Hannah as “Hannah” no matter how I pronounced it.

I think it’s safest to assume the transcription will write their name wrong, or write it in a way that doesn’t necessarily tell you how to pronounce it. To solve this, I found an STT model on Hugging Face that transcribes directly to phonemes. I send the voice recording containing Hannah’s name to both whisper-1 and wav2vec2-ljspeech-gruut and get this back:

whisper: "My name is Hannah, but some people mispronounce it as Hannah."
wav2vec2_ljspeech_gruut: "maɪ n eɪm æ z h ɑ n ə b ʌ t s ʌ m p i p ə l m ɪ s p ɹ ə naʊn s ɪ d æ z h æ n ə"

Then I send both results to GPT-4 to have it correlate the phonemes to the words and get back a json object that’s a pronunciation dictionary with a few different tools to use with various TTS (text-to-speech) services.

name: "Hannah",
ipa: "hɑnə",
x_sampa: "hAn@",
uroman: "hana",
raw_phonemes: "h ɑ n ə"

You can step through the process below and hear a few services attempt pronunciation:

It feels inevitable that new services like Whisper-1 (by OpenAI) might start to integrate phonetic features, which play sometimes larger roles in non-roman alphabet languages. But it’s not there today.

Text-to-speech +

A lot of the noticeable improvements in TTS (text-to-speech) have to do with the reader’s “comprehension” of what they are reading (is this a whispery secret, part of a question, sarcastic, etc). And while the pronunciation of proper nouns keeps improving, there is still a lot of variety to it. If a name has “conventional spelling” and “conventional pronunciation,” it will probably be spoken correctly by AI. But to be sure, we can make adjustments to the messages we ask these services to read.

Google Speech Studio has a few studio voice models that are high quality, and importantly accept SSML (Speech Synthesis Markup Language). We can send it the name Hannah enclosed in a phoneme tag that tells it how to pronounce that name phonetically using the IPA (International Phonetic Alphabet) field we got back from GPT-4. If you send the right phonemes, this is going to work. GPT-3.5/4 will easily give you messages in the required format:

"<speak>It's nice to meet you <phoneme alphabet="ipa" ph="hɑnə">Hannah</phoneme>!</speak>"

Eleven Labs has some really natural voices with great vocal presence. However, to achieve proper pronunciation, the only trick we can currently do is to replace the user name with the uroman (Universal Romanizer) spelling. For “uroman,” GPT-4 wrote “Hana” which Eleven Labs will say correctly. This isn’t as foolproof as using a service with SSML, but it’s a conceptual start.

Alexa skills and many other services can accept SSML. In those cases, it really comes down to you figuring out how to extract those phonemes. I built a second more brutish method you can test out. A simple voice bot that uses SSML and Google TTS. Its main trick is that it will iterate its IPA spelling based on your conversation.

I’m curious to hear how it worked for you 👆

Higgins Call to Get Phonemes

That’s it for this one. I think the most beneficial code is mainly contained in this single server call below. ChatGPT can easily help you run this and send your own audio file for processing.

app.post("/higgins", upload.single('audio'), async (req, res) => {

const audioFileBuffer = req.file.buffer;
const audioFilePath = path.join(__dirname, 'temp_audio.mp3');
await fs.promises.writeFile(audioFilePath, audioFileBuffer);

const ASRResPromise = hf.automaticSpeechRecognition({
processor: "bookbot/wav2vec2-ljspeech-gruut",
model: "bookbot/wav2vec2-ljspeech-gruut",
data: audioFileBuffer,
});
const WhisperResPromise = openai.createTranscription(
fs.createReadStream(audioFilePath),
"whisper-1"
);
const [ASRRes, WhisperRes] = await Promise.all([ASRResPromise, WhisperResPromise]);
console.log(WhisperRes.data.text)
console.log(ASRRes.text)

// Extract and tidy up any names
const messages = [
{ "role": "system", "content": system_command },
{ "role": "assistant", "content": `Transcription: ${WhisperRes.data.text} \n\n Phonemes: ${ASRRes.text}` }
]
const AIresponse = await openai.createChatCompletion({
model: "gpt-4", // significantly better than 3.5 at this task
messages: messages,
temperature: 0,
});
console.log(AIresponse.data.choices[0].message.content)

// Package up transcriptions, extracted names, send to client
const reply = {
transcription: WhisperRes.data.text,
phonemes: ASRRes.text,
extractedNames: JSON.parse(AIresponse.data.choices[0].message.content)
}
res.json(reply);

});

const system_command = `You are a Speech to text chatbot. You have just processed an audio file detecting a voice file. First you detected and transcribed in english. Then you detected and transcribed phonemes.

Your job is to be able to detect regional accents by detecting any mention of a user name along with their detected phonemes and list them out in json. If you detect a user name:

{
"detected" : true,
"names" : [
{
"name" : <name>,
"phonemes" : <name raw phonemes>,
"x_sampa" : <phonemes converted into X-SAMPA format removing spaces>,
"ipa" : <phonemes converted into IPA removing spaces>,
"uroman" : <convert the IPA into U-ROMAN, or roman character as pronounced in english, removing any spaces, and representing unstressed syllables apropriately>
}
]
}

or if there are no user names return

{
detected: false,
}`;
post authorDanny DeRuntz

Danny DeRuntz
As an Executive Design Director at IDEO Cambridge, I help prototype and explore the intersection of emerging technology with core human needs.

Tweet
Share
Post
Share
Email
Print
Ideas In Brief
  • The article delves into the complexities of AI voice assistants mispronouncing names and offers a solution using phoneme extraction and TTS services to ensure accurate and personalized pronunciations for a better user experience.

Related Articles

Discover the hidden costs of AI-driven connectivity, from environmental impacts to privacy risks. Explore how our increasing reliance on AI is reshaping personal relationships and raising ethical challenges in the digital age.

Article by Louis Byrd
The Hidden Cost of Being Connected in the Age of AI
  • The article discusses the hidden costs of AI-driven connectivity, focusing on its environmental and energy demands.
  • It examines how increased connectivity exposes users to privacy risks and weakens personal relationships.
  • The article also highlights the need for ethical considerations to ensure responsible AI development and usage.
Share:The Hidden Cost of Being Connected in the Age of AI
9 min read

Is AI reshaping creativity as we know it? This thought-provoking article delves into the rise of artificial intelligence in various creative fields, exploring its impact on innovation and the essence of human artistry. Discover whether AI is a collaborator or a competitor in the creative landscape.

Article by Oliver Inderwildi
The Ascent of AI: Is It Already Shaping Every Breakthrough and Even Taking Over Creativity?
  • The article explores the transformative impact of AI on creativity, questioning whether it is enhancing or overshadowing human ingenuity.
  • It discusses the implications of AI-generated content across various fields, including art, music, and writing, and its potential to redefine traditional creative processes.
  • The piece emphasizes the need for a balanced approach that values human creativity while leveraging AI’s capabilities, advocating for a collaborative rather than competitive relationship between the two.
Share:The Ascent of AI: Is It Already Shaping Every Breakthrough and Even Taking Over Creativity?
6 min read

Discover how GPT Researcher is transforming the research landscape by using multiple AI agents to deliver deeper, unbiased insights. With Tavily, this approach aims to redefine how we search for and interpret information.

Article by Assaf Elovic
You Are Doing Research Wrong
  • The article introduces GPT Researcher, an AI tool that uses multiple specialized agents to enhance research depth and accuracy beyond traditional search engines.
  • It explores how GPT Researcher’s agentic approach reduces bias by simulating a collaborative research process, focusing on factual, well-rounded responses.
  • The piece presents Tavily, a search engine aligned with GPT Researcher’s framework, aimed at delivering transparent and objective search results.
Share:You Are Doing Research Wrong
6 min read

Join the UX Magazine community!

Stay informed with exclusive content on the intersection of UX, AI agents, and agentic automation—essential reading for future-focused professionals.

Hello!

You're officially a member of the UX Magazine Community.
We're excited to have you with us!

Thank you!

To begin viewing member content, please verify your email.

Tell us about you. Enroll in the course.

    This website uses cookies to ensure you get the best experience on our website. Check our privacy policy and