Part of a series on prototyping with generative AI. The prototype further down the page is an example of how to extract phonemes from a voice recording containing a name, and then use those phonemes to control how an AI voice assistant pronounces that name.
Ok, now that we’re getting bored with chatting with AI in text, it’s time to use our voice. There are various Speech-to-text and Text-to-speech services that can make this relatively easy in theory. But once you have something working, it only takes a simple greeting to realize we have a whole new set of problems coming for us. For example, getting your AI to correctly pronounce someone’s name!
I’ll try to illustrate the challenge in this simple courtesy:
Bot — “Hello, I’m looking forward to working with you. To get started, can you tell me your name?”
Hannah — “Sure. My name is Hannah.”
User pronounced “Han” like Han… SoloBot — “It’s nice to meet you Hannah!”
The Bot pronounced “Han” like HandHannah — “No, it’s Hannah. My name is Hannah”
Bot — “Ah, I’m sorry for the confusion. Your name is Hannah. It’s nice to meet you Hannah!”
AI still pronounced the “Han” like Hand
Hannah just said their name correctly into a microphone, “Isn’t this AI?!” This can quickly become annoying, or even insulting, as the AI assistant reinforces the idea that this AI was definitely not made for Hannah. Meanwhile, we haven’t even gotten started with whatever this voice assistant is supposed to do. So, let’s dig in. There are two spots where Hannah’s correct pronunciation can (and will) degrade or get lost.
- Audio processing — Hannah speaks into the microphone that’s recording
- Transcription — Recording transcribed to text — degrades here
- Chat completion — Text message sent to AI, AI sends the message back
- Text-to-speech — AI message converted to Speech — degrades here
- Playback — AI speech plays through the speaker
We are going to try doing this.
Speech-To-Text (aka transcription)
The first spot where we lose information is when rich audio input is converted into text. Every STT (speech-to-text) service can give a slightly different transcription, especially with proper nouns.
If we send Hannah’s “Sure. My name is Hannah.” audio clip to OpenAI’s Whisper-1 it might transcribe her name as “Hana” vs “Hannah” which might result in correct pronunciation later. But, to use whisper you are going slow: Hannah finishes talking → package up audio recording → send to whisper-1 → wait for transcription. You’re also going to pay for a whisper.
Most devices, and browsers, have free STT (Speech-to-text) built in. Chrome’s STT works in real-time and is my choice for this project. The instant the user is done talking, you already have the transcribed message and can send it to AI for a reply. But, Chrome transcribes Hannah as “Hannah” no matter how I pronounced it.
I think it’s safest to assume the transcription will write their name wrong, or write it in a way that doesn’t necessarily tell you how to pronounce it. To solve this, I found an STT model on Hugging Face that transcribes directly to phonemes. I send the voice recording containing Hannah’s name to both whisper-1 and wav2vec2-ljspeech-gruut and get this back:
whisper: "My name is Hannah, but some people mispronounce it as Hannah."
wav2vec2_ljspeech_gruut: "maɪ n eɪm æ z h ɑ n ə b ʌ t s ʌ m p i p ə l m ɪ s p ɹ ə naʊn s ɪ d æ z h æ n ə"
Then I send both results to GPT-4 to have it correlate the phonemes to the words and get back a json object that’s a pronunciation dictionary with a few different tools to use with various TTS (text-to-speech) services.
name: "Hannah",
ipa: "hɑnə",
x_sampa: "hAn@",
uroman: "hana",
raw_phonemes: "h ɑ n ə"
You can step through the process below and hear a few services attempt pronunciation:
It feels inevitable that new services like Whisper-1 (by OpenAI) might start to integrate phonetic features, which play sometimes larger roles in non-roman alphabet languages. But it’s not there today.
Text-to-speech +
A lot of the noticeable improvements in TTS (text-to-speech) have to do with the reader’s “comprehension” of what they are reading (is this a whispery secret, part of a question, sarcastic, etc). And while the pronunciation of proper nouns keeps improving, there is still a lot of variety to it. If a name has “conventional spelling” and “conventional pronunciation,” it will probably be spoken correctly by AI. But to be sure, we can make adjustments to the messages we ask these services to read.
Google Speech Studio has a few studio voice models that are high quality, and importantly accept SSML (Speech Synthesis Markup Language). We can send it the name Hannah enclosed in a phoneme tag that tells it how to pronounce that name phonetically using the IPA (International Phonetic Alphabet) field we got back from GPT-4. If you send the right phonemes, this is going to work. GPT-3.5/4 will easily give you messages in the required format:
"<speak>It's nice to meet you <phoneme alphabet="ipa" ph="hɑnə">Hannah</phoneme>!</speak>"
Eleven Labs has some really natural voices with great vocal presence. However, to achieve proper pronunciation, the only trick we can currently do is to replace the user name with the uroman (Universal Romanizer) spelling. For “uroman,” GPT-4 wrote “Hana” which Eleven Labs will say correctly. This isn’t as foolproof as using a service with SSML, but it’s a conceptual start.
Alexa skills and many other services can accept SSML. In those cases, it really comes down to you figuring out how to extract those phonemes. I built a second more brutish method you can test out. A simple voice bot that uses SSML and Google TTS. Its main trick is that it will iterate its IPA spelling based on your conversation.
I’m curious to hear how it worked for you 👆
Higgins Call to Get Phonemes
That’s it for this one. I think the most beneficial code is mainly contained in this single server call below. ChatGPT can easily help you run this and send your own audio file for processing.
app.post("/higgins", upload.single('audio'), async (req, res) => {
const audioFileBuffer = req.file.buffer;
const audioFilePath = path.join(__dirname, 'temp_audio.mp3');
await fs.promises.writeFile(audioFilePath, audioFileBuffer);
const ASRResPromise = hf.automaticSpeechRecognition({
processor: "bookbot/wav2vec2-ljspeech-gruut",
model: "bookbot/wav2vec2-ljspeech-gruut",
data: audioFileBuffer,
});
const WhisperResPromise = openai.createTranscription(
fs.createReadStream(audioFilePath),
"whisper-1"
);
const [ASRRes, WhisperRes] = await Promise.all([ASRResPromise, WhisperResPromise]);
console.log(WhisperRes.data.text)
console.log(ASRRes.text)
// Extract and tidy up any names
const messages = [
{ "role": "system", "content": system_command },
{ "role": "assistant", "content": `Transcription: ${WhisperRes.data.text} \n\n Phonemes: ${ASRRes.text}` }
]
const AIresponse = await openai.createChatCompletion({
model: "gpt-4", // significantly better than 3.5 at this task
messages: messages,
temperature: 0,
});
console.log(AIresponse.data.choices[0].message.content)
// Package up transcriptions, extracted names, send to client
const reply = {
transcription: WhisperRes.data.text,
phonemes: ASRRes.text,
extractedNames: JSON.parse(AIresponse.data.choices[0].message.content)
}
res.json(reply);
});
const system_command = `You are a Speech to text chatbot. You have just processed an audio file detecting a voice file. First you detected and transcribed in english. Then you detected and transcribed phonemes.
Your job is to be able to detect regional accents by detecting any mention of a user name along with their detected phonemes and list them out in json. If you detect a user name:
{
"detected" : true,
"names" : [
{
"name" : <name>,
"phonemes" : <name raw phonemes>,
"x_sampa" : <phonemes converted into X-SAMPA format removing spaces>,
"ipa" : <phonemes converted into IPA removing spaces>,
"uroman" : <convert the IPA into U-ROMAN, or roman character as pronounced in english, removing any spaces, and representing unstressed syllables apropriately>
}
]
}
or if there are no user names return
{
detected: false,
}`;