
Large-Scale IVR System
Intent Design, NLU & ASR Analysis, Prompt Creation, RASA
Background on project:
-
150k-200k calls per day
-
client project (wireless cell phone & data domain)
-
55 intents (large granularity)
-
2 languages (English & Spanish)
-
RASA
-
Custom-built ASR model
My team was tasked with upgrading an existing, legacy IVR (Interactive Voice Response) system to a new and improved system, maintaining identical functionality while utilizing machine learning (NLU).
This existing IVR system featured directed-dialog -- think "For voicemail options, say 'Voicemail'".
The legacy design documentation was 242 pages long!
Read below to find out a little about my role as Conversation Designer/Analyst on this project.
Intent Design
I owned the project of converting the legacy main menu into an initial intent structure.
Initial Intent Architecture
The legacy menu structure included 50 distinct menu options, with several more possible flow permutations among lower menus.

To begin building out the initial intent structure, I converted the main menu options from the legacy structure to a skeleton outline of intents.
​
I processed and labeled about 50,000 rows of legacy caller utterance data to build out and train an initial NLU model, refining the intent structure as I went.
Eventually, I ended up with 65 intents.
Iterating on Intent Structure
As we started collecting fresh caller data from traffic to our upgraded ML-based application, I began to improve upon the initial intent structure.
Data analysis to improve intent structure included:
-
examining call logs; full conversation data
-
listening to call audio
-
examining "missed intent" phrase data, where intent confidence was below threshold
-
interpreting F1, precision, and recall metrics for individual intents; targeting lowest performers
-
identifying intents with most "confusion" by interpreting confusion matrices
-
mining supplemental utterances from production data by targeting key words (python pandas)​
Data Analysis Tools (Python)
Here are two Python Pandas Jupyter Notebooks that I wrote to aid in NLU data analysis during this project. They contain tools to
-
find phrases within a data set that contain combinations of keywords, and
-
take a random sub-sample from a dataset that contains proportional intent ratios.
I used all the information gleaned from the above described data analysis to refine and optimize the intent structure, eventually reducing the total number of intents from 65 to 55 .
​
I conducted mini experiments using Rasa's built in testing tools to validate each intent change and ensure that performance was maintained or improved after each.

To illustrate with an example, one change made to the intent structure included combining the intents "suspend_service" and "remove_subscriber" into one single intent, as the data belonging to each was too similar to represent separate intents.
​

As the project matured and our client continued to pursue the goal of maximizing the level of self-service available to their customers, we needed to add new functionality to the IVR system.
On average, we added 1-2 new features every other month.
​
Each new application feature typically required either a new intent, or supplemental data added to an existing intent.
As the NLU owner, I needed to design the most appropriate intent/entity combination for each new feature.
However, this can be very challenging to do for new features, as there is not yet real utterance data to train a new intent... because callers can't ask for functionality that doesn't exist yet :)
​
So, I developed the following two-phase process for introducing new features/intents into an NLU-based system.
Process for Adding a New Intent
Phase 1: Add New Intent
Start with "synthetic" data
Add an intent to the NLU model as a "beta" version of the new feature.
Procedure:
-
Research
-
If possible, conduct ​user interviews, user testing, or surveys to elicit as close to real user phrase data as possible.
-
Curate any materials that can be used for research.
-
Examples: marketing materials that users might encounter, similar functionality existing in previous systems, competitor analysis, etc.
-
Users' mental models are always influenced by what they've seen, heard, read. This will be reflected in how they word input for the new feature/intent.
-
-
-
-
Generate Initial Training Data​
-
Generate list of phrases for training data that you predict are representative of things users will actually say​
-
Sometimes you have no choice but to generate synthetic or manufactured data to stand up the new intent.
-
ChatGPT (or another LLM) can be very helpful here during the generation process by helping you cover a variation in wording off the bat.​
-
-
-
Vet Initial Data​​
-
Use focus groups, peer feedback, etc. to vet initial training phrases
-
Use Wizard of Oz testing to perform some initial validation
-
Get initial list of phrases approved by client
-
-
Train Model and Monitor Traffic​
-
Once real traffic is hitting the NLU model with the new intent, monitor traffic to that new intent. ​​​
-
Determine the proportion of traffic expected to hit the new intent. This will inform how long it will take to get enough traffic to get sufficient supplemental data.
-
Intents that are predicted to have lower traffic will require longer amount of time.​
-
-
​
Phase 2: Update Intent with Real Data
Retrain NLU with updated data
Update the new intent with real user phrase data and retrain the NLU model.
​
Procedure:
-
Review Metrics
-
Review amount of user utterances/conversations where new intent was classified with high enough confidence.
-
Review any reports of F1 or other metrics for clues on intent performance.
-
-
Pull User Data​
-
Curate utterances from call logs/data report to use for supplementing the new intent's training data​
-
-
Refine Intent Definition​
-
Refine the intent definition based on insights gained from data analysis​
-
-
Retrain NLU Model​
-
Incorporate the newly curated data and retrain the NLU model.​
-
Gather new metrics on intent performance.
-
​
NLU & Speech Recognition Analysis
The distinct ML components of a language recognition system should be compatible in the language patterns they are trained to recognize, and the way they produce output.
NLU and non-NLU classifiers
In many IVR systems, segments of dialog are separated into "states", and many states use speech recognition called "grammars" that work much more like a dictionary match than an AI's intent prediction.
​
Our system included an NLU engine at the "main menu" state, while many lower states used grammars.
We needed to make sure that the NLU engine could recognize phrasing used in lower down menus, in case a caller became familiar with that wording and used it at the NLU menu.

For example, "one time payment" was not one of the most common phrases used by callers to access the system payment flow. However, that phrase was used in multiple system prompts among the grammar-based payment dialog states, so users were familiar with the phrase.
I collected utterance data with variations on the phrase "one time payment" and added it to the "payment" intent training data.
​
​
Additionally, we needed to ensure that the ASR (automated speech recognition), which was used to transcribe callers' spoken phrases to text, was able to accurately transcribe phrases prevalent in the NLU training data.
​
Phrases containing the word "autopay" were very prevalent in caller data. At the beginning of our project, even though the NLU was trained well on "autopay", the ASR was lacking training for "autopay" and thus was mis-transcribing it as "Otto pay".
I made recommendations for supplemental training for the ASR based on call log analysis like this.
Diagnosing Recognition Errors
I gained a lot of experience analyzing user conversation data for errors and diagnosing which system was the culprit.

The above shows an example of how a missed user utterance can be caused by problems at different points along the conversational pipeline.
The user said "Add a line" and received the "I'm sorry, I didn't get that" response.
​
In the first example of the conversational pipeline, the ASR transcribed the phrase correctly and the NLU engine got the correct intent ("ADD_LINE") for the phrase. However, the NLU engine did not predict the intent with high enough confidence (0.43 is typically below threshold). Therefore, the user is re-prompted.
​
In the second conversation pipeline variation, the ASR mis-transcribed the phrase as "Adeline", which is not a phrase that the NLU engine is trained on. So, the NLU engine could not get high enough confidence on an intent and thus the system must re-prompt the user.
​
Audio Creation & TTS
In addition to my primary role as NLU analyst, I also gained a lot of experience designing and creating the audio for the IVR system's prompts.
Working with Voice Talent
I got to assist in coaching our Voice Talent who was our client's proprietary brand voice.
We held two separate sessions with him and attended via video call to listen in on his live recording.
I assisted in coaching him to manipulate his vocal emphasis, speed, intonation, etc. to align with our client's brand guidelines.
​
The phrases recorded in these sessions were part of a large data set used to train a custom text-to-speech model (third party) to produce synthetic voice identical to our talent's voice.
Prompt creation using Custom TTS
I gained a lot of experience creating audio files using the custom text-to-speech model.
​
This model took input in the form of textual phrases and generated .WAV files with the spoken phrase audio.
​
The TTS model tooling allowed us to manipulate the style, intonation, speech, etc. of the generated speech to a certain extent, but capability was limited.
​
I also frequently had to define custom pronunciations within the TTS tool using IPA (international phonetic alphabet) for certain words to get the TTS model to produce the desired pronunciation.
​
For example, one word that the TTS model had difficulty with was "Mountain" (in the phase "Mountain Time") -- the 't' sound was absent in the default pronunciation.
​
I achieved better generated pronunciation by defining the following IPA pattern:

Audio Editing - Style Guides
I often had to edit prompt audio files in Audacity even after generating them using the TTS tooling for several reasons.
​
One important editing activity required compressing the .WAV files to a format required by our telephony platform.
​
I also performed some editing in order to ensure the prompts followed Brand and Style Guidelines.
​

Some of our technical audio style guidelines:
-
all commas should receive a 0.25 second pause
-
remove all sounds of the speaker breathing/inhaling before their next word
-
remove any distortions or clicking sounds (sometimes produced by TTS)
-
leading fragments should have a 0.5 second trailing pause
-
ending fragments should have no leading or trailing pauses
-
middle fragments should have a 0.5 second trailing pause
​
I adjusted audio prompts to comply with the above guidelines.
Cross-Channel Transition
Our client is beginning the transition to cross-channel, and so our team has begun developing a chat-based version of this IVR system, with a visual chat interface.
Can we use the same NLU model?
We have started discussions about how to approach the language model aspect of the chat channel. I've been asked to assess whether we can re-use the IVR system's NLU model for chat.
My answer is:
-
We can use the IVR NLU model to bootstrap the chat model ...
-
But we must collect true chat utterances to supplement the existing training data.
The structure of the chat language model will be the same as the IVR channel -- it should have the same intents and entities and should support an equivalent linguistic coverage.
Thus, much of the intent training data can be identical.
​
However, a textual user phrase inputted through a chat channel, even if its linguistic signal aligns well with an intent category existing in the IVR model's training, will differ in style enough that we should account for the differences in training data.
​
See below for important considerations for handling chat input when transitioning to chat channel from voice.
User Input: How Chat differs from Voice
These are aspects of input that differ between chat and voice channels, due to the differing natures of each interaction.
​
-
User phrases are generally shorter in length via chat
-
Spelling mistakes (“typos”) introduced with typed input, especially on mobile
-
relatedly, autocorrect can contribute to this as well
-
-
Users often submit input before they intend to (AKA, hitting “enter” too early)
-
Text shortcuts are often used (ex: “thru”, abbreviations, acronyms, “u” vs. “you”)
-
Much easier for users to enter “garbage” repeatedly (entering "g98h23nbkpsdzo" over & over)
-
Profanity/lewdness more common with chat than with voice
Omnichannel Nielsen Norman Course
I wanted to learn more about designing for Omnichannel experiences, so I took a course an "Omnichannel Journey & Customer Experiences" Nielsen Norman course in March 2023.
​
I wrote up a summary deck to present the information I learned to my colleagues. Click below to see the deck and more information about my experience with the course.
One of the most notable things I learned from this course was the technique of creating an "Asset Map" which involves overlaying artifacts from all channels at each interaction point along a flow to ensure consistency.
​
I thought it could be helpful to use learnings from this course, including creating a mini Asset Map, to explore a specific interaction that was causing callers trouble.
Pain Point: Repeat Confirmation Number
A big pain point for IVR callers in the payment flow was getting their confirmation number repeated. This issue was uncovered while helping our client begin their cross-channel journey.
I've written a case study to flesh this out -- click below.