Everyone can do customer support quality assurance (QA), but few truly do it well.
The objective of today's article is to walkthrough a couple of the more technical aspects of tuning AI for the QA process.
Bad QA Process, Bad AI
We start first by cleaning up the QA questions that make up one's Internal Quality Score (IQS)^.
In today's era of AI-led QA, having a solid first principles foundation for QA is more important than ever. AI amplifies – and the last thing you want is to amplify logic gaps within your QA process.
The most important thing in designing a QA question is:
A clear, objective measurable definition of what is good and bad
If applicable, a clear definition of what "it depends on the context" means
Let's take 2 example questions:
The agent had good energy with the customer.
The agent used correct spelling and grammar.
Intuitively, we can tell that Question 2, unlike Question 1, can be universally defined and agreed upon even amongst different human evaluators, and will likely lead to similar results.
That's because Question 2 has a clear, objective standard that can be measured against, while Question 1 does not. AI, as with human evaluators, require clarity and objectivity to create consistency.
Types of QA Questions
The good news though, is that QA questions can generally be broken down into a few types. Let's walk through the typical these question types, and then touch on how to "translate" them into "AI-speak".
First, QA questions can be broken down into the following output types:
Binary questions – answer is either yes/no
Ratings question – answer can be a spectrum of possibilities, e.g. 1 to 10 rating or a percentage
Choice questions – answer is one or multiple options from a pre-defined list
Open-ended questions (rare) – a wide variety of possible answers, no pre-defined list
But that's just the output. Let's next consider the possible factors that a QA question measure in order to arrive at an output:
Type | Measured On | Output | Example |
Binary | Clear, objective factor with inherently defined standards | Yes/No | Did the agent refer to the customer by the correct name? |
| Clear, objective factor with pre-defined standards | Yes/No | Did the agent have less than 3 spelling/grammar mistakes? |
| Subjective factor | Yes/No | Did the agent demonstrate good energy on this ticket? |
| |||
Rating | Clear, objective factor with mathematically defined standards | Spectrum (e.g. 1 to 10) Percentage | Out of all the times the agent referred to the customer, how many percent did he/she use the correct name? |
| Subjective factor | Spectrum (e.g. 1 to 10) Percentage | How well did the agent address the customer's concern? |
| |||
Choice | Clear, objective factor with pre-defined standards | One or multiple pre-defined options | Which topic(s) are relevant to this ticket? Relevant is defined as the customer inquiring about an issue or requiring some clarification of that topic. |
| Subjective factor | One or multiple pre-defined options | Which topic(s) do you feel you would like to tag this ticket with? |
| |||
Open Ended | Left out as IQS today typically avoid open ended questions |
|
|
Ideally, QA questions should always measure on a clear, objective factor with either (a) pre-defined or (b) inherently-defined standards. Only QA questions designed as such can counteract emotional bias, human fatigue, forgetfulness etc to improve consistency.
With AI, this also holds true. AI excels at reasoning and pattern recognition, and having essentially crystal-clear directions will ensure optimum consistency and results.
Translating QA Questions Into AI Chains
Now that we've exposed the logic that we want to "teach" our AI, the next step is to adjust it to what I term the quirks of AI.
AI has its own quirks. For example, AI excels at single-prompt single-task problems, but performance starts to degrade as complexity stacks up. Yet, break that complex problem down into single-task steps in a chain, and AI's performance recovers perfectly.
To give another example, a prompt can work really well with a particular AI model, but not with another. This has to do with the way AI models are trained differently. I won't dive into the technical complexities of it, but suffice to say, one really has to know these nuances in order to design a hyper performant AI system.
You can read more about some of the more advanced technical stuff that we do under the hood here.
Applying these to the customer support QA space, some typical steps we will take are:
break down a vague question into clearer, more objective sub-questions, and
convert definitions of good into something quanitfiable (e.g. less then 3 spelling errors, 80% of relevant processes followed)
build multi-step LLM chains (i.e. "AI") can answer these specific, quantifiable questions
Includes using both open-source and closed-source paid models like OpenAI at various different steps to optimise for costs, effectiveness and speed
build validation LLM chains that double-check the accuracy of answers
test out our LLM chains on historical IQS data to validate accuracy and find edge cases
iterate and improve our LLM chains using the edge case scenarios, rinse and repeat
For example:
Q: How well did the agent answer this ticket?
What are we trying to measure?
How complete the agent explained the necessary information
How do we measure completeness?
We compare what the agent said against a "golden" list of things that should be said for that particular topic
What situational exceptions might arise?
In the "golden" list of things that the agent should have said, we should exclude irrelevant things that the customer is not concerned about
What is considered a good measure for completeness?
If the agent hits 80% of what should have been said in the "golden" list, we consider that the agent answered the ticket well
What are we trying to measure?
The emotional management of the customer
How do we measure emotions of the customer?
We run sentiment analysis of the customer's responses from start to end of the ticket.
What is considered a good measure?
Sentiment "de-escalates" either from negative to neutral, or better yet, negative to positive
What situational exceptions might arise?
Sometimes, a customer is just in a foul mood. A negative to negative sentiment change does not always indicate the agent did a bad job.
What do we do to run a second layer of analysis to see if it was the agent's fault?
Did the agent say anything that caused the "intensity" of negative sentiment to drastically increase (versus a general normal negative baseline)
Q: Did the agent articulate follow-ups steps throughout the ticket, and give the correct timelines based on the proposed action(s)?
Break it up into 2 questions:
Did the agent articulate post-incident follow ups steps?
If so, did the agent give the correct timelines based on the proposed action(s)?
What are we trying to measure?
1(a) is an objective factor with an inherently defined standard
1(b) is an objective factor with pre-defined standards (e.g. based on internal documentation)
What is considered a good measure?
1(a) → yes
1(b) → depends on internal documentation
What situational exceptions might arise?
1(a): No need for follow up steps if:
an issue is resolved
the follow up action is by the customer (unless agent is required to check back periodically)
agent has acknowledged that the timeline for the action is uncertain, but will get back to customer within a specific time period and the customer acknowledged its sufficiency
As you can tell, one key challenge in implementing AI for an IQS QA process lies in defining exception situations. To minimise these sorts of problems, we often use historical QA results to find edge cases and cover them during the AI training phase.
This process typically continues even as our system goes "live", and our AI systems effectively get sharper over time. While this might require small efforts to guide the AI system initially, the AI system will be never forget, and be infinitely scalable. Best of all, our AI systems are completely unaffected by people quitting and new people coming onboard – it's really like a superstar customer support manager that never leaves.
A Little About The Need to Monitor AI's Output
Before I end, I just wanted to touch a little on why it's necessary for continuous monitoring of AI systems (which we do, not to worry!).
Building AI is unique in the sense that it's a lot about making (very technical and) educated guesses about its "knobs and dials", and testing it against a set of what we term as a "golden" dataset to see how well it performs.
Once we find the right "knobs and dials", we then need to continuously monitor it in real life, to watch out when deviations exceed our tolerance thresholds. We then step in to make fixes, and improve the AI model.
In this way, AI is unique in that it is 85% science and 15% art. But what a powerful tool it is when implemented correctly!
Conclusion
To conclude, we help companies who are looking to infuse AI into their QA process, to alleviate the (expensive) manpower burden. Managers should spend time value adding by articulating voice-of-customer and coaching customer support staff based on pre-prepared data, not manually and tediously reviewing one ticket conversation after another.
We do everything end-to-end, and you don't even need to provide us a single engineer. We believe actions speak louder than words. Let us run an analysis on a sample of your tickets today for you to consider, before taking things any further!
_______
^ An Internal Quality Score (IQS) is a metric used in Customer Support that measures your team's ability to meet customer service expectations. By establishing a quality assurance program, you can accurately assess IQS and identify areas for improvement.