End to end automation testing of Alexa Skill (conversational AI testing)

Nitin Verma
6 min readJun 9, 2021

Testing has evolved a lot over many years for all the traditional software but there’s no doubt that AI is a hot topic. All the popular sectors are now focusing on improving their customer services with AI integration to their products and services and targeting wider scope with different AI channels like Chatbot, Virtual Assistant, Voice Assistant etc.

Like other software and its testing, Conversational AI automation testing is also possible and many of them are available open-source along with few paid solutions.

In the first place, AI is new to many industries where Shift left and early automation implementation becomes important in order to successfully product.

Despite the varied activity in this space, AI is still in an exploratory phase and largely an untested customer value proposition. As people start to feel more at ease with the technology, designers need to pay greater attention to how AI fits into the wider guest experience. There are key design considerations we expect to surface for automation with this emerging technology.

What is Conversational AI?

Conversational AI is the technology that ultimately enables machines to naturally interact with humans via language. It is a subset of artificial intelligence that leverages concepts like neural networks, machine learning and others. And make them available to build useful applications with it like hands-free control while you are driving or at home, Alexa waiting for your command or even virtual agents that assist in customer support over phone.

How does it work?

Conversational AI is a fusion of various technologies like Automated Speech Recognition (ASR), ML, natural language processing (NLP) and Natural Language Understanding (NLU), which process every written or spoken word and check out the best way to respond and grasp from every user interaction.

Conversational AI flow is sliced into three sections:

· Automatic Speech Recognition (ASR)

· Natural Language Processing (NLP) or Natural Language Understanding (NLU)

· Text-to-Speech (TTS)

Conversational flow

What are Alexa skills?

As we have apps for smartphones, similarly for Alexa devices we have skills which can be enabled and disabled from Alexa apps.

Keywords in Alexa skill kit:

  • Wakeword: The Echo devices have a ring of always-on microphones, meaning the device is always in listening mode. It will only ‘wake up’ and actively pay attention to you when it hears a specific word or phrase, called a wakeword

- For Alexa: Alexa, Echo, Computer are wakewords

- For Google assistant: Ok Google

  • Invocation Name: An ‘invocation name’ is the word or phrase used to trigger your skill. For example

- Alexa, play music

- Alexa, ask doctor connect

  • Intent: An intent is what a user is trying to accomplish
  • Utterance: Utterances are the specific phrases that people will use when making a request to a voice app or Alexa
  • Slots: A slot is a variable that relates to an intent allowing Alexa to understand information about the request

Utterance example:

Utterance Example

So far, I have 3 different successful POC done on Conversational AI automation.

  1. Botium Connector for Amazon Alexa Skills Management API
  2. Botium alexa avs connector
  3. Custom framework/tool to test end to end Alexa skills

Broadly I will be taking each of them one by one.

AUT Summary:

Custom Alexa skill Doctor Appointment is created, this helps users to book an appointment with a doctor.

Custom Skill

Botium Connector for Alexa Skills Management API(SMAPI):

This verifies the utterance mapped with each intents. Using this once can test Alexa skill programmatically instead of testing it through Alexa Skills Kit developer console.

  • It uses Skill Invocation API, to invoke your application for testing purpose
  • Skill Simulation API, helps to test skill and see the intent that a simulated device returns from your interaction model

Demo:

Result:

Automation execution report

Tools and technology:

  • NodeJs, AVS, SMAPI, Mocha

Botium Connector for Amazon Alexa with AVS:

This tool can be used to test Alexa skills. For this one has to register a virtual Alexa device in the Alexa voice service (AVS) portal. This Works the same way as a physical Alexa device. It is not bound to any Alexa Skill, so you have to activate your Alexa skill with its activation utterance. Like “Alexa, order a pizza

For single level conversation it will be flawless for example “Alexa, turn light on”. But when it comes to test multilevel conversation like in this case (Doctor connect skills) it fails, as it validates the Alexa response at every step. Whereas Alexa waits for user input for 8 seconds which is a default time of it. If the user fails to respond, the skills session will expire and Alexa will be out of context of the skill.

Notes: Botium AVS connector is not meant to handle multilevel conversation, as Alexa moves out of context when the transcribe job is in process.

So I raised an issue — https://github.com/codeforequity-at/botium-bindings/issues/124 on githhub which is maintained by Florian Treml a Co-Founder and CTO of Botium. And what was his response.

It’s an appreciation for me. Thank you Florian Treml

Demo:

Result:

Execution report and console logs

Tools and technology:

  • NodeJs, AVS, AVS-connector, AWS Polly, transcribe, Mocha.

Custom Framework:

I have created a custom tool and framework for testing the Alexa skill of multi level conversation end to end.
To make it workable you need to register a new product in the Alexa Voice Service console. This registered product will work the same as the Alexa device. It is not restricted to any Alexa Skill and can be invoked by invocation name. Using this framework you can test production skills as well.

How does it work?

  • Utterances are converted from text to speech into .mp3 file using Amazon Polly
  • Converted .mp3 is then feed to Alexa Voice Service (AVS) via Alexa client and returns the Alexa response in same format i.e .mp3
  • Then output response is converted into .wav file using ffmpeg and stored in S3 bucket
  • Finally AWS Transcribe turns the stored .wav file into text. Which is then verified with the expected response text

Workflow:

Framework flow

Demo:

Result:

Execution report of custom framework

--

--