Recently, there has been a notable advancement in the field of voice assistants. Consequently, in 2019 we introduced our voice shopping assistant for Walmart customers on Google Assistant and Apple Siri. At Walmart our mission is to build the best shopping assistant in the world to help our customers save time and money. Our vision is to assist the customers in performing the tedious task of buying stuff by visiting stores and websites. Realizing the vision becomes even more important in today’s fast-paced life.
Building a voice based assistant is challenging. It is especially tough in domains such as retail/shopping where the product taxonomy is both wide (has various different categories of products) and deep (has various different types of products in a category). Another major challenge in building voice based shopping assistants is to precisely understand an utterance in a dialog without providing much context. For example,
if a user says ‘ add bananas’ and follows by saying ‘five’ then she intends to add five bananas to her cart.
if a user says ‘five’ at the beginning of her conversation with the shopping assistant then her intention is unknown (i.e., it does not represent any e-commerce action at the start of a conversation).
if a user says ‘ set a pickup slot’ and follows by saying ‘five’ then she intends to set a pickup slot of 5 am or 5 pm, probably the one closest to the requesting time.
Handling such scenarios require the Natural Language Understanding (NLU) component in the assistants to utilize the context (What transpired previously in the conversation) while predicting the intent associated with an utterance. In this work we integrated inter-utterance contextual features in the NLU component of our voice shopping assistant at Walmart to improve its intent classification.
Basic Components of a Voice Assistant and Role of NLU
There are four main aspects of a voice assistant, namely, Speech-to-Text, NLU, Dialog Management (DM) and Text-to-Speech. The NLU component identifies intent(s) and entities in a user utterance. The dialog manager uses the output of the NLU component to prepare a suitable response for the user.
The NLU systems in the currently available voice enabled shopping assistants do not focus on inter-utterance context and hence the onus of context disambiguation lies upon the dialog manager. Although it is possible to capture a small number of such cases in the dialog manager, it becomes difficult for it to scale for large number of contextually dependent utterances. For example, let us again consider the user utterance ‘five’ after the utterance to add something to cart. Then a dialog manager can predict its intent by using the rule:
IF the previous intent was ‘add to cart’ AND the current query is an integer
THEN the current intent is ‘add to cart’
ELSE the current intent is ‘unknown’
But such a general rule can not be created for many other queries without overfitting to the actual words used in the conversation. For example,
‘organic please’ (where the previous intent was ‘add to cart’ and the current intent is ‘filter search results’) and
‘stop please’ (where the previous intent was ‘add to cart’ and the current intent is ‘stop conversation’).
In this work, we handled the context in the NLU component and reduced the contextual disambiguation burden from the dialog manager.
How Did We Use Context?
We used two different deep neural networks based architectures for two separate implementations of our approach. Following are the details of the architectures.
Bi-LSTM and GRU
The first architecture is inspired by the work in an RNN based Multi-turn QA approach. It has two main components. First, a Bi-LSTM encoder which generates a vector encoding of an input utterance. The vector represents a summarized version of a user utterance. It is generated from the embeddings of the words in the utterance. The Second component is a GRU layer whose inputs consist of the output of a feed forward layer with respect to the embedding generated by the Bi-LSTM encoder, along with the one-hot encoding of the previous utterance’s intent. The output of this layer (which also includes a Softmax step at the end) is the intent of the current input utterance. Figure 1 below shows the architecture diagrammatically.
The encoding layer in the Figure 1 above is a module that generates the vector representation of the words in the input user utterance. There are various off-the-shelf options available to plug in this layer including Word2Vec, Glove, Elmo and BERT. In this work we experimented with BERT and Glove to retrieve the initial word embedding.
Bi-LSTM and Feed-Forward Layer
Similar to the Bi-LSTM and GRU architecture mentioned above, the second one also has two components, a Bi-LSTM layer followed by a feed forward layer. The output of the Bi-LSTM layer is concatenated with the one-hot encoding of the intent of the previous user utterance in the user-assistant conversation and entered as an input to a feed forward layer, which includes a Softmax layer at the end. The output of the feed forward layer is the most probable intent of the input utterance.
As in the first architecture, in this work we experimented with BERT and Glove to retrieve the initial word embedding in the encoding layer.
How Well Did It Work?
The main goal of this work was to improve the intent classification in the NLU component of Walmart’s shopping assistant by using inter-utterance context. Consequently, we tested our approach on the real user logs of Walmart’s voice shopping assistant. We found that about 40% of the overall user-assistant interactions were in need for a contextual intent classification. We also found that adding a product to e-cart and searching for a product on Walmart were the two most popular, contextual user-assistant interactions. About 98% of all the contextual interactions were related to those intents. We observed great performance (90% on contextual and 87.68% overall) of our best performing, contextual intent classification approach (Bi-LSTM and GRU based implementation with BERT language model) on the add to cart and search related user queries.
In this work we improved the voice assistant’s understanding of the add to cart queries. It leads to an improved add to cart success rate, an important criteria of determining the performance of an e-commerce application.
The Table 1 below presents the experimental results of the two implementations of our approach (with different language models plugged in). More details on the experiments and the results are available in our ECNLP 3 article titled Improving Intent Classification in an E-commerce Voice Assistant by Using Inter-Utterance Context
What Do We Conclude?
As hypothesized, the experimental results show that the
contextual update in the NLU module improves the intent classification.
It also takes the burden of intent based contextual disambiguation away from a dialog manager. We experimented with two distinct implementations of our approach and compared them by using the real user logs.
Though, in this work our main focus was on the contextual disambiguation of intents, the entities are also contextually dependent. For example,
‘five’ uttered after ‘add bananas’ may refer to the quantity five whereas if uttered after ‘pick a delivery time’ may refer to the time of day five (am/pm).
We are currently working towards using contextual features to disambiguate between entities and improve the entity tagging as well.
References and Further Readings
Our work was published in the Proceedings of The 3rd Workshop on e-Commerce and NLP (ECNLP). 2020 with the title Improving Intent Classification in an E-commerce Voice Assistant by Using Inter-Utterance Context
Using Context to Improve Intent Classification in Walmart’s Shopping Assistant was originally published in Walmart Global Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.