Introduction:
Task-oriented dialogue systems aim to assist users to accomplish their daily tasks through natural language in a spoken or written form. In recent years, the widespread use of virtual assistants has also spurred the development of innovative applications such as automated customer service and shopping. Walmart has developed several virtual assistants to support its customers in many domains such as grocery shopping for consumers, customer care, and askSam for store associates through voice and text channels. Virtual assistants require natural language understanding (NLU) models to identify the intent of the user and extract associated slots (entities) as expressed in utterances, the intent classification, and slot filling tasks. In this blog, we will discuss how to unify models of virtual assistants from different domains: voice shopping and store associates. Since these models share some commonalities in entities and intents, unifying them leads to reductions of the maintenance cost, an improvement in generalization and performance of the unified model for the joint tasks of intent classification and slot filling.
Walmart Virtual assistants
Virtual assistants for the shopping domain and store associates at Walmart’s storesWalmart has been supporting voice shopping through Google Assistant and Siri platforms (shopping domains) since 2018. For example, Walmart’s customers can add products to their carts, search for refinement of products and brands, schedule pickup or delivery, or edit their shopping carts using the voice shopping assistant. The shopping assistant should distinguish the intent (e.g., add to cart, checkout, search refinements, etc.) of users and their relevant entities (e.g., product, brand, type, time references, etc.) of intents.
Walmart’s shopping assistant for Text-to-shopWalmart also built the AI assistant (askSam app) for store associates so, they can benefit from using its capabilities and serving customers better. For example, they can ask the virtual store assistant to find the location of items, check the stock status of items, and retrieve the schedule of departments and their associates in stores. The virtual store assistant (askSam app) uses an ensemble of machine learning and deep learning models to identify the intents and slots as expressed in utterances.
Intent Classification and Slot Filling
Intent classification is a classification problem that predicts the intent label and slot filling is a sequence labeling task that tags the input word sequence with the slot label sequence. Training independent intent classification and slot filling models lead to worsening performance of models due to lack of shared representations and generalization of separate models. Recently, several joint learning methods for intent classification and slot filling were proposed to utilize the dependencies between the two tasks and improve the performance over independent models [13]. We also observe that the joint learning model achieves better performance on retail entities and intents classification tasks versus the independent models (intent classification and sequence labeling task) on the shopping domain.
Developing One Specific Model for Each Domain:
Individual models have been developed for a particular domain. But maintenance of individual models is difficult since they may need to be trained separately on a frequent basis. Moreover, they have generalization problems since they only focus on one specific domain and data. Models of virtual assistants may have a large overlap of conversational patterns in different domains such as product and brand detection in e-commerce shopping or location of products at stores. Thus, unifying them leads to improvement in their performance. Moreover, a common issue of assistants is relying on a large amount of annotated training data. Motivated by the idea of training a unified model for multiple domains, the data of each assistant is aggregated, and the paucity of the data is resolved by sharing the knowledge between different assistants that can strengthen each other.
The unified modelMulti-task Learning
Multiple relevant tasks are learned together to improve the performance of each task [1]. Multi-task learning shares information between multiple tasks. This technique has been shown to be effective in diverse domains such as vision [2], medicine [3], natural language processing [4]. By sharing representations between relevant tasks, models are generalizing better on the target task [14]. Multi-task learning can help to improve a model by introducing an inductive bias, which causes a model to prefer some feature representation over others. The inductive bias is provided by the relevant tasks, which causes the model to prefer representation that explains more than one task. Thus, it leads to a better-generalized model [14, 15].
The Unified Model for Voice Shopping and Store Associates Assistants
We want to build a unified model for shopping and the askSam domains. The accuracy of the unified model should be comparable to the individual model. The unified model should have similar performance on the intent classification and slot filling tasks. A unified model is required to be robust in all scenarios that may occur in the store or through the voice shopping channels. However, building a unified model for diverse domains may present several challenges. Thus, it is essential to understand the domains properly:
- The vocabulary that associates use in store for their daily activities is not common with voice shopping (e.g., “cart pusher schedule”, “who is in Deli (one of the departments in stores) today,” “what is Mike’s schedule for tomorrow”, or “pick up associate hours”).
2. Similar words may be used in different contexts in the askSam domain and voice shopping channels. For example, store associates may mention “who is in checkout today,” “what is the truck delivery schedule” or “John Deere scheduling” versus our customer via voice shopping channels may mention “checkout my cart?”, “schedule my delivery for today,” “pick up hours for tomorrow,” or “add John Deere Legos”
These assistants have their own skills for their specific domain. But they also share some commonalities for retail-specific entities (e.g., product and brand detection in e-commerce shopping or at stores). The training data of the existing models are prepared by leveraging templates and syntactic data generation. Since both models have a lack of capabilities to recognize person and time entities correctly in unseen phrases, particularly in-store live logs. Syntactic data generation is not adequate to improve the “person” and “time” recognition capabilities of the models. This is impractical to generate a representative set for the person scheduling intent only by using syntactic data generation due to the infinite compositionality of language. Thus, store live logs are filtered to prepare a representative set of the person scheduling logs. Store live logs provide useful training instances to cover person names from different counties and ethnicities. Store live logs are annotated and appended to the existing training data of the shopping model.
We decided to unify the training data of these two models and augment its training by using store live logs that include person and time entities. Thus, the unified model can support both domains and use the synergy of training data of those two domains to improve the generalization of the unified model. Since there is no labeling effort available for the store live logs. We need to set up a pipeline to label store live logs automatically.
Challenges to Annotate Store live logs:
- There are ambiguities among person names, common brands, and products. This kind of ambiguity can be resolved using the context of utterances. Off-the-shelf NER (named entity recognition) models detect “John Deere” as a person in the utterance “John Deere Legos” but by considering Legos in the context. “John Deere” should be recognized as a brand for the store model. However, “John Deere” in the utterance “John Deers scheduling” should be detected as a person entity in the store model by considering “scheduling” in the context of the utterance.
- Store associates may speak multiple languages and their names could be European, American, Spanish, Asian, middle eastern, etc. The store model should detect all these names as “person” based upon the context of utterances.
- There are abbreviations and common expressions (store domain knowledge) that Walmart associates use at the store during their daily activities, but off-the-shelf NER models detect them as named entities. Thus, the output of those models must be corrected for Walmart’s use cases.
- Off-the-shelf NER models utilize capitalization as a signal for their models to recognize named entities, particularly for “person” and “location” mentions. “John lives in Chicago” but the capitalization information is not perfect in voice channels because of automatic speech recognition (ASR) errors.
To address the shortage of person and time recognition in the training data of the unified model, we leveraged several state-of-the-art multilingual NER models [10, 11, 12] as labeling functions to detect person and time entities in-store live logs. The output of NER models on store live logs was corrected by utilizing store domain knowledge specifically for store common expressions which wrongly classified them as named entities. We also developed labeling functions to disambiguate retail entities and named entities by examining the context of store logs and using the store domain knowledge. Finally, we applied weighted major voting based on labeling functions to annotate the store live logs automatically. We leverage several heuristics based on store domain knowledge to filter out live logs which are not relevant to person and time entities. We achieved better results on annotating store live logs by applying weighted major voting than Snorkel [9]. Snorkel is one of the weak supervision approaches which combines several supervision resources to annotate data. However, Snorkel could not work well on the sequence labeling task since it required data points to be independent.
Results:
The unified model has competitive performance in retail entities in comparison with the shopping model. The unified model has scientifically better performance on person and time recognition in comparison with the existing model using the annotated store live logs. The store live logs also lead to better generalization of the intent classification of the joint model. The performance of the unified model is slightly improved on the intent classification in contrast to the original shopping model without store live logs (not relevant to the shopping channels). It outperforms the shopping model on the intent classification by 5% on average across different intent supports. This improvement is due to the generalizable feature representation of the unified model using data from different domains versus just the shopping domain. Recent works have been shown that leveraging external training data can improve the intent classification performance. The improved generalization is due to a more generalizable feature representation in the joint model for the intent classification.
The Benefit of a Unified Model:
- The unified model does not need to be retrained separately for each domain on a frequent basis.
2. The common named entities (e.g., products, brands, time references, etc.) will be shared between the training data of the shopping model and the store model.
3. The maintenance cost of models is reduced dramatically. The data science team can maintain one unified model by updating it regularly rather than maintaining dozens of separate models for each business requirement.
4. GPU costs are reduced through unified training for one model vs. training two separate models.
5. Neural language models are data-hungry, and their performance is improved by increasing the amount of training data. Data and domain knowledge is leveraged across both domains (shopping and ask Sam).
Summary
The unified model improves the performance of the intent classification of the shopping model by leveraging the store live logs and the model has shown promising results on benchmarks of live logs for shopping and store data. The model learns a more robust set of features and achieves better generalization by integrating multiple domains training data (e.g., store live logs are not relevant to voice shopping). The unified model has significantly better performance on person and time recognition using store live logs. The maintenance cost and training cost of models is reduced dramatically.
Acknowledgments
Thanks to the Walmart Global Tech Conversational AI team and Komal Dhuri to prepare the auto labeling pipeline to annotate live logs.
References:
[1] Rich Caruana. 1993. Multitask learning: A knowledge-based source of inductive bias. In ICML, pages 41–48.
[2] Yuan Zhang, Regina Barzilay, and Tommi Jaakkola. 2017. Aspect-augmented adversarial networks for domain adaptation. TACL, 5:515–528.
[3] Steffen Bickel, Jasmina Bogojeska, Thomas Lengauer, and Tobias Scheffer. 2008. Multi-task learning for HIV therapy screening. In ICML, pages 56–63.
[4] Xing Fan, Emilio Monti, Lambert Mathias, and Markus Dreyer. 2017. Transfer learning for neural semantic parsing. In ACL-RepL4NLP, pages 48–56.
[5] Gokhan Tur, Dilek Hakkani-Tur, and Larry Heck. 2010. What is left to be understood in ATIS? In SLT workshop, pages 19–24.
[6] Shiva Pentyala, Mengwen Liu, and Markus Dreyer, 2019, Multi-task Learning with Task, Group, and Universe Feature Learning, In ACL, pages 820–830
[8] https://venturebeat.com/2021/05/22/whats-next-machine-learning-at-scale-through-unified-modeling/
[10] Stefan Schweter and Alan Akbik, 2020, FLERT: Document-Level Features for Named Entity Recognition
[11] https://spacy.io/api/entityrecognizer/
[12] Qi, Peng and Zhang, Yuhao, and Zhang, Yuhui and Bolton, Jason and Manning, Christopher D. 2020 Stanza: A Python Natural Language Processing Toolkit for Many Human Languages In ACL: System Demonstrations, pages 101–108
[13] Chen, Qian, Zhu Zhuo, and Wen Wang. “Bert for joint intent classification and slot filling.” (2019).
[14] https://ruder.io/multi-task/index.html#motivation
[15] Hashimoto, K., Xiong, C., Tsuruoka, Y., & Socher, R. (2016). A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks. arXiv Preprint arXiv:1611.01587.
[16] Kendall, A., Gal, Y., & Cipolla, R. (2017). Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics.
[17] Liu, X., Gao, J., He, X., Deng, L., Duh, K., & Wang, Y.-Y. (2015). Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval. Naacl-2015, 912–921
[18] Yu, J., & Jiang, J. (2016). Learning Sentence Embeddings with Auxiliary Tasks for Cross-Domain Sentiment Classification. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP2016), 236–246.
[19] Rei, M. (2017). Semi-supervised Multitask Learning for Sequence Labeling. In ACL 2017
[20] Ben-David, S., & Schuller, R. (2003). Exploiting task relatedness for multiple task learning. Learning Theory and Kernel Machines, 567–580
A Unified Multi-task Model for Supporting Multiple Virtual Assistants in Walmart was originally published in Walmart Global Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.