Using a Ring Architecture For In-Store Application Resiliency

Solving Real-World Resiliency and Scale Challenges for Retail Store System Technologies

Source: https://pixabay.com/illustrations/tech-circle-technology-abstract-5142625/

The Challenge

Imagine being in a situation where poor-performing software, inconsistent behavior, or an outage results in revenue loss. Let’s go a step further, imagine where customer experience and brand reputation are directly impacted by these scenarios. Working in retail and customer-facing solutions means always seeing things from the eyes of the customers and the store associates who use the technologies we deliver. At Walmart, we deal with…well…Walmart scale, which means national, international, retail, e-commerce, and mobile. It’s a highly complex environment, requiring our solutions to work without a hitch and be backed by highly resilient and scalable infrastructure architectures that keep the lights always on. Even if they flicker, there must be a way to get back to full power to fuel sales and keep customer experiences running smoothly. Coming up with creative solutions for in-store systems enabling high resiliency and scale is no trivial task.

What I’m going to do is take a deep dive into the cloud application infrastructure architecture and patterns we recently implemented in digital retail at Walmart for the in-store solution selling wireless devices and plans. This platform is currently supporting online sales and with this recent implementation now supports retail. If you want to learn more about the application and services layer of this platform see my other article. There are multiple design and implementation decisions when using these patterns so I’m going to keep it at a slightly higher level to explain the essentials and why specific solution design choices were made.

If you haven’t read my prior article about the importance of being a Full Stack Architect I would encourage you to do so. It discusses how modern architectures are more complex than ever and why being an architect who can design solutions across all layers is critical. The topic discussed here is a real example of why being a full-stack architect matters. When it comes to resiliency and scale, traditional approaches such as clustering, load balancing, and auto-scaling are table stakes. However, in this case, these were simply not enough to meet the needs of our retail use case. Developing a comprehensive solution required a deep understanding of the entire stack: network, infrastructure, application, and data. Before I get into the details let’s lay the groundwork and talk about the various aspects of the problem statement:

  • Wireless devices are sold through multiple channels including in Walmart stores around the country and through Walmart’s e-commerce website. Furthermore, stores can support different carriers.
  • Transactions per carrier and sales revenue differ based on market trends, shopping patterns, and other factors. Although shopping spikes will occur during holidays, promotional events, and the release of new devices, we need to ensure consistent performance and availability throughout the year. Planning for the unexpected is very important.
  • The user interface for store associates is a new tablet-based application. There is a complex network of infrastructure between these tablets which run in the store and the backing microservices in the cloud. This includes edge gateways, in-store wireless networks, global and internal load balancers, and robust security infrastructure.
  • As you would expect, outages mean loss of sales, negative customer experience, and a high potential for brand impact. Although there is some tolerance for outages, we wanted the solution to not get anywhere close to that. After all, this is a platform for the future, not one for yesterday.

Architecture Pattern

Before we developed the architecture, key objectives which we had to meet were established and were used to drive the solution design and its implementation. These were very deliberate and crafted in alignment with what the business wanted to achieve and based on what the entire platform needed to deliver.

  1. Limit blast radius. The solution needed to limit the blast radius for any outage or disruption. In other words, any outage needed to be contained so that the number of stores impacted was minimal and did not, at worst, cross geographic regional boundaries.
  2. Enable future scalability. We needed to meet existing needs for scalability and for what was yet to come. Although this platform was initially for wireless, there are plans to use it for other services. The architecture needed to be extensible and expand to accommodate future use cases as they emerged.
  3. Preserve customer experience. User interface performance and customer experience could not be sacrificed in normal and exception scenarios. All the backing microservices are in the cloud so the distance between communicating services mattered. The architecture solution needed to be engineered to consider the physical location of stores and the distance between communicating services.
  4. Keep data consistent and easily accessible. Wireless order data and other data used for transactional processing and analytics must be consistent at all costs. What that means is no matter which store was processing a transaction, pricing information, device details, store inventory, and order data needed to be consistent so that all orders were processed the same and without any discrepancies.

The approach we ultimately landed on was to use a ring infrastructure architecture pattern. So what exactly is this and how does it work? For starters, as the word ring implies, this pattern enables the creation of logical groupings of our infrastructure architecture and the platform services. It is made up of several key pieces:

  • Outer rings — receive customer transactions from stores in order to be processed. They are the closest to the stores and the client applications.
  • Inner rings — contain the application and data services of the technology platform responsible for processing customer transactions. They also handle failover events, ensuring our software continues to run if any single ring becomes unstable or unavailable.
  • Intelligent routing — routes store traffic to the closest outer ring. For our use case, “closest” means the ring that is the fewest number of miles away from any given Walmart store.

Using a ring pattern allowed us to achieve several things. First, there could be as many outer and inner rings as needed, enabling our platform to have extremely high resiliency and scalability. Second, using intelligent routing we were able to guarantee store transactions get processed by the closest cloud region. This was very important as it avoided scenarios where a single customer transaction would be processed by multiple cloud regions. If this were to happen, performance would be poor and the customer experience would have been adversely impacted.

Solution Design

Now that we’ve reviewed the core aspects of this pattern let’s take a look at the solution design and discuss how it works:

Ring Architecture Solution Design

Working from the top, transactions from client applications in the stores are processed by an intelligent router that determines the destination outer ring. In our implementation, each outer ring is a cloud region so this routing directs store traffic to where processing takes place. Store traffic is sharded between regions using a unique identifier that is assigned to each Walmart store and included with each transactional request. The job of the custom routing algorithm in the router is to interrogate these unique identifiers and send traffic to the appropriate region.

Each cloud region has a single primary inner ring and dedicated load balancer. The load balancer sends 100% of its traffic through the primary route during normal processing. The secondary route is used only when the primary route is not available, meaning there is an outage. All of this is configuration-driven and does not require any code to be released. Once traffic is sent to an inner ring, all subsequent service-to-service integrations stay within that ring using our service mesh frameworks.

Transactional data is stored locally within each ring to keep transactional processing optimized for performance. These transactional stores are designed in such a way as to guarantee zero data loss during ring or region failures. We configured this using cloud-native solutions to avoid unnecessary complexity. The common layer shown at the bottom is where all transactional data gets integrated for analytics and where we store reference data which is replicated to each inner ring in a data cache. This design enabled us to meet the objectives for preserving the customer experience, ensuring zero data loss, and enabling data consistency during transactional processing.

There are numerous implications when using this pattern that will need to show up in the detailed design and implementation. It’s important to get ahead of these to complete the architecture at a deep enough level to enable the engineering work.

  • Services that are a part of the core transaction must be ring-aware, meaning they all reside in the same inner ring. Keep in mind that making each service ring-aware could change the application architecture for those impacted services.
  • Services which are not a part of the core transaction can be ring-aware but don’t have to be/can’t be. These could include dependent services from outside domains, services that are integrated using asynchronous patterns, message/streaming services, or legacy business services that follow a different deployment model.
  • Each primary inner ring will have a full capacity disaster recovery (DR) ring which must be in a different outer ring. This enables operating at full scale in the event an outer ring is not accessible (e.g. there is a region outage) or when an inner ring has become unstable and a controlled failover is in place.
  • There will be limitations that do not allow all services to be ring-aware. A common layer, as shown in the solution design, will need to be a part of the architecture implementation. In our case, the common layer contains reference data and the data warehouse. Reference data is authored and managed outside the ring infrastructure and cached within each inner ring. Caches in each inner ring co-locate this data with the transactional services that use them. All reporting and analytics also sit outside the ring, providing a single, integrated source for operational analytics.
  • Looking for transactional data requires knowing in which inner ring to search. In our case we had a way to do this by identifying the unique store number upfront and knew our worst case would be to search each inner ring. Keep this use case in mind and make sure it’s addressed. Searching each ring may work fine where there are only a few rings but this will not scale well as more are added.
  • It is unlikely that conventional load balancing and routing algorithms will match the strategy needed for traffic sharding to a given outer ring. As detailed above, we used store location and implemented a custom routing script. This was decided after analyzing several data points including sales volumes by store, revenue per store, and latencies between cloud regions. Determine what data points suit your need, and plan this carefully as it will directly drive how many outer and inner rings are needed.
  • Disaster recovery processes will get more complicated. Maximize automation and pay particular attention to how transactional data needs to be treated to recover from an outage without any loss.
  • Don’t overlook the importance of CI/CD automation and deployment parity. When implementing this pattern, build pipelines will need to always deploy to all active and disaster recovery inner rings. You may also need to be able to deploy store applications to a ring or a set of stores in a ring. In our architecture, store to ring mappings are implemented as technical reference data. This data is managed in a central location so that we can avoid any hardcoding.
  • Operational logging, monitoring, and transaction tracing will get harder. Have a unique identifier for each outer and inner ring and add them to all application logs. This will enable tracing and mapping transactions back to the rings from where they were processed. Follow specifications and standards such as OpenTelemetry to get deep telemetry data to help analyze performance and behavior.

Wrap-Up

Architecting for resiliency and scale in retail systems requires innovative thinking and using infrastructure solutions that supplement approaches taken in the application tier. A ring infrastructure architecture pattern is one such example and is enabling us to meet the high demands of our business. Using it does add complexity so choose when and where to use this pattern wisely and understand the implications through the entire solution stack. Although a one per cloud region was illustrated here, there can be as many inner and outer rings as needed for the use case at hand. Start simple, be practical, and evolve based on your unique needs.

Using a Ring Architecture For In-Store Application Resiliency was originally published in Walmart Global Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Article Link: Using a Ring Architecture For In-Store Application Resiliency | by Navdeep Singh | Walmart Global Tech Blog | Feb, 2022 | Medium