Building Pipelines for Automated Reconciliation with Near Zero Data Loss

Introduction

Walmart Marketplace allows third party sellers to list items on Walmart.com. Our Marketplace Payments Platform performs vital functions like user payments, invoicing, billing, settlement reporting, and other fintech services for both US and international markets. The platform receives transaction events for sales, returns, and adjustments data from the upstream systems, computes ledgers for financials and performs payment settlement periodically.

To ensure the best experience for sellers, the payments platform has an important goal of making accurate user payments on-time, every time. This translates to many sub-goals, such as:

  • Timely arrival of upstream events
  • Data completeness, meaning zero data loss during transit
  • Data consistency across systems
  • Scaling as needed

We accomplished this goal by building an End-to-End Automated Reconciliation System with a more robust architecture.

Problem

Even small rates of failure in any of the above goals could result in delayed payments, diminishing user trust and increasing manual work by operations to identify the root cause. Given that the payments team is the last stop before the payments go out to users, there is an additional responsibility to identify any issues that could have happened in upstream systems.

The payments platform is based on a microservices architecture with complex ownership. The process involves asynchronous data exchange between multiple event-driven distributed systems. As more independently deployable systems are involved, with data boundaries unclear, the data tractability becomes challenging.

Anatomy of a missed or delayed event in a micro-services architecture

Missed Event Anatomy

In the above diagram, data flows from source to destination. The source system receives events from another upstream system (origin). The events flow out of the destination system (final step). Also, the source system has two dependencies. The destination system has various life cycle stages (LC1,2,3) before the event flushes out of the system. Ten events came in from the origin, and two got stuck at the source dependencies, ending up with eight events emitted. Due to the handshaking issue, an event is lost, and seven events entered the destination. One event was lost within the life cycle stages, and finally, we sent out six events with a data loss (or delay) of four events end to end. The solution should be able to capture these four events.

The previous core system read a stream of events (or order info) from a streaming service, and the database used was NoSql. This setup was the primary pipeline. The legacy data flow assumed that the data received is complete. However, given that there are at least 3 upstream systems involved, and they, in turn, depend on other systems, the SLA was occasionally missed.

Legacy Primary Pipeline

The Solution — Architecture

The solution should allow us to foresee potential delays in both the upstream and the platform itself. It should be able to track the lineage of the data flow and give us quick access to any delayed events. So, we opted to build an End-to-End Automated Reconciliation System, that serves as a second pipeline to identify lost data.

This End-to-End Automated Reconciliation System has a number of key features:

Missed Event Anatomy with Reconciliation

Data Sourcing

This stage involves capturing:

  1. the raw data (blue line below) from the “source” systems before processing them. They provide the “expected transactions” that the platform should consider. The minimum requirement is to publish an event with “mandatory identifiers” to reconcile any potential dropouts during the movement. The data can be arrived in near-real time or up to a max of 24 hours due to their dependent systems.
  2. the data from the “destination” system for each transaction’s life cycle, which must be joined. These provide the “actual transactions”. Tapping the data for each life cycle is optional but helps to reconcile to a more granular level.
Revised Generic Architecture

Master Reconciliation

This phase reconciles the data between “expected” and “actual” transactions on various attributes, and the results are published using multiple communication channels to the engineering and operations team. Online Analytical Processing (OLAP) store is at the heart of the reconciliation process.

Any datastore with the below capabilities can be used to:

  • handle a massive volume of writes
  • provide aggregated metrics, and
  • fast random reads

The OLAP in addition should provide the following:

  • a “bird’s eye view” of summarized results of the reconciliation process, and
  • a “ground-level view” of the mismatched or delayed events with a sub-second response time

Care needs to be taken regarding the merging process of the database, to manage small files.

At the core, this is a left join in the database between the upstream data and payment events table. As part of the query, various criteria are checked to match the data between the two systems, and a report is generated. The query also generates aggregations for each life cycle to track data lineage. We also captured the dollar impact and customer impact as part of the reconciliation.

For any mismatches found due to late arrivals, the process will be repeated on the next try and try to close the gap.

Architecture — Implementation

Retry or Auto Heal Mechanism

The retry/auto heal step is performed daily to ensure that the platform receives all the events from the upstream. It checks for each event sent during the day. Ideally, the retry number is expected to be zero. Any non-zero number indicates that the handshake did not happen correctly and that there are some issues in the respective pipelines. The data is loaded into database after the retry is done. For auto-healed transactions, we capture the metrics of how many such occurrences happened during the day. The report is available as part of the reconciliation process. Hardware upgrades were one of the primary drivers of auto-healed transactions

Reverse-reconciliation

This stage is a left-outer join, but this time between the destination event and source data. It is added to ensure that the secondary pipeline itself is not dropping any events. This two-way reconciliation process helped achieve “a near-zero” data loss for the platform.

Data Visualization

A sample reconciliation report:

Sales — 100 transactions are processed successfully. So, no dollar impact.

Returns — Out of 20, only 19 were received and processed, resulting in one impacted user with 250-dollar impact.

Adjustments — All 10 were obtained by the platform but only 8 were processed. It could mean that there is an issue within the platform. This could be due to life cycle stages. The impacted two users assuming each transaction is unique for a user, and the total dollar impact was 200 accounting for 150 and 50, respectively.

Notification and Alerting

The results are published to various notification channels like email, instant messaging, issue tracking and on-demand visualization.

Key Performance Indicators (KPIs)

The primary KPIs are:

Birds-Eye View

  • Master Reconciliation — Are there any delays/mismatches in the events?
  • Reverse Reconciliation — Used to authenticate the data and make sure the data is complete
  • Life-Cycle Tracker — Are any events stuck as they move through the life cycle?
  • Impact — Revenue (dollar impact) and trust (number of users affected)

Ground-Level View

  • What events are unreconciled?
  • Reason for mismatch — Why?
  • At which system is a transaction stuck or delayed — Where?

Conclusion

In our testing of the new process, we found the End-to-End Automated Reconciliation System achieved a near zero payout failure rate and significantly reduced the average time spent manually reconciling tickets. Building this pipeline automated the process of catching lost events, resulting in more accurate payouts and a better experience for Walmart Marketplace sellers.

Building Pipelines for Automated Reconciliation with Near Zero Data Loss was originally published in Walmart Global Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Introduction to Malware Binary Triage (IMBT) Course

Looking to level up your skills? Get 10% off using coupon code: MWNEWS10 for any flavor.

Enroll Now and Save 10%: Coupon Code MWNEWS10

Note: Affiliate link – your enrollment helps support this platform at no extra cost to you.

Article Link: Building Pipelines for Automated Reconciliation with Near Zero Data Loss | by Ram Pedapatnam | Walmart Global Tech Blog | Mar, 2025 | Medium