American Express Technology

Cell-Based Architecture for Resilient Payment Systems

2026-06-11T00:00:00-04:00

The American Express core payments ecosystem is a global platform relied on by Card Members and partners around the world. Every day, it processes live payment transactions that require high availability, low latency, and predictable performance.

Resiliency is not an afterthought; it has been encoded into the system’s design from the beginning. Localized faults are contained within defined boundaries, and recovery is designed to be fast and predictable.

To achieve this, the platform is built around a cell-based architecture that isolates failures, maintains low-latency processing, and scales capacity without expanding the failure domain.

This blog outlines the principles that guide this architecture and how they help us build a resilient payments latform at global scale.

Core Payments Ecosystem

In 2018, we started a journey to modernize our core payments ecosystem. This platform processes live card and payment transactions and is mission-critical to our Card Members and partners.

As we modernized the platform, resiliency remained a primary design requirement. We needed an architecture that could continue processing transactions reliably, even when individual components failed. This decision was heavily influenced by our historical design patterns, which predated the term “cell-based architecture,” but share many of the same principles.

Our new platform targeted cloud-native technologies, which meant we needed to think differently about how we designed for resiliency and scalability.

In the next sections, we’ll discuss some of the design principles we follow in our core payments ecosystem and how they not only improve our ability to process payments reliably but also help us reduce latency and scale more easily.

What is Cell-based Architecture?

Cell-based architecture is an architecture pattern that has gained popularity in the cloud-native distributed systems space.

The idea behind the concept is to group related microservices, databases, and other components into independent instances called cells. Each cell is able to function independently without reliance on other cells.

In this diagram: Each cell contains its own services and data so a failure stays within that cell instead of spreading across the platform.

The primary benefit of cell-based architecture is reducing the blast radius of failures. With each cell being independent, if one cell experiences issues, it doesn’t impact the others. The trade-off is that cell-based architecture often increases management overhead and architectural complexity, as it requires careful design to ensure that cells are truly independent and that data is appropriately localized.

However, for mission-critical systems like payments, we find that the benefits of a reduced blast radius and improved resiliency outweigh the additional complexity.

We’ve also found that when implemented well, a cell-based architecture can help platforms reduce latency (by reducing external dependencies and network hops) and improve scaling by introducing additional independent cells.

How We Follow Cell-Based Architecture

Each instance of our core payments ecosystem is designed as a cell, which:

Is an independently deployable unit that can process payments on its own.
Has its own set of microservices, databases, and other components.
Is a single failure domain, meaning that if one cell experiences issues, it doesn’t cascade the failure beyond the cell boundary.
Can be taken out of rotation for maintenance or in response to failures without impacting the overall system or requiring coordination with other cells.
Has no synchronous cross-cell dependencies in the critical path of processing transactions.

A cell is defined by its failure boundaries rather than a specific infrastructure construct. In practice, cells never span multiple regions—everything required to process transactions (DNS, databases, microservices, and supporting services) remains local within that boundary.

To achieve this, we follow a set of core principles that guide our design decisions and help us ensure that our cells are truly independent and resilient.

Data and Processing Locality by Default

Processing payments requires data: currency rates, merchant category codes, and so on. Some data is static, while some data changes with each transaction.

Static & Semi-Static Data Replication

For static or semi-static data like currency rates and merchant category codes, we replicate that data to each cell.

In this diagram: Reference data is pushed into every cell ahead of time so transaction processing never needs a synchronous lookup to a central source.

Rather than relying on a fall-through read to a centralized system of record during transaction processing, we pre-populate this data in each cell ahead of time. This keeps reference data local before transactions arrive, avoids cache-miss latency during processing, and preserves critical-path isolation.

The replication work happens outside the transaction path, which lets us keep the data available locally without introducing synchronous cross-cell dependencies.

Dynamic Data Routing

Not all data is static and not all data can be pre-populated. For more dynamic data (data that changes with each transaction), data replication may not be fast enough to ensure that every cell has the right data at the right time. We don’t want to route transactions to cells that don’t have the latest data, as that would increase latency and potentially lead to processing failures.

Instead, we use deterministic routing to route transactions to the cell where the right data is already available. In a recent article, Migrating the Payment Network Twice with Zero Downtime, we introduced the Global Transaction Router, which is responsible for managing connectivity and routing transactions to the appropriate cell. It can do so because it understands just enough of the payment specifications to make routing decisions based on the transaction data.

For example, we may route transactions based on partner, market, or payment type; how we route depends on the payment transaction data and the use case, but the key is that we selectively route transactions to where they are needed when there is a need for strong data consistency across transactions.

In this diagram: The router sends a transaction to the cell that already has the authoritative dynamic state, while replication continues asynchronously outside the critical path.

We keep transaction processing localized by restricting microservice communication to pod-to-pod interactions within the cell’s Kubernetes network, ensuring all processing remains within the cell’s boundaries.

To ensure failover data is synchronized across cells using message-based replication, that replication happens asynchronously outside the transaction path, so it doesn’t impact latency or availability.

No in-flight transaction waits for replication to complete; if the latest state is required, the Global Transaction Router sends the transaction to the cell where that data is already authoritative or available.

We only allow our microservices to talk to localized database instances. This keeps latency predictable and avoids unnecessary network hops, but it requires deliberate routing decisions.

By introducing deterministic routing at the edge, we can ensure that transactions are routed to the cell where the right data is already available.

Enforced Boundaries for Ingress and Egress

Along with its routing capabilities, the Global Transaction Router also serves as a key enforcer of our “local only” processing.

Transactions must enter a cell through the Global Transaction Router; if a cell cannot process a transaction and that transaction needs to be rerouted to another cell, it must also go through the Global Transaction Router.

In this way, the Global Transaction Router also serves as a payments mesh, connecting our cells globally.

In this diagram: All cross-cell traffic is funneled through the Global Transaction Router, which preserves strict cell boundaries.

Preventing cross-cell dependencies becomes increasingly difficult as platforms grow.

By tightly controlling cross-cell communication through the Global Transaction Router, we prevent cells from forming strong dependencies on each other, as they do not have the ability to communicate at all—only the Global Transaction Router can communicate across cells.

This enforcement occasionally results in duplicated services where shared implementations might otherwise seem simpler, but it preserves cell independence and improves latency by reducing cross-cell network hops.

The same principle applies to observability. Each cell publishes logs, metrics, and traces to observability components localized within that cell first, so losing part of the observability stack only reduces visibility for that cell instead of the entire platform. We still aggregate observability data asynchronously to provide global dashboards, alerting, and fleet-wide analysis, but that aggregation remains outside the transaction’s critical path.

Cells Break in Isolation; Other Cells Replace Them

Leveraging the ability to reroute transactions to other cells is a key part of our resiliency strategy.

When failures occur, their impact stays contained within the affected cell, and transactions are automatically rerouted to a healthy cell where processing restarts.

We reroute not only new incoming transactions but also transactions that were already in-flight in the failing cell.

Our Payments Processing subsystem follows an orchestrated microservices architecture, where an orchestrator microservice manages the processing workflow and calls other microservices to perform specific tasks.

If a downstream service begins to fail, the orchestrator detects the failure, halts processing, and sends the transaction back to the Global Transaction Router to be rerouted to another cell.

In this diagram: When a cell fails mid-flow, the transaction is rerouted and restarted in a healthy cell rather than resumed across cells.

We do not attempt to resume partially processed transactions across cells. Instead, we restart transaction processing in another cell with the original transaction data.

This restart is only safe while the transaction is still within the core payments ecosystem. Once a transaction has been sent to an external system (e.g., card issuer), we consider that a point of no return, and we don’t allow transactions to be rerouted after that point.

Card authorizations are structured so that the point of no return is toward the end of processing. If a transaction fails before the point of no return, we can safely reroute and restart processing without worrying about duplicate transactions or data consistency issues.

For other payment types, we manage idempotency through transaction identifiers. Each transaction carries a unique transaction identifier that remains consistent across retries and reroutes. Downstream systems use these identifiers to detect and suppress duplicate requests, allowing retries and reroutes to be handled safely without introducing inconsistencies or duplicate transactions.

The restart model emphasizes the importance of avoiding shared state between cells. Cross-cell shared state would introduce synchronization challenges and potential consistency issues, especially during failover scenarios. Communication failures between cells could impact the ability to process transactions globally, which we want to avoid at all costs for a payments system.

In our architecture, cells are designed to be loosely coupled. Each cell has its own database clusters, and the microservices within a cell only communicate with the local database cluster.

When a cell fails, its impact stays confined to that cell, allowing other cells to continue processing transactions normally.

When rerouted, transactions are processed without reliance on state from the previous cell.

At any point in time, a cell can be taken out of rotation. When a cell is taken out of rotation either automatically or manually, another cell takes its place. This does not have to be a binary cutover. As discussed in Migrating the Payment Network Twice with Zero Downtime, the Global Transaction Router can shift traffic between cells by percentage, allowing us to gradually drain a cell for maintenance, validate a recovering cell under partial load, or respond more safely during incidents.

Minimal Dependencies at the Edge

With the Global Transaction Router at the edge, it’s a critical service providing connectivity, routing, and resiliency. To ensure its availability, we aim to keep dependencies within this system as small as possible.

The closer to the edge, the fewer dependencies we aim for.

But we don’t just reduce the dependencies; we also aim to keep them out of the critical path.

If our logging infrastructure becomes unavailable, we don’t want that to impact the ability to process transactions. We do this by using an asynchronous logger configured with a buffer truncation policy, so if the buffer is full, we drop logs instead of blocking transaction processing.

If our configuration service becomes unavailable, we want to continue running with the last known configuration. For this, we maintain an in-memory configuration that is updated asynchronously, so if the configuration service becomes unavailable, we can continue running with the last known configuration until it becomes available again and we can pull the latest configuration.

In this diagram: The edge path stays thin and resilient by handling logging and configuration asynchronously instead of letting those dependencies block transactions.

Keeping dependencies out of the critical path reduces failure points. This requires deliberate trade-offs: accepting degraded non-critical functionality (logging, metrics) to preserve transaction processing.

Summary

In distributed payments systems, resiliency isn’t achieved through monitoring and retries alone—it’s achieved by defining clear failure boundaries and enforcing them through design.

By organizing our core payments ecosystem into isolated, independently recoverable cells, we transform major failures into controlled routing decisions. Locality, deterministic routing, idempotent processing, and strict boundary enforcement work together to ensure growth and change don’t increase risk.

This discipline underpins our cell-based architecture, enabling us to operate a global payments platform with low latency and high resiliency—principles that continue shaping our evolution.

Reimagining Software Delivery with AI

2026-05-20T00:00:00-04:00

With the rapid rise of AI agents, we’ve entered a new phase of technological acceleration. Every week introduces new models, new capabilities, and new benchmarks with one agent claiming deeper reasoning, another promising greater autonomy, and so on. Each new release expands context windows, multimodal inputs, or tool integration. The landscape is evolving at breakneck speed.

As engineers, technical project managers, product owners, quality engineers, and leaders, it’s natural to ask which of these tools truly matter—and how they can be applied to improve the way we build and deliver software.

The truth is that the race to adopt the ‘best’ model will never end. Technology has always evolved this way. The real question isn’t which model tops a benchmark, but rather: how do we use AI to improve the way we deliver value?

As we sought to answer this question, we discovered something unexpected.

The Opportunity

Our objective seemed straightforward: improve the product delivery lifecycle by leveraging emerging AI capabilities.

Initially, we approached the challenge like many organizations do — evaluate tools, pilot agents, integrate those agents into development workflows, and measure productivity gains. But we quickly realized that improving delivery wasn’t primarily a tooling problem.

The software lifecycle spans ideation, requirement definition, design, implementation, testing, deployment, and feedback. It involves product, architecture, engineering, QA, and leadership — each with different artifacts, incentives, and feedback loops.

AI could not simply be inserted into one step to magically transform outcomes. More importantly, we realized this transformation could not be owned by engineering alone. The opportunity extended across the lifecycle — from product teams shaping intent in the earliest phases, to delivery teams executing against validated requirements, to QA organizations continuously strengthening quality and release confidence.

If we wanted meaningful impact, we had to rethink the entire lifecycle — from early ideation and product definition through engineering delivery, testing, and production release.

Introducing Ideation to Implementation

Instead of treating AI as a coding assistant bolted onto implementation, we shifted our perspective and reframed our optimum AI solution as a strategic co-creation partner embedded across the lifecycle. In this model, AI does not replace expertise; it amplifies it. Here were some of the principles we were looking for in this new approach:

It should help leaders clarify intent earlier.
It should enable product teams to test and refine concepts before committing engineering capacity.
It should strengthen traceability between business objectives and technical execution.
It should accelerate feedback loops and reduce ambiguity before code is written.

The goal was to leverage AI to enhance alignment, accelerate value realization, and consistently turn ideas into outcomes with greater precision and confidence. Rather than chasing model releases, we redesigned how we use AI to turn ideas into real customer impact.

A Recharged Software Development Life Cycle

Inspired by traditional SDLC and Agile principles, we designed a recharged lifecycle enhanced with AI across four integrated phases:

Envision + Define → Verify + Specify → Build + Integrate → Test + Release

Each phase produces measurable outputs, but the power lies in how they connect.

1. Envision + Define

Goal: Transform an ambiguous idea into structured, prioritized capabilities ready for engineering.

Upstream ambiguity is the largest drag on velocity and the root cause of downstream rework. When intent is unclear, everything slows: sprint planning, estimation, testing, and integration.

AI can help bring structure to early-stage thinking. In this phase, product and business stakeholders remain central, with AI helping accelerate discovery, alignment, prioritization, and readiness before implementation begins.

Market signals, research notes, and competitive inputs can be synthesized quickly and business capabilities can be mapped to measurable OKRs. Additionally, features can be decomposed into user stories with acceptance criteria, risks, and dependencies.

Instead of starting engineering with loosely formed epics, teams generate:

Planning-tool-ready feature sets
Clear Gherkin-based acceptance criteria
High-level architecture and integration diagrams
Dependency maps and risk registers
Preliminary estimations and readiness assessments

AI accelerates artifact creation, while humans validate feasibility and tradeoffs. The goal is not more documentation, but rather, clearer intent. When done well, this phase reduces ambiguity, improves sprint predictability, and shortens time-to-value.

2. Verify + Specify

Goal: Convert validated features into implementation-ready specifications with measurable completeness.

If “Envision” reduces ambiguity, “Verify” eliminates hidden risk. Most delivery failures don’t originate in code. They originate in incomplete or misaligned specifications — hidden edge cases, undocumented assumptions, missing non-functional requirements, or late-discovered integration constraints.

In this phase, AI acts as a systematic reviewer. User stories are rigorously evaluated against structured quality criteria, with acceptance criteria strengthened, dependencies validated, and data flows and observability requirements clearly defined. In parallel, architecture alignment is assessed early, diagrams are reviewed to close logical gaps, and integration risks are surfaced—so scaling and performance considerations are already modeled before sprint one begins.

Rather than relying solely on human review cycles, AI can serve as a second-pass auditor, identifying patterns of ambiguity across large backlogs.

This can help shift readiness from assumption to validation and support informed estimation. Historical velocity patterns and complexity comparisons inform sprint shaping and capacity modeling. The result is AI that does not replace team judgment but helps augment it with pattern recognition.

Before moving forward, readiness should be measurable:

Complete acceptance criteria
Validated architecture alignment
Documented dependencies and mitigation strategies
Defined test strategy
Clear linkage to KPIs

The cultural shift is subtle but powerful: engineers may begin to focus less on interpreting requirements and more on executing validated intent, with effects that can compound over time.

3. Build + Integrate

Goal: Translate validated intent into production-grade software with speed and discipline.

By the time we reach this stage, ambiguity should be minimal. Requirements are hardened, architecture is aligned, and dependencies are mapped.

Here, AI functions as a force multiplier, not an autopilot.

It can support structured code scaffolding aligned to user stories, automated unit test generation, inline documentation, and contract validation. When implemented correctly, boilerplate shrinks, cognitive load decreases, and engineers can focus on design integrity and problem solving instead of repetition.

Continuous alignment with architecture becomes critical here. Architectural drift — small deviations accumulating over time — is one of the most expensive long-term risks in software systems.

AI-assisted analysis can detect pattern deviations, inconsistent data models, unused interfaces, or emerging security and performance anti-patterns before they spread.

Integration, often the true bottleneck in delivery, shifts left. Early contract validation, mock generation, and scenario simulation can reduce late-stage surprises. Breaking changes may be identified sooner, and stabilization cycles may shorten.

CI/CD pipelines continue to evolve as well. AI can support the contextual analysis of build failures, help identify flaky tests, and assist in surfacing coverage gaps and quality trends. The pipeline can become more than a deployment mechanism — it can become an intelligent feedback engine.

When this phase operates on validated specifications, the impact is tangible:

Shorter development cycles
Reduced rework
Fewer integration failures
Slower technical debt accumulation
Increased delivery confidence
Engineering time shifts from correction to creation.
That shift is where competitive advantage lives.

4. Test + Release

Goal: Deliver with confidence, not hope.

Testing in a recharged lifecycle is continuous. Because specifications are hardened early, AI can generate meaningful test cases directly from acceptance criteria, surfacing edge cases sooner, optimizing regression suites instead of bloating them, and making traceability between requirements and coverage explicit. This can lead to smarter testing as QA teams are not simply downstream validators in this model. With AI-assisted traceability, risk analysis, and regression optimization, quality organizations become more active participants in continuously improving delivery confidence across the lifecycle.

Quality continues to shift from reactive detection to proactive insight. AI helps analyze defect patterns, build histories, performance regressions, and runtime signals to highlight risk areas before incidents occur.

As a result, release decisions become more evidence-informed rather than deadline-driven. Readiness can be evaluated through measurable indicators: defect trends, coverage completeness, performance stability, dependency health, and rollback preparedness.

After deployment, the loop should continue, with production serving as an ongoing source of learning. Usage patterns, feature adoption, and friction signals can be analyzed and fed back into the next “Envision” phase, supporting a more adaptive lifecycle. When Test + Release operates as an intelligent feedback system, organizations can experience fewer escaped defects, faster recovery, and greater stakeholder trust.

Over time, this can contribute to making reliability a stronger strategic differentiator and helping build trust.

Our findings: from Tools to Mindset

After applying this model, one insight became clear: the biggest challenge is not engineering complexity – It’s mindset. When implemented intentionally, AI can sharpen thinking, expose gaps earlier, strengthen alignment, and accelerate learning. But it does not replace judgment, accountability, or ownership. Those remain deeply human responsibilities.

What emerges is a new way of working:

Intent clarified before execution.
Specifications validated before implementation.
Integration treated as continuous.
Quality made predictive.
Feedback fueling the next idea.

AI is not redefining software delivery because it writes code faster, but because it changes when and how clarity is achieved. When ambiguity is reduced earlier and feedback is embedded across the lifecycle, the entire system, rather than individual steps, accelerates. Teams that do this well can move faster, reduce rework, and deliver with greater confidence—not by working harder, but by operating with sharper clarity from the start. This is what it means to move from ideation to implementation with precision.

Trust Without Disclosure: Why Zero-Knowledge Proofs Could Help Build Trust in AI Agents

2026-05-06T00:00:00-04:00

We’re moving from systems that respond to our questions to AI agents that act on our behalf. In this new era, AI agents can help book travel, manage tasks, and coordinate across systems, with less human intervention at each step.

This creates a practical problem: How do we trust these agents? How do we verify what they are allowed to do, or what they have done, without exposing sensitive information?

Enter zero-knowledge proofs—a cryptographic technique that lets you prove you know something without revealing what you know. It sounds like a magic trick, and in many ways, it is. But unlike magic, the mathematics behind it are provably sound.

The Agentic AI Trust Problem

Consider what happens if your AI assistant negotiates a deal with a vendor’s AI agent. Your agent needs to prove it has authorization to spend within a certain threshold, but revealing your exact budget gives the vendor leverage. The vendor’s agent needs to verify the customer’s agent isn’t bluffing but doesn’t necessarily need to know their financial details.

Traditional approaches fail here. Revealing everything destroys negotiating leverage. Revealing nothing undermines trust. We need something in between: proof without disclosure.

This isn’t hypothetical. As organizations explore AI agents for more sensitive workflows across healthcare systems, financial platforms, and enterprise infrastructure, the question of how agents prove things to each other has become urgent.

Proving Without Showing: The ZK Paradigm

The classic illustration involves a cave with two paths, A and B, that meet at a magic door in the back. Peggy claims she knows the password to open the door. Victor wants proof, but Peggy refuses to reveal the password itself.

The protocol: Peggy enters the cave and randomly chooses a path. Victor, who can’t see which path she took, calls out which path he wants her to exit from. If Peggy knows the password, she can always comply—she either exits from the path she entered or uses the door to cross to the other side.

Each successful round cuts Victor’s doubt in half. After 20 rounds, there’s less than a one-in-a-million chance Peggy is faking it. Victor becomes statistically convinced Peggy knows the password—without ever learning what it is.

This captures the three essential properties of zero-knowledge proofs.

Completeness: if Peggy truly knows the password, she can always convince Victor.
Soundness: if Peggy doesn’t know the password, she can’t consistently fool Victor.
Zero-Knowledge: Victor learns nothing beyond the fact that Peggy knows the password.

From Theory to Agentic Reality

The cave example is interactive—it requires back-and-forth communication. Many modern ZK systems have evolved to support non-interactive proofs, where a prover generates a single proof that anyone can verify without further communication. This is essential for agentic AI, where agents may need to prove credentials asynchronously across different systems.

Three main approaches have emerged, each with distinct trade-offs:

zk-SNARKs: Compact but Trust-Dependent

Succinct Non-Interactive Arguments of Knowledge produce remarkably small proofs—around 200 bytes regardless of what’s being proven. Verification is fast, making them ideal for resource-constrained environments. The catch: they require a trusted setup ceremony. If this setup is compromised, fake proofs become possible.

The trusted setup challenge: SNARKs require a one-time ceremony where multiple parties jointly generate cryptographic parameters. The setup remains secure as long as a single participant acts honestly and destroys their contribution—only if every participant colluded to combine their secret inputs could proofs be forged. This 1-of-N security model is why ceremonies involve many independent participants, but the coordination required is operationally complex for agentic systems that need rapid, dynamic deployment. Newer “universal setup” approaches (like Plonk) reduce this burden but don’t eliminate it entirely.

zk-STARKs: Transparent and Post-Quantum Friendly

Scalable Transparent Arguments of Knowledge eliminate the trusted setup entirely. Everything needed to verify proofs is publicly derivable. They’re also built on hash functions rather than elliptic curves, making them more resistant to quantum computing attacks. The trade-off: larger proofs (tens to hundreds of kilobytes instead of a few hundred bytes) and more computational overhead, which can increase verification time and on-chain costs.

Bulletproofs: Efficient and Setup-Free

Bulletproofs require no trusted setup and are particularly well-suited for proving that a value falls within a certain range—without revealing the value itself. Proof size grows slowly relative to what’s being proven, keeping them practical even in constrained environments.

Performance Reality: Async Over Real-Time

Proof generation takes time—seconds to minutes depending on circuit complexity. Today, this often makes ZK proofs better suited for asynchronous workflows: pre-flight credential checks, batch audit generation, or background compliance verification. An agent negotiating a contract can generate proofs between message exchanges; an agent executing millisecond trades cannot. Hardware acceleration is closing this gap but hasn’t eliminated it.

Where ZK Proofs Meet Agentic AI

The intersection of zero-knowledge proofs and agentic AI opens possibilities that neither technology enables alone:

Agent-to-Agent Authentication

When AI agents interact, they need to verify each other’s capabilities and authorizations. An agent could prove it’s authorized to access certain data, that its operation falls within specified parameters, or that it meets the requirements set by the receiving system—all without revealing the underlying credentials or system architecture.

Verifiable Agent Reasoning

One of the challenges with AI agents is understanding why they made certain decisions. ZK proofs could allow an agent to prove its reasoning followed certain rules or constraints without exposing its full reasoning chain, protecting proprietary models while enabling accountability.

Privacy-Preserving Collaboration

Multiple AI agents working together often need to share information selectively. A medical AI agent could prove that a patient meets defined eligibility criteria without revealing their complete medical history. A financial AI agent could prove that a transaction falls within approved limits without exposing full account details.

Audit Without Surveillance

Regulators and compliance systems need to verify AI agents operate within bounds, but constant surveillance creates privacy and competitive concerns. ZK proofs enable agents to generate audit trails that support compliance audits without exposing operational details.

Real-World Adoption: DIDs, VCs, and Beyond

Some Verifiable Credentials (VCs) and Decentralized Identifiers (DIDs) already leverage ZK proofs in production environments. Standardized credential frameworks and digital identity wallet initiatives are enabling selective disclosure—proving “I’m over 18” or “I hold certification X” without exposing full identity documents. Agentic commerce frameworks are now exploring VCs as the trust substrate for agent-to-agent transactions.

On an emerging frontier, ZK circuits are being developed that allow model creators to prove their training data was used under selected licensing or data-governance requirements—without revealing the dataset itself. As regulators increase scrutiny of AI training practices, this capability becomes a potential differentiator.

Current Limitations

Honest assessment is essential. Several constraints limit immediate deployment:

Tooling fragmentation: Proofs generated in one system (Circom) may not readily verify in another (Noir) without translation. Portability across agentic platforms—where Agent A’s proof must verify on Agent B’s stack—remains immature.
Blockchain dependency: Many of the most mature ZK implementations emerged from blockchain infrastructure (Ethereum L2s, Zcash, Mina). Enterprise tooling outside crypto is maturing but early-stage.
Computational overhead: Proof generation remains resource-intensive. Better suited for high-value, asynchronous verification than real-time decision loops.
Standards gap: There is not yet a broadly adopted standard for ZK-based trust in AI agent interactions. W3C’s DID and Verifiable Credentials specs provide the most mature foundation—already referenced by governments (EU eIDAS 2.0) and enterprises. The Decentralized Identity Foundation (DIF) and Internet Identity Workshop (IIW) are convening efforts, but agent-to-agent trust protocols remain undefined.

Looking Forward

The trust infrastructure for AI agents is still catching up to their capabilities. Zero-knowledge proofs represent one promising direction—offering a mechanism to establish verifiable trust without requiring full disclosure of underlying data.

Early convergence is visible. ZK-based identity frameworks are being explored as a way for agents to assert credentials selectively. Verifiable computation approaches could allow an agent to demonstrate what code it ran and on what inputs—shifting the basis of trust from assertion to proof. Standards work is beginning to examine how these tools might support compliant AI operations across different regulatory contexts.

Whether and how quickly these approaches are adopted remains an open question. But the underlying cryptographic primitives are well-studied, and the problems they address are real.

The cave example showed how to prove you know a password without revealing it. The agentic AI era presents opportunity to scale that principle to everything agents do: proving authorization, proving compliance, proving correctness—all without disclosure.

What began as an elegant mathematical curiosity in 1985 may become part of the trust infrastructure for a world where autonomous agents act on our behalf. An idea that once seemed like magic may prove increasingly practical as autonomous systems take on more responsibility.

Building Trust in AI-Powered Transactions with Amex Agentic Commerce Experiences (ACE) Developer Kit

2026-04-22T00:00:00-04:00

Introduction: From User-Driven to Agent-Driven Commerce

What if purchases could be made on a Card Member’s behalf by an agent that understands what they need, when they need it, and how to execute the transaction?

In this scenario, the Card Member doesn’t explicitly add items to a cart or tap “Buy Now.” Instead, an agent could recommend options and complete a purchase using an American Express account, based on permissions defined by the Card Member.

At American Express, we’re building for this shift with the Agentic Commerce Experiences (ACE) Developer Kit.

This shift challenges a core assumption of how payments work today.

Traditional systems are built around user actions where a person browses, decides, and executes each step of a transaction. As agents begin to act on behalf of users, that interaction model may no longer hold. In a more autonomous agentic AI system, instead of repeatedly translating intent into action, the user defines intent once, and the agent continuously evaluates context, makes decisions, and executes when conditions are met. This is the new model the Amex Agentic Commerce Experiences (ACE) Developer Kit is designed to enable—bringing intent-driven, agent-powered transactions onto the American Express network with trust and control.

The Core Challenge: Enabling Delegation

This evolution introduced a complex problem: how do we help ensure that any agent taking delegated actions is explicitly authorized, controlled, and accountable?

How can an agent acting on a customer’s behalf prove it has the required authority to make purchases? How can guardrails be set for the agent’s actions? Who is accountable if the agent makes a mistake? These questions require the payment ecosystem to adapt for agent-powered transactions.

What it Takes: Core Capabilities

As we explored these challenges in building the ACE Developer Kit, it became clear that enabling AI-powered payments required a new set of capabilities.

Establishing Trust Through Identity and Enrollment

Trust begins with a clear, explicit setup: the Card Member provides their payment instrument for agents registered with Amex, completes issuer authentication, and defines controls on how that instrument can be used. This interaction model is supported by capabilities such as agent registration and account enablement, which help create a verifiable relationship between the Card Member, the agent, and the payment instrument.
Representing Intent as Enforceable Boundaries

At the time of purchase, the agent may play a role in both discovery and execution—recommending what to buy, selecting from relevant merchants, and completing the transaction. These decisions are guided by natural language instructions that express the Card Member’s intent, which ultimately helps determine what is purchased, how much can be spent, and with which merchants.

With that in mind, we built an interaction model where intent captures qualifiers such as:
- What should be purchased
- Where it could be purchased from
- How much can be spent
- When additional approvals are required
Intent helps define the boundaries within which the agent can operate, and the ACE Developer Kit helps make those boundaries enforceable in practice.
Enforcing Boundaries at Execution

Delegation often involves enforced boundaries, including spending limits, merchant restrictions, frequency controls, and conditional execution. These are applied at execution, helping to ensure agents operate within defined boundaries.
Securing Execution Across the Transaction Flow

At the payment authorization stage, payment credentials are designed to be used in a limited and controlled manner; actions are tied to verified intent, while execution is auditable and traceable.

The ACE Kit is designed to support this through:
- Scoped payment credentials
- Short-lived authorization artifacts
- Strong verification mechanisms
In this way, the agent can complete a payment within a defined framework.

What This Means for Developers

Instead of building complex payment and risk infrastructure from scratch, developers can integrate a framework that helps manage identity, intent, and execution in a consistent way. As the platform evolves, additional tools will further simplify integration and help accelerate development.

How This Works in Practice

When these capabilities come together, an adapted model of payments emerges. In the ACE Developer Kit, this interaction model is implemented through a sequence of coordinated steps.

Intent is first captured and stored as a structured contract. At the time of payment credentials issuance, Amex generates a scoped credential tied to that intent.

Before a payment credential is generated, the ACE Kit is designed to validate:

That the intent is still valid
That constraints are satisfied

Partner Integration: Supporting Existing Commerce Flows

The ACE Developer Kit is designed to integrate with partner environments—including AI agent providers, platforms, and merchants—without requiring fundamental changes to their existing payment infrastructure. Partners interact with the Kit through a combination of synchronous APIs and asynchronous notifications. API-driven interactions allow partners to initiate enrollment, create intent, and request payment credentials, while event-driven notifications provide updates on state transitions such as authentication completion or transaction outcomes. This dual model allows partners to operate in both request-driven and event-driven environments, depending on their architecture.

In a typical flow, the partner initiates a request to create or update an intent. Amex evaluates the request, applies risk controls, and determines whether criteria are met to permit execution. Scoped payment credentials are then issued and used by the merchant to process the transaction through existing payment rails.

Security Protocols and Design Patterns

Security is a foundational element of this architecture and is implemented across all stages of the lifecycle.

All API interactions are authenticated using OAuth and mutual TLS, ensuring that each request is associated with a verified partner or agent identity. This allows access to be scoped, monitored, and revoked as needed.

Sensitive payloads are protected using industry-standard encryption mechanisms, helping ensure account details and credentials remain protected in transit.

The platform follows a zero-trust model with respect to client input. No request is accepted based solely on client-provided data. Instead, each step is validated using signed artifacts that bind to a specific intent. These artifacts are short-lived and include protections against replay, ensuring that requests cannot be reused or forged. This combination of authentication, encryption, tokenization, and verification allows ACE to enforce strong security guarantees without introducing unnecessary friction for developers.

Scaling Across Ecosystems

As adoption grows, the ACE Developer Kit is designed to scale and be interoperable with different standards and models.

By decoupling intent and credential issuance, and keeping the transaction processing the same for merchants, ACE avoids dependency across components, allowing each to scale independently. This approach enables partners to onboard incrementally, extend their use cases over time, and operate at scale without requiring changes to the core model.

Developer Experience and Integration Simplicity

While the ACE Developer Kit enforces complex validation and security controls, the integration model is intentionally designed to remain straightforward. Developers interact with a small set of well-defined capabilities: enrollment, intent creation, credential retrieval, and lifecycle management, which map cleanly to existing application flows. This reduces the need for custom orchestration logic and allows developers to focus on building agentic experiences.

To simplify integration, in the future, the platform will provide supporting tools such as SDKs, Agent Skills, an MCP server, and reference implementations. These abstractions encapsulate common patterns, allowing developers to focus on building user experiences rather than managing low-level details. Structured error responses and consistent request and response patterns help ensure that integrations are predictable and easier to debug.

This balance between strong controls and simple integration is critical for enabling adoption at scale.

What We Learned: Key Principles

As we worked through these challenges, a few principles became clear.

Delegation should always be explicit and verifiable, rather than implied.
Intent should capture the Card Member’s goal in a way that can be enforced.
Payment credentials should be scoped to the intent.

These principles shaped how we approached the problem.

Conclusion

The ACE Developer Kit brings a new payments model to life, providing developers with the capabilities to enable secure delegation, enforceable intent, and controlled execution on the American Express network.

Developers can explore the APIs, integration patterns, and supporting tools on the American Express developer portal: https://developer.americanexpress.com

Optimizing Istio for Large-Scale Enterprise Applications

2026-03-30T00:00:00-04:00

Overview

In today’s rapidly evolving cloud-native application landscape, adopting service meshes has become vital for effectively managing the complexities inherent in microservices architectures. Among the leading solutions, Istio stands out by offering a comprehensive suite of features, including traffic management, security, and observability.

If a large enterprise is expanding its use of Istio, performance optimization should sit front and center in the overall implementation strategy. Below, I’ll delve into proven strategies for enhancing Istio’s performance in large enterprises.

Sidecar resource usage and sizing

Managing Istio sidecar resources can pose significant challenges that often go unnoticed, potentially leading to application issues if not addressed properly. The default resource requests for the sidecar are 128Mi for memory and 100m for CPU, with limits set at 1Gi for memory and 2 cores for CPU.

When to Update CPU Requests

The Envoy proxy consumes approximately 0.5 vCPU per 1000 requests per second. CPU requests should be increased when transactions per second (TPS) are high, and the Istio sidecar approaches its configured CPU limits.

When to Update Memory Requests

The Envoy proxy uses 50 MB of memory per 1,000 requests per second going through the proxy. When numerous entries—such as egress or import and export resources—are added to the namespace, the sidecar will require additional memory to manage these configurations effectively.

The sample deployment configuration below provides a guide for modifying resource requests and limits. It’s crucial to specify limits for both CPU and memory; omitting these will result in limits being set to unlimited, which could lead to resource contention and instability.

      spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
      annotations:
        sidecar.istio.io/proxyMemoryLimit: 3Gi
        sidecar.istio.io/proxyCPULimit: '3'
        sidecar.istio.io/proxyCPU: '1'
        sidecar.istio.io/proxyMemory: 2G

    

When to use L4 over L7

Istio is capable of handling both Layer 7 (L7) and Layer 4 (L4) communications during pod-to-pod interactions, depending on the protocol specified for the destination Kubernetes service. If the appProtocol is set to ‘tcp’, Istio treats the connection to that service as an L4 connection; otherwise, it is classified as L7.

In high-traffic scenarios where multiple hops are required between microservices before reaching the final response, each hop adds additional latency. If L7 controls are not required, disabling them can improve latency when using Istio.

Layer	Use When	Examples
L7	Advanced traffic routing, observability, or security for HTTP/HTTPS traffic is needed.	Canary deployments, API routing, fault injection.
L4	Protocol-agnostic traffic management or low-latency handling for non-HTTP protocols is needed.	Database traffic, gRPC, streaming services.

L7 destination [appProtocol as ‘http’]:

      kind: Service
apiVersion: v1
metadata:
  name: myapp
  namespace: mynamespace
spec:
  ipFamilies:
    - IPv4
  ports:
    - name: http-8080
      protocol: TCP
      appProtocol: http
      port: 8080
      targetPort: 8080
  internalTrafficPolicy: Cluster
  type: ClusterIP
  ipFamilyPolicy: SingleStack
  sessionAffinity: None
  selector:
    app: myapp

    

L4 destination [ appProtocol as ‘tcp’]

      kind: Service
apiVersion: v1
metadata:
  name: myapp
  namespace: mynamespace
spec:
  ipFamilies:
    - IPv4
  ports:
    - name: tcp-8080
      protocol: TCP
      appProtocol: tcp
      port: 8080
      targetPort: 8080
  internalTrafficPolicy: Cluster
  type: ClusterIP
  ipFamilyPolicy: SingleStack
  sessionAffinity: None
  selector:
    app: myapp

    

Sample Test results:

In a complex multi-hop microservices architecture (approximately 15 hops), tests show that using L4 yields about 53% better response times compared to L7 for pod-to-pod communication. This highlights the importance of choosing the right traffic management layer to optimize performance in microservices architectures.

With Layer7

TPS graph:

Response time graph [P90 reaching till 340ms at 6kTPS]

With Layer4

TPS graph

Response time graph [P90 reaching only till 140ms]

Istio Retry logic

The default retry policy for the mesh includes connect-failure, refused-stream, unavailable, cancelled, and retriable-status-codes. It’s important to be cautious about retriable-status-codes, which, when combined with the configuration for http.StatusServiceUnavailable, means that Istio will, by default, retry any 503 error—even those intentionally returned by the service. For applications that may not work with default retries should consider updating retry logic as described in the link below:

https://istio.io/latest/docs/reference/config/networking/virtual-service/#HTTPRetry

Limiting the configuration sprawl that needs to be pushed out

To optimize control plane performance effectively, the most straightforward approach is to minimize the scope and size of the proxy configurations deployed to the data plane. For example, consider a specific workload ‘myapp’. Instead of pushing configurations for all services within the mesh, it’s possible to significantly improve efficiency by only deploying the proxy configuration relevant to workload ‘myapp’ and its dependent services. Utilizing the Sidecar resource allows for precise control over which configurations are sent, ensuring that only necessary data is pushed to the data plane.

      apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
  name: sidecar-myapp
  namespace: 600000392-colleague360
spec:
  egress:
    - hosts:
        - '*/mysecondapp.mysecondappnamespace.svc.cluster.local'
        - '*/myexternalendpoint.com'
  workloadSelector:
    labels:
      app: myapp

    

Smart DNS proxy

Smart DNS Proxy is a feature in Istio that enhances DNS resolution for workloads within the service mesh. It allows Istio sidecars to intercept DNS queries and resolve them based on Istio’s service registry. There are a few things that need to be evaluated when setting up ServiceEntries for external access.

https://istio.io/latest/docs/ops/configuration/traffic-management/dns/#proxy-dns-resolution

Switch to resolution: NONE to avoid proxy DNS lookups entirely. This is suitable for many use cases.
If the domains being resolved are controlled internally, increasing their TTL is recommended.
If ServiceEntry is only needed by a few workloads, its scope can be limited with exportTo or a Sidecar.

Reduce stampede of DNS requests to upstream

Unlike most clients, which will handle DNS requests on demand before caching the results, the Istio proxy never conducts synchronous DNS requests. When a resolution: DNS type ServiceEntry is configured, the proxy will periodically resolve the configured hostnames and use those for all requests. This interval is fixed at 30 seconds and cannot be changed, this occurs even if the proxy never sends any requests to these applications and regardless of TTL values returned by the DNS server. This can create issues in large clusters with multiple service entries and DNS queries upstream.

The problem: synchronized 30-second DNS refreshes

Because the DNS refresh interval is fixed and identical across all proxies, large Istio meshes can experience highly synchronized DNS lookups. When hundreds or thousands of Envoy sidecars refresh DNS at the same 30-second boundary, a classic thundering herd effect ensues, leading to problems such as:

Burst spikes in DNS queries every 30 seconds
Increased load on CoreDNS / kube-dns or external DNS providers
DNS latency spikes or rate limiting from upstream DNS servers
Increased control-plane pressure during mass restarts or rollouts

This behavior becomes especially problematic when:

During events like rolling restarts, deployments, or config pushes:
- Many proxies restart and reinitialize envoy clusters simultaneously
- DNS resolution is triggered immediately during envoy cluster warming
- This stacks on top of periodic refreshes, compounding DNS pressure
Each Envoy sidecar independently maintains its own DNS cache and schedules periodic asynchronous resolution using a timer-driven event loop. However, since the refresh interval is deterministic and starts at roughly the same time (e.g., proxy startup or cluster warming), thousands of sidecars can align their DNS queries on the same boundary.
Envoy’s DNS refresh behavior is interval-driven and does not strictly honor upstream TTLs in all cases. When TTLs are low (or effectively overridden by dns_refresh_rate), queries are issued more frequently than necessary.

Fix via PILOT_DNS_JITTER_DURATION

Thankfully, a solution exists. PILOT_DNS_JITTER_DURATION is an Istio configuration that introduces randomized jitter to DNS to refresh scheduling across proxies.

Instead of all Envoy sidecars refreshing DNS exactly every 30 seconds at the same moment, Istio spreads those refreshes across a configurable time window. Each proxy still refreshes DNS on the same fixed interval, but the refreshes are intentionally de-synchronized.

This means:

The 30-second DNS refresh interval remains unchanged
Refresh timing is staggered across proxies
DNS query traffic is evenly distributed over time

Resulting benefits

Eliminates DNS query bursts caused by synchronized refreshes
Reduces load and rate-limit risk on DNS infrastructure
Improves DNS latency stability and P99 behavior
Makes large Istio meshes more resilient during restarts and scaling events

When to use it

PILOT_DNS_JITTER_DURATION is strongly recommended for:

Large Istio deployments with many sidecars
Heavy use of ServiceEntry with resolution: DNS
Environments sensitive to DNS performance or quotas

Logging Optimization

Istio can produce a significant volume of logs when default logging is enabled at the cluster level. This excessive logging can result in performance degradation, increased storage costs, and challenges in log analysis. To optimize logging practices, it is recommended to enable error logging by default while allowing application teams to manage logging settings for their respective applications. The steps below outline how to implement this best practice:

Add MeshConfig.ExtensionProvider.EnvoyFileAccessLogProvider at cluster level to enable cluster wide logging

      extensionProviders:
   - envoyFileAccessLog:
       path: /dev/stdout
     name: envoy-access-logs

    

Now create telemetry object cluster wide to only show error logs, below filter can be updated based on usage requirements:

      apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: default-exception-logging
  namespace: istio-system
spec:
  accessLogging:
    - providers: 
      - name: envoy-access-logs
      filter:
        expression: "response.code >= 400 || xds.cluster_name == 'BlackHoleCluster' || xds.cluster_name == 'PassthroughCluster'"

    

For production setups, it is recommended to set up info access logging for Istio gateways as well for Istio gateways, enable that with below telemetry object:

      apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: disable-providers-envoy-access-logs
namespace: istio-gateways
spec:
accessLogging:
    - providers:
      - name: envoy-access-logs

    

Application teams will now only see the minimum required error logs as mentioned in the filter at istio-system namespace level and if needed on demand can enable logging for their workloads via below Telemetry object:

      apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
    name: myapp-telemetry
    namespace: mynamespace
spec:
    accessLogging:
    - providers:
      - name: envoy-access-logs
    selector:
      matchLabels:
        app: myapp

    

Metrics Optimization

Istio offers a wide range of additional metrics that can be easily enabled or disabled, as outlined below. However, leveraging these metrics comes with trade-offs in resource consumption and system complexity. Therefore, it is advisable to enable only the necessary metrics in production environments, while maintaining the flexibility to toggle metrics on or off in development and testing environments. Here’s an overview of the potential impacts:

Enabling additional metrics:

      annotations:
          proxyStatsMatcher:    ## this part is adding additional metrics
            inclusionRegexps:
              - ".*upstream_rq_.*"
              - ".*upstream_cx_.*"
              - ".*downstream_rq_.*"
              - ".*downstream_cx_.*"

    

Increased Resource Usage:

Collecting and exporting additional metrics increases CPU and memory usage for the Envoy sidecar proxies.
The Prometheus server may also consume more resources to scrape, store, and query the expanded dataset.

Higher Network Overhead:

Exporting metrics from sidecars to telemetry systems generates additional network traffic.
This can impact overall cluster performance, especially in high-traffic environments.

Storage Requirements:

More metrics mean larger storage requirements for time-series databases like Prometheus.
Long-term retention policies may need adjustment to accommodate the increased data volume.

Conclusion

Running Istio at enterprise scale requires deliberate trade-offs rather than enabling every feature by default. As environments grow, unmanaged observability, retries, sidecars, and configuration sprawl can introduce significant performance and operational overhead.

A more thoughtful approach goes a long way: focus on high-signal telemetry, right-sizing sidecars, lean on L4 over L7 when possible, and be selective about when to use retries. Keeping configuration scope controlled is equally important to ensure the control plane remains scalable and predictable.

Advanced patterns like Smart DNS proxies can be useful in targeted scenarios, but they introduce additional DNS load and should be adopted cautiously with clear justification and monitoring.

Ultimately, successful Istio operations depend on continuous tuning—measuring impact, refining configurations, and evolving alongside workload and traffic changes.

*Note: Istio is an open-source technology.

Migrating the Payments Network Twice with Zero Downtime

2026-03-12T00:00:00-04:00

If you tuned in to Monster Scale Summit this year, you may have seen our talk on migrating the American Express Payments Network - not once, but twice — with zero customer-impacting downtime — meaning no transactions were interrupted and no planned maintenance windows were required during either migration. The session focused on how we moved live payments traffic reliably under strict operational constraints. If you missed it, the talk is available to watch on the Monster Scale Summit website.

This article expands on the conference talk and dives deeper into the engineering decisions, tradeoffs, and lessons learned across both migrations.

Context: The Payments Network

The payments network is a mission-critical distributed system responsible for processing critical payments traffic, including live card authorization. It serves as the bridge between American Express merchants, acquirers, and issuers globally.

This platform must be continuously available, operate at low latency, and handle large volumes of critical traffic.

Migration Constraints

In 2018, American Express began a multi-year modernization of our payments network, including migrating from a legacy platform to a new microservices-based architecture.

A migration of this scale had to operate within several non-negotiable constraints:

The migration had to be performed online, with no planned or unplanned downtime.
The new platform had to reimplement existing payment processing logic; regressions in functionality were not acceptable.
Latency, throughput, and resiliency characteristics had to remain consistent, and in some cases improve.
Payment requests could not be dropped, delayed, or left unanswered.

Not only did we need to migrate under these constraints once - we needed to do it twice.

Migration #1: From the Legacy Payments Network to the New Platform

The first migration involved transitioning live card authorization traffic from the legacy payments network to a new, modernized platform.

While the payments network is large and complex, real-time card authorization traffic is primarily handled by two subsystems: a routing layer (which we’ll refer to as the “Global Transaction Router” or “GTR”, for simplicity) and the payments processing platform.

Understanding these two subsystems is key to understanding how we approached the migration.

Global Transaction Router (GTR)

The GTR acts as the gateway into the payments network. Unlike typical backend platforms, card authorization traffic is primarily sent over long-lived TCP connections carrying ISO8583 messages, a message format specific to payments.

The GTR manages these long-lived connections from acquirers and issuers and routes incoming transactions to the payments processing platform. It is also responsible for routing responses from the payments processing platform to network participants.

The router intentionally implements a minimal understanding of payment protocols - just enough to make routing decisions. Its primary role is to make routing, failover, and traffic-shaping decisions without owning payment processing logic.

Acting as the gateway, the GTR also provides centralized traffic control and resiliency for the payments network. It sits at the edge of the payments network and is highly specialized, optimized for low latency and high throughput.

Payments Processing Platform

The payments processing platform is where the complex, business-critical payment processing logic lives.

This platform is implemented as a microservices-based architecture, consisting of numerous services and databases. As transactions flow through the payments network, the payments processing platform validates, enriches, and transforms them.

This logic has been developed and refined over many years. Rebuilding this logic was a significant undertaking, and ensuring parity with the legacy system was critical.

Migration Strategy

Rebuilding the full payments network from scratch was a significant, multi-year effort. It involves complex processing logic, extensive edge cases, and exception handling. Waiting for full platform completion before migrating live traffic was not an option. Building new functionality would require building in both the legacy and new systems, leading to duplicated effort and increased risk of functionality drift.

Instead, we broke the migration into three stages:

Stage 1: Connection Migration
Stage 2: Shadow Traffic
Stage 3: Canary Routing

Stage 1: Connection Migration

In the first stage, we wanted to introduce the GTR into the flow of transactions. This was the most critical stage of the migration - it enabled every other stage and was the first time a new component was inserted into the live traffic path.

When new connections landed on the GTR, it routed all traffic to the legacy payments network. This allowed us to introduce the GTR without requiring processing logic parity.

For each incoming connection, the GTR established a corresponding connection to the legacy payments network. Any transaction received on the incoming connection was forwarded to the legacy payments network over the downstream connection. No logic, no message parsing, just simple forwarding.

This approach allowed us to insert centralized traffic control and resiliency into the payments network with minimal risk. To reduce risk further, we migrated connections in small batches, monitoring system health and performance closely. Observability and metrics from the GTR were critical during this stage.

Stage 2: Shadow Traffic

With the GTR in place, we were able to introduce shadow traffic to the new payments processing platform.

Shadow traffic is, at its core, a replay of live production traffic. We deployed a dedicated production instance of the new payments processing platform and replayed a copy of live traffic to it.

If there were any functional discrepancies between the legacy and new payments processing platform, they would show up here.

This shadow traffic capability allowed us to validate payment processing logic in a production-like environment without impacting live traffic. It did not replace traditional unit and functional testing, but rather it provided a final validation step before routing live traffic to the new platform.

Stage 3: Canary Routing

With processing logic validated via shadow traffic and the GTR in place, we were ready to route live traffic to the new payments processing platform.

We applied canary deployment principles to the platform migration. We extended the GTR with just enough understanding of payment protocols to make routing decisions based on transaction attributes.

This allowed us to take small percentages of live traffic and route them to the new payments processing platform. As functionality was ready, we identified customer segments and transaction types that could be routed to the new platform.

The GTR took care of routing these transactions to the appropriate backend platform based on the canary configurations. All canary decisions were enforced centrally by the GTR, before transactions reached the payments processing platform. This canary routing capability was implemented as custom logic within the GTR to support this migration and has since become a critical component of the Payments Network architecture.

We started with 1%; when everything looked good, we increased to 5%, then 10%, and so on.

If anomalies were detected, we immediately reverted all routing back to the legacy payments network. This gradual approach allowed us to migrate live traffic with minimal risk. We avoided any big-bang cutovers or customer impacts.

In addition to reducing risk, this approach reduced duplicated development effort. It allowed the platform to evolve with real traffic without needing to maintain two separate codebases for an extended period.

Migration #2: Kubernetes Infrastructure Migration

After the new payments processing platform was operational, we faced a second major migration that reused the same traffic control patterns established during the platform migration. We needed to move from a legacy Kubernetes infrastructure to a new Kubernetes environment.

Due to significant differences in networking, security, and operational practices between the two environments, an in-place migration was not feasible. This required a full rebuild of the payments network infrastructure in the new Kubernetes environment.

This meant we needed to migrate live traffic again - with zero downtime. Latency, throughput, and resiliency characteristics had to remain consistent as well.

Environment Setup and Validation

The first step in this migration was establishing the new Kubernetes environment in a repeatable and consistent manner. We leveraged infrastructure-as-code to ensure consistency and repeatability across test and production environments.

Existing pod and service configurations were exported from our existing production environment. They were redefined as declarative infrastructure-as-code configurations.

This approach ensured consistency across regions and environments. It took time to get right, but once we had a solid foundation, we could spin up new environments quickly, both for the initial migration and future expansions. Any new infrastructure changes now start with infrastructure-as-code definitions.

Performance and Resiliency Testing

With the new environment established, we validated that it could meet our performance and resiliency requirements. We first established a performance baseline in our existing environment. We then deployed the same application versions into the new environment and ran load tests to compare performance characteristics. The new environment exhibited differences that required tuning.

We implemented those tuning changes via infrastructure-as-code and rolled them out to all environments.

Resiliency testing followed a similar approach. We ran various failure scenarios in the existing environment, documented the results, and then ran the same scenarios in the new environment. Any discrepancies were investigated and resolved via infrastructure-as-code changes.

Before moving any traffic, we ensured the new environment met or exceeded all performance and resiliency requirements.

Canary — Again

With the new environment validated, we were ready to migrate live traffic again - with zero downtime.

We reused the same canary routing strategy from the first migration. This time, we were routing traffic between two identical payments processing platforms. External ISO8583 connectivity continued to terminate at the edge; canary routing was applied only to internal gRPC Remote Procedure Calls (gRPC) traffic between the GTR and the payments processing platform.

As we built the GTR, we implemented canary deployments leveraging Envoy Proxy and a custom control plane. While our initial implementation was focused on routing between different versions within the same region, we extended this capability to route between different regions.

We called this multi-region canary routing. This allowed us to route all traffic from one region to another. With traffic re-routed, it freed us to enable the new Kubernetes environment in the original region.

Once ready, we routed percentages of traffic back to the original region, now running the new Kubernetes environment. We gradually increased traffic back to the original region, monitoring system health and performance closely.

Observability was as critical to this step as the canary routing itself. Our business metrics, application logs, and application health metrics all gave us visibility into how the new environment was performing under live traffic. If issues were detected, we could quickly revert all traffic back to the secondary region.

Lessons Learned

Both migrations were significant undertakings, and we learned a lot along the way.

Traffic Control was Essential

The GTR and Envoy Proxy-based canary routing were essential components of both migrations. They provided the traffic control needed to safely route live traffic between different platforms and environments.

These capabilities were initially developed as glue code, but over time evolved into critical components of our payments network architecture.

Rolling Back is a First-Class Capability

In both migrations, the ability to quickly and safely roll back changes was essential. Designing systems and processes with rollback in mind reduced risk and allowed us to respond quickly to any issues that arose.

Invest in Observability

Observability was critical to the success of both migrations. Having deep visibility into system health, performance, and business metrics allowed us to make informed decisions during the migrations.

Shadow Traffic is Invaluable

The shadow traffic capability provided a final validation step before routing live traffic to the new payments processing platform. This capability was essential in identifying any unknown discrepancies between the legacy and new systems.

We’ve since leveraged this capability for ongoing testing and validation of new features and changes. We are also using this capability to validate other downstream systems migrations.

Infrastructure-as-Code is Non-Negotiable

Leveraging infrastructure-as-code for the Kubernetes migration ensured consistency and repeatability. It allowed us to manage complex infrastructure changes with confidence, and it set the foundation for future expansions.

The Most Important Lesson

The most important lesson was patience and discipline. In payments, success is measured in reliability, even if it takes longer to get there.

When Human Feedback Is Scarce, How Do You Evaluate AI?

2026-03-02T00:00:00-05:00

Evaluating AI systems is easy… until it isn’t.

For many user-facing applications like travel planning, clinical note drafting, or conversational agents, there is no single “right answer.” The most reliable signal of quality is human feedback: ratings, preferences, or real-world behavior. That’s because quality is subjective, contextual, and often best judged by people. As a result, human feedback such as ratings, preferences, and real-world behavior is the most reliable signal we have.

But in early-stage systems and research prototypes, that signal is often too sparse, too expensive, or too slow to guide development. This creates a gap between how AI systems are evaluated in research settings and how they need to be evaluated in real-world deployment.

This challenge sits at the heart of a new ICLR paper, AutoMetrics: Approximate Human Judgments with Automatically Generated Evaluators, authored by researchers from American Express and Stanford HAI.

The work introduces a practical open-source framework for transforming small amounts of human feedback into scalable and interpretable evaluation metrics, helping teams move from prototypes to production with greater confidence.

From Expensive Human Labels to Practical Metrics

Today, when evaluating AI systems, there is often a trade-off between two imperfect options:

Human evaluation: Accurate but costly and slow.
LLM-as-a-Judge: Fast and inexpensive but can be brittle and often poorly aligned with what users actually care about.

AutoMetrics offers a third path.

The key idea is simple but powerful: instead of relying on a single evaluator, AutoMetrics learns a weighted combination of evaluation metrics that best matches human judgment, using fewer than 100 feedback points.

The framework operates in four steps:

Generate candidate metrics

AutoMetrics automatically creates task-specific evaluation criteria (e.g., clarity, usefulness, policy compliance) using LLMs.
Retrieve existing metrics from MetricBank

The system draws from MetricBank, a curated library of 48 well-documented evaluation metrics spanning tasks like summarization, dialogue, code generation, and safety.
Learn how to combine them

Using lightweight regression, AutoMetrics identifies which metrics matter most and how they should be weighted to best predict human feedback.
Report interpretable evaluators

The output isn’t just a score—it’s a breakdown of why a system is performing well or poorly.

The result is an evaluation signal that is data-efficient, adaptive, and explainable.

Stronger Alignment with Human Judgment

The paper analyzes AutoMetrics across five diverse tasks, including dialogue, product description generation, code completion, and travel planning. Across these settings, AutoMetrics consistently outperforms strong baselines.

It improves correlation with human ratings by up to 33.4% compared to standard LLM-as-a-Judge approaches.

Performance saturates with approximately 80 human feedback examples, making it practical even for low-data settings.
The learned metrics remain stable under irrelevant changes and sensitive to real quality degradations, a key requirement for trustworthy evaluation.

In other words, AutoMetrics doesn’t assign scores, it behaves like a reliable proxy for how people actually judge quality.

Evaluation That Can Drive Optimization

One of the most compelling findings goes beyond measurement. The authors show that AutoMetrics can be used as a proxy reward signal to optimize an AI agent, matching or even exceeding the performance of systems trained with a fully verifiable reward. In a realistic airline-assistance benchmark, agents optimized with AutoMetrics improved at the same rate as those trained with explicit ground-truth rewards.

This opens the door to human-aligned optimization in domains where rewards are subjective, ambiguous, or hard to formalize.

Why This Matters

For practitioners building real-world AI systems, AutoMetrics points to a future where:

Evaluation adapts as products evolve
Small amounts of user feedback go much further
Metrics are understandable enough to guide iteration, not just leaderboard scores

By releasing AutoMetrics and MetricBank as open-source tools, the authors aim to make adaptive, human-aligned evaluation a standard part of the AI development workflow.

Looking Ahead: Evaluation That Keeps Pace with AI

As AI systems move faster from prototype to production, evaluation can no longer be an afterthought or a bottleneck. AutoMetrics shows that it’s possible to ground evaluation in human judgment without requiring massive labeling efforts, and to do so in a way that remains explainable adaptive, and actionable.

The broader implication is clear: evaluation itself must become a learning system. By discovering what users actually value and translating that signal into scalable metrics, AutoMetrics reframes evaluation as a first-class component of AI development, rather than a scorecard at the end. These metrics can then be used to optimize AI Agent configurations.

For teams building AI in open-ended, user-facing domains, this work points toward a future where small amounts of real feedback can drive rapid iteration, reliable optimization, and more human-aligned systems from day one.

As the community continues to explore adaptive evaluation, AutoMetrics provides both a practical toolkit and a compelling blueprint for how we measure progress in AI—when the only true reference is human judgment.

Read the full ICLR paper:
AutoMetrics: Approximate Human Judgments with Automatically Generated Evaluators

Note: This research is part of an industrial affiliate program.

Mastering Decision-Making in Technology

2026-02-11T00:00:00-05:00

We all do a common thing, every single day, especially in the fast-paced world of engineering leadership: making decisions. Big ones, small ones, the kind that keep you up at night, and the ones you barely notice.

For the longest time, I prided myself on my “gut feeling” and ability to make quick calls. Sometimes it works spectacularly. Other times… well, let’s just say hindsight is 20/20, and some decisions felt more like stumbling in the dark than striding confidently forward.

I realized that just being smart or experienced wasn’t enough. Leading a team, building complex systems, and navigating the business landscape demands more. It demands smarter decision-making. Not just faster, but better. I needed a process to cut through the noise and, frankly, get out of my own way.

So, I went on a bit of a quest️, researching, completing trainings and courses, and diving deep into the art and science of decision-making strategy. And wow, did I learn a few things! I want to share my journey and some “aha!” moments, hoping they might help you level up your own decision-making.

Step 1: Define the Problem

How often have we jumped into coding a solution, only to realize later we misunderstood the core need? Well-defined problems lead to breakthrough solutions.

Adopt a more rigorous Problem Definition Process:

Problem Statement: Write it down. Is it clear? Is it actually multiple problems? What does success look like? Who needs to be involved?
Need: What’s the fundamental need? Who benefits? Why?
Justification: Does this align with our strategy? What are the measurable benefits? How do we ensure implementation?
Context: What have we or others tried? What are the constraints (tech debt, budget, regulations)?

“Instead of just fixing a slow page, we dug deeper to define the problem as: an API endpoint’s response times increased significantly over the past month, correlating with a decline in user engagement.”

Step 2: Choose Your Battles

What now? We design a better decision-making system.

Not every decision needs a 10-page analysis. Focus intense effort on the critical, high-impact decisions. Which ones truly warrant the deep dive? Think of it as triaging decisions, classifying them as low-, medium-, and high- stakes.

Leaders can get stuck treating all decisions as equal. The skill is in knowing when to slow down and invest more thinking versus when to move fast and conserve energy for the calls that matter most.

Step 3: Recognize and Spot Biases

We’re all biased. It’s not a moral failing; it’s just how our brains are wired. We take mental shortcuts (heuristics) to deal with complexity, but sometimes these shortcuts lead us down the wrong path. Think of it like wearing slightly warped glasses—the world looks almost right, but things are subtly off, leading to missteps.

Here are the common culprits:

Action-Oriented Bias: We want to go fast, jumping into solutions before fully understanding the problem. We need to embrace uncertainty and explore before executing.

“I dove right into coding a complex feature request without writing a proper design doc. Halfway through, it hit me—I’d completely missed some crucial edge cases and overlooked key non-functional requirements.”

Pattern-Recognition Bias: Seeing patterns where none exist, often based on past (but maybe irrelevant) experiences. Like assuming a new coding challenge is exactly like one you solved five years ago, ignoring crucial differences. Change the angle and look from a different perspective.

“I caught myself assuming a performance issue must be the database again, without even checking the caching layers or network latency first. I defaulted to my past experiences instead of considering other possibilities.”

Stability Bias: Preferring the status quo even when change is needed. “If it ain’t broke, don’t fix it” can be dangerous in a dynamic environment. Sometimes, you need to shake things up!

“Hesitating to upgrade frameworks that are outdated and lack essential features because it feels too disruptive.”

Interest Bias: This one is very common for tech leaders, letting personal or team incentives cloud judgment. Is this really the best technical solution, or does it just let my team use that shiny new framework they love? It is important to make those interests explicit!

“Let’s do Rust, for a new service, even if the rest of the team isn’t proficient in it.”

Social Bias: Grounded in groupthink or letting the loudest voice dominate. We need processes that encourage diverse viewpoints and depersonalize debate.

“I remember that architecture review where I found myself deferring to the most senior engineer’s opinion. Even though I had concerns, I hesitated to speak up, and I noticed that other junior members did the same. The senior voice dominated the conversation, and our quieter perspectives were never heard.”

Recognizing these biases is like turning on a light switch. It’s about seeing potential pitfalls before falling into them.

Step 4: Deploy Countermeasures for Biases

Use targeted tactics:

Think statistically and rely on data rather than intuition.
Make sure to gather diverse perspectives.
Aggregate input from multiple team members to improve decision quality.

Step 5: Embed those Countermeasures

Make it routine. Add bias checks to your formal decision processes (like project kick-offs or solution or strategy reviews).

The key to embedding is ritualizing good practices. A few practical ways to make it stick:

Add a simple “bias check” question into project templates: “What blind spots might affect this decision?”
In retrospectives, explicitly review not just outcomes but the decision process: Did we rush? Did we ignore dissenting opinions?
Incorporate bias-awareness and problem-definition training into onboarding for engineers and managers so that new team members are aligned from the start.
Celebrate examples of good decision-making, not just good results — sometimes a well-structured process prevents disaster, even if the initial idea didn’t pan out.

Over time, these small rituals hardwire bias awareness and structured decision-making into the team’s cultural DNA, so it becomes second nature.

Step 6: Remain Grounded in Strategy

Decisions don’t happen in a vacuum. They need to serve a larger strategy.

I used to think strategy was just for senior executives. But strategy is crucial at every level. Why? Scarcity. We don’t have infinite time, money, or people. Strategy helps us make choices about where to focus our limited resources.

Crucially, strategy needs:

Internal Fit (do the pieces work together logically, reinforcing each other?)
External Fit (does it match the reality of the market, tech trends, regulations, etc.?).

Your internal plan isn’t helpful if the external world makes it obsolete. Strategy ensures that every investment and development aligns with the broader objectives of the enterprise, creating a whole that is greater than the sum of its parts.

Step 7: Move From Gut Feel to Hypothesis-Driven

This was a big shift. Instead of saying “I think this feature will work,” start saying, “My hypothesis is that if we build feature X (independent variable), then we will see a Y% increase in (dependent variable).”

Why? Because most ideas, even good-sounding ones, often don’t deliver the expected value when tested scientifically! We need to move from intuition to evidence.

Process:

Ask Questions: Start broad (exploratory questions) especially with unknowns. Why is the system slow? What are users really trying to do?
Collect Facts / Stats: Collect facts, data, different perspectives as much as possible.
Formulate Hypotheses: Get specific (confirmatory questions). Make them measurable and testable.
Test & Learn: Test and gather data! Run experiments (POC, tests, user studies, others who have done it in past).
Refine: Was the hypothesis right, wrong, or partially right? Update your understanding and iterate.

The Road Ahead

This isn’t an overnight transformation and each of our leadership transformation journeys will look different. It’s an ongoing practice of awareness, discipline, and learning.

But the payoff? More confident decisions, better team alignment, strategies that actually work, and ultimately, building better products and stronger teams to win.

It’s about shifting from simply reacting to proactively architecting our decisions and strategies. It takes effort, but the clarity and effectiveness it brings are invaluable.

The Innovation Behind Amex’s Platinum Card Refresh

2026-01-29T00:00:00-05:00

At American Express, innovation is not a one-time milestone, it’s a continuous journey. Over the past decade, we’ve reimagined the way we deliver new Card products and benefits to our Card Members by thoughtfully investing in technology platform modernization, API architecture, data-driven insights, and digital experiences. As a result, we’ve completed more than 200 Card refreshes since 2019, averaging over 30 each year. Our most recent U.S. Platinum Card refresh—the most ambitious yet across both consumer and business Cards—showcases this transformation in action.

Continuous Modernization Fuels Innovation

Our technology transformation has been deliberate and multi-year, with ongoing investments in continuous modernization. These investments enhance our modular architecture, reusable frameworks, and advanced big data platforms that drive speed and efficiency.

Gen AI-powered rule generation and UI-driven configuration are two key foundational investments. Our technical architecture places a rules engine above the base code. That means Gen AI generates core decision rules for benefits, offers, and rewards, enabling faster configuration and deployment. Parameters and configurations are managed through intuitive UI layers, while Gen AI supports internal teams’ real-time setup, speeding up end-to-end delivery and reducing time and complexity for our engineers.

Previously, setting up new benefits could take three months or more. Now, with UI-driven configuration and Gen AI–powered business rule creation, setup timelines have been reduced to as little as 6–8 weeks.

APIs: The Building Blocks of Exceptional Experiences

At the center of our digital ecosystem is a robust API-led architecture. Our core products and benefits are powered by APIs designed for transparency and scalability.

For customers, this means seamless experiences: real-time benefit tracking, self-service enrollment, and personalized recommendations. For engineers, APIs act as modular building blocks—ready-to-use components that can be combined to create new digital experiences without starting from scratch.

This flexible foundation also made it possible to bring together lifestyle, dining, and payment experiences together in one cohesive ecosystem. Now, our mobile app serves as a hub where customers can manage their finances and explore curated lifestyle offerings, like discovering their next Resy reservation or a new travel inspiration—all in one place.

And we’re already building for what’s next: many of our APIs are AI-ready, paving the way for intelligent, context-aware features in travel, dining, and beyond. Our API strategy ensures that as technology evolves, our platforms evolve along with it.

Data-Driven Insights at Scale

American Express’ closed-loop model provides a high-definition view of our customers, allowing us to deliver benefits that resonate. Our new cloud-based data platform unlocks scalable computing power, real-time analytics, and robust governance to enable innovation while meeting regulatory requirements. For the Platinum Card refresh, this level of data-driven rigor helped us curate new dining, lifestyle and business benefits. For example, dining is a top passion for American Express Card Members, who spent over $87B on dining in the U.S. in 2024, so offering stronger dining value to Card Members was a key priority in this update.

Looking Ahead: Digital at the Core

The Platinum Card refresh is another exciting milestone in our digital journey, but there is more to come. We’re making strategic investments in our mobile app, website, and tech platforms to enable future product updates and seamless digital experiences for our customers. From more intuitive personalization to new integrations across benefits and rewards, digital is at the heart of every product refresh.

Beyond Vanilla RAG: 7 Techniques for Better Retrieval-Augmented Generation

2026-01-14T00:00:00-05:00

Large Language Models (LLMs) are trained on vast datasets, yet they still struggle when queries require information outside their training data. Correct responses to challenging queries might involve proprietary information, recent events, or specialized knowledge not captured during training. One popular approach to mitigate this issue is Retrieval-Augmented Generation (RAG), which enhances LLMs by leveraging external knowledge to deliver better responses.

The standard, or “vanilla,” RAG process involves the following steps:

Document Chunking: Splitting a document or article into smaller, manageable chunks.
Vectorization: Using an embedding model to transform these chunks into vector representations and store them in a vector store along with relevant metadata.
Similarity Search: When a query is received, the system vectorizes the query using the same embedding model and performs a similarity search to retrieve the top k chunks that are most relevant to the query.
Response Generation: The query, along with the top k chunks, is passed to an LLM to generate a response based on the retrieved information.

While vanilla RAG is effective in many cases, it is not without limitations. It may fail to retrieve the most relevant chunks or generate accurate responses, especially for more complex or nuanced questions. These limitations have driven significant research efforts aimed at enhancing the basic RAG approach.

In this blog post, we will explore seven advanced RAG approaches grouped by core strategies, including: Reasoning-based, Retrieval reliability, and Knowledge structure-enhanced. Each of these is an improvement from vanilla RAG and each results in better responses from the LLM. By the end of this post, you’ll have a clearer understanding of these advanced techniques and the types of applications they are best suited for.

Reasoning-based: Self-RAG, ActiveRAG, Chain-of-Note, RAFT

Retrieval reliability: CorrectiveRAG, Adaptive-RAG

Knowledge structure-enhanced: Graph-Enhanced RAG

Reasoning-Based

Self-RAG

The Self-RAG [1] approach leverages a fine-tuned model to make more informed decisions during the question-answering process. Unlike the vanilla RAG approach, which always retrieves additional context, Self-RAG introduces a conditional retrieval mechanism. Here’s how it works:

Initial Query and Conditional Retrieval: The process starts with a query being sent to the model, which then decides whether extra context needs to be retrieved from the vector store. If retrieval is necessary, the model retrieves relevant chunks.
Chunk Evaluation and Response Generation: For each retrieved chunk, a two-fold evaluation takes place:
- The model checks if the chunk is relevant to the query.
- Regardless of the chunk’s relevance, the model generates a preliminary response.
Self-Reflection and Validation:
The generated response, alongside the query and the chunk, is then passed through the model again to evaluate whether:
- The response is supported by the chunk.
- The response is useful for answering the question.
Re-Ranking Based on “Tokens”:
Self-RAG ranks the retrieved chunks based on three key factors (tokens): Relevance, Usefulness, and Supportiveness (though this ranking step is not depicted in diagram). The top k re-ranked chunks are selected.
Final Answer Generation:
Finally, the top k ranked chunks are sent back to the model one last time, along with the original query, to generate the final, refined answer (not depicted in diagram).

Self-RAG excels at handling single-hop questions, where the answer can be found within a single retrieved chunk. Its success across various benchmarks, such as PopQA, TriviaQA, PubHealth, ARC-Challenge, Biography, and ASQA, is attributed to the multiple rounds of self-reflection and reasoning achieved through repeated LLM calls. This iterative process significantly enhances the model’s reasoning capacity and ensures higher accuracy.

However, there is a trade-off. Self-RAG requires:

Fine-tuning two large language models.
Multiple calls to one fine-tuned LLM during the inference stage.

While these factors contribute to its superior performance, they also make Self-RAG less cost-effective, especially for applications requiring real-time responses or operating under strict computational budgets.

ActiveRAG

ActiveRAG [2] is a unique approach that can be thought of as dual tasking in parallel. On one hand, a Chain-of-Thought (CoT) query is sent to an LLM to generate a step-by-step reasoning response for the question. Simultaneously, after retrieving relevant chunks based on the query, these chunks are sent to the LLM along with one of four knowledge construction prompting strategies, which enhances the LLM’s reasoning process. For example, one strategy helps the LLM better understand the query and context leveraging the retrieved context.

In the final cognitive nexus step, ActiveRAG integrates the reasoning result from the reasoning process to identify potential errors in the original CoT response, ultimately producing the final, refined answer.

ActiveRAG has outperformed several benchmarks, including Natural Questions (NQ), TriviaQA, PopQA, and WebQ, demonstrating its strength in single-hop questions. This improvement is largely due to the explicit expansion of the LLM’s reasoning capabilities in the knowledge construction step, coupled with the cognitive nexus step, which self-checks the CoT response against the retrieved information.

Unlike some other approaches, ActiveRAG does not require fine-tuning of any large or small LMs. However, it does involve multiple calls to LLMs, which can lead to higher latency and increased computational costs.

Chain-of-Note

The Chain-of-Note [3] approach leverages a fine-tuned model. In this approach, retrieved chunks of information along with the query are passed to the model. The model’s response not only provides the final answer but also includes explanatory notes on how the answer was derived from the retrieved chunks, reducing the risk of hallucination. There are three types of notes that can be generated:

Relevant (contains the answer): The chunk directly provides the correct answer.
Irrelevant (model knows the answer): The model already knows the answer independently of the retrieved chunk.
Irrelevant (model doesn’t know the answer): The chunk does not help, and the model acknowledges uncertainty.

Chain-of-Note has outperformed key benchmarks such as Natural Questions (NQ), TriviaQA, and WebQ, particularly excelling at single-hop questions. This improvement is due to the model’s additional self-refinement steps before producing the final answer, which strengthens its reasoning abilities.

During inference, only a single call to the fine-tuned model is required, which keeps operational efficiency high. However, the data collection process for fine-tuning a model can be resource intensive. For instance, Chain-of-Note leveraged ChatGPT to generate answers with notes for 10,000 questions sampled from the NQ dataset. While the approach is effective, using a more robust, commercial LLM for production deployments may offer better performance. Chain-of-Note could be expensive when developers need to fine-tune a model with their dataset for their use cases.

RAFT

RAFT [4] employs a fine-tuned model. During the training phase, in addition to relevant chunks, RAFT intentionally includes irrelevant chunks in the training datasets. The model generates responses in a Chain-of-Thought (CoT) style, incorporating reasoning and citing relevant documents. This training strategy equips the model to identify and disregard irrelevant chunks during inference, enabling it to provide accurate answers even when such chunks are mistakenly retrieved.

RAFT has surpassed benchmarks such as PubMed, HotPotQA, HuggingFace, Torch Hub, and TensorFlow Hub, showcasing its effectiveness in tackling multi-hop questions. This capability means that generating a correct answer often requires synthesizing information from multiple chunks located in different contexts. RAFT’s exceptional performance on multi-hop questions stems from its enhanced reasoning capacity, allowing it to analyze both relevant and irrelevant chunks concurrently.

At first glance, RAFT may resemble the Chain-of-Note approach; however, its distinctive training methodology sets it apart. By deliberately including irrelevant chunks during training, RAFT bolsters the robustness of its fine-tuned model, ensuring better performance in inference scenarios where irrelevant information may arise. Moreover, it is specifically trained to excel at multi-hop questions rather than just single-hop inquiries.

Retrieval Reliability

CorrectiveRAG

The CorrectiveRAG [4] approach introduces a Retrieval Evaluator, fine-tuned using T5-large, to assess the relevance of retrieved chunks to the query. If the retrieved chunks are accurate, the query and chunks are sent to a large language model (LLM) to generate the final response. However, if the chunks are deemed incorrect, CorrectiveRAG rolls back to a web search, and the query along with the search results are then passed to the LLM for response generation.

In cases where ambiguity arises, CorrectiveRAG combines both the retrieved chunks from the vector store and the search results from the web search, along with the query, and sends them to the LLM for generating a response.

CorrectiveRAG has outperformed benchmarks such as PopQA, Biography, PubHealth, and ARC-Challenge, demonstrating its effectiveness at handling single-hop questions. This performance boost is due to the added evaluation of the correctness of retrieved chunks and the integration of web search results when necessary.

Compared to previous approaches, CorrectiveRAG only requires fine-tuning a small language model (LM) as a classifier. During inference, it requires just one call to a small LM and one call to an LLM, with occasional web searches. This makes CorrectiveRAG more cost-effective than methods that rely on fine-tuning one or more LLMs.

However, the reliance on web searching, when retrieved chunks are ambiguous or incorrect, is a double-edged sword. While web searches can improve the accuracy of answers, certain use cases prohibit external web searching, limiting the applicability of CorrectiveRAG in such environments.

Adaptive-RAG

Adaptive-RAG [4] employs a classifier, fine-tuned on T5-large, to assess the complexity of incoming queries. If a query is classified as native, the query is sent directly to an LLM for a response. For simple queries, the system retrieves chunks only once; the query and the retrieved chunks are then sent to the LLM for a response. In the case of complex queries, Adaptive-RAG retrieves chunks multiple times before passing the query and chunks to the LLM for final response generation.

Adaptive-RAG has outperformed benchmarks such as SQuAD, Natural Questions (NQ), TriviaQA, MuSiQue, HotPotQA, and 2WikiMultiHopQA, achieving superior results in either accuracy or cost-effectiveness. This indicates that Adaptive-RAG is efficient at handling both single-hop and multi-hop questions, particularly when incoming queries may vary in complexity.

While both Adaptive-RAG and CorrectiveRAG utilize a small LM as a classifier, they differ in their approaches. Adaptive-RAG classifies the query prior to any chunk retrieval. For native questions, the system sends them directly to the LLM without retrieving additional information. For complex questions, multiple retrievals are performed, but Adaptive-RAG does not revert to web searching. We are curious about the potential outcomes of combining Adaptive-RAG with CorrectiveRAG to leverage their respective strengths.

Knowledge Structure-Enhanced

Graph-Enhanced RAG

Knowledge Graphs can significantly enhance the capabilities of vanilla RAG. To differentiate this general approach from Microsoft’s GraphRAG, which is not the focus of this blog post, we refer to this approach as Graph-Enhanced RAG [7].

The vanilla RAG approach begins with a vector database that stores the vectors of chunks (e.g., c1, c2, and c3) extracted from articles or documents. The texts of these chunks are stored as metadata linked to their corresponding vectors (this metadata is not depicted in the diagram). When a query is received, the system retrieves the top k chunks from the vector database based on their similarity to the query.

By incorporating a knowledge graph, we can conceptualize the chunks as nodes (e.g., n1, n2, and n3), with the chunk texts and vectors represented as properties of these nodes. Relationships can then be established among the nodes, such as a NEXT relationship that describes the sequence of the chunks.

Subsequently, LLMs can be employed to extract entities from the nodes, including people (e.g., p1 and p2), organizations, companies (e.g., c1), locations, and more. These entities become new nodes within the graph. Relationships between the extracted entities and the original chunks can also be added (e.g., MENTIONS).

Additionally, relationships among the newly created nodes can be identified using LLMs. For example, person p1 works for company c1, and person p2 also works for company c1. We can infer that p1 and p2 are colleagues based on this information.

One of the advantages of Graph-Enhanced RAG is that it does not require fine-tuning any models. When a new context is introduced, developers can easily extend the existing knowledge graph. This approach is particularly effective for multi-hop questions and overarching tasks, where responses may need to synthesize content from across an entire article or document, such as generating a summary.

However, Graph-Enhanced RAG does necessitate calls to LLMs for entity and relationship extraction from the texts of the nodes. Additionally, depending on the size and content of the documents, storing the knowledge graph may require significant space, which can increase costs.

Summary

#	Name	Need to fine-tune model(s)?	Fine-tuned model size x amount	Applications
1	Self-RAG	Yes	7b x 2	single-hop QA
2	ActiveRAG	No	n/a	single-hop QA
3	Chain-of-Note	Yes	7b x 1	single-hop QA
4	RAFT	Yes	7b x 1	single-hop QA multi-hop QA
5	CorrectiveRAG	Yes	0.77b x 1	single-hop QA+
6	Adaptive-RAG	Yes	0.77b x 1	single-hop QA multi-hop QA
7	Graph-Enhanced RAG	No	n/a	single-hop QA multi-hop QA overarching task

The table above provides a high-level summary of the seven advanced RAG approaches, highlighting whether each method requires fine-tuning a model, the model size, and the applications for which each approach is best suited.

Overall, there is no single RAG approach that universally fits all use cases. Developers should choose a RAG approach based on their specific question types (e.g., single-hop and multi-hop) and other requirements (e.g., latency is a critical factor and can fine-tune a model.) For instance, if the question set includes a mix of single-hop and multi-hop questions, and latency is a critical factor, developers might consider starting with Adaptive-RAG. Conversely, if the questions are complex and necessitate information drawn from various parts of the context, and fine-tuning a model is not feasible, Graph-Enhanced RAG may be the better option. Advanced RAG approaches will continually evolve with new ones emerging in the future. By leveraging these advanced RAG techniques, we can improve the quality of LLM answers for our use cases.

This article summarizes publicly available research on retrieval-augmented generation (RAG) techniques. It is provided for informational purposes only.

Academic Papers

1. Self-RAG
Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv preprint arXiv:2310.11511. https://arxiv.org/pdf/2310.11511

2. ActiveRAG
Xu, Z., Liu, Z., Liu, Y., Xiong, C., Yan, Y., Wang, S., Yu, S., Liu, Z., & Yu, G. (2024). ActiveRAG: Autonomous Knowledge Assimilation and Accommodation through Active Retrieval. arXiv preprint arXiv:2402.13547. https://ar5iv.labs.arxiv.org/html/2402.13547

3. Chain-of-Note
Yu, W., Zhang, H., Pan, X., Cao, P., Ma, K., Li, J., Wang, H., & Yu, D. (2023). Chain-of-Note: Enhancing Retrieval-Augmented Generation with Knowledge Organization. arXiv preprint arXiv:2311.09210. https://arxiv.org/pdf/2311.09210

4. RAFT
Zhang, T., Patil, S. G., Jain, N., Shen, S., Zaharia, M., Stoica, I., & Gonzalez, J. E. (2024). RAFT: Adapting Language Model to Domain-Specific Retrieval-Augmented Generation Tasks. arXiv preprint arXiv:2403.10131. https://arxiv.org/pdf/2403.10131

5. CorrectiveRAG
Yan, S.-Q., Gu, J.-C., Zhu, Y., & Ling, Z.-H. (2024). Corrective Retrieval-Augmented Generation. arXiv preprint arXiv:2401.15884. https://arxiv.org/pdf/2401.15884

6. Adaptive-RAG
Jeong, S., Baek, J., Cho, S., Hwang, S. J., & Park, J. C. (2024). Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. arXiv preprint arXiv:2403.14403. https://arxiv.org/pdf/2403.14403

7. Neo4j Product Examples – SEC EDGAR Data Prep Repository
Neo4j Product Examples. (n.d.). Data Preparation for SEC EDGAR Knowledge Graph Examples. GitHub repository. Retrieved November 2025, from https://github.com/neo4j-product-examples/data-prep-sec-edgar

8. DeepLearning.AI Short Course
DeepLearning.AI. (n.d.). Knowledge Graphs & Retrieval-Augmented Generation [Online short course]. Retrieved November 2025, from https://www.deeplearning.ai/short-courses/knowledge-graphs-rag/