Skip to content

Conversation

@nikagra
Copy link

@nikagra nikagra commented Jan 30, 2026

Implement LWT Replica-Only Routing with Local DC Prioritization

Overview

This PR implements specialized load balancing for Lightweight Transaction (LWT) queries to optimize performance by routing exclusively to replicas and prioritizing local datacenter replicas. This reduces coordinator forwarding overhead and minimizes contention during Paxos consensus phases.

Motivation

LWT queries require multi-round Paxos coordination and are sensitive to latency and coordinator variance. The previous implementation used the same routing logic as regular queries, which could:

  • Route to non-replica nodes, adding an extra forwarding hop
  • Not prioritize local DC replicas consistently
  • Introduce unnecessary cross-DC traffic for LOCAL_SERIAL transactions

This change ensures LWT queries always target replicas directly, with local replicas first, resulting in lower latency and reduced contention.

Changes

1. Modified RequestHandler LWT Path

  • Removed: Special-case LWT bypass that returned raw replica list from metadata
  • Now: LWT queries flow through the configured LoadBalancingPolicy chain like regular queries
  • Benefit: Enables policy composition (latency-aware, host filters) while maintaining LWT-specific replica-only behavior

2. Enhanced TokenAwarePolicy with LWT-Aware Routing

  • New LWTHostIterator: Specialized iterator for LWT queries that:
    • Returns only replicas (never non-replica hosts)
    • Prioritizes local DC replicas (based on childPolicy.distance(host) == LOCAL)
    • Filters out DOWN hosts and IGNORED hosts (respects allowlists/filters)
    • Preserves primary replica ordering from metadata (no randomization to avoid contention)
  • Regular routing unchanged: Non-LWT queries continue to use existing RANDOM, TOPOLOGICAL, or NEUTRAL strategies

3. Key Design Decisions

Rack Awareness Intentionally Omitted

  • Decision: Do not apply rack-based prioritization within the local DC for LWT queries
  • Rationale: Rack-based ordering can create hotspots and contention during Paxos phases. Treating all local DC replicas equally distributes LWT load more evenly and reduces the chance of lockstep contention on hot partitions.

Replica-Only with Pragmatic Fallback

  • Default behavior: When routing key and keyspace are available, the query plan contains only replicas
  • Fallback: If routing information is missing or metadata is unavailable, falls back to child policy's plan (may include non-replicas)
  • Rationale: Hard failure when routing info is unavailable would be overly strict; the fallback preserves availability with a small performance penalty (documented in Javadoc)

Local DC First, Remote DC Second

  • Decision: Local replicas (distance == LOCAL) are returned first; remote replicas (distance == REMOTE) follow
  • Rationale: For LOCAL_SERIAL, this keeps Paxos messaging within the local DC for best performance. Remote replicas are still included as fallback for availability.

Filter DOWN and IGNORED Hosts

  • Decision: LWTHostIterator skips hosts that are !isUp() or have distance == IGNORED
  • Rationale: Attempting DOWN hosts wastes time; respecting IGNORED honors user-configured allowlists and filters

4. Documentation

  • Class-level Javadoc: Added comprehensive section explaining LWT routing behavior, requirements, fallback behavior, and rack avoidance rationale
  • README.md: Added LWT optimization to the features list with brief explanation
  • Inline comments: LWTHostIterator now documents filtering and ordering logic

Requirements for LWT Replica-Only Routing

For the optimization to apply, the following must be true:

  1. Statement routing key available: Use prepared statements or manually set routing key
  2. Effective keyspace known: Set on statement or session
  3. Child policy reports DC locality: Use DCAwareRoundRobinPolicy or similar
  4. Cluster metadata available: Driver must have resolved replica topology

Configuration

LWT routing method is controlled via:

QueryOptions.setLoadBalancingLwtRequestRoutingMethod(RequestRoutingMethod)
  • Default: PRESERVE_REPLICA_ORDER (replicas-only, local-first)
  • Alternative: REGULAR (uses standard token-aware routing with non-replica fallback)

Performance Impact

Expected Improvements (with routing info available)

  • Reduced latency: Eliminates coordinator forwarding hops
  • Lower p99 variance: Consistent replica targeting reduces coordinator variance
  • Reduced cross-DC traffic: Local DC replicas prioritized for LOCAL_SERIAL
  • Lower contention: Rack avoidance distributes load across local DC

Fallback Scenario (routing info missing)

  • Adds ~0.5-2ms median latency due to coordinator forwarding
  • Mitigated by using prepared statements (routing key always available)

Compatibility

  • No breaking changes: Public API unchanged
  • Backward compatible: Existing query behavior preserved for non-LWT
  • Java 8 compatible: No new APIs used
  • Clirr check: 0 errors, 0 warnings

Testing

Unit Tests Added

  • should_filter_down_replicas_for_lwt: Verifies DOWN replicas are excluded from LWT query plans
  • should_filter_ignored_replicas_for_lwt: Verifies IGNORED replicas (e.g., from allowlists) are excluded
  • should_filter_down_and_ignored_replicas_for_lwt: Verifies combined filtering of DOWN and IGNORED hosts
  • should_return_empty_plan_when_all_lwt_replicas_filtered: Verifies empty plan when all replicas filtered

Validation Performed

  • ✅ Code formatting: mvn fmt:format
  • ✅ Compilation: mvn compile -pl driver-core -am
  • ✅ Unit tests: mvn test -Dtest=TokenAwarePolicyTest -pl driver-core
  • ✅ Full verification: mvn verify -DskipTests -pl driver-core
  • ✅ API compatibility: Clirr checks pass

Recommended Integration Tests (follow-up)

  • Multi-DC cluster with LWT + LOCAL_SERIAL: verify local DC replicas used first in live cluster
  • Missing routing key: verify fallback to child policy behavior end-to-end

Migration Notes

No action required for existing users. The optimization is automatic when:

  • Using TokenAwarePolicy (default in most configurations)
  • Using prepared statements (best practice)
  • LWT queries detected via server-provided metadata (Scylla) or manual marking

Users not meeting these conditions will continue to work with existing behavior.

Related Documentation

Closes

#752

feat: Enhance LWT routing in TokenAwarePolicy with replica-only optimization and detailed documentation

refactor: Improve formatting in newQueryPlan methods for better readability
…queries

test: Implement LWT replica-only routing with local DC prioritization
@nikagra nikagra requested review from Copilot and dkropachev January 30, 2026 15:19
@nikagra nikagra changed the title Fix/lwt routing lbp filtering 752 3.x: Implement LWT Replica-Only Routing with Local DC Prioritization Jan 30, 2026

if (partitionKey == null || keyspace == null)
return childPolicy.newQueryPlan(keyspace, statement);
return childPolicy.newQueryPlan(loggedKeyspace, statement);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkropachev Please confirm this is a necessary fix

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not, it is a bug

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements replica-only routing for LWT queries within TokenAwarePolicy, prioritizing local-DC replicas and allowing fallback to regular routing based on a new QueryOptions setting.

Changes:

  • Removed RequestHandler’s LWT special-casing so LWTs flow through the configured load balancing policy chain.
  • Added LWT-aware query planning to TokenAwarePolicy (replicas-only, local-first, filters DOWN/IGNORED), configurable via QueryOptions.RequestRoutingMethod.
  • Added unit tests covering LWT replica filtering and updated README with the new optimization.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
driver-core/src/main/java/com/datastax/driver/core/policies/TokenAwarePolicy.java Adds LWT-specific replica-only query plan behavior and related Javadoc.
driver-core/src/main/java/com/datastax/driver/core/RequestHandler.java Removes LWT replica-routing bypass so LWT uses the LBP chain.
driver-core/src/main/java/com/datastax/driver/core/QueryOptions.java Adds configurable LWT routing method (RequestRoutingMethod) and default.
driver-core/src/test/java/com/datastax/driver/core/policies/TokenAwarePolicyTest.java Adds unit tests for LWT local-first ordering and filtering behavior.
README.md Documents the LWT optimization feature at a high level.

* replicas} for that partition key is not empty). If no local replica can be either found or
* successfully contacted, the rest of the query plan will fallback to the child policy's one.
* {@code null} and if the {@linkplain Metadata#getReplicasList(String, String, Token.Factory,
* ByteBuffer)} (String, ByteBuffer) set of replicas} for that partition key is not empty). If
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class Javadoc link/phrase around Metadata#getReplicasList is malformed ({@linkplain Metadata#getReplicasList(...)} (String, ByteBuffer) set of replicas}), which will render incorrectly in Javadoc and is confusing. Please rewrite this sentence/link to correctly reference the getReplicasList overload used by the policy, without the stray (String, ByteBuffer) fragment and with balanced braces.

Suggested change
* ByteBuffer)} (String, ByteBuffer) set of replicas} for that partition key is not empty). If
* ByteBuffer) set of replicas} for that partition key is not empty). If

Copilot uses AI. Check for mistakes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines +84 to +86
* <li>No non-replica hosts: unlike regular queries, LWT plans do not fall back to non-replica
* hosts, ensuring all attempts target actual replicas to maintain LWT correctness and reduce
* coordinator forwarding overhead.
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LWT Javadoc says "LWT plans do not fall back to non-replica hosts" and claims this is needed "to maintain LWT correctness", but later in the same Javadoc it documents fallback to the child policy (which may include non-replicas). Also, LWT correctness does not require the coordinator to be a replica; this is a performance optimization. Please adjust this bullet to be consistent with the fallback section and avoid the incorrect correctness claim.

Suggested change
* <li>No non-replica hosts: unlike regular queries, LWT plans do not fall back to non-replica
* hosts, ensuring all attempts target actual replicas to maintain LWT correctness and reduce
* coordinator forwarding overhead.
* <li>Replica-only routing when possible: under normal conditions, LWT query plans target only
* replicas for the partition in order to reduce coordinator forwarding overhead and improve
* performance. When replica information is unavailable, the driver falls back to the child
* policy as described in the fallback behavior below, which may include non-replica hosts.

Copilot uses AI. Check for mistakes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is ok to fallback to non-replica hosts, but only when replicas are exhausted

Comment on lines +625 to 628
&& this.prepareOnAllHosts == other.prepareOnAllHosts
&& this.loadBalancingLwtRequestRoutingMethod
== other.loadBalancingLwtRequestRoutingMethod)
&& this.schemaQueriesPaged == other.schemaQueriesPaged;
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equals() now includes loadBalancingLwtRequestRoutingMethod, but hashCode() does not. This violates the equals/hashCode contract and can break behavior in hash-based collections. Please include loadBalancingLwtRequestRoutingMethod in hashCode() (and any other derived methods if present).

Copilot uses AI. Check for mistakes.
assertThat(rs.getExecutionInfo().getQueriedHost()).isEqualTo(host1);
assertThat(rs.isExhausted()).isFalse();
Row r = rs.one();
assertThat(rs.getExecutionInfo().getQueriedHost()).isEqualTo(host1);
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion duplicates the one immediately above for the same result set execution info and doesn’t add new coverage, making the test harder to read/maintain. Consider removing it or replacing it with an assertion that validates additional behavior.

Suggested change
assertThat(rs.getExecutionInfo().getQueriedHost()).isEqualTo(host1);

Copilot uses AI. Check for mistakes.

if (nonReplicasIterator.hasNext()) return nonReplicasIterator.next();
}
final Iterator<Host> childIterator = childPolicy.newQueryPlan(loggedKeyspace, statement);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkropachev Same here, changed from keyspace to loggedKeyspace

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bug


public LWTHostIterator(Iterator<Host> replicasIterator) {
this.replicasIterator = replicasIterator;
nonLocalReplicas = new LinkedList<>();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LinkedList is quite slow, please use ArrayList instead

* replicas} for that partition key is not empty). If no local replica can be either found or
* successfully contacted, the rest of the query plan will fallback to the child policy's one.
* {@code null} and if the {@linkplain Metadata#getReplicasList(String, String, Token.Factory,
* ByteBuffer)} (String, ByteBuffer) set of replicas} for that partition key is not empty). If

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines +84 to +86
* <li>No non-replica hosts: unlike regular queries, LWT plans do not fall back to non-replica
* hosts, ensuring all attempts target actual replicas to maintain LWT correctness and reduce
* coordinator forwarding overhead.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is ok to fallback to non-replica hosts, but only when replicas are exhausted


if (partitionKey == null || keyspace == null)
return childPolicy.newQueryPlan(keyspace, statement);
return childPolicy.newQueryPlan(loggedKeyspace, statement);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not, it is a bug


if (nonReplicasIterator.hasNext()) return nonReplicasIterator.next();
}
final Iterator<Host> childIterator = childPolicy.newQueryPlan(loggedKeyspace, statement);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bug

Comment on lines +342 to +344
when(childPolicy.distance(host2)).thenReturn(HostDistance.REMOTE);
when(host2.isUp()).thenReturn(true);
when(childPolicy.distance(host2)).thenReturn(HostDistance.IGNORED);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is something missing ?
It looks like when(childPolicy.distance(host2)).thenReturn(HostDistance.REMOTE); is not doing anything

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants