Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
d8a4af4
rewards: ensure global stake cache is updated during partitioned epoc…
smcio Jan 13, 2026
85c53d4
update vote cache when validator identity is changed
smcio Jan 13, 2026
6b23725
Revert "update vote cache when validator identity is changed"
smcio Jan 13, 2026
dfe8580
fix possible corner-case with calculation of leader schedule
smcio Jan 13, 2026
d45ea13
perf(simd186): memoize accounts to avoid double-clone in loadAndValid…
7layermagik Jan 14, 2026
d02ca7d
perf: optimize reward distribution memory and pool usage
7layermagik Jan 13, 2026
4c64df0
perf: reward distribution optimizations + thread safety
7layermagik Jan 13, 2026
fc63bb8
perf: reuse worker pool across reward partitions
7layermagik Jan 13, 2026
4c2dea6
perf: always exclude stake accounts from CommonAcctsCache
7layermagik Jan 13, 2026
9f3c4d7
Add separate 2k-entry StakeAcctCache for stake accounts
7layermagik Jan 13, 2026
3e04c1a
Make AccountsDB LRU cache sizes configurable via config.toml
7layermagik Jan 13, 2026
fa1a5ca
Improve snapshot log message consistency
7layermagik Jan 13, 2026
e227175
fix: prevent stale cache entries when account owner changes
7layermagik Jan 13, 2026
c7691ee
perf: skip caching new stake accounts during reward distribution
7layermagik Jan 13, 2026
4f55841
perf: add cache hit/miss profiling to 100-slot summary
7layermagik Jan 13, 2026
cdb5125
perf: split common cache into small/large account caches
7layermagik Jan 13, 2026
fad0841
Add granular cache miss size buckets for profiling
7layermagik Jan 13, 2026
ef0efce
Add cache fill stats to 100-slot summary
7layermagik Jan 13, 2026
81438be
perf: restructure account cache into small/medium/huge tiers
7layermagik Jan 14, 2026
f3e99cc
perf: add admit-on-second-hit filter for common account caches
7layermagik Jan 14, 2026
a42232d
fix: replace map-based seen-once filter with LRU, fix oscillation bug
7layermagik Jan 14, 2026
4f42c5f
fix: remove unused slot parameter from cacheAccount
7layermagik Jan 14, 2026
455c562
feat: add seen-once filter stats to 100-slot summary
7layermagik Jan 14, 2026
a99dbe3
feat: add memory stats to 100-slot summary
7layermagik Jan 14, 2026
5439bec
Add delta memory stats to 100-slot summary
7layermagik Jan 14, 2026
eb0ec98
perf: add program cache hit/miss stats with size breakdown
7layermagik Jan 14, 2026
ed8c17a
fix: add net/http/pprof import to enable /debug/pprof/* handlers
7layermagik Jan 14, 2026
12377aa
Add pprof HTTP server support to mithril run command
7layermagik Jan 14, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 25 additions & 3 deletions cmd/mithril/node/node.go
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,7 @@ func init() {
Run.Flags().BoolVar(&sbpf.UsePool, "use-pool", true, "Disable to allocate fresh slices")

// [tuning.pprof] section flags
Run.Flags().Int64Var(&pprofPort, "pprof-port", -1, "Port to serve HTTP pprof endpoint")
Run.Flags().StringVar(&cpuprofPath, "cpu-profile-path", "", "Filename to write CPU profile")

// [debug] section flags
Expand Down Expand Up @@ -855,7 +856,15 @@ func runVerifyRange(c *cobra.Command, args []string) {
klog.Fatalf("end slot cannot be lower than start slot")
}
mlog.Log.Infof("will replay startSlot=%d endSlot=%d", startSlot, endSlot)
accountsDb.InitCaches()
accountsDb.InitCaches(
config.GetInt("tuning.cache.vote_acct_lru"),
config.GetInt("tuning.cache.stake_acct_lru"),
config.GetInt("tuning.cache.small_acct_lru"),
config.GetInt("tuning.cache.medium_acct_lru"),
config.GetInt("tuning.cache.huge_acct_lru"),
config.GetInt("tuning.cache.program_lru"),
config.GetInt("tuning.cache.seen_once_filter_size"),
)

metricsWriter, metricsWriterCleanup, err := createBufWriter(metricsPath)
if err != nil {
Expand Down Expand Up @@ -1075,6 +1084,11 @@ func runLive(c *cobra.Command, args []string) {
// Now start the metrics server (after banner so errors don't appear first)
statsd.StartMetricsServer()

// Start pprof HTTP server if configured
if pprofPort != -1 {
startPprofHandlers(int(pprofPort))
}

// Determine if using Lightbringer based on block source
// NOTE: Lightbringer mode is TEMPORARILY DISABLED. The background block downloader that
// wrote Lightbringer blocks to disk was removed due to reliability issues (panics, race conditions).
Expand Down Expand Up @@ -1160,7 +1174,7 @@ func runLive(c *cobra.Command, args []string) {

// Handle explicit --snapshot flag (bypasses all auto-discovery, does NOT delete snapshot files)
if snapshotArchivePath != "" {
mlog.Log.Infof("Using snapshot file: %s", snapshotArchivePath)
mlog.Log.Infof("Using full snapshot: %s", snapshotArchivePath)

// Parse full snapshot slot from filename for validation
fullSnapshotSlot := parseSlotFromSnapshotName(filepath.Base(snapshotArchivePath))
Expand Down Expand Up @@ -1610,7 +1624,15 @@ postBootstrap:
}

liveEndSlot := uint64(math.MaxUint64)
accountsDb.InitCaches()
accountsDb.InitCaches(
config.GetInt("tuning.cache.vote_acct_lru"),
config.GetInt("tuning.cache.stake_acct_lru"),
config.GetInt("tuning.cache.small_acct_lru"),
config.GetInt("tuning.cache.medium_acct_lru"),
config.GetInt("tuning.cache.huge_acct_lru"),
config.GetInt("tuning.cache.program_lru"),
config.GetInt("tuning.cache.seen_once_filter_size"),
)

metricsWriter, metricsWriterCleanup, err := createBufWriter(metricsPath)
if err != nil {
Expand Down
1 change: 1 addition & 0 deletions cmd/mithril/node/pprof.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ package node
import (
"fmt"
"net/http"
_ "net/http/pprof" // registers /debug/pprof/* handlers
"runtime"
"strconv"
"time"
Expand Down
50 changes: 46 additions & 4 deletions config.example.toml
Original file line number Diff line number Diff line change
Expand Up @@ -232,13 +232,13 @@ name = "mithril"
port = 8899

# ============================================================================
# [tuning] - Performance Tuning & Profiling
# [development] - Performance Tuning & Profiling
# ============================================================================
#
# Advanced settings for optimizing Mithril's performance.
# The defaults work well for most deployments.

[tuning]
[development]
# Zstd decoder concurrency (defaults to NumCPU)
# zstd_decoder_concurrency = 16

Expand All @@ -254,15 +254,57 @@ name = "mithril"
# Enable/disable pool allocator for slices
use_pool = true

# [tuning.pprof] - CPU/Memory Profiling
[tuning.pprof]
# [development.pprof] - CPU/Memory Profiling
[development.pprof]
# Port to serve HTTP pprof endpoint (-1 to disable)
# Access at http://localhost:PORT/debug/pprof/
# port = 6060

# Filename to write CPU profile (for offline analysis)
# cpu_profile_path = "/tmp/cpuprof.pprof"

# [development.cache] - AccountsDB LRU Cache Sizes
#
# These control the LRU caches for fast account data reads during replay.
# Values are NUMBER OF ENTRIES, not bytes.
#
# NOTE: These are DIFFERENT from the global vote/stake caches used for
# leader schedule building. Those are unbounded maps that store vote STATE
# (voting history, credits) and stake DELEGATIONS. These LRU caches store
# full ACCOUNT data for frequently-accessed accounts.
#
# Larger caches = fewer disk reads, but more memory usage.
# Memory per entry is ~200-1000 bytes depending on account data size.
[development.cache]
# Vote account data cache - number of entries (frequently accessed during replay)
vote_acct_lru = 5000

# Stake account data cache - number of entries (separated to avoid evicting
# hot accounts during epoch rewards when ~1.25M stake accounts are touched once each)
stake_acct_lru = 2000

# Small account data cache - accounts ≤512 bytes (token accounts, etc.)
small_acct_lru = 50000

# Medium account data cache - accounts 512-64KB
medium_acct_lru = 20000

# Huge account data cache - accounts >64KB (mostly programs)
huge_acct_lru = 500

# Compiled BPF program cache - number of entries
program_lru = 5000

# Admit-on-second-hit LRU filter for common accounts (small/medium/huge)
# Only caches accounts seen twice within the filter window, filtering one-shot reads.
# 0 = disabled (cache everything immediately, like traditional LRU)
# >0 = enable filtering with LRU of this capacity
# Recommended: 50000 (roughly 0.5-1x of small+medium+huge total)
# Monitor SeenOnceAdmitted/SeenOnceFiltered ratio to tune:
# <5-10% admitted = too strict (filter too small)
# >40-50% admitted = too lenient (filter too large or most accesses are hot)
seen_once_filter_size = 0

# ============================================================================
# [debug] - Debug Logging
# ============================================================================
Expand Down
88 changes: 88 additions & 0 deletions docs/TODO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# TODO / Known Issues

Identified on branch `perf/reward-distribution-optimizations` at commit `3b2ad67`
dev HEAD at time of identification: `a25b2e3`
Date: 2026-01-13

---

## Failing Tests

### 1. Address Lookup Table Tests - `InstrErrUnsupportedProgramId`

**File:** `pkg/sealevel/address_lookup_table_test.go`
**Test:** `TestExecute_AddrLookupTable_Program_Test_Create_Lookup_Table_Idempotent` (and likely all other ALT tests)

**Root Cause:** `AddressLookupTableAddr` and `StakeProgramAddr` were accidentally removed from `resolveNativeProgramById` switch in `pkg/sealevel/native_programs_common.go`.

| Program | Removed In | Commit Date | Commit Message |
|---------|------------|-------------|----------------|
| `AddressLookupTableAddr` | `d47c16b` | May 16, 2025 | "many optimisations and changes" |
| `StakeProgramAddr` | `e890f9e` | Jul 26, 2025 | "snapshot download, stake program migration, refactoring" |

**Fix:** Add these cases back to the switch in `resolveNativeProgramById`:
```go
case a.StakeProgramAddr:
return StakeProgramExecute, a.StakeProgramAddrStr, nil
case a.AddressLookupTableAddr:
return AddressLookupTableExecute, a.AddressLookupTableProgramAddrStr, nil
```

---

### 2. Bank Hash Test - Nil Pointer Dereference

**File:** `pkg/replay/hash_test.go`
**Test:** `Test_Compute_Bank_Hash`

**Error:**
```
panic: runtime error: invalid memory address or nil pointer dereference
pkg/replay/hash.go:227 - shouldIncludeEah(0x0, 0x0)
```

**Root Cause:** Test passes `nil` for the first argument to `shouldIncludeEah`, which dereferences it without a nil check.

**Fix:** Either add nil check in `shouldIncludeEah` or fix the test to pass valid arguments.

---

## Agave/Firedancer Parity Issues

### 3. Missing "Burned Rewards" Semantics in Reward Distribution

**File:** `pkg/rewards/rewards.go` (lines 180-230)

**Problem:** Mithril does not implement "burn" semantics for per-account failures during partitioned reward distribution. This diverges from both Agave and Firedancer.

**Current Mithril behavior:**
- `GetAccount` error → panic (aborts replay)
- `UnmarshalStakeState` error → silent skip (reward lost, not counted)
- `MarshalStakeStakeInto` error → panic (aborts replay)
- Lamport overflow → panic (aborts replay)

**Agave behavior** (`distribution.rs:260`):
- `build_updated_stake_reward` returns `DistributionError::UnableToSetState` or `AccountNotFound`
- Caller logs error and adds to `lamports_burned`
- Continues processing remaining accounts

**Firedancer behavior** (`fd_rewards.c:958`):
- `distribute_epoch_reward_to_stake_acc` returns non-zero on decode/non-stake/etc.
- Caller increments `lamports_burned` and continues

**Failure scenarios that should burn (not panic):**
- Account missing / not found
- Stake state decode fails (including short/invalid data)
- Account isn't a stake account
- Lamport add overflows
- `set_state`/encode fails (e.g., data too small)

**Fix required:**
1. Add `lamports_burned` tracking to reward distribution
2. Change panics to log + burn + continue
3. `epochRewards.Distribute()` should receive `distributedLamports` (successful) separately from burned amount
4. Ensure `SysvarEpochRewards.DistributedRewards` advances correctly (may need to include burned in total)

**Note:** The current silent skip on `UnmarshalStakeState` error reduces `distributedLamports` but doesn't track it as burned, which may cause `SysvarEpochRewards` to diverge from Agave/FD.

---
Loading