Making Claude Code Tell You What It's Doing
Claude Code has a status line that sits at the bottom of the terminal showing things like the current directory, git branch, model, and context window usage. It’s driven by a shell script that receives JSON on stdin and prints whatever it wants. I wanted to add one more thing: a short description of what the session is actually working on.
The Simple Way: /rename
The built-in /rename command sets a session name that Claude Code displays above the prompt. Type /rename fix auth bug at the start of each session and you’re done — no scripts needed.
The downside is that it’s manual, and /rename can’t be invoked programmatically by Claude. If you want Claude to automatically describe what it’s working on and update that description as the focus shifts, you need the automated approach below.
The Automated Approach
The goal is for Claude to write a short status like “fix auth bug” that shows up in the status line, updated automatically as the session’s focus changes:
op-claude (main) Opus ctx:8% · fix auth bug
This turns out to be harder than it should be. The status line script receives a JSON blob on stdin that includes the session_id. Claude’s bash tool calls don’t. There’s no $SESSION_ID environment variable, and $PPID differs between the two because they’re spawned through different process trees.
So we need a way for the status line side (which knows the session ID) to leave a breadcrumb that the bash side (which doesn’t) can find.
The Breadcrumb
Both the status line script and Claude’s bash calls have a common ancestor: the claude process. They just reach it through different paths. The trick is to walk up the process tree until you find a process named claude, then use its PID as a shared key.
A UserPromptSubmit hook runs on every user message and receives the session_id in its input. It walks the process tree to find the ancestor claude PID and writes a breadcrumb file mapping one to the other:
#!/usr/bin/env bash
# ~/.claude/hooks/session-status.sh
input=$(cat)
session_id=$(echo "$input" | jq -r '.session_id // empty')
[ -z "$session_id" ] && exit 0
# Write breadcrumb mapping ancestor claude PID -> session_id
pid=$PPID
while [ "$pid" -gt 1 ]; do
comm=$(ps -o comm= -p "$pid" 2>/dev/null)
if [ "$comm" = "claude" ]; then
echo "$session_id" > "/tmp/claude-sid-${pid}"
break
fi
pid=$(ps -o ppid= -p "$pid" 2>/dev/null | tr -d ' ')
done
# If no status file exists yet, remind Claude to create one
if [ -f "/tmp/claude-status-${session_id}" ]; then
exit 0
fi
jq -n '{
"hookSpecificOutput": {
"hookEventName": "UserPromptSubmit",
"additionalContext": "STATUS LINE REMINDER: Run ~/.claude/update-status.sh \"short summary\" to set what this session is working on (under 30 chars)."
}
}'
That last part is important. You can’t just tell Claude in your CLAUDE.md to “please update the status line” and expect it to reliably happen. The hook injects a reminder into the conversation context on every user message until a status file exists. Belt and suspenders.
Writing the Status
Claude calls a small helper script that does the same process-tree walk in reverse — finds the claude ancestor PID, reads the breadcrumb to get the session ID, then writes the status:
#!/usr/bin/env bash
# ~/.claude/update-status.sh "short summary"
msg="$1"
[ -z "$msg" ] && exit 1
pid=$$
while [ "$pid" -gt 1 ]; do
comm=$(ps -o comm= -p "$pid" 2>/dev/null)
if [ "$comm" = "claude" ]; then
sid=$(cat "/tmp/claude-sid-${pid}" 2>/dev/null)
[ -n "$sid" ] && echo "$msg" > "/tmp/claude-status-${sid}"
exit 0
fi
pid=$(ps -o ppid= -p "$pid" 2>/dev/null | tr -d ' ')
done
The Status Line Script
The full status line script reads the JSON from stdin, extracts the fields it cares about, and builds the output. The session status is just another part appended at the end:
#!/usr/bin/env bash
# ~/.claude/statusline-command.sh
input=$(cat)
cwd=$(echo "$input" | jq -r '.cwd // .workspace.current_dir // ""')
model=$(echo "$input" | jq -r '.model.display_name // ""')
used_pct=$(echo "$input" | jq -r '.context_window.used_percentage // empty')
vim_mode=$(echo "$input" | jq -r '.vim.mode // empty')
session_id=$(echo "$input" | jq -r '.session_id // empty')
# Per-session status from temp file keyed by session_id
session_status=""
if [ -n "$session_id" ]; then
session_status=$(cat "/tmp/claude-status-${session_id}" 2>/dev/null || true)
fi
# Directory: basename of cwd
dir=$(basename "$cwd")
# Git branch (skip optional locks)
branch=""
if git_out=$(GIT_OPTIONAL_LOCKS=0 git -C "$cwd" symbolic-ref --short HEAD 2>/dev/null); then
branch="$git_out"
fi
# Build status line parts
parts=()
# Directory in cyan
parts+=("$(printf '\033[36m%s\033[0m' "$dir")")
# Git branch in yellow if present
if [ -n "$branch" ]; then
parts+=("$(printf '\033[33m(%s)\033[0m' "$branch")")
fi
# Model
if [ -n "$model" ]; then
parts+=("$(printf '\033[90m%s\033[0m' "$model")")
fi
# Context usage with color thresholds
if [ -n "$used_pct" ]; then
used_int=${used_pct%.*}
if [ "$used_int" -ge 80 ] 2>/dev/null; then
color='\033[31m'
elif [ "$used_int" -ge 50 ] 2>/dev/null; then
color='\033[33m'
else
color='\033[32m'
fi
parts+=("$(printf "${color}ctx:%s%%\033[0m" "$used_int")")
fi
# Session status (per-session work summary)
if [ -n "$session_status" ]; then
parts+=("$(printf '\033[90m· %s\033[0m' "$session_status")")
fi
# Vim mode
if [ -n "$vim_mode" ]; then
parts+=("$(printf '\033[90m[%s]\033[0m' "$vim_mode")")
fi
printf '%s' "${parts[*]}"
Wiring It Up
Make both scripts executable:
chmod +x ~/.claude/statusline-command.sh ~/.claude/update-status.sh ~/.claude/hooks/session-status.sh
Register the status line and hook in ~/.claude/settings.json:
{
"statusLine": {
"type": "command",
"command": "bash ~/.claude/statusline-command.sh"
},
"hooks": {
"UserPromptSubmit": [
{
"hooks": [
{
"type": "command",
"command": "~/.claude/hooks/session-status.sh"
}
]
}
]
}
}
And add the instruction to your CLAUDE.md that tells Claude when to update:
## Session Status Line
Update the session status line so the user can see what each session
is working on at a glance.
- **After the first user prompt**: run `~/.claude/update-status.sh "short summary"`
as part of your first response
- **Periodically**: run it again every ~5 interactions or when focus shifts
- Keep summaries under 30 chars
What I Learned
The interesting constraint here is that Claude Code’s extensibility points — status line scripts, hooks, and bash tool calls — all run as separate processes with no shared environment. There’s no session ID in the environment, no shared memory, no IPC channel. The process tree walk is a hack, but it’s a reliable one. Every subprocess of a Claude Code session shares a common claude ancestor, even if the paths diverge.
The other lesson is that CLAUDE.md instructions alone aren’t enough for “always do X” behaviors. Claude follows them inconsistently, especially across sessions. Hooks that inject reminders into the conversation context are much more reliable. The CLAUDE.md instruction tells Claude what to do; the hook makes sure it actually does it.
Claude Docker
I’ve been using Claude Code a lot lately. It’s become a core part of how I work — planning changes, exploring unfamiliar codebases, writing and reviewing code. But giving an AI agent the ability to run arbitrary shell commands on your machine does make you think a bit more carefully about what’s happening on your host system.
The natural answer is to run it in a container. Not as a security boundary — Claude still needs access to your code, your git config, a GitHub token, and the internet — but as a way to keep all the side effects contained. If it installs random packages, creates temp files, or leaves build artifacts scattered around, that’s all happening inside the container rather than on your actual machine. It also makes the environment completely reproducible and disposable. Something goes wrong? Tear it down and rebuild.
So I built claude-docker to do exactly that.
How It Works
An Ubuntu container runs an SSH server. Your code directory is bind-mounted at the same path inside the container so file references are identical on both sides — Claude can say “edit /Users/aj/Documents/code/foo/bar.go” and it works whether you’re looking at it from inside or outside the container. Your git config, Claude config, and known hosts are all mounted in too, so everything just works as expected.
The container comes pre-loaded with the usual development tools: Go, Node.js, mise, gopls, gh, ripgrep, fzf, tmux, and a bunch of others. There’s an EXTRA_PACKAGES option if you need anything else — set it in your .env and it gets installed on the next build.
A be-claude helper script SSHs into the container and launches Claude Code in whatever directory you’re currently in. Symlink it onto your PATH and it works from anywhere. It automatically passes through a GitHub token (from gh auth token or the environment) so Claude can interact with GitHub inside the container.
Build Caches
One thing I wanted to get right was build cache persistence. Rebuilding the container shouldn’t mean re-downloading every Go module and Cargo crate. A single named Docker volume is mounted at ~/.cache and environment variables redirect the various tool caches into it:
- Go module cache via
GOMODCACHE - Cargo registry via
CARGO_HOME - Solc binaries via
SVM_HOME - Foundry and mise already use
~/.cacheby default
So you get fast rebuilds without the volume shadowing any binaries installed in the image (like gopls). The distinction matters — you want caches persisted but binaries to come fresh from each build.
Getting Started
The setup is pretty minimal:
git clone git@github.com:ajsutton/claude-docker.git
cd claude-docker
cp .env.example .env
# Edit .env — set CODE_PATH to your code directory
./run.sh
./be-claude
If you have SSH keys loaded in your agent, you don’t even need to configure SSH_AUTHORIZED_KEYS — run.sh picks them up automatically.
If your network requires custom root CAs (corporate proxies, internal domains, etc.), drop .crt files into the certs/ directory and they get installed into the container’s trust store on the next build. The directory is gitignored so your certificates stay local.
What It Isn’t
This is a convenience layer, not a security sandbox. Claude has read/write access to your mounted code, a GitHub token, and unrestricted network access. It’s useful for keeping your host system clean and making the environment reproducible, but don’t treat the container boundary as a trust boundary.
The code is up at github.com/ajsutton/claude-docker — it’s intentionally simple and easy to customise for your own setup.
Yes, I was too lazy to write this post myself and got Claude to do it for me. The whole world is just AI slop now.
Types of Tech Debt
The Optimism blog has published an article I wrote discussing the various types of tech debt. I’ve been finding it very useful lately to be able to “put words to things” better, whether that’s naming things better or just better ways to explain things.
Teku Event Channels
Teku uses a really nice framework for separating different components - Event Channels. It’s based on similar patterns in the Sail library used at LMAX for sending network messages between services. In Teku though, it’s designed to work in-process while still decoupling the components in the system. Turns out I never wrote about it here, so I’m very belatedly catching up.
Event channels are defined by declaring a pretty standard interface:
public interface SlotEventsChannel extends VoidReturningChannelInterface {
void onSlot(UInt64 slot);
}
There are a few simple restrictions:
- It must extend from
VoidReturningChannelInterface(orChannelInterfacebut we’ll get to non-void returning cases later) - All methods must return
void - Methods cannot throw any exceptions
There can be any number of methods on the same interface and any number of subscribers to the channel.
The implementing side simply implements the interface and the calling side simply has an implementation of the interface injected and calls it as normal. So far, this isn’t actually providing any real separation - it’s just using a Java interface. You can pass the concrete implementation of the interface to the calling side and it will all work. The interface provides some decoupling between the caller and receiver, but they’re still coupled temporally because the call is synchronous, and exceptions on the receiving side would propagate back up through the calling side. Both can be fixed to isolate the components fully, but then you have to do that at every call-site.
Instead, the event channel system uses reflection to generate an implementation of the interface that ensures complete isolation between caller and
receiver. The generated implementation is passed to the caller and it implements each method by passing the work to a thread pool, then calling the actual
implementation. It also provides error handling and records metrics to give visibility into the event system. While reflection is used to generate the
implementation most of the code is in abstract classes that the generated implementations extend so it’s easy to maintain. Importantly, the complexity of
that reflection is abstracted away from the code using them framework - it’s just like an interface where part of the API contract is that calls are always
asynchronous and never throw any exceptions. The code for the framework is quite small, all in the infrastructure.events package.
Calls to the interface are added to the queue the thread pool takes work from in call order. So if the thread pool has a single thread, the calls will all be processed in exactly the same order they were made. In most cases there are multiple threads in the thread pool so processing happens in parallel (but starts in order), but
for cases like the StorageUpdateChannel where event order is important, a single thread is used.
The VoidReturningChannelInterface is an ideal case for maximum decoupling of components - the sender is just notifying when events happen and forgetting about them.
But often we need to request data from another component or be able to handle failures. The storage system in Teku is a decoupled component for example. In that case
we use an interface that just extends ChannelInterface. Then methods are allowed to return SafeFuture - the promise type used in Teku.
Exceptions are still not allowed, but the returned SafeFuture can be used to return error information as part of the result. The same implementation applies,
reflection is used to generate an implementation that calls the real implementation via a thread pool, but now when the real implementation completes, the result
is used to complete the originally returned SafeFuture. For example:
public interface Eth1DepositStorageChannel extends ChannelInterface {
SafeFuture<ReplayDepositsResult> replayDepositEvents();
SafeFuture<Boolean> removeDepositEvents();
}
Note that the actual implementation still provides a method that returns SafeFuture which allows it to use an asynchronous implementation when suitable. It
can also just use SafeFuture.completedFuture(value) to return a value synchronously easily. The event system will now only allow a single subscriber to the
topic to ensure it knows where the result value should come from. Since publishers and subscribers are created at startup, if multiple subscribers are added
it means Teku fails to start, a lot of tests fail and it won’t go unnoticed.
There’s a bunch of nice things about this framework:
- EventChannels have “click-throughability”. You can easily jump from a call to the interface to the actual implementation (or see all implementations) using the go to implementation functionality of an IDE. The details of how the decoupling is implemented are all abstracted away.
- The ability to return a value asynchronously is much easier to reason about than having to send responses via a separate event. The request/response is clearly coupled together in the interface rather than piecing together two independent events.
- For testing, the channel interface can be easily mocked, a synchronous event channel passed or a custom stub provided.
One particularly neat trick in Teku is that the validator client can run either within the Teku beacon node process or as a separate process. It’s the event
channel system that makes that work. The validator client was originally built in-process but as it’s own component so all calls to or from it were completely
asynchronous and decoupled through the event channel interfaces. To make it run as an external process we simply wrote an implementation of the channels it called
that worked by sending HTTP requests to the beacon node API rather than using the in-process generated ones. The calls to the validator client were all
timing information like the SlotEventsChannel above. For most of those we simply wrote a new publisher that ran on an independent timer inside the validator
client. The few that actually depended on the state of the beacon node were produced by subscribing to the beacon node API event stream and sending events
based off of that.
The main downside is that the asynchronicity of the call isn’t visible in the actual code (only in the reflection generated implementation). That’s why by
convention in Teku channel interfaces and the variables for them are always suffixed with Channel so it is clear that asynchronicity is part of the API contract.
It isn’t immediately obvious to people new to the codebase, but it’s quick to learn and easy to remember so I don’t recall it ever causing any problems in
practice.
Ultimately event channels are a pretty simple system that provides a lot of power and flexibility.
Home Lab
One of the downsides of moving from working on the Ethereum consensus layer is that you often need a real execution node sync’d, and they don’t have the near instantaneous checkpoint sync. So recently I bit the bullet and custom built a PC to run a whole bunch of different Ethereum chains on. I’m really quite happy with the result.
There’s actually a really good variety of public endpoints available for loads of Ethereum-based chains these days so while running your own is maximally decentralised, it’s not just a choice between Infura or your own node now. Public Node provide very good free JSON-API and consensus APIs. Alchemy and Quicknode both have quite usable free tiers etc. The downside with all of them though is that their servers are in the Americas or Europe and that’s a whole lot of latency away from Australia. When you’re syncing L2 nodes or particularly running fault proof systems, you wind up making a lot of requests and that latency becomes very painful very quickly. More than anything it was wanting to avoid that latency that drove me to want to run my own nodes locally.
To be useful though, I really want it to run quite a few different chains. Currently it’s running:
- Ethereum MainNet
- Ethereum Sepolia
- OP Mainnet
- OP Sepolia
- Base Mainnet
- Base Sepolia
I’m quite tempted to add a Holeksy node just so I can run some validators again - shame most of the L2 stacks and apps use Sepolia but it has a locked down validator set.
Hardware-wise running this many nodes is primarily about disk space so I wound up with an MSI Pro Z790-P motherboard which has a rather ridiculous number of ports that you can plug SSDs into - not all at full speed but plenty at fast enough speeds. It’s been nearly 20 years since I built a custom PC so there’s likely a bunch of things that aren’t the perfect trade offs but I’m quite happy with the overall result. One of the mistakes which I’m actually happy about was that I mistook the case size names and wound up with a much larger case than I expected. That does give it capacity to shove a heap of spinning rust drives into it and leverage that for things like historic data that doesn’t need the fast disk. Its got a Intel Core i7 CPU which is barely being used. I had wanted to add 128Gb of RAM since Ethereum nodes do like to cache stuff but apparently using 4 sticks of RAM can cause instability so I’ve stuck to just 64Gb for now. It seems to be plenty for now but is probably the main limiting factor at the moment. For disk it currently has two 4Tb NVME drives.
For software, the L1 consensus nodes are obviously all Teku and they’re doing great. The team has done a great job continuing to improve things since I left so even with the significant growth in validator set, its running very happily with less memory and CPU than it had been “back in my day”. The L1 Mainnet execution client is a reth archive node which has been quite successful. I did try a reth node for sepolia but hit a few issues (which I think have now been fixed) so I’ve wound up running executionbackup and have both geth and reth for sepolia.
The L2 nodes are all op-node and op-geth - always good to actually run the software I’m helping build. For OP Sepolia, I’m also running op-dispute-mon and op-challenger to both monitor the fault proof system and participate in games to ensure correct outcomes. I really do like the fact that OP fault proofs are fully permissionless so anyone can participate in the process just like my home lab now does.
For coordination, everything is running in docker via docker-compose which made it much easier to avoid all the port conflicts that would otherwise occur. Each network has its own docker-compose file, though there’s a bunch of networks shared between chains so the L2s can connect to the L1s and everything can connect to metrics. All the compose files and other config is in a local git repo with a hook setup to automatically apply any changes. So I’ve wound with a home grown gitops kind of setup. I did try using k8s with ArgoCD to “do it properly” at one point but it just made everything far more complex and less reliable so switched back to simple docker compose.
For monitoring, I’ve got Victoria Metrics capturing metrics and Loki capturing logs - both automatically pick up any new hosts. Then there’s a grafana instance to visualise it all. I even went as far as running ethereum-metrics-exporter to give a unified view of metrics when using different clients.
The final piece is a nginx instance that exposes all the different RPC endpoints at easy to remember URLs, ie /eth/mainnet/el, /eth/mainnet/cl, /op/mainnet/el etc. All the web UIs for the other services like Grafana are exposed through the same nginx instance. My initial build exposed all the RPCs on different ports and it was a nightmare trying to remember which chain was one which port, so the friendly URLs have been a big win.
Overall I’m really very happy with the setup and it is lightning fast even to perform quite expensive queries like listing every dispute game ever created. Plus it was fun to play with some “from scratch” system admin again instead of doing everything in the cloud with already existing templates and services setup.
Moving On From ConsenSys
After nearly 5 years working with the ConsenSys protocols group, I’ll be finishing up at the end of January.
So what happens with Teku? It will carry on as usual and keep going from strength to strength. There’s an amazing team of people building Teku and I have complete confidence in their ability to continue building Teku and contributing to the future of the Ethereum protocol. Teku started well before I was involved with it and has always been the work of an amazing team of people. I just wound up doing a lot of the more visible stuff - answering discord questions and reacting to the ad-hoc stuff that popped up.
My time at ConsenSys actually started by working on Besu, back before it’s initial release when it was called Pantheon. I was part of the team adding the initial support for private networks and then later moved over to join the team focussed on MainNet compatibility with work on things like fast sync, core EVM work and all that kind of fun. After that I got to help build a new team to focus on setting up tooling to make development and testing easier - modernising build and release systems, automated deployment and monitoring of test nodes and so on.
Then this “Ethereum 2.0” thing seemed like it might actually be ready to move out of the research phase and move towards production. So I joined the research team that was building “Artemis” to start bringing it out of research and to a real production-ready client. Most of the research team moved on to other research topics and we built a mostly new team around what we then called Teku. And so began one heck of a journey leading to the beacon chain launch, Altair and then The Merge. Hearing the crowd cheering in support of the merge at DevCon this year is one of the great highlights of my career.
I’m so lucky to have gotten to work with some truly amazing people. The folks who have been part of the Teku team along our journey share a truly special place in my heart though and I will always be grateful for the shared knowledge, persistence and dedication they have all contributed but even more so the caring, friendly way they contributed it. It’s not just the teams in ConsenSys but right across the Ethereum eco-system. The way the different consensus client teams have come together to push Ethereum forward is particularly amazing. These are ostensibly teams that are competing with each other and yet actively share knowledge to improve both the protocol and other team’s clients.
As I leave ConsenSys, I do so knowing that there are teams of incredible people who will carry on with the work I’m so privileged to have been able to contribute to.
So why the change? Mostly because this is a good time for me personally. As I mentioned, I started working on Teku to bring it out of research and into production. Getting The Merge done is a natural endpoint of that mission and a natural place to start looking for new challenges and opportunities. Obviously there are plenty of remaining things to improve in the Ethereum protocol and clients like Teku, but I’m keen to get a bit further out of my comfort zone.

So what’s next? I’ll be taking up a role as Staff Protocol Engineer with OP Labs to work on Optimism. I started looking at opportunities at Optimism because I’ve seen some of the great work they’ve been doing and I really like their retroactive public goods funding - it shows they’re investing in Ethereum, not just taking what they can get from it. Primarily though for me, finding a great place to work is about finding a great team of people doing interesting work. As I talked with various people from the Optimism team, I found them to be smart, curious, welcoming people who not only wanted to build great software but also wanted to keep improving the way they went about that. Plus I’ll be staying in the Ethereum eco-system so still get to work with all those amazing people. I can already see there’s a ton of stuff I can learn from the Optimism team and I think there’s places where I can bring some useful skills and experience beyond just writing some code.
In fact, given they mostly use Go and I have no real Go experience, “just writing some code” will be one of the first fun challenges. Java has kind of followed me for my career, not entirely deliberately though I do like it as a language, so I’m actually excited to really dig into writing production grade Go code.
Philosophically, one of the things I dislike about Ethereum (and blockchains in general) is that the high cost of transactions means it often becomes a rich person’s game and it often feels like people just throwing play money around. L2 solutions like Optimism are a big part of solving that by scaling blockchains and dramatically reducing fees. It feels good to me to be contributing to that. So much of the potential of Ethereum is waiting to be unlocked once it really scales. Besides, having worked on execution and consensus layers so far, moving to Layer 2 seems like an obvious next step.
Overall, I’m excited about the future of Teku and will be cheering the team on, and excited about the future of Ethereum and look forward to being part of delivering The Surge.
DevCon VI Talks
Mostly just so that I can find the recordings more easily later, here’s the recordings of DevCon VI talks I gave in Bogotá.
Firstly, Post-Merge Ethereum Client Architecture:
And a panel, It’s 10pm, do you know where you mnemonic is?
Understanding Attestation Misses
The process of producing attestations and getting them included into the chain has become more complex post-merge. Combined with a few issues with clients causing more missed attestations than normal there’s lots of people struggling to understand what’s causing those misses. So let’s dig into the process involved and how to identify where problems are occurring.
Attestation Life-Cycle
There’s a number of steps required to get an attestation included on the chain. My old video explaining this still covers the details of the journey well - the various optimisations I talk about there have long since been implemented but the process is still the same. In short, the validator needs to produce the attestation and publish it to the attestation subnet gossip channel, an aggregator needs to include it in an aggregation and publish it to the aggregates gossip channel and then it needs to be pick up by a block producer and packed into a block.
Attestations are more likely to be included in aggregates and blocks if they are published on time and match the majority of other attestations. Attestations that are different can’t be aggregated so they’re much less likely to be included in aggregates (the aggregator would have to produce an attestation that matches yours) and they take up 1 of the 128 attestations that can be in a block but pay less than better aggregated attestations.
Since attestations attest to the current state of the chain, the way to ensure your attestation matches the majority is to ensure you’re following the chain well. That’s where most of the post-merge issues have been - blocks taking too long to import, causing less accurate attestations which are then more likely to not get included. So let’s look at some metrics to follow so we can work out what’s happening.
Key Indicators of Attestation Performance
Often people just look at the “Attestation Effectiveness” metric reported by beaconcha.in, but that’s not a great metric to use. Firstly it tries to bundle together every possible measure of attestations, some within your control and some not, into a single metric. Secondly, it tends to be far too volatile with a single delayed attestation causing a very large drop in the effectiveness metric, distorting the result. As a result, it tends to make your validator performance look worse than it is and doesn’t give you any useful information to fix act on.
So let’s look at some more specific and informative metrics we can use instead.
Firstly for the overall view, look to percentage of attestation rewards earned. While that write up is pre-Altair the metrics on the Teku Dashboard have been updated to show the right values even with the new Altair rules. Look at the “Attestation Rewards Earned” line on the “Attestation Performance” graph in the top left of the board. This will tell you quite accurately how well you’re doing in terms of total rewards, but it still includes factors outside of your control and won’t help identify where problems are occurring.
To identify where problems are occurring we need to dig a bit deeper. Each epoch, Teku prints a summary of attestation performance to the logs like:
Attestation performance: epoch 148933, expected 16, produced 16, included 16 (100%), distance 1 / 1.00 / 1, correct target 16 (100%), correct head 16 (100%)
This is an example of perfect attestation performance - we expected 16 attestations, 16 were included, the distance had a minimum of 1, average of 1.00 and maximum of 1 (the distance numbers are min / avg / max in the output) and 100% of attestations had the correct target and head. One thing to note is that attestation performance is reported 2 epochs after the attestations are produced to give them time to actually be included on chain. The epoch reported in this line tells you which epoch the attestations being reported on are from.
Each of these values are also available as metrics and the Teku Dashboard uses them to create the “Attestation Performance” graph. That provides a good way to quickly see how your validators have performed over time and get a better overview rather than fixating on a single epoch that wasn’t ideal.
Attestations Expected
Each active validator should produce one attestation per epoch. So the expected value reported should be the same as the number of active validators you’re running. If it’s less than that, you probably haven’t loaded some of your validator keys and they’ll likely be missing all attestations. It’s pretty rare that expected isn’t what we expect though.
Attestations Produced
If the produced value is less than the expected then something prevented your node from producing attestations at all. To find out what, you’ll need to scroll back up in your validator client logs to the epoch this performance report is for - remember that it will be 2 epochs ago. We’re looking for a log that shows the result of the attestation duty. When the attestation is published successfully it will show something like:
Validator *** Published attestation Count: 176, Slot: 3963003, Root: b4ca6d61be7f54f7ccc6055d0f37f122943e8313dbcfe49513c9d4ef50bbc870
The Count field is the number of local validators that produced this attestation (this example is from our Görli testnet node - sadly we don’t have that many real-money validators).
When an attestation fails to be produced the log will show something like:
Validator *** Failed to produce attestation Slot: 4726848 Validator: d278fc2
java.lang.IllegalArgumentException: Cannot create attestation for future slot. Requested 4726848 but current slot is 4726847
at tech.pegasys.teku.validator.coordinator.ValidatorApiHandler.createAttestationData(ValidatorApiHandler.java:324)
at jdk.internal.reflect.GeneratedMethodAccessor54.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at tech.pegasys.teku.infrastructure.events.DirectEventDeliverer.executeMethod(DirectEventDeliverer.java:74)
at tech.pegasys.teku.infrastructure.events.DirectEventDeliverer.deliverToWithResponse(DirectEventDeliverer.java:67)
at tech.pegasys.teku.infrastructure.events.AsyncEventDeliverer.lambda$deliverToWithResponse$1(AsyncEventDeliverer.java:80)
at tech.pegasys.teku.infrastructure.events.AsyncEventDeliverer$QueueReader.deliverNextEvent(AsyncEventDeliverer.java:125)
at tech.pegasys.teku.infrastructure.events.AsyncEventDeliverer$QueueReader.run(AsyncEventDeliverer.java:116)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
The specific reason the attestation failed can vary a lot. In this case the beacon node wasn’t keeping up for some reason which would require further investigation into Teku and its performance. One common source of failures if the beacon node or execution client isn’t in sync at the time which appears as a 503 response code from the beacon node when using the external validator client.
We can look at the “Produced” line on the “Attestation Performance” graph of the standard Teku dashboard to see the percentage of expected attestations that were produced over time.
Attestation Timing
If the attestation was produced, the next thing to check is that it was actually produced on time. If you find the Published attestation log line, you can compare the timestamp of that log message to the time the attestation’s slot started. You can Slot Finder to find the start time of the slot. Attestations are due to be published 4 seconds into the slot. Anywhere from the start of the slot up to about 4.5 seconds after is fine.
You can also use the validator_attestation_publication_delay metric to track publication times. The Teku Detailed dashboard includes graphs of this under the Validator Timings section.
Remember that neither logs nor metrics can identify when your system clock is incorrect, because the timings they’re using are from the system clock too. Make sure you’re running ntpd or chrony and that they report the clock as in sync.
Correct Head Vote
If the attestation was published on time, we need to start checking if it matched the majority of other nodes produced. There isn’t a simple way to do this directly, but generally if the head block our attestation votes for turns out to be correct, we will almost certainly have agreed with the majority of other validators. The correct head 16 (100%) part of the attestation performance line shows how many attestations produced had the right head block. If that’s at 100% and the attestations were all published on time, there isn’t really much more your node can do.
Having some attestations with incorrect head votes may mean your node is too slow importing blocks. Note though that block producers are sometimes slow in publishing a block. These late blocks sometimes mean that the majority of validators get the head vote “wrong”, so it’s not necessarily a problem with your node when head votes aren’t at 100%. Even if it is your node that’s slow, we need to work out if the problem is in the beacon node or the execution client. Block timing logs can help us with that.
Block Timings
To dig deeper we need to enable some extra timing metrics in Teku by adding the --Xmetrics-block-timing-tracking-enabled option. This does two things, firstly when a block finishes importing more than 4 seconds into a slot (after attestations are due), Teku will now log a Late Block Import line which includes a break down of the time taken at each stage of processing the block (albeit very Teku-developer oriented). Secondly, it enables the beacon_block_import_delay_counter metric which exposes that break down as metrics. Generally, for any slot where the head vote is incorrect, there will be a late block import that caused it. We just need to work out what caused the delay.
An example late block log looks like:
Late Block Import *** Block: c2b911533a8f8d5e699d1a334e0576d2b9aa4caa726bde8b827548b579b47c68 (4765916) proposer 6230 arrival 3475ms, pre-state_retrieved +5ms, processed +185ms, execution_payload_result_received +1436ms, begin_importing +0ms, transaction_prepared +0ms, transaction_committed +0ms, completed +21ms
Arrival
The first potential source of delay is that the block just didn’t get to us in time. The arrival timing shows how much time after the start of the slot the block was first received by your node. In the example above, that was 3475ms which is quite slow, but did get to us before we needed to create an attestation 4 seconds into the slot. Delays in arrival are almost always caused by the block producer being slow to produce the block. It is however possible that the block was published on time but took a long time to be gossiped to your node. If you’re seeing late arrival for most blocks, there’s likely an issue with your node - either the system clock is wrong, your network is having issues or you may have reduced the number of peers too far.
Execution Client Processing
Post-merge, importing a block involves both the consensus and execution clients. The time Teku spends waiting for the execution client to finish processing the block is reported in the execution_payload_result_received value. In this case 1436ms, which would have been ok if the block wasn’t received so late but isn’t ideal. Under 2 seconds is probably ok most of the time, but under 1 second would be better. Execution clients will keep working on optimisations to reduce this time so its worth keeping up to date with the latest version of your client.
Note that prior to Teku 22.9.1 this entry didn’t exist and the execution client time was just counted as part of transaction_prepared.
Teku Processing
The other values are all various aspects of the processing Teku needs to do. pre-state_retrieved and processed are part of applying the state transaction when processing the block. begin_importing, transaction_prepared and transaction_committed record the time taken in various parts of storing the new block to disk. Finally completed reports the final details of things like updating the fork choice records and so on.
Prior to Teku 22.9.1, the transaction_committed was a common source of delays when updating the actual LevelDB database on disk. The disk update is now asynchronous so unless the disk is really exceptionally slow this value is generally only 0 or 1ms.
Next Steps
All these metrics let us get an understanding of where time was spent or where failures occurred. If you’re node is processing blocks quickly, publishing attestations on time and the system clock is accurate there’s probably very little you can do to improve things - having the occasional delayed or missed attestation isn’t unheard of or really worth worrying about.
Otherwise these metrics and logs should give a fairly clear indication of which component is causing problems so you can focus investigations there are get help as needed.
Beacon REST API - Fetching Blocks on a Fork
When debugging issues on the beacon chain, it can be useful to download all blocks on a particularly, potentially non-canonical fork. This script will do just that.
The script should work with any client that supports the standard REST API. Execute it with fetch.sh <BLOCK_ROOT> <NUMBER_OF_BLOCKS_TO_DOWNLOAD>
#!/bin/bash
set -euo pipefail
ROOT=${1:?Must specify a starting block root}
COUNT=${2:?Must specify number of blocks to fetch}
for i in $(seq 1 $COUNT)
do
curl -s http://localhost:5051/eth/v2/beacon/blocks/${ROOT} | jq . > tmp.json
SLOT=$(cat tmp.json | jq -r .data.message.slot)
PARENT=$(cat tmp.json | jq -r .data.message.parent_root)
mv tmp.json ${SLOT}.json
curl -s -H 'Accept: application/octet-stream' http://localhost:5051/eth/v2/beacon/blocks/${ROOT} > ${SLOT}.ssz
echo "$SLOT ($ROOT)"
ROOT=$PARENT
done
Blocks are downloaded in both JSON and SSZ format. As it downloads it prints the slot and block root for each block it downloads.
This is particularly useful when combined with Teku’s data-storage-non-canonical-blocks-enabled option which makes it store all blocks it receives even if they don’t wind up on the finalized chain.
Aggregators and DVT
Obol Network are doing a bunch of work on distributed validator technology and have hit some challenges with the way the beacon REST API determines if validators are scheduled to be aggregators.
Oisín Kyne has written up a detailed explanation of the problem with some proposed changes. Mostly noting it here so I can find the post again later.
Personally I’d like to avoid adding the new /eth/v1/validator/is_aggregator and just have that information returned from the existing /eth/v1/validator/beacon_committee_subscriptions endpoint given it has to be changed anyway and the beacon node will have to check that as part of handling the call anyway. Otherwise it seems simple enough to implement and is worth it to enable DVT to be delivered as middleware rather than having to replace the whole validator client.