Playbook: debugging a hanging pytest run in one shot
Audience. Future-me (or any agent) the next time a make agent-test / pytest -n auto run hangs without finishing. Specific to this codebase, where the most common cause of a hang is Temporal integration-test infrastructure (in-process test servers, xdist worker crashes, fixture teardown), but the iteration loop generalizes to any hanging suite.
Why this exists. In a recent session (Issue 2 fix to with_conditional_worker), I burned ~30 minutes across multiple test runs that all hung or got killed, requiring the user to flag the hang each time. Every individual decision was defensible; the iteration loop was broken. Following the playbook below, the whole thing — code fix, verify unit, verify integration, find the real failure, fix it, verify again — should finish in one autonomous pass.
What I did wrong (concrete, in order)
-
make agent-testis silent on success. I backgrounded it and waited for harness notification. The output file stayed 0 bytes for 14 minutes. The user had to flag it. → No progress signal at all. -
Re-ran with
-qflag. Even more aggressive buffering. Still 0 bytes for the whole run. → Same problem. -
Re-ran with
| tail -80. Pipe buffering means tail receives nothing until pytest exits. 0 bytes again. → Same problem. -
Set
--timeout=120 --timeout-method=thread. Believed this bounded the run. Wrong: per-test timeout doesn't catch fixture teardown hangs (which is whatboot_temporal'smanager.teardown()was doing), and it doesn't catch xdist's worker-crash-and-replace cycle which can loop indefinitely retrying failed integration tests. -
Never killed zombie
temporal-sdk-pythonprocesses between runs. They accumulated (18+ at one point, PPID=1 after their parents died). Each holds a port and consumes resources. Compounding contention made each new run slower and more likely to crash, which spawned more zombies. -
Two pytest sessions ran concurrently (mine + a leftover from the user's earlier attempt). I assumed
pkillof one PID killed the tree; it didn't. Contention multiplied. Should have done a blanketpkill -9 -f pytestfirst. -
Wrapped wait loops in
while kill -0 $PID; do sleep 10; done. Bounded by pytest's own runtime — i.e., not bounded at all when pytest hangs. No outer wall-clock cap. -
When pytest finally produced output, grep'd for "FAILED tests/" but missed the actual error. The smoking gun (
failure-message="Asynchronous pipeline execution is not enabled...") was buried at line ~1490 of a 1497-line log, inside a Temporal server log line — not in a pytest FAILED line. Should have grep'd for the error class name directly from the start. -
Trusted xdist output as the source of truth. Worker crashes ("node down: Not properly terminated") + worker replacement made it impossible to tell whether a failure was a real bug or xdist flakiness. The truth came from running serially.
The right iteration loop
Before any test run
pkill -9 -f "pytest" 2>/dev/null
pkill -9 -f "temporal-sdk-python" 2>/dev/null
sleep 1
ps -ef | grep -E "pytest|temporal-sdk" | grep -v grep | wc -l # must print 0
If that doesn't print 0, escalate (one of those processes is hung in uninterruptible I/O — investigate before continuing).
For first-pass change verification
Run only the tests directly touched by the change, serially, with -x (stop at first failure) and a short per-test timeout:
.venv/bin/pytest -x --timeout=30 -v tests/path/to/specific_test.py
Fast, fail-loud, no buffering questions. This is what verified the unit-test fix in 0.55s.
For broader verification
Run unit-only (no integration) with a per-test timeout and direct file redirect (no pipe):
.venv/bin/pytest --timeout=120 --timeout-method=thread -q \
-m "not (inference or llm or img_gen or extract or search) and not pipelex_api" \
tests/unit/ > /tmp/pytest_unit.log 2>&1 &
PID=$!
Then either wait $PID in a bash backgrounded with run_in_background: true so the harness notifies on exit, OR — if Temporal integration tests are in scope — wrap in an outer shell timeout so even a fixture-teardown hang gets killed:
timeout 480 .venv/bin/pytest ... > /tmp/pytest_full.log 2>&1
echo "exit=$?" # 124 = timeout fired
The 5555-test unit suite runs in ~52s serially. So 480s (8 min) is a generous outer bound that lets you see "did it finish" without ambiguity.
When the run hangs anyway
- Read the live log (
tail -50 /tmp/pytest_full.log) — direct-redirect means progress is visible as it happens. If-vwas used, you see the last test name that started. - Check active processes —
ps -ef | grep -E "pytest|temporal-sdk"shows which xdist worker is alive and how long it's been running. - Don't poll. If you've already wrapped in
timeout, the outer cap will kill it; let the harness notification fire.
When the run completes with FAILED
Grep for the error class name, not the formatted message:
grep -B 2 -A 10 "AsyncExecutionNotEnabledError\|YourSpecificErrorClass" /tmp/pytest_full.log
The formatted message can show up anywhere — including inside Temporal server log lines that aren't pytest FAILED markers. The error class name is greppable, stable, and points straight at the cause.
If the failures look infrastructure-related (worker crashes, "node down", "replacing crashed worker"), re-run the failing slice serially with no xdist:
.venv/bin/pytest --timeout=60 --tb=short -q tests/integration/path/that/failed/
If serial passes, the original failure was xdist contention — not your code. If serial fails, it's a real bug, and you now have a clean traceback to read (no xdist obscuring it).
Iteration template (the whole thing in one go)
# 0. Clean state
pkill -9 -f "pytest" 2>/dev/null
pkill -9 -f "temporal-sdk-python" 2>/dev/null
sleep 1
# 1. Fast verification of the change
.venv/bin/pytest -x --timeout=30 -v tests/unit/path/touched/ 2>&1 | tail -20
# 2. Broad unit verification, direct redirect, outer wall-clock cap
timeout 480 .venv/bin/pytest --timeout=120 --timeout-method=thread -q \
-m "not (inference or llm or img_gen or extract or search) and not pipelex_api" \
tests/unit/ > /tmp/pytest_unit.log 2>&1
echo "exit=$?" # 0=pass, 1=fail, 124=outer timeout
# 3. If any failures, grep by error class name
grep -B 2 -A 10 "YourErrorClass" /tmp/pytest_unit.log
# 4. If integration temporal tests are in scope, run them serially with shorter timeout
timeout 240 .venv/bin/pytest --timeout=60 --timeout-method=thread -q \
tests/integration/pipelex/temporal/ > /tmp/pytest_temporal.log 2>&1
echo "exit=$?"
Total worst-case wall clock: ~13 minutes (30s + 8min + 4min). If anything exceeds that, the outer timeout kills it.
Why each tool choice matters
| Choice | Why |
|---|---|
Direct file redirect (> file 2>&1) |
No pipe buffering; progress is visible live. |
-q only for final accepted runs, NOT debug runs |
-q buffers everything until session-end. Use -v for debug. |
--timeout=N --timeout-method=thread |
Per-test cap. thread method works in async contexts where signal doesn't. |
Outer shell timeout N |
Catches fixture teardown hangs and xdist worker-replace loops that --timeout cannot. |
-x for first-pass debug |
Stops at first real failure; gives you a clean traceback fast. |
Serial (no -n auto) for verifying flaky integration tests |
xdist worker crashes mask real failures and add runtime via replacement cycles. Serial is honest. |
pkill -9 -f temporal-sdk-python before every run |
Zombie test servers compound contention across runs. |
| Grep for error class name, not message | Messages get embedded in non-pytest log lines; class names don't. |
When to stop iterating and surface to the user
Bail out and ask for help (don't burn more time) when:
- After clean-state + outer-timeout run, the suite still doesn't terminate within 2× its expected duration.
- After serial re-run of the failing slice, you still can't tell whether the failure is your code or pre-existing flake.
- You've made a code change to fix a test failure and the same test now fails differently — pause and read the new failure carefully before changing more code.
- The error message points at sandbox / determinism / serialization issues you can't reason about from the surface (this is the Temporal sandbox class of bug — the one I hit).
In all four cases, write up what you know in 3–4 sentences and surface to the user. Better than spending another 20 minutes on dead ends.
Codebase-specific notes (Pipelex / Temporal)
make agent-testuses-n auto -q— both buffer output and obscure progress. For debug runs, invoke pytest directly with-vand direct file redirect.boot_temporalfixture teardown (intests/integration/pipelex/temporal/conftest.py) is the known fixture-teardown hang point. If a hang traceback mentionsmanager.teardown()ortemporal_hub.reset(), it's this.- xdist worker crashes on integration temporal tests often show as
[gw N] node down: Not properly terminatedfollowed byreplacing crashed worker. Treat as infrastructure flake until you've proven otherwise serially. is_in_temporal_workflow()is the right gate for code that might be invoked from inside a Temporal sandbox —get_config()is unsafe to read from inside, and many singletons return sandboxed-empty copies. This was the actual bug I hit; it would have been caught instantly by the playbook above if I'd run the integration tests serially on first iteration instead of trusting the xdist run.