Daily Journal — 2026-03-02
Today’s Overview
- What I did: Completed the full scGPT+UNI2 fusion experiment pipeline and weekly report visualization output on DCC; managed migration, restart, and progress monitoring of 9 Pi0.5 LoRA training jobs across two Tianhe nodes (an49/an53); fixed multiple critical issues in the eval pipeline and designed a GPU utilization optimization plan
- How I did it: Batch-evaluated 5 fusion strategies on DCC using phase2_evaluate.py and called visualize_from_cache.py to generate clustering visualizations; managed dual-node processes on Tianhe via SSH + srun –overlap dual-strategy, iteratively fixed eval pipeline issues including WebSocket timeouts and JIT concurrency, and ran 1-client vs 2-client concurrency comparison experiments
- Why it matters: Prepared complete experimental data for Monday’s presentation (QFormer avg ARI=0.370, +117% vs scGPT-only); discovered a key architectural constraint in STAIG; all 9 Pi0.5 training jobs are back online; eval pipeline is running stably; VLA inference optimization direction is clear (batched inference rather than multi-client concurrency); established an external dependency inventory and updated cluster access strategy
DCC
- What I did: Fixed scGPT cache metadata bug, ran 5 scGPT+UNI2 fusion experiments, generated 33 single-modality clustering visualizations and embedded them in the weekly report, and discovered via code tracing that STAIG does not use the gene encoder output
- How I did it: Batch-fixed the cache_version field in 11 .npz files, ran concat/mean/attention/llava_mlp/qformer evaluations in sequence, switched from mclust (which hung) to kmeans, and verified the STAIG architecture via three-layer code tracing (Fusion.py → runner.py → STAIGTrainer.py)
- Why it matters: scGPT+UNI2+QFormer avg ARI=0.370 (+117% vs scGPT-only 0.170); confirmed that STAIG does not use the gene encoder output (improvements should target the GNN structure instead); weekly report fully generated
Tianhe
- What I did: Monitored progress of 6 Pi0.5 training jobs on an49 (~11%–38%), migrated 3 queued jobs to an53 and restarted 6 accidentally terminated jobs; fixed WebSocket timeout/JIT concurrency issues in the eval pipeline; performed in-depth root cause analysis of low GPU utilization
- How I did it: Monitored progress via Slurm queue queries and log files, managed nodes via SSH + srun –overlap dual-strategy, iteratively fixed eval scripts (v3→v4→v5 single-GPU sequential version), ran concurrency comparison experiments and reverted to the stable single-GPU approach
- Why it matters: All 9 jobs are back to training; eval v5 single-GPU version is running stably; confirmed action chunking as the root cause of low GPU utilization; updated cluster node access strategy (SSH first → fallback to srun); completed VLA pipeline optimization plan design
Completed the full scGPT+UNI2 fusion experiment suite and weekly report visualization on DCC, identified the key architectural fact that STAIG does not use the gene encoder; managed migration and restart of 9 Pi0.5 LoRA training jobs across two Tianhe nodes, fixed multiple critical issues in the eval pipeline (WebSocket timeout, JIT concurrency crash), conducted in-depth root cause analysis of the ~10% GPU utilization in VLA inference, and completed the design of a batched concurrent eval optimization plan.
Today’s Tasks
Architecture & Strategy
- ✅ Run the full scGPT+UNI2 fusion experiment suite (5 strategies) and compile weekly report — After fixing the cache metadata in 11 scGPT .npz files, ran concat/mean/attention/llava_mlp/qformer fusion strategies in sequence. QFormer performed best (avg ARI=0.370), followed by LLaVA-MLP (0.316), both significantly outperforming scGPT-only (0.170). Generated a method comparison table and 3 statistical charts, called visualize_from_cache.py to produce 33 clustering visualizations across PCA/scGPT/UNI2 (switched from mclust to kmeans after mclust hung), and embedded all of them in weekly_report_20260301.md
- 🔄 Design VLA eval pipeline GPU utilization optimization plan — Reviewed 40+ related papers and designed a BatchedVLAServer (time-based request aggregation) + multi-worker parallel eval plan; user interrupted during ExitPlanMode, so the plan was not finalized
- 🔄 Fix eval pipeline and run rollout evaluation for 9 jobs — Fixed multiple bugs including MUJOCO_EGL_DEVICE_ID, WebSocket ping_timeout (added ping_timeout=None), and JIT concurrency crash (staggered startup); after concurrency experiments confirmed that multi-client does not improve throughput, reverted from multi-GPU parallel (v3/v4) to single-GPU sequential (v5); launched 20-trial evaluation for 9 jobs on GPU 3
Implementation & Fixes
- ✅ Pi0.5 LoRA dual-node job management (monitoring, migration, restart) — Monitored 6 jobs on an49 (~11%–38% progress), migrated 3 queued jobs to an53 (8 A800 GPUs idle), restarted 6 jobs on an49 that were accidentally terminated by killing the launcher (resumed from step ~4000); all 9 jobs are now running in parallel across two nodes; evaluation results: Stack_D0=95%, Stack_D1=100%, Coffee_D0 only 5%
- ✅ Updated cluster node access strategy — Changed GPU node access strategy from “Slurm only” to “SSH first (when there is an active job) → fallback srun –overlap”; updated CLAUDE.md and project memory
- ✅ Created external file dependency inventory — Created docs/external_files_inventory.md, cataloging 10 categories of external dependencies (4 conda environments, 10 HDF5 datasets, 9 BC-RNN checkpoints, 4 VLA checkpoints, etc.), each with full path and reference location annotated; added a reference entry in the project overview summary
Issues & Solutions
Critical Issues
1. STAIG two-stage completely ignores gene encoder output in practice, using the raw HVG expression matrix as GCN input — meaning scGPT+UNI2+STAIG ≡ PCA+UNI2+STAIG
Solution: Confirmed via three-layer code tracing (models/Fusion.py:1246, pipeline/runner.py:445-446, STAIGTrainer.py) that the staig_gene_feat input path is entirely independent of gene_emb
Key insight: STAIG’s gene encoder and GNN training are decoupled by design; improvements to STAIG should focus on the GNN structure rather than replacing the gene encoder
2. Pi0.5 eval client reports ‘keepalive ping timeout’, causing every episode to fail immediately with SR at 0%
Solution: Added ping_timeout=None to the connect() call in websocket_client_policy.py to disable the default 20-second timeout
Key insight: JAX’s first inference pass requires JIT compilation (30–60s), which exceeds the WebSocket library’s default 20s keepalive timeout; all JAX applications with long initial inference times must explicitly set ping_timeout=None
3. Starting 5 JAX servers simultaneously causes ’no close frame received’ server crashes; 2 clients connecting to the same server concurrently is 24% slower than serial (1771s vs 1428s)
Solution: Switched to staggered startup (v4): start each server one by one, wait for JIT warmup to complete before launching the next; abandoned single-server multi-client approach in favor of one independent server+client pair per GPU
Key insight: Multiple JAX processes JIT-compiling simultaneously contend for CPU and memory bandwidth; VLA server inference is strictly serial, so multi-client concurrency cannot improve throughput — the only correct parallelization is independent deployment across multiple GPUs
4. Pi0.5 eval pipeline GPU utilization is only ~10%, with large amounts of idle GPU time
Solution: Confirmed the root cause is action chunking: each trial has ~400 steps with inference every 50 steps = 8 GPU calls × 2.5s = 20s GPU work time / 200s total time; the optimization direction is batched inference (BatchedVLAServer) rather than adding concurrent clients
Key insight: Action chunking makes inference calls sparse; the true optimization should focus on aggregating multiple inference requests into batch processing, rather than trying to parallelize an already-serial inference path
5. Killing the launcher PID caused all 6 training processes (including nohup-launched ones) to be collectively terminated by Slurm cgroup
Solution: Switched to launching each training job via an independent background srun –overlap command; these srun processes do not depend on the launcher and can survive independently after SSH disconnects
Key insight: Subprocesses launched by nohup inside srun will still be terminated by Slurm cgroup when the srun command exits; each long-running job must be its own independent srun process
General Issues
6. SSH to compute node rejected by pam_slurm_adopt; srun to an49 times out, making it impossible to get GPU status directly
Solution: Nodes with an active job (an53) allow direct SSH login; nodes without an active job use srun –jobid=XXXXX –overlap instead; training logs can also be read directly via the shared filesystem as a workaround
Key insight: pam_slurm_adopt requires the user to have an active job on the target node to SSH in; cluster access strategy should be “SSH first → fallback srun –overlap”; the shared filesystem is available as a fallback information source
7. AI created a new visualization script instead of using the existing visualize_from_cache.py; mclust hangs on high-dimensional embeddings (512d/1536d)
Solution: After user correction, switched to the existing script; resolved mclust hang by adding the –cluster_method kmeans parameter
Key insight: Before executing new tasks, always glob-search for existing tools in the project; mclust is not suitable for high-dimensional data — kmeans should be the default
Human Thinking vs AI Thinking
Strategic Level
Proposing Research Innovation Directions vs Code-Level Architecture Tracing
| Role | Approach |
|---|---|
| Human | Prof. Yi proactively proposed the innovative direction of using zero-shot embeddings for cross-sample layer5 patch queries (cross-sample query + HD patches) and identified the batch alignment challenge; Zijia implicitly assumed STAIG uses gene encoder output when asking about STAIG results |
| AI | AI focused on organizing existing experimental data and preparing the presentation, without proactively proposing research directions; however, through systematic three-layer code tracing, AI discovered that STAIG does not actually use gene encoder output, providing line-level evidence |
Analysis: Research innovation direction is driven by the advisor; AI has an advantage in systematic code-level tracing, uncovering architectural design details that documentation rarely reveals
GPU Resource Constraint Insights & Driving Parallel Eval Feasibility
| Role | Approach |
|---|---|
| Human | User observed that the eval client uses only ~400MB of VRAM and proactively suggested server and client could share a single GPU; user proactively asked whether 20 trials could run in parallel, which drove the entire concurrency experiment |
| AI | AI defaulted to server and client each occupying a separate GPU without proactively calculating actual client VRAM usage; AI designed serial eval without proactively considering trial-level parallelization |
Analysis: User identified key optimizations from actual resource constraints and goals; AI followed existing design patterns, lacking proactive optimization and resource accounting awareness
Pragmatism vs Over-Engineering & Limitations of Documentation Rules
| Role | Approach |
|---|---|
| Human | User decisively halted the complex parallel plan for GPUs 3–7, keeping only GPU 3 for stability; user pointed out that SSH to nodes should be the priority, correcting the outdated “never SSH directly” rule AI was following |
| AI | AI tended toward technically more complex parallelization schemes (v4 staggered startup); AI strictly followed CLAUDE.md documentation rules until corrected by the user |
Analysis: User prioritizes practicality and stability, avoiding over-engineering; AI’s reliance on documentation rules sometimes hinders optimal real-world practice
AI Limitations
Critical Limitations
- Did not proactively check the WebSocket library’s default ping_timeout configuration; did not anticipate that Slurm cgroup would terminate all background training processes when killing the launcher; lacked systematic testing of the entire eval pipeline, leading to multiple version iterations (v3/v4/v5) patching a single issue repeatedly
General Limitations
- Did not search for existing project tools before executing new tasks (created a new visualization script instead of using visualize_from_cache.py), and did not anticipate that mclust could hang on high-dimensional embeddings, leading to repeated rework
- Did not proactively calculate eval client GPU memory usage; followed outdated CLAUDE.md node access rules; did not clearly present the complete plan for user confirmation before ExitPlanMode; did not anticipate user preferences before launching multiple agents in parallel
Today’s Key Takeaways
Core Insights
- STAIG two-stage’s GCN uses the raw HVG expression matrix rather than gene encoder output, so gene encoder choice has no impact on STAIG results; improvements to STAIG should target GNN structure design
- JAX JIT compilation time (30–60s) exceeds WebSocket default ping_timeout (20s); multiple JAX processes JIT-compiling simultaneously contend for CPU and memory bandwidth. Solution: set ping_timeout=None in connect() + serial warmup before parallel execution
- Root cause of ~10% GPU utilization in VLA inference: action chunking (50-step sequences) means each trial requires only ~8 GPU calls (8×2.5s/200s=10%); Pi0.5 single-GPU inference is strictly serial, multi-client concurrency cannot improve throughput — the correct optimization is BatchedVLAServer batched inference + independent multi-GPU deployment
- Slurm cgroup mechanics: subprocesses launched by nohup inside srun will still be terminated after srun exits; each long-running task must be an independent srun process; pam_slurm_adopt requires an active job to SSH to a node; cluster access strategy: “SSH first → fallback srun –overlap”
- Learned fusion significantly improves scGPT representation utilization: scGPT+UNI2+QFormer avg ARI=0.370 (+117% vs scGPT-only 0.170), showing that scGPT’s 512d representation has value but requires nonlinear projection to fully activate
Practical Insights
- Full openpi benchmark eval workflow: training complete → start VLA server (pi05_benchmark_{task}_inference config, openpi05 env) → run evaluate_mimicgen.py client (mimicgen_env) → output success rate; Pi0.5 LoRA checkpoints save every 1000 steps by default, max_to_keep=1
Session Summaries
MIHD (Spatial Transcriptomics Benchmark)
✅ Full scGPT+UNI2 fusion experiment pipeline: STAIG architecture discovery, 5-strategy evaluation, weekly report visualization 02:51:19.717 | claude_code After Prof. Yi confirmed satisfaction with the bug fix, AI discovered through three-layer code tracing that STAIG two-stage does not actually use gene encoder output (scGPT+UNI2+STAIG ≡ PCA+UNI2+STAIG, avg ARI≈0.546). After batch-fixing cache metadata in 11 scGPT .npz files, ran 5 fusion strategies in sequence (QFormer best, avg ARI=0.370, +117% vs scGPT-only 0.170), generated a comparison table and 3 statistical charts for the weekly report. During visualization, AI mistakenly created a new script but was corrected by the user to use visualize_from_cache.py; after mclust hung, switched to kmeans and successfully generated 33 clustering visualizations embedded in section 6 of the weekly report. Prof. Yi also proposed the innovative research direction of cross-sample zero-shot querying.
Error Recovery Benchmark
🔄 External dependency inventory, Pi0.5 dual-node training management, eval pipeline fixes and GPU optimization plan design 03:10:32.470 | claude_code Created docs/external_files_inventory.md cataloging 10 categories of external dependencies. Monitored training progress on an49 via log files (6 jobs, ~11%–38%, 3 queued), migrated queued jobs to an53 (8 A800 GPUs idle), and restarted 6 accidentally terminated jobs on an49 after killing the launcher; all 9 jobs recovered. Updated cluster access strategy to “SSH first → fallback srun –overlap”. Iteratively fixed eval pipeline: WebSocket ping_timeout (added ping_timeout=None), JIT concurrency crash (staggered startup), and confirmed via concurrency experiments (1-client vs 2-client was 24% slower) that multi-client is ineffective; ultimately reverted to single-GPU v5 version for stable operation. Entered plan mode, reviewed 40+ papers, and designed a BatchedVLAServer batched optimization plan; user interrupted at ExitPlanMode.
Token Usage
Overview
| Metric | Value |
|---|---|
| Total Tokens | 54,883,979 |
| Input Tokens | 110,299 |
| Output Tokens | 106,931 |
| Cache Write | 2,558,990 |
| Cache Read | 52,107,759 |
| Cache Hit Rate | 95.3% |
| Total Cost (USD) | $35.8983 |
Model Breakdown
| Model | Input | Output | Cache Write | Cache Read | Cost | Share |
|---|---|---|---|---|---|---|
| claude-opus-4-6 | 63,192 | 60,737 | 1,640,537 | 42,934,009 | $33.5547 | 93.5% |
| claude-haiku-4-5-20251001 | 47,107 | 46,194 | 918,453 | 9,173,750 | $2.3435 | 6.5% |
Usage by Device
| Device | Total Tokens | Input | Output | Cost |
|---|---|---|---|---|
| DCC | 637,595 | 9 | 435 | $0.4049 |
| tianhe | 54,246,384 | 110,290 | 106,496 | $35.4933 |