Daily Journal — 2026-03-02

Today’s Overview

  • What I did: Completed the full scGPT+UNI2 fusion experiment pipeline and weekly report visualization output on DCC; managed migration, restart, and progress monitoring of 9 Pi0.5 LoRA training jobs across two Tianhe nodes (an49/an53); fixed multiple critical issues in the eval pipeline and designed a GPU utilization optimization plan
  • How I did it: Batch-evaluated 5 fusion strategies on DCC using phase2_evaluate.py and called visualize_from_cache.py to generate clustering visualizations; managed dual-node processes on Tianhe via SSH + srun –overlap dual-strategy, iteratively fixed eval pipeline issues including WebSocket timeouts and JIT concurrency, and ran 1-client vs 2-client concurrency comparison experiments
  • Why it matters: Prepared complete experimental data for Monday’s presentation (QFormer avg ARI=0.370, +117% vs scGPT-only); discovered a key architectural constraint in STAIG; all 9 Pi0.5 training jobs are back online; eval pipeline is running stably; VLA inference optimization direction is clear (batched inference rather than multi-client concurrency); established an external dependency inventory and updated cluster access strategy

DCC

  • What I did: Fixed scGPT cache metadata bug, ran 5 scGPT+UNI2 fusion experiments, generated 33 single-modality clustering visualizations and embedded them in the weekly report, and discovered via code tracing that STAIG does not use the gene encoder output
  • How I did it: Batch-fixed the cache_version field in 11 .npz files, ran concat/mean/attention/llava_mlp/qformer evaluations in sequence, switched from mclust (which hung) to kmeans, and verified the STAIG architecture via three-layer code tracing (Fusion.py → runner.py → STAIGTrainer.py)
  • Why it matters: scGPT+UNI2+QFormer avg ARI=0.370 (+117% vs scGPT-only 0.170); confirmed that STAIG does not use the gene encoder output (improvements should target the GNN structure instead); weekly report fully generated

Tianhe

  • What I did: Monitored progress of 6 Pi0.5 training jobs on an49 (~11%–38%), migrated 3 queued jobs to an53 and restarted 6 accidentally terminated jobs; fixed WebSocket timeout/JIT concurrency issues in the eval pipeline; performed in-depth root cause analysis of low GPU utilization
  • How I did it: Monitored progress via Slurm queue queries and log files, managed nodes via SSH + srun –overlap dual-strategy, iteratively fixed eval scripts (v3→v4→v5 single-GPU sequential version), ran concurrency comparison experiments and reverted to the stable single-GPU approach
  • Why it matters: All 9 jobs are back to training; eval v5 single-GPU version is running stably; confirmed action chunking as the root cause of low GPU utilization; updated cluster node access strategy (SSH first → fallback to srun); completed VLA pipeline optimization plan design

Completed the full scGPT+UNI2 fusion experiment suite and weekly report visualization on DCC, identified the key architectural fact that STAIG does not use the gene encoder; managed migration and restart of 9 Pi0.5 LoRA training jobs across two Tianhe nodes, fixed multiple critical issues in the eval pipeline (WebSocket timeout, JIT concurrency crash), conducted in-depth root cause analysis of the ~10% GPU utilization in VLA inference, and completed the design of a batched concurrent eval optimization plan.

Today’s Tasks

Architecture & Strategy

  • Run the full scGPT+UNI2 fusion experiment suite (5 strategies) and compile weekly report — After fixing the cache metadata in 11 scGPT .npz files, ran concat/mean/attention/llava_mlp/qformer fusion strategies in sequence. QFormer performed best (avg ARI=0.370), followed by LLaVA-MLP (0.316), both significantly outperforming scGPT-only (0.170). Generated a method comparison table and 3 statistical charts, called visualize_from_cache.py to produce 33 clustering visualizations across PCA/scGPT/UNI2 (switched from mclust to kmeans after mclust hung), and embedded all of them in weekly_report_20260301.md
  • 🔄 Design VLA eval pipeline GPU utilization optimization plan — Reviewed 40+ related papers and designed a BatchedVLAServer (time-based request aggregation) + multi-worker parallel eval plan; user interrupted during ExitPlanMode, so the plan was not finalized
  • 🔄 Fix eval pipeline and run rollout evaluation for 9 jobs — Fixed multiple bugs including MUJOCO_EGL_DEVICE_ID, WebSocket ping_timeout (added ping_timeout=None), and JIT concurrency crash (staggered startup); after concurrency experiments confirmed that multi-client does not improve throughput, reverted from multi-GPU parallel (v3/v4) to single-GPU sequential (v5); launched 20-trial evaluation for 9 jobs on GPU 3

Implementation & Fixes

  • Pi0.5 LoRA dual-node job management (monitoring, migration, restart) — Monitored 6 jobs on an49 (~11%–38% progress), migrated 3 queued jobs to an53 (8 A800 GPUs idle), restarted 6 jobs on an49 that were accidentally terminated by killing the launcher (resumed from step ~4000); all 9 jobs are now running in parallel across two nodes; evaluation results: Stack_D0=95%, Stack_D1=100%, Coffee_D0 only 5%
  • Updated cluster node access strategy — Changed GPU node access strategy from “Slurm only” to “SSH first (when there is an active job) → fallback srun –overlap”; updated CLAUDE.md and project memory
  • Created external file dependency inventory — Created docs/external_files_inventory.md, cataloging 10 categories of external dependencies (4 conda environments, 10 HDF5 datasets, 9 BC-RNN checkpoints, 4 VLA checkpoints, etc.), each with full path and reference location annotated; added a reference entry in the project overview summary

Issues & Solutions

Critical Issues

1. STAIG two-stage completely ignores gene encoder output in practice, using the raw HVG expression matrix as GCN input — meaning scGPT+UNI2+STAIG ≡ PCA+UNI2+STAIG

Solution: Confirmed via three-layer code tracing (models/Fusion.py:1246, pipeline/runner.py:445-446, STAIGTrainer.py) that the staig_gene_feat input path is entirely independent of gene_emb

Key insight: STAIG’s gene encoder and GNN training are decoupled by design; improvements to STAIG should focus on the GNN structure rather than replacing the gene encoder

2. Pi0.5 eval client reports ‘keepalive ping timeout’, causing every episode to fail immediately with SR at 0%

Solution: Added ping_timeout=None to the connect() call in websocket_client_policy.py to disable the default 20-second timeout

Key insight: JAX’s first inference pass requires JIT compilation (30–60s), which exceeds the WebSocket library’s default 20s keepalive timeout; all JAX applications with long initial inference times must explicitly set ping_timeout=None

3. Starting 5 JAX servers simultaneously causes ’no close frame received’ server crashes; 2 clients connecting to the same server concurrently is 24% slower than serial (1771s vs 1428s)

Solution: Switched to staggered startup (v4): start each server one by one, wait for JIT warmup to complete before launching the next; abandoned single-server multi-client approach in favor of one independent server+client pair per GPU

Key insight: Multiple JAX processes JIT-compiling simultaneously contend for CPU and memory bandwidth; VLA server inference is strictly serial, so multi-client concurrency cannot improve throughput — the only correct parallelization is independent deployment across multiple GPUs

4. Pi0.5 eval pipeline GPU utilization is only ~10%, with large amounts of idle GPU time

Solution: Confirmed the root cause is action chunking: each trial has ~400 steps with inference every 50 steps = 8 GPU calls × 2.5s = 20s GPU work time / 200s total time; the optimization direction is batched inference (BatchedVLAServer) rather than adding concurrent clients

Key insight: Action chunking makes inference calls sparse; the true optimization should focus on aggregating multiple inference requests into batch processing, rather than trying to parallelize an already-serial inference path

5. Killing the launcher PID caused all 6 training processes (including nohup-launched ones) to be collectively terminated by Slurm cgroup

Solution: Switched to launching each training job via an independent background srun –overlap command; these srun processes do not depend on the launcher and can survive independently after SSH disconnects

Key insight: Subprocesses launched by nohup inside srun will still be terminated by Slurm cgroup when the srun command exits; each long-running job must be its own independent srun process

General Issues

6. SSH to compute node rejected by pam_slurm_adopt; srun to an49 times out, making it impossible to get GPU status directly

Solution: Nodes with an active job (an53) allow direct SSH login; nodes without an active job use srun –jobid=XXXXX –overlap instead; training logs can also be read directly via the shared filesystem as a workaround

Key insight: pam_slurm_adopt requires the user to have an active job on the target node to SSH in; cluster access strategy should be “SSH first → fallback srun –overlap”; the shared filesystem is available as a fallback information source

7. AI created a new visualization script instead of using the existing visualize_from_cache.py; mclust hangs on high-dimensional embeddings (512d/1536d)

Solution: After user correction, switched to the existing script; resolved mclust hang by adding the –cluster_method kmeans parameter

Key insight: Before executing new tasks, always glob-search for existing tools in the project; mclust is not suitable for high-dimensional data — kmeans should be the default

Human Thinking vs AI Thinking

Strategic Level

Proposing Research Innovation Directions vs Code-Level Architecture Tracing

Role Approach
Human Prof. Yi proactively proposed the innovative direction of using zero-shot embeddings for cross-sample layer5 patch queries (cross-sample query + HD patches) and identified the batch alignment challenge; Zijia implicitly assumed STAIG uses gene encoder output when asking about STAIG results
AI AI focused on organizing existing experimental data and preparing the presentation, without proactively proposing research directions; however, through systematic three-layer code tracing, AI discovered that STAIG does not actually use gene encoder output, providing line-level evidence

Analysis: Research innovation direction is driven by the advisor; AI has an advantage in systematic code-level tracing, uncovering architectural design details that documentation rarely reveals

GPU Resource Constraint Insights & Driving Parallel Eval Feasibility

Role Approach
Human User observed that the eval client uses only ~400MB of VRAM and proactively suggested server and client could share a single GPU; user proactively asked whether 20 trials could run in parallel, which drove the entire concurrency experiment
AI AI defaulted to server and client each occupying a separate GPU without proactively calculating actual client VRAM usage; AI designed serial eval without proactively considering trial-level parallelization

Analysis: User identified key optimizations from actual resource constraints and goals; AI followed existing design patterns, lacking proactive optimization and resource accounting awareness

Pragmatism vs Over-Engineering & Limitations of Documentation Rules

Role Approach
Human User decisively halted the complex parallel plan for GPUs 3–7, keeping only GPU 3 for stability; user pointed out that SSH to nodes should be the priority, correcting the outdated “never SSH directly” rule AI was following
AI AI tended toward technically more complex parallelization schemes (v4 staggered startup); AI strictly followed CLAUDE.md documentation rules until corrected by the user

Analysis: User prioritizes practicality and stability, avoiding over-engineering; AI’s reliance on documentation rules sometimes hinders optimal real-world practice

AI Limitations

Critical Limitations

  • Did not proactively check the WebSocket library’s default ping_timeout configuration; did not anticipate that Slurm cgroup would terminate all background training processes when killing the launcher; lacked systematic testing of the entire eval pipeline, leading to multiple version iterations (v3/v4/v5) patching a single issue repeatedly

General Limitations

  • Did not search for existing project tools before executing new tasks (created a new visualization script instead of using visualize_from_cache.py), and did not anticipate that mclust could hang on high-dimensional embeddings, leading to repeated rework
  • Did not proactively calculate eval client GPU memory usage; followed outdated CLAUDE.md node access rules; did not clearly present the complete plan for user confirmation before ExitPlanMode; did not anticipate user preferences before launching multiple agents in parallel

Today’s Key Takeaways

Core Insights

  • STAIG two-stage’s GCN uses the raw HVG expression matrix rather than gene encoder output, so gene encoder choice has no impact on STAIG results; improvements to STAIG should target GNN structure design
  • JAX JIT compilation time (30–60s) exceeds WebSocket default ping_timeout (20s); multiple JAX processes JIT-compiling simultaneously contend for CPU and memory bandwidth. Solution: set ping_timeout=None in connect() + serial warmup before parallel execution
  • Root cause of ~10% GPU utilization in VLA inference: action chunking (50-step sequences) means each trial requires only ~8 GPU calls (8×2.5s/200s=10%); Pi0.5 single-GPU inference is strictly serial, multi-client concurrency cannot improve throughput — the correct optimization is BatchedVLAServer batched inference + independent multi-GPU deployment
  • Slurm cgroup mechanics: subprocesses launched by nohup inside srun will still be terminated after srun exits; each long-running task must be an independent srun process; pam_slurm_adopt requires an active job to SSH to a node; cluster access strategy: “SSH first → fallback srun –overlap”
  • Learned fusion significantly improves scGPT representation utilization: scGPT+UNI2+QFormer avg ARI=0.370 (+117% vs scGPT-only 0.170), showing that scGPT’s 512d representation has value but requires nonlinear projection to fully activate

Practical Insights

  • Full openpi benchmark eval workflow: training complete → start VLA server (pi05_benchmark_{task}_inference config, openpi05 env) → run evaluate_mimicgen.py client (mimicgen_env) → output success rate; Pi0.5 LoRA checkpoints save every 1000 steps by default, max_to_keep=1

Session Summaries

MIHD (Spatial Transcriptomics Benchmark)

✅ Full scGPT+UNI2 fusion experiment pipeline: STAIG architecture discovery, 5-strategy evaluation, weekly report visualization 02:51:19.717 | claude_code After Prof. Yi confirmed satisfaction with the bug fix, AI discovered through three-layer code tracing that STAIG two-stage does not actually use gene encoder output (scGPT+UNI2+STAIG ≡ PCA+UNI2+STAIG, avg ARI≈0.546). After batch-fixing cache metadata in 11 scGPT .npz files, ran 5 fusion strategies in sequence (QFormer best, avg ARI=0.370, +117% vs scGPT-only 0.170), generated a comparison table and 3 statistical charts for the weekly report. During visualization, AI mistakenly created a new script but was corrected by the user to use visualize_from_cache.py; after mclust hung, switched to kmeans and successfully generated 33 clustering visualizations embedded in section 6 of the weekly report. Prof. Yi also proposed the innovative research direction of cross-sample zero-shot querying.

Error Recovery Benchmark

🔄 External dependency inventory, Pi0.5 dual-node training management, eval pipeline fixes and GPU optimization plan design 03:10:32.470 | claude_code Created docs/external_files_inventory.md cataloging 10 categories of external dependencies. Monitored training progress on an49 via log files (6 jobs, ~11%–38%, 3 queued), migrated queued jobs to an53 (8 A800 GPUs idle), and restarted 6 accidentally terminated jobs on an49 after killing the launcher; all 9 jobs recovered. Updated cluster access strategy to “SSH first → fallback srun –overlap”. Iteratively fixed eval pipeline: WebSocket ping_timeout (added ping_timeout=None), JIT concurrency crash (staggered startup), and confirmed via concurrency experiments (1-client vs 2-client was 24% slower) that multi-client is ineffective; ultimately reverted to single-GPU v5 version for stable operation. Entered plan mode, reviewed 40+ papers, and designed a BatchedVLAServer batched optimization plan; user interrupted at ExitPlanMode.

Token Usage

Overview

Metric Value
Total Tokens 54,883,979
Input Tokens 110,299
Output Tokens 106,931
Cache Write 2,558,990
Cache Read 52,107,759
Cache Hit Rate 95.3%
Total Cost (USD) $35.8983

Model Breakdown

Model Input Output Cache Write Cache Read Cost Share
claude-opus-4-6 63,192 60,737 1,640,537 42,934,009 $33.5547 93.5%
claude-haiku-4-5-20251001 47,107 46,194 918,453 9,173,750 $2.3435 6.5%

Usage by Device

Device Total Tokens Input Output Cost
DCC 637,595 9 435 $0.4049
tianhe 54,246,384 110,290 106,496 $35.4933