Weekly Report — 2026-W08 (2026-02-16 ~ 2026-02-22)
This week centered on the MIHD spatial transcriptomics project, completing a systematic survey of H&E Image-Only clustering methods (establishing ARI 0.11–0.16 literature baseline), implementing three self-supervised enhancement schemes (SCAN boosted ARI from 0.251 to 0.303, +20.6%), and building the Vision Refinement two-stage fusion framework. Simultaneously on the tianhe cluster: Error Recovery Benchmark (M14 evaluation infrastructure validated, full 649-scenario evaluation launched) and Phoenix pi0.5 reproduction data pipeline (18.4GB MimicGen dataset ingested, training config ready). Resolved multiple engineering blockers including STEGO NaN, double normalization bug, lerobot version conflict, and HuggingFace proxy issues. Pi0.5 OOM and visualize_scene.py video validation remain blocked for next week.
Weekly Overview
| Metric | Value |
|---|---|
| Date Range | 2026-02-16 ~ 2026-02-22 |
| Active Days | 3 / 7 |
| Total Conversations | 9 |
| Projects Involved | 4 |
| Completed Tasks | 11 |
| In-Progress Tasks | 5 |
| Total Tokens | 28,509,501 |
| Total Cost | $14.13 |
| Daily Average Cost | $4.71 |
Project Progress
MIHD (Spatial Transcriptomics Clustering) (3 active days) — 🔄 active
Completed:
- Authored four core technical documents: RM-IDEAL bilingual specification, visual encoder usage guide, pathology PFM literature review, and UNI/UNI2 original paper benchmark analysis
- Conducted a systematic survey of H&E Image-Only methods, establishing DLPFC ARI literature baselines (SpaConTDS=0.16, stLearn=0.11), and analyzed the five root causes of Foundation Model failure
- Implemented and validated three self-supervised enhancement schemes (STEGO/BYOL+GAT/SCAN); SCAN achieved the best result at ARI=0.303 (baseline 0.251, +20.6%), with complementarity between its embeddings and gene features validated via fusion (mean fusion +0.065 ARI)
- Discovered and fixed a double normalization bug in eval_scan_fusion.py for STAIG; decided to switch to the correct run_benchmark.py path instead of patching the standalone script
- Integrated CacheManager embedding caching into run_benchmark.py (second-level loading, supports scan_uni2 custom cache names)
- Implemented the Vision Refinement two-stage fusion framework (–vision_refine parameter, ~60 lines minimal-invasive integration); launched background batch experiments across 7 fusion strategies
- Batch-regenerated all 11 section visualizations (leveraging .npz cache, no re-inference required) and added an H&E original image panel
Blockers:
- ⚠️ SCAN fusion joint evaluation script (eval_scan_fusion.py) debugging incomplete due to coordinate dimension bug
- ⚠️ First Vision Refinement experiment (scan_cluster + concat, ARI=0.313) underperformed direct concat (0.387); root cause of self-supervised compression degrading feature diversity needs analysis
- ⚠️ ImageEncoder enhancement under Goal 7 of ENHANCEMENT_PLAN_CN.md not yet started
Error Recovery Benchmark & Phoenix pi0.5 Reproduction (1 active day) — 🔄 active
Completed:
- Drafted a complete Phase II execution plan with 7 steps covering Goals G1–G7 and Milestones M12–M15; critical path approximately 16 days
- Validated M14 baseline evaluation infrastructure (sanity check passed on 10 scenarios); launched full 649-scenario CPU evaluation (exceeded expected 454, +43%)
- Completed the full Phoenix pi0.5 reproduction data pipeline: convert_mimicgen_to_lerobot.py, evaluate_mimicgen.py, OpenPI training config (100K steps, 4-GPU), downloaded and converted the 18.4GB MimicGen dataset (7–8/9 tasks completed)
- Diagnosed Pi0.5 OOM blocker; standardized GPU access from SSH to srun –overlap
- Resolved lerobot 0.1.0 incompatibility with datasets>=4.0 (downgraded to 3.6.0)
- Established hf-mirror.com as the standard HuggingFace data access solution on the cluster
Blockers:
- ⚠️ Pi0.5 OOM unresolved (GPU VRAM short by 150MB); baseline evaluation still blocked
- ⚠️ visualize_scene.py force parameter extension complete but video validation blocked by SLURM node permission issues
- ⚠️ Pi0 VLA Server port conflict (port 5555 occupied) caused session interruption
Key Tasks
- ✅ Systematic survey of H&E Image-Only clustering methods (2026-02-19) — Surveyed the full landscape of MILWRM/F-SEG/Deep Contrastive Clustering and related methods; verified image-only DLPFC ARI figures from ablation experiments (SpaConTDS=0.16, stLearn=0.11); conducted deep research into BYOL/STEGO/SCAN applications in pathology; organized the CV community’s four-level domain gap resolution framework
- ✅ Established MIHD technical documentation system (2026-02-19) — Created four core technical documents: RM-IDEAL bilingual structure document, visual encoder usage guide (12 chapters), pathology PFM literature review, and UNI/UNI2 original paper benchmark analysis (34 clinical tasks)
- ✅ Root cause analysis of Foundation Model failures in spatial domain recognition (2026-02-19) — Systematic analysis across five dimensions: training data domain mismatch, pretraining task mismatch, extremely small inter-layer morphological variation in brain tissue, feature redundancy, and lack of spatial context in single-patch encoding; supported by the brown repetitive patch phenomenon observed in UNI2
- ✅ Implemented Image-Only clustering enhancement schemes (STEGO/BYOL+GAT/SCAN) (2026-02-19) — Created four model files: STEGOHead, BYOLAdapter, SpatialGATRefiner, SCANHead; completed comparative testing on section 151673; SCAN achieved best ARI=0.303 (+20.6%)
- ✅ STAIG fusion double normalization bug investigation and architecture decision (2026-02-20) — Confirmed that eval_scan_fusion.py failed to pass staig_alignment_config, causing STAIGTrainer to apply StandardScaler internally a second time; decided to use run_benchmark.py’s correct path rather than fixing the standalone script
- ✅ Integrated embedding caching into run_benchmark.py (2026-02-20) — Introduced CacheManager; checks cache before gene/vision encoding, skips encoder instantiation on cache hit (second-level loading); supports scan_uni2 custom cache names; vision cache has three variants: standard/freq/staig_strict
- 🔄 M14 baseline evaluation infrastructure validation and full evaluation launch (2026-02-22) — Sanity check (60 episodes, ~7 minutes, SR=0% as expected); launched full CPU evaluation of 649 scenarios × 2 policies × 3 seeds (~3894 episodes); running in background
- 🔄 Phoenix pi0.5 reproduction full data pipeline setup (2026-02-22) — Wrote convert_mimicgen_to_lerobot.py and evaluate_mimicgen.py; configured pi05_base_mimicgen_phoenix training parameters; downloaded 18.4GB MimicGen dataset (9 tasks, 9000 demos) via hf-mirror.com; completed format conversion for 7–8/9 tasks
- 🔄 SCAN embedding and multimodal fusion joint evaluation (2026-02-19) — Wrote eval_scan_fusion.py to compare SCAN’s optimized 256-dim visual embeddings against PCA gene features across all fusion methods; mean fusion ARI +0.065; coords dimension bug partially fixed; script debugging ongoing
- 🔄 MIHD Vision Refinement two-stage fusion framework implementation and batch experiments (2026-02-22) — Added –vision_refine parameter (scan_cluster/stego_refine/byol_spatial), ~60 lines minimal-invasive integration; first experiment ARI=0.313 underperformed baseline concat 0.387; batch experiments across 7 fusion strategies running in background
- ✅ Error Recovery Benchmark Phase II complete execution plan (2026-02-22) — Analyzed dependency relationships among Goals G1–G7 and Milestones M12–M15; drafted a 7-step execution plan; defined GPU allocation strategy (srun –overlap, ≥50% VRAM free); critical path approximately 16 days
- ✅ Error Recovery Benchmark baseline diagnosis and GPU access standard update (2026-02-22) — Confirmed Pi0.5 OOM (VRAM short by 150MB), BC-RNN obs key issue fixed, 649 scenarios ready; standardized GPU access in CLAUDE.md and MEMORY.md to srun –overlap
- 🚫 visualize_scene.py force parameter extension (2026-02-22) — Completed force_override/duration_override/settle_steps parameters and Phase 3 neutral action logic; video validation blocked by SLURM node permission issues
- ✅ STAIG fusion targeted comparative experiment (151673) (2026-02-20) — Independently tested staig_fusion × {UNI2_raw, SCAN(UNI2)}, ARI 0.3929/0.3880 respectively, nearly identical; confirmed that STAIG’s internal StandardScaler+PCA preprocessing cancels out SCAN’s optimization gains
- ✅ Added H&E panel to UNI2 visualizations and batch update (2026-02-19) — Switched to 1×3 layout (H&E + GT + prediction); batch-regenerated all 11 section visualizations using .npz cache; fixed 151510 via hires→lowres symlink
Issues & Solutions
1. STEGO training loss was NaN throughout — 3639×3639 dense similarity matrix causes float32 exponential overflow at temperature=0.07 [MIHD] (2026-02-19)
Solution: Two-step fix: apply L2 normalization to input image_emb; replace InfoNCE with a numerically stable version (subtract row maximum before logsumexp, raise temperature to 0.1)
2. MILWRM incorrectly classified as an Image-Only method — AI initial survey mixed multimodal methods into image-only results, requiring major revision of first-draft findings [MIHD] (2026-02-19)
Solution: Used WebFetch to read the full PMC paper and confirmed MILWRM is actually gene-expression-based; specifically targeted image-only data points from ablation experiments in papers such as SpaConTDS
3. Double normalization bug in eval_scan_fusion.py for staig_fusion (STAIGTrainer applies StandardScaler internally a second time) [MIHD] (2026-02-20)
Solution: Abandoned fixing the standalone script; switched to run_benchmark.py which already passes staig_alignment_config correctly, reusing the validated code path
4. run_benchmark.py re-instantiates the encoder to extract embeddings every run, lacking pipeline-level caching [MIHD] (2026-02-20)
Solution: Integrated CacheManager from pipeline/cache_manager.py; prioritizes loading from cache before encoding and writes to cache after extraction; skips encoder instantiation on cache hit
5. lerobot 0.1.0 incompatible with datasets>=4.0 (torch.stack raises TypeError: Column object replaced list) [Error Recovery Benchmark / Phoenix pi0.5] (2026-02-22)
Solution: Downgraded datasets from 4.4.1 to 3.6.0 (<4.0); datasets 4.0 changed dataset[‘column’] return type from list to Column object, while lerobot expects a list/tuple of tensors
6. Official HuggingFace downloads fail due to proxy (Squid 503); Python download scripts cannot connect [Error Recovery Benchmark / Phoenix pi0.5] (2026-02-22)
Solution: Switched to hf-mirror.com + wget; URL format: https://hf-mirror.com/datasets/{repo_id}/resolve/main/{file_path}; reachable via cluster HTTP proxy at 40–200MB/s
7. Unstable Slurm GPU node access: srun without –overlap causes command to hang; direct SSH unreliable; multi-partition job submissions rejected [Error Recovery Benchmark / Phoenix pi0.5] (2026-02-22)
Solution: Standardized workflow: source set-XY-I.sh → squeue → srun –jobid=
8. Spatial coordinate dimension anomaly in eval_scan_fusion.py (becomes (1,2)), causing errors in multiple fusion methods [MIHD] (2026-02-19)
Solution: Abandoned calling load_spatial_coordinates() (barcode matching fails); switched to reading coordinates directly from adata.obsm[‘spatial’]; fixed return value unpacking error in load_dlpfc_data
Lessons Learned
Domain Knowledge
- Pure Image-Only methods achieve only ARI 0.11–0.16 on the fine-grained DLPFC layer segmentation task (vs. 0.45–0.64 for multimodal methods), a result of extremely small inter-layer morphological differences in brain tissue combined with Foundation Model training domain mismatch. Notably, ablation experiments in multimodal papers almost never test image-only in isolation — this itself represents a meaningful research gap
- Five root causes of Foundation Model failure in spatial domain recognition: ① training dominated by cancer tissue (domain gap); ② pretraining tasks misaligned with inter-layer gradient recognition; ③ extremely subtle morphological differences between cortical layers; ④ high redundancy between image features and gene expression; ⑤ single-patch independent encoding lacks spatial positional context
- STAIG uses BYOL for unsupervised domain adaptation on target-dataset H&E patches (retaining the encoder after training, discarding projector/predictor) — this is a direct precedent for introducing unsupervised domain adaptation into spatial transcriptomics. BYOL’s negative-sample-free design is naturally suited to small-batch ST settings (thousands of patches per section)
- The CV community’s four-level framework for “domain gap + fine-grained task + no labels”: Level 1 direct pretrained feature clustering → Level 2 STEGO/SCAN feature refinement → Level 3 in-domain SSL repretraining (BYOL/MAE) → Level 4 dedicated foundation model; GPFM/CHIEF are the top-performing PFMs for spatial domain recognition ARI, UNI2 is best for spot retrieval
Architecture
- SCAN achieves the best ARI in image-only spatial transcriptomics (0.303, +20.6%); its core advantage is offline feature k-NN mining that decouples embedding learning from clustering, and its 256-dim optimized embeddings are genuinely complementary to gene features when fused (mean fusion +0.065 ARI)
- Two-stage fusion does not necessarily outperform single-stage: compressing visual embeddings from 1536d to 256d via scan_cluster resulted in multimodal fusion ARI (0.313) lower than direct concat (0.387), indicating that self-supervised compression loses raw feature diversity needed for fusion — the self-supervised clustering objective is misaligned with the downstream fusion task
- STAIG fusion’s internal StandardScaler+PCA preprocessing absorbs the gains from external embedding optimization (SCAN vs. UNI2_raw ARI difference only 0.005); embedding caches should store raw embeddings before normalization, with post-encoder normalization applied after loading to ensure consistency across different call paths
- VLA baseline evaluations must explicitly declare checkpoint provenance: evaluating with non-target-task fine-tuned models (pi0_libero, pi05_base) measures zero-shot cross-domain recovery capability — papers must proactively declare this experimental setup, and subsequent fine-tuned comparison experiments are needed to fully demonstrate dataset utility
- In complex experimental systems, reusing existing validated tool paths (e.g., run_benchmark.py) should be the priority; standalone scripts easily introduce hidden bugs such as preprocessing inconsistencies. Function signatures should be verified by reading source code in real time rather than relying on memory
Debugging
- When computing InfoNCE contrastive loss on large-scale dense similarity matrices (n>3000), numerically stable log-sum-exp (subtracting row maximum before logsumexp) is mandatory; at float32 precision, temperature=0.07 exponential operations will overflow and produce NaN. This is a critical engineering constraint for large-scale contrastive learning implementations
Tools
- lerobot 0.1.0 has strict version constraints on datasets, requiring pin to <4.0 (3.x recommended); datasets 4.0 changed dataset[‘column’] return type to a Column object, causing torch.stack to raise TypeError. MimicGen and LIBERO obs/action formats are fully compatible (84×84 images, 8D state, 7D action), allowing direct reuse of OpenPI’s LeRobotLiberoDataConfig
- The standard solution for accessing HuggingFace on HPC clusters in mainland China is hf-mirror.com; URL format: https://hf-mirror.com/datasets/{repo_id}/resolve/main/{file_path}; achieves 40–200MB/s via wget + HTTP proxy, and should be the default approach
- OpenPI’s JAX training natively supports multi-GPU data parallelism — simply specify GPU list via CUDA_VISIBLE_DEVICES and JAX automatically constructs a 2D mesh for parallelism without modifying TrainConfig. Slurm –overlap is the key parameter for running new commands on top of an existing interactive job, and is the core technique for Claude Code to access cluster GPU nodes
AI Usage Notes
Effective Patterns:
- ✓ Parallel sub-tasks for exploring project structure and drafting cleanup plans (MIHD organization task), reducing sequential exploration time
- ✓ Running three schemes (STEGO/BYOL/SCAN) in parallel as background GPU jobs for comparative validation, significantly reducing total experiment time
- ✓ Batch regeneration of visualizations using .npz cache, fully decoupling inference from visualization — 11/11 sections successful
- ✓ Isolated temporary scripts (_test_staig_scan.py) to test a single fusion method in isolation, reducing a 2-hour task to 30 seconds
- ✓ Independently explored JAX multi-GPU mechanics (no config changes needed) and hf-mirror.com as an alternative download solution, without requiring user intervention
- ✓ Minimal-invasive code modification (~60 lines) to insert Vision Refinement stage into run_benchmark.py, preserving architectural stability
Limitations:
- ✗ Insufficient accuracy in literature classification: incorrectly classified MILWRM as an image-only method, mixing multimodal results into the image-only survey; required two user interventions before converging on the correct research scope
- ✗ Lack of initiative in surfacing critical experimental assumptions: failed to proactively note the impact of using a LIBERO fine-tuned checkpoint on VLA evaluation validity; only expanded on this after user follow-up
- ✗ Tendency to rely on memory rather than real-time verification when using APIs: eval_scan_fusion.py exhibited repeated function signature/return value unpacking errors; source code should be Read before calling
- ✗ When facing Slurm permission issues, tended to exhaustively try multiple partitions (5+ attempts) rather than quickly asking the user for the correct account/partition information
- ✗ Insufficient awareness of background task status: triggered LeRobot dataset validation while data conversion was still in progress, causing false timestamp violation errors; repeated sleep polling was interrupted multiple times
- ✗ Defaulted to CPU for model validation in HPC environments, masking real performance issues and creating unnecessary interaction friction
Next Week Outlook
Next week focuses on three parallel tracks: ① MIHD: fix the eval_scan_fusion.py coordinate bug to complete the SCAN embedding and full fusion strategy joint evaluation; analyze batch experiment results across 7 fusion strategies and diagnose the root cause of Vision Refinement compression feature degradation; consider adjusting refinement hidden_dim or switching to stego_refine/byol_spatial methods. ② Error Recovery Benchmark: consolidate M14 full 649-scenario CPU evaluation results; resolve Pi0.5 OOM (request higher-VRAM GPU or reduce batch size); launch Phoenix pi0.5 100K-step 4-GPU training. ③ Engineering blockers: resolve SLURM node permission issue to complete visualize_scene.py video validation; resolve Pi0 VLA Server port conflict (lsof -i:5555 detection + fallback port mechanism); advance ImageEncoder enhancement under Goal 7 of ENHANCEMENT_PLAN_CN.md.
Token Usage Statistics
Daily Cost Trend
| Date | Tokens (millions) | Cost ($) |
|---|---|---|
| 2026-02-19 | 3.2 | 2.14 |
| 2026-02-20 | 19.3 | 10.00 |
| 2026-02-22 | 6.0 | 1.99 |
Peak Day: 2026-02-20 — $10.00 / 19.3M tokens
Claude Code
| Metric | Value |
|---|---|
| Total Tokens | 28,509,501 |
| Input Tokens | 93,255 |
| Output Tokens | 16,437 |
| Cache Creation | 2,761,832 |
| Cache Read | 25,637,977 |
| Total Cost | $14.13 |
Model Usage Distribution
| Model | Cost ($) | Input Tokens | Output Tokens |
|---|---|---|---|
| claude-opus-4-6 | 9.57 | 15,496 | 14,319 |
| claude-haiku-4-5-20251001 | 4.14 | 77,744 | 2,084 |
| claude-sonnet-4-6 | 0.42 | 15 | 34 |