Weekly Report — 2026-W11 (2026-03-09 ~ 2026-03-15)

This week, six parallel workstreams advanced across three machines (DCC, tianhe, TzJsDesktop): ①MIHD spatial transcriptomics uncovered a fundamental methodological flaw in cross-sample embedding (per-section independent processing causes incomparable feature spaces) and initiated a fix; ②ErrorRecoveryBenchmark scaled from bug fixes to 13 skills/29 subtypes, solved the Drop skill object-not-falling issue, exposed online quota architecture limitations, and established offline injection as the new direction; ③VLA-RoboTwin/pi05 achieved end-to-end progress from environment setup and training performance optimization (JAX version alignment +33% speedup) to new data variable collection and auxiliary task experiments; ④gadget toolchain completed an architectural upgrade with MCP Server + common/ shared package + unified output directory, and the research profiler achieved homepage-first student discovery; ⑤CalendarPro completed 7-phase comprehensive optimization with all 230 tests passing and token consumption reduced by 40–60%; ⑥gadget research toolchain integrated citation graph analysis and produced deep profiles for 7 embodied AI researchers.

Weekly Overview

Metric	Value
Date Range	2026-03-09 ~ 2026-03-15
Active Days	6 / 7
Total Conversations	29
Projects Involved	19
Tasks Completed	36
Tasks In Progress	10
Total Tokens	309,110,118
Total Cost	$227.47
Daily Avg Cost	$32.50

Project Progress

VLA-RoboTwin/pi05 (6 days active) — 🔄 active

Completed:

Successfully converted 50 RoboTwin episodes to LeRobot format (11,459 frames)
Diagnosed the 33% training time gap between pi05 and openpi; upgraded 6 key dependencies including JAX 0.5.0→0.5.3, compressing expected training time from 20h to 15h
Completed full end-to-end fix of the eval.sh runtime environment: upgraded torchvision to 0.22.1 and set conda CUDA_HOME, recompiled curobo from source to resolve ABI incompatibility
Implemented 5 new data variables for Place Dual Shoes (manip_progress_time/distance_left/right, target_endpose, target_joint), using a post-processing architecture that backpatches pickles after move() to resolve future-state dependencies
Designed and implemented four groups of manipulation progress prediction auxiliary experiments across 6 files (last_token vs special_token × time vs distance) under the JAX/Flax NNX framework; added stop_gradient isolation and ProgressConfig toggle
Fixed CheckpointWeightLoader missing_regex configurability and pi0.py LeRobot shape squeeze issues; training step-100 action_loss/aux_loss curves show normal descent

Blockers:

⚠️ All four auxiliary experiment groups are blocked because the LeRobot dataset does not include the new fields; dataset must be re-converted
⚠️ eval.sh defaults to checkpoint_id=5000 which does not exist; needs correction to an available value (15000/25000/29999)

ErrorRecoveryBenchmark (4 days active) — 🔄 active

Completed:

Fixed two critical bugs: discarded return value in monitor.update() and taxonomy label mapping; re-annotated 1,029 historical scenarios
Solved the Drop skill object-not-falling issue: calling mujoco.mj_step() for 15 physics steps bypasses OSC controller interference
Fixed 5 systematically failing skills (3 drop variants + grasp_misalignment + trajectory_regression + wrong_object); all 105 unit tests passing
Semantically split E2 Drop into 3 independent skills by recovery strategy (drop_in_transit / drop_at_wrong_place / drop_with_interaction), expanding the benchmark to 13 skills / 29 subtypes
Fixed Stack body name parsing silent failure; generated MP4 demo videos for 11 demo skills; completed v4 code archival
Completed v5.1 architecture planning (InjectionEngine refactor + speed limits + human demo collection pipeline); established milestone of beginning recovery training before April 1
v5 full run generated 231 scenarios and MP4s; first D0 round generated 207 scenarios

Blockers:

⚠️ D0 scenario generation is still short of the 600-scenario target; 5 fixed root causes need re-validation
⚠️ Coffee machine part disassembly (lid floating, base displaced) kinematic tree diagnosis is incomplete
⚠️ v5.1 offline injection architecture implementation has not started

MIHD (Spatial Transcriptomics) (3 days active) — 🔄 active

Completed:

Completed 151673↔151508 cross-sample RM-IDEAL benchmark; Layer_1/5 positive correlation (r≤0.66), Layer_3 negative correlation reveals fusion embedding layer specificity
Implemented CrossModalEnhancer module (spatial neighbor KV sequence construction + symmetric InfoNCE); CPU-side three-mode tests passing
Worked around RTX 2080 Ti cuBLAS large-tensor bug (project to hidden_dim first before aggregating neighbors + mini-batch contrastive loss)
scGPT literature review confirmed zero-shot underperforms PCA/scVI, providing strategic evidence for gene encoder selection
Completed major MIHD output directory restructure (all 14+ file path references updated)
Identified fundamental methodological flaw in cross-sample embedding and initiated raw_shared shared HVG intersection (1,137 genes) baseline fix

Blockers:

⚠️ 151676 STAIG embedding is all-zero (model collapse); GPU retraining failed due to PyTorch 2.9.0 + PyG CUDA conflict; cross-section visualization blocked
⚠️ raw_shared embedding diagnosis still running; CrossModalEnhancer full GPU pipeline evaluation incomplete

gadget Toolchain (2 days active) — 🔄 active

Completed:

Wrapped 9 MCP tools using FastMCP + capture_stdout + asyncio.to_thread; refactored to content-return pattern (save parameter controls file writing)
Enhanced research_scout logging system (RotatingFileHandler dual output); added bioRxiv/PubMed multi-source support with zero new dependencies
Created 6 new common/ modules eliminating ~500 lines of duplicate code; paths.py unifies 6 path constants; .gitignore simplified to single-line outputs/
Implemented Homepage-Based student discovery (4-phase strategy: homepage-first + co-authorship supplement); completed deep profiles for 7 embodied AI researchers
Integrated research_scout.py as unified CLI entry (profile/citations subcommands); integrated Semantic Scholar citation graph API; added Hugo research section

Blockers:

⚠️ Hugo deployment of 7 researcher profiles not yet completed
⚠️ LLM-generated Chinese long-form JSON quote pollution issue unresolved

CalendarPro (2 days active) — ✅ completed

Completed:

Implemented gadget integration layer (ResearchScoutTool + DailySummaryTool + conda run cross-environment); auto-triggered at 8AM/11PM daily; 13 unit tests passing
Completed 7-phase comprehensive optimization (confidence threshold, hybrid routing, prompt simplification + Chinese token correction, exponential backoff, configurable scheduling weights, automatic threshold tuning, ThoughtStore cache)
Fixed 4 real misclassification scenarios; prompt token consumption reduced by 40–60%; all 230 tests passing

UniVLA/CALVIN Evaluation (2 days active) — 🔄 active

Completed:

Completed CALVIN dependency chain analysis (4 issues located); found evaluation is purely online simulation; extracted eval-only files (1.3GB → 600KB)
Added –single_gpu mode to bypass torchrun/DDP; fixed multiple hardcoded paths; installed braceexpand dependency

Blockers:

⚠️ Full evaluation script pipeline not yet validated; still iterating through debugging

Key Tasks

✅ CalendarPro 7-Phase Comprehensive Optimization (2026-03-15) — Implemented semantic routing confidence threshold, hybrid routing (Dense 70% + Keyword 30%), prompt simplification (530 lines → base + 11 fragments) + Chinese token correction (×1.5/character), exponential backoff retry, configurable scheduling weights, automatic threshold tuning feedback loop, ThoughtStore memory cache; fixed 4 real misclassification scenarios; token consumption reduced 40–60%; all 230 tests passing
✅ gadget Research Toolchain CLI Integration + Citation Graph + Deep Profiles for 7 Researchers (2026-03-15) — Unified paper scout and researcher profiler under research_scout.py as a single CLI; added Semantic Scholar citation graph API (three-stage report auto-runs citation analysis on top-5 papers); completed deep profiles for Mingyu Ding / Ruoshi Liu / Xiaolong Wang / Shuran Song / Yunzhu Li / Yuke Zhu / Chelsea Finn; identified complete advisor relationship networks
🔄 ErrorRecoveryBenchmark v5 Comprehensive Fix and Scale-Up to 13 Skills/29 Subtypes (2026-03-15) — Fixed 5 systematically failing skills; split E2 into 3 semantically independent skills; completed v4 archival; v5 full run generated 231 scenarios; first D0 round generated 207 scenarios (target: 600); completed v5.1 architecture planning (InjectionEngine + speed limits + human demo collection; recovery training to begin before April 1)
✅ gadget common/ Shared Package Extraction + outputs/ Unified Directory Restructure (2026-03-15) — Created 6 new common/ modules (io/cache/json_utils/llm/hugo); eliminated ~500 lines of duplicate LLM call and JSON parsing code; paths.py unifies 6 path constants; .gitignore simplified to single-line outputs/; updated 4 CLAUDE.md files
✅ gadget MCP Server Design, Implementation, and Tool Content-Return Refactor (2026-03-09) — Wrapped 9 MCP tools using FastMCP + capture_stdout + asyncio.to_thread; refactored from ‘write file return path’ to ‘return full content + optional save parameter’; established pip install -e . + console entry point distribution; all tools validated
🔄 MIHD Cross-Sample Embedding Methodology Diagnosis and Fix (2026-03-15) — Identified dual incomparability from per-section independent HVG selection + independent PCA fitting; invalidated the false conclusion that ‘PCA outperforms STAIG = weak input features’; initiated raw_shared baseline with shared HVG intersection (1,137 genes); discovered STAIG’s layer-specific pattern: Layer_1/5 (SL@50=0.94–1.0) vs complete failure in intermediate layers
✅ pi05 Training Performance Optimization: JAX Version Alignment + Dependency Conflict Resolution (2026-03-11) — Used parallel sub-agents to compare pyproject.toml/uv.lock/wandb logs; identified JAX version gap (0.5.0 vs 0.5.3) as root cause of 33% slower training due to accumulated XLA compiler optimizations; aligned 6 key dependencies; used uv override-dependencies to resolve lerobot torch<2.7 version constraint conflict; successfully completed uv lock (305 packages)
🔄 pi05 Four-Group Manipulation Progress Prediction Auxiliary Experiment Design and Implementation (2026-03-14) — Implemented manip_progress auxiliary prediction head across 6 files in JAX/Flax NNX framework (last_token vs special_token × time vs distance); added stop_gradient isolation and ProgressConfig toggle; fixed CheckpointWeightLoader and LeRobot shape issues; training step-100 loss curves show normal descent
✅ ErrorRecoveryBenchmark v5.1 Architecture Planning (2026-03-15) — Refactored ContextReplayEngine into InjectionEngine (direct recovery by injecting sim state at the target frame, bypassing VLA’s no-context-window assumption); added motion speed limits; designed keyboard teleoperation human demo collection pipeline; limited data source to MimicGen demos; established phased implementation plan for March 16–31
✅ RoboTwin New Data Variable Post-Processing Collection Architecture (2026-03-13) — Used post-processing approach of backpatching pickles after move() to implement 5 new variables; resolved target_endpose/target_joint dependency on future states; fixed negative manip_progress_distance (np.clip to [0,1]); pkl2hdf5.py generic recursive design requires no modification
🔄 VLA eval.sh Runtime Environment Full End-to-End Fix (2026-03-12) — Upgraded torchvision 0.22.1+cu126 to fix nms operator mismatch; set CUDA_HOME to conda targets directory and recompiled curobo from source to resolve ABI incompatibility; remaining issue: checkpoint_id=5000 path does not exist
✅ gadget Homepage-Based Student Discovery Strategy Implementation (2026-03-15) — Implemented homepage_discovery.py module (~200 lines); 4-phase discovery strategy (homepage-first + co-authorship supplement); multi-strategy URL discovery (S2 homepage field + LLM suggestion + –homepage parameter); HTMLParser text extraction; 2MB limit + 7-day cache TTL; resolved the fundamental limitation of S2 co-authorship analysis failing completely for top-tier researchers

Problems and Solutions

1. Drop Skill: OSC controller actively maintains EEF position during env.step() (impedance control), causing the object to be held by fingers after the gripper opens and unable to fall freely [ErrorRecoveryBenchmark] (2026-03-15)

Solution: Bypass the controller by directly setting MuJoCo qpos/qvel, then call mujoco.mj_step() for 15 physics steps to complete initial separation before entering the standard control loop

2. MIHD Cross-Sample Embedding Comparison Invalid: per-section independent HVG selection + independent PCA fitting causes incomparable feature spaces; conclusion that ‘PCA outperforms STAIG’ is a methodological error [MIHD] (2026-03-15)

Solution: Switch to the raw_shared approach using shared HVG intersection (1,137 genes) + unified processing as the correct baseline; load directly from raw HDF5 rather than relying on per-section cache (which has a var_names integer-conversion bug)

3. Stack Body Name Parsing Silent Failure: stack.yaml uses cubeA/cubeB, but MuJoCo actual names are cubeA_main; _sim_body_name2id returns -1; Python negative indexing causes all task phase detection to be misidentified as pre_reach [ErrorRecoveryBenchmark] (2026-03-15)

Solution: Fixed body name fields; added _main/_body0 suffix fallback logic in _sim_body_name2id; lookup failures now emit WARNING instead of silently returning -1

4. pi05 training 33% slower than openpi (20h vs 15h); intuition pointed to hardware differences, root cause unclear [VLA-RoboTwin/pi05] (2026-03-11)

Solution: Used parallel sub-agents to compare software layers (pyproject.toml/uv.lock/wandb logs); identified JAX version gap (0.5.0 vs 0.5.3) as root cause, with accumulated XLA compiler optimizations; used uv override-dependencies to resolve lerobot torch version upper-bound constraint conflict

5. curobo precompiled .so ABI-incompatible with torch 2.7.1 (undefined symbol); JIT recompilation failed because conda CUDA header path is non-standard [VLA-RoboTwin] (2026-03-12)

Solution: Set CUDA_HOME to conda environment root, CPATH to targets/x86_64-linux/include/, then pip install -e . to recompile from source

6. Online quota generation severely imbalanced: premature_release naturally captured 7,233 entries, 7 types completely at zero; strategy behavior distribution uncontrollable [ErrorRecoveryBenchmark] (2026-03-09)

Solution: Established offline injection architecture: first do complete rollouts to collect trajectories, offline-detect injectable points to build an index, then selectively inject according to quota; skip already-satisfied types

7. CalendarPro intent misclassification: no confidence threshold (0.52 treated as valid), time expressions misrouted by keyword router, short confirmation words lack context understanding, Chinese token estimation off by 3× [CalendarPro] (2026-03-15)

Solution: Added per-route confidence thresholds (0.40–0.60); introduced keyword scorer with 70/30 embedding hybrid routing; split system prompt into base + 11 fragments injected on demand; switched Chinese token estimate to ×1.5

8. S2 co-authorship analysis completely fails for top-tier researchers (Levine/Abbeel/Finn etc.) (depth-2 all empty); Xiaolong Wang/Shuran Song have severe same-name ambiguity [gadget] (2026-03-15)

Solution: Refactored to homepage-first strategy: prioritize scraping student lists from professors’ personal pages, with co-authorship as supplementary; multi-strategy URL discovery; same-name ambiguity flagged with WARNING recommending use of S2 authorId for precise lookup

9. VLA context replay architecture assumption incorrect: designed a full N-1 frame replay mechanism, but most VLAs have no context window, making this overhead useless [ErrorRecoveryBenchmark] (2026-03-15)

Solution: Refactored ContextReplayEngine into InjectionEngine that directly restores sim state at the injection frame; limited data source to MimicGen demo data for better controllability

10. RTX 2080 Ti + PyTorch 2.9.0 triggers cuBLAS CUBLAS_STATUS_EXECUTION_FAILED for high-dimensional tensors with N>3500 [MIHD] (2026-03-09)

Solution: First project to hidden_dim (128) with a Linear layer before indexing neighbors (avoids high-dimensional large tensors entering cuBLAS); switched InfoNCE to mini-batch contrastive loss (batch_size=512)

11. MCP Server tools write file and return path; AI cannot directly consume the content [gadget] (2026-03-09)

Solution: Refactored tools to bypass cmd_* wrappers and directly call underlying functions, returning full content (markdown/JSON); file writing controlled by a save parameter

12. pi0.py made incorrect assumptions about LeRobot internal behavior: inferred shape=(1,) features maintain (b,1) shape and modified code accordingly; actual LeRobot DataLoader squeezes to (b,) causing shape mismatch during training [VLA-RoboTwin/pi05] (2026-03-15)

Solution: Confirmed true shape by actually running training and observing logs (‘aux_targets[…]: (32,)@float32’); reverted original [:, None] and jnp.stack operations

Lessons Learned

Architecture

Cross-sample embedding comparison requires a shared feature space as a prerequisite: per-section independent HVG selection + independent PCA fitting = dual incomparability. A valid baseline must use shared HVG intersection + joint processing, or a foundation model with fixed pretrained weights
Direct state manipulation in MuJoCo fundamentally conflicts with feedback controllers (OSC): sim.forward() only updates kinematics; mujoco.mj_step() advances dynamics and bypasses the controller. Simulation injection design must explicitly choose one path
Error type semantic splitting should be based on ‘whether recovery strategies differ,’ not ‘whether injection mechanisms differ’: drop_in_transit / drop_at_wrong_place / drop_with_interaction have completely different detection conditions and recovery logic; even if injection actions are identical, they must be modeled separately
Semantic router architectural flaw: embedding nearest-neighbor always produces a result and cannot express ‘uncertain.’ Confidence threshold + fallback LLM + keyword scorer hybrid is the most practical fix pattern, generalizable to all vector-retrieval-based classification systems (RAG routing, tool selection, etc.)
MCP tools should prioritize AI consumption: return full content, with file writing as an optional side effect. General benchmarks should not assume models have a context window; InjectionEngine that directly restores sim state is more generalizable than context replay
For top-tier researchers (500+ papers), S2 co-authorship frequency analysis cannot identify students — the first-author signal is diluted by a massive number of collaborators. Professors’ personal pages explicitly list students, with reliability an order of magnitude higher. Citation graph (forward + backward) is a core feature of a research toolchain; ‘relevance’ should be decoupled from ‘citation count/popularity’
Offline injection architecture is better suited for building balanced error scenario datasets than online quota systems: decoupling ’exploring injectability’ from ’executing injection’ enables precise control of each error type count; online natural capture is heavily influenced by policy behavior distribution and cannot control type balance

Debugging

A minor JAX version upgrade (0.5.0→0.5.3) can bring ~33% training speedup; the cumulative effect of XLA compiler optimizations should not be ignored. uv override-dependencies can forcibly ignore transitive dependency version constraints, an effective tool for resolving third-party library version conflicts
Compiling CUDA extensions in a conda environment: CUDA_HOME = conda environment root, CPATH = envs//targets/x86_64-linux/include/ (not /usr/local/cuda/include/); after a major torch version upgrade, all .so files that depend on the torch C++ ABI need recompilation
Assumptions about third-party framework internal behavior must be verified through actual runs: LeRobot auto-squeezes shape=(1,) scalar features to (batch_size,) during DataLoader; code inference is unreliable. Actual training config values must be verified from wandb logs, as code defaults may be overridden by CLI parameters
GPU monitoring inside K8s containers: scan /proc//fd/ for /dev/nvidia* device symlinks + prioritize reading CUDA_VISIBLE_DEVICES to bypass PID namespace isolation; processes that open all GPU devices without consuming VRAM are usually monitoring tools and can be filtered accordingly
Silent failure is the most dangerous bug pattern: body_xpos[-1] negative indexing always returns the same position for two cubes; cached var_names integer-conversion caused gene name intersection to be zero. Any parsing failure should immediately emit WARNING rather than returning a sentinel value; cached data should be sanity-checked before use

Domain Knowledge

An independent benchmark (Genome Biology 2025) confirmed that scGPT zero-shot underperforms PCA/scVI; scGPT-spatial only compared against weak baselines (ARI ≈ 0.30–0.40), while SOTA (GraphST, ARI ≈ 0.55–0.63) was not included, with no independent third-party validation. When evaluating new methods, always verify whether their baselines represent current SOTA
CALVIN evaluation is purely online simulation; it does not read episode data at all, only requires validation/.hydra/merged_config.yaml; the 1.3GB dataset can be compressed to a 600KB eval-only version
Embodied AI researcher advisor lineage: Mingyu Ding ← Jitendra Malik, Ruoshi Liu ← Carl Vondrick, Xiaolong Wang ← Abhinav Gupta, Shuran Song ← Thomas Funkhouser, Yunzhu Li ← Antonio Torralba, Yuke Zhu ← Li Fei-Fei — showing a systematic output of students toward embodied AI from top perception/robotics advisor groups
Flow matching is becoming the mainstream action decoding architecture for VLAs. Pi0 time convention: t=1 is pure noise → t=0 is the target action. Pi0.5 uses adaRMS to inject time conditioning, outperforming simple concatenation. In VLA auxiliary tasks, stop_gradient isolating main task gradients is a safe starting point

Tools

On-demand prompt injection strategy: split system prompt into base (~50 lines) + intent-specific fragments (dynamically injected by classification), reducing token consumption by 40–60%. Chinese character token density is approximately 6× that of English characters (1.5 tokens/character vs 0.25 tokens/character); failing to correct this systematically underestimates context length
For projects with multiple tools, output directories should be organized by ‘file type first’ (outputs/reports/summarize/ rather than summarize/reports/), allowing .gitignore to be simplified to a single-line outputs/; Python re-export shim pattern (containing only from x import y; __all__ = [...]) is an elegant backward-compatible migration approach
PubMed esearch→efetch two-step E-utilities API can freely index metadata from subscription journals such as Nature/Cell/Science; bioRxiv API is equally open; both require no new dependencies (urllib.request); small-batch validation of pipeline feasibility is better than going straight to full scale

AI Usage Notes

Effective Patterns:

✓ Parallel sub-agents accelerate multi-dimensional code analysis: launching 3+ sub-agents simultaneously covering different file sets for dependency version diagnosis and codebase exploration significantly compresses analysis time
✓ Goal-driven delegation + iterative debugging loop: user provides clear termination conditions (‘fix until no errors’), AI independently iterates run → error → minimal fix; built-in error correction mechanism
✓ Deep codebase exploration identifies architecture-level challenges: proactively identified the single-spot KV degeneration issue in CrossModalEnhancer (each spot only has one vector) and proposed spatial neighbor KV sequence construction
✓ sys.path hack → common/ package gradual refactoring: re-export shim pattern maintains backward compatibility while eliminating duplicate code
✓ Small-batch pipeline feasibility validation (207 scenarios exposed 5 systemic defects) is better than going straight to full scale; end-to-end integration tests surface pipeline-level implicit dependencies better than unit tests

Limitations:

✗ Insufficient ability to reflect on experimental conclusions: jumps from numerical results directly to attribution without proactively questioning the validity of experimental design (MIHD embedding methodology flaw required external user trigger to correct)
✗ Silent failure patterns not proactively detected: Stack body name parsing returning -1 + Python negative indexing, cached var_names integer-conversion — both required user discovery due to lack of sanity checks
✗ Over-engineering and incorrect architecture assumptions: VLA context replay based on the erroneous assumption that ‘all VLAs need a context window’; incorrect inference about LeRobot shape behavior leading to code modification — both required user correction or runtime verification
✗ Insufficient ability to proactively question methodology applicability boundaries: when S2 student discovery failed, continued debugging code logic rather than proactively questioning the methodology’s own limitations; required user prompting to pivot to the homepage approach
✗ Weak handling of Semantic Scholar same-name ambiguity: lacks proactive entity disambiguation for common Chinese-to-English name translations; LLM analysis also cannot automatically identify ambiguous researchers
✗ API signatures not verified before use: FastMCP version parameter and conda –no-banner were both found incompatible only after runtime failure

Next Week Outlook

Next week (2026-W12) focus: ①ErrorRecoveryBenchmark v5.1 implementation — complete D0 scenario regeneration for 5 fixed skills (target: 600+ scenarios), advance InjectionEngine refactor, motion speed limits, and keyboard teleoperation human demo collection pipeline; milestone: begin recovery strategy training before April 1; ②VLA-RoboTwin/pi05 — re-convert LeRobot dataset (including 5 new fields such as manip_progress), start the four-group auxiliary experiment training and comparative analysis, correct eval.sh checkpoint_id for formal policy evaluation; ③MIHD — complete raw_shared baseline diagnosis and reach a methodological fix conclusion, resolve the 151676 GPU retraining issue (pin PyTorch version), evaluate CrossModalEnhancer full GPU pipeline performance; ④gadget/research — deploy 7 researcher profiles to the Hugo research section, explicitly require English quotes in prompts to eliminate LLM-generated Chinese JSON pollution; ⑤UniVLA — complete CALVIN evaluation full pipeline validation (–single_gpu mode).

Token Usage Statistics

Daily Cost Trend

Date	Tokens (millions)	Cost ($)
2026-03-09	46.9	32.17
2026-03-11	30.5	20.75
2026-03-12	2.0	2.22
2026-03-13	3.0	2.23
2026-03-14	19.0	13.13
2026-03-15	135.3	100.70
unknown	72.5	56.27

Peak Day: 2026-03-15 — $100.70 / 135.3M tokens

Claude Code

Metric	Value
Total Tokens	309,110,118
Input Tokens	315,228
Output Tokens	1,023,671
Cache Creation	22,299,827
Cache Reads	285,471,392
Total Cost	$227.47

Model Usage Distribution

Model	Cost ($)	Input Tokens	Output Tokens
claude-opus-4-6	203.57	170,917	554,482
claude-haiku-4-5-20251001	19.77	144,115	468,454
claude-sonnet-4-6	4.12	196	735

Weekly Report — 2026-W11 (2026-03-09 ~ 2026-03-15)#

Weekly Overview#

Project Progress#

VLA-RoboTwin/pi05 (6 days active) — 🔄 active#

ErrorRecoveryBenchmark (4 days active) — 🔄 active#

MIHD (Spatial Transcriptomics) (3 days active) — 🔄 active#

gadget Toolchain (2 days active) — 🔄 active#

CalendarPro (2 days active) — ✅ completed#

UniVLA/CALVIN Evaluation (2 days active) — 🔄 active#

Key Tasks#

Problems and Solutions#

1. Drop Skill: OSC controller actively maintains EEF position during env.step() (impedance control), causing the object to be held by fingers after the gripper opens and unable to fall freely [ErrorRecoveryBenchmark] (2026-03-15)#

2. MIHD Cross-Sample Embedding Comparison Invalid: per-section independent HVG selection + independent PCA fitting causes incomparable feature spaces; conclusion that ‘PCA outperforms STAIG’ is a methodological error [MIHD] (2026-03-15)#

3. Stack Body Name Parsing Silent Failure: stack.yaml uses cubeA/cubeB, but MuJoCo actual names are cubeA_main; _sim_body_name2id returns -1; Python negative indexing causes all task phase detection to be misidentified as pre_reach [ErrorRecoveryBenchmark] (2026-03-15)#

4. pi05 training 33% slower than openpi (20h vs 15h); intuition pointed to hardware differences, root cause unclear [VLA-RoboTwin/pi05] (2026-03-11)#

5. curobo precompiled .so ABI-incompatible with torch 2.7.1 (undefined symbol); JIT recompilation failed because conda CUDA header path is non-standard [VLA-RoboTwin] (2026-03-12)#

6. Online quota generation severely imbalanced: premature_release naturally captured 7,233 entries, 7 types completely at zero; strategy behavior distribution uncontrollable [ErrorRecoveryBenchmark] (2026-03-09)#

7. CalendarPro intent misclassification: no confidence threshold (0.52 treated as valid), time expressions misrouted by keyword router, short confirmation words lack context understanding, Chinese token estimation off by 3× [CalendarPro] (2026-03-15)#

8. S2 co-authorship analysis completely fails for top-tier researchers (Levine/Abbeel/Finn etc.) (depth-2 all empty); Xiaolong Wang/Shuran Song have severe same-name ambiguity [gadget] (2026-03-15)#

9. VLA context replay architecture assumption incorrect: designed a full N-1 frame replay mechanism, but most VLAs have no context window, making this overhead useless [ErrorRecoveryBenchmark] (2026-03-15)#

10. RTX 2080 Ti + PyTorch 2.9.0 triggers cuBLAS CUBLAS_STATUS_EXECUTION_FAILED for high-dimensional tensors with N>3500 [MIHD] (2026-03-09)#

11. MCP Server tools write file and return path; AI cannot directly consume the content [gadget] (2026-03-09)#

12. pi0.py made incorrect assumptions about LeRobot internal behavior: inferred shape=(1,) features maintain (b,1) shape and modified code accordingly; actual LeRobot DataLoader squeezes to (b,) causing shape mismatch during training [VLA-RoboTwin/pi05] (2026-03-15)#

Lessons Learned#

Architecture#

Debugging#

Domain Knowledge#

Tools#

AI Usage Notes#

Next Week Outlook#

Token Usage Statistics#

Daily Cost Trend#

Claude Code#

Model Usage Distribution#

Weekly Report — 2026-W11 (2026-03-09 ~ 2026-03-15)

Weekly Overview

Project Progress

VLA-RoboTwin/pi05 (6 days active) — 🔄 active

ErrorRecoveryBenchmark (4 days active) — 🔄 active

MIHD (Spatial Transcriptomics) (3 days active) — 🔄 active

gadget Toolchain (2 days active) — 🔄 active

CalendarPro (2 days active) — ✅ completed

UniVLA/CALVIN Evaluation (2 days active) — 🔄 active

Key Tasks

Problems and Solutions

1. Drop Skill: OSC controller actively maintains EEF position during env.step() (impedance control), causing the object to be held by fingers after the gripper opens and unable to fall freely [ErrorRecoveryBenchmark] (2026-03-15)

2. MIHD Cross-Sample Embedding Comparison Invalid: per-section independent HVG selection + independent PCA fitting causes incomparable feature spaces; conclusion that ‘PCA outperforms STAIG’ is a methodological error [MIHD] (2026-03-15)

3. Stack Body Name Parsing Silent Failure: stack.yaml uses cubeA/cubeB, but MuJoCo actual names are cubeA_main; _sim_body_name2id returns -1; Python negative indexing causes all task phase detection to be misidentified as pre_reach [ErrorRecoveryBenchmark] (2026-03-15)

4. pi05 training 33% slower than openpi (20h vs 15h); intuition pointed to hardware differences, root cause unclear [VLA-RoboTwin/pi05] (2026-03-11)

5. curobo precompiled .so ABI-incompatible with torch 2.7.1 (undefined symbol); JIT recompilation failed because conda CUDA header path is non-standard [VLA-RoboTwin] (2026-03-12)

6. Online quota generation severely imbalanced: premature_release naturally captured 7,233 entries, 7 types completely at zero; strategy behavior distribution uncontrollable [ErrorRecoveryBenchmark] (2026-03-09)

7. CalendarPro intent misclassification: no confidence threshold (0.52 treated as valid), time expressions misrouted by keyword router, short confirmation words lack context understanding, Chinese token estimation off by 3× [CalendarPro] (2026-03-15)

8. S2 co-authorship analysis completely fails for top-tier researchers (Levine/Abbeel/Finn etc.) (depth-2 all empty); Xiaolong Wang/Shuran Song have severe same-name ambiguity [gadget] (2026-03-15)

9. VLA context replay architecture assumption incorrect: designed a full N-1 frame replay mechanism, but most VLAs have no context window, making this overhead useless [ErrorRecoveryBenchmark] (2026-03-15)

10. RTX 2080 Ti + PyTorch 2.9.0 triggers cuBLAS CUBLAS_STATUS_EXECUTION_FAILED for high-dimensional tensors with N>3500 [MIHD] (2026-03-09)

11. MCP Server tools write file and return path; AI cannot directly consume the content [gadget] (2026-03-09)

12. pi0.py made incorrect assumptions about LeRobot internal behavior: inferred shape=(1,) features maintain (b,1) shape and modified code accordingly; actual LeRobot DataLoader squeezes to (b,) causing shape mismatch during training [VLA-RoboTwin/pi05] (2026-03-15)

Lessons Learned

Architecture

Debugging

Domain Knowledge

Tools

AI Usage Notes

Next Week Outlook

Token Usage Statistics

Daily Cost Trend

Claude Code

Model Usage Distribution