Daily Report — 2026-03-17

Today’s Overview

  • What was done: Five research/tooling projects were advanced in parallel across two machines by multiple researchers. On DCC, the MIHD multimodal spatial transcriptomics project underwent comprehensive code refactoring. On tianhe, work proceeded in parallel on robobrain_pi robot training system fixes and optimizations, CALVIN dataset format conversion fixes, GPU monitor improvements, and multiple dataset availability investigations.
  • How it was done: Tasks were advanced in parallel through multiple methods: three-way parallel code review via /simplify, systematic state machine logic diagnosis, precise issue localization via schema files, dual GPU process filtering with FD scanning + parent process chain, JAX has_aux mechanism adaptation, and more.
  • Why it matters: Eliminated MIHD HD data OOM risk and robobrain_pi task state infinite loop bug; enabled independent wandb monitoring of three loss curves; fixed CALVIN conversion script to run correctly; reduced gpumon from 35 duplicate processes to 8 real processes with keyboard navigation support; confirmed MimicGen data links are an upstream unpublished issue requiring no code fix.

DCC

  • What was done: Comprehensive code refactoring of the MIHD spatial transcriptomics multimodal fusion project — fixed 9 code reuse and efficiency issues, and organized project planning documents.
  • How it was done: Launched three-way parallel code review (reuse/quality/efficiency) via /simplify, located key issues and fixed them one by one, then updated CLAUDE.md via /init and restructured plans.md.
  • Why it matters: Centralized the NEEDS_COORDS_FUSIONS constant (also fixed a missed adaln_attention latent bug), replaced O(N²) cdist with KDTree to prevent HD data OOM, eliminated 8 duplicate device resolution patterns, and restructured plans.md future directions into three temporal priority tiers.

tianhe

  • What was done: Multiple researchers advanced several projects in parallel: robobrain_pi task state reporting bug fix and training loss split monitoring, LIBERO custom suite integration confirmation, error recovery benchmark data quality verification; on the same day, fixed CALVIN RLDS→LeRobot conversion script runtime errors, eliminated duplicate process display in the GPU monitor and added keyboard navigation, and confirmed RoboCasa MimicGen pretrained data download failure as an upstream issue.
  • How it was done: Systematic code review located and fixed vla_infer.py state machine logic defects; precise dataset issue diagnosis by reading schema/config files such as features.json and box_links_ds.json; dual strategy of FD scanning + parent process chain deduplication to eliminate GPU monitor false positives; JSON and NPZ file analysis to validate benchmark data distribution.
  • Why it matters: Eliminated the infinite loop trigger bug after robobrain_pi task completion and enabled independent wandb monitoring of three loss curves; CALVIN conversion script now runs correctly; gpumon reduced from 35 duplicate processes to 8 with keyboard browsing of full commands; confirmed 2920 error scenes generated (while discovering a systemic issue where all threading phase annotations are pre_reach); confirmed MimicGen data links are an upstream unpublished issue.

DCC: comprehensive code refactoring of the MIHD multimodal spatial transcriptomics project (/simplify fixed 9 issues) with documentation updates. tianhe: multiple researchers in parallel completed robobrain_pi task state machine bug fixes and training loss split monitoring, LIBERO test suite integration confirmation, error recovery benchmark 2920-scene data quality verification, CALVIN format conversion script fix, GPU monitor deduplication and keyboard navigation addition, and RoboCasa MimicGen pretrained data download upstream root cause confirmation.

Today’s Tasks

Architecture & Strategy

  • robobrain_pi task state reporting bug fix — Analyzed and fixed 5 issues in vla_infer.py: incorrect None check order (potential crash), task completion without clearing current_prompt (infinite loop sending done), idle state not broadcast, inconsistent debug log threshold (chunk 3/4 silenced), and print message incorrectly labeled as manual annotation.
  • MIHD /simplify code review and refactoring — Conducted three-way parallel review (reuse/quality/efficiency) on 21 modified Python files in the MIHD project, fixing 9 issues: centralized NEEDS_COORDS_FUSIONS constant and added missing adaln_attention, replaced O(N²) cdist with KDTree to prevent HD data OOM, hoisted device resolution to eliminate 8 duplicate checks, reused DataPreparer instances, removed vestigial train_staig() wrapper, fixed duplicate n_pseudo_clusters assignment, removed unused imports, etc. All changes passed Python AST syntax validation.
  • robobrain_pi action_loss and task_loss split monitoring — Modified model.py abstract method return type to tuple[loss_array, dict], pi0.py returns (combined_loss, aux_dict) tuple, pi0_fast.py updated in sync with zero-padding, train.py uses has_aux=True to unpack auxiliary metrics and adds them to info dict, added independent action_loss and task_loss curves in wandb and progress bar.
  • RoboCasa MimicGen pretrained data download failure root cause diagnosis — Diagnosed the root cause of download_datasets --source mimicgen erroring on all tasks: box_links_ds.json contains zero MimicGen download links (0/350 entries), and only 60 of 317 tasks have mg_path registered — an upstream unpublished data issue requiring no code fix.
  • Error Recovery Benchmark generation statistics and issue diagnosis — Confirmed 2920 error scenes successfully generated across 6 tasks (coffee 1076, stack 499, three_piece_assembly 487, pick_place 326, stack_three 382, threading 150); analyzed task×error×difficulty distribution; found 7 D0 error types with fewer than 10 samples, and a systemic issue where all threading task trajectory phase annotations are pre_reach.
  • gpumon.py duplicate process bug fix — Fixed the GPU monitor tool displaying large numbers of duplicate processes: required simultaneous /dev/nvidia* FD evidence to classify a process as GPU-using (eliminating false positives from inherited environment variables), and added parent process chain deduplication in get_gpu_procs (collapsing DDP workers and other child processes). Process count reduced from 35 to 8.

Implementation & Fixes

  • LIBERO libero_object_com test suite integration confirmation — Analyzed and confirmed libero_object_com suite integration is essentially complete: libero_suite_task_map.py, init.py (with LIBERO_OBJECT_COM class and libero_suites registration), and bddl_files directory are all done; main.py has been updated with default suite name and max_steps=300. No need to create init_files.
  • CALVIN dataset RLDS→LeRobot conversion script fix — Fixed multiple issues in rlds_to_lerobot.py: added progress bar, corrected dataset name (calvin_abc_d→calvin_abc), corrected observation key names (image→rgb_static, wrist_image→rgb_gripper), added automatic output directory creation and overwrite confirmation logic. Script now runs correctly.
  • robobrain_pi git workflow management — Multiple commits (command.txt update, vla_infer bug fix, loss split feature); resolved git proxy conflict (overrode local config to use working localhost:9999); branch switching, cherry-pick to sync command.txt to dev/mlp_old, and reverted an erroneous merge commit on main branch.
  • gpumon.py keyboard interactive navigation — Added nvitop-style keyboard interaction: up/down arrows to select process rows (highlighted in reverse), left/right arrows to horizontally scroll full command (10 characters at a time, with … overflow indicator), Esc to deselect, q to quit, dynamic bottom status bar.
  • MIHD project documentation (CLAUDE.md + plans.md) — Updated CLAUDE.md to change the needs_coords description in the “New Fusion Strategies” section to reference the NEEDS_COORDS_FUSIONS constant; cleaned up plans.md by integrating scattered raw notes at the bottom into formal sections, restructured future directions into three temporal priority tiers (near/mid/long-term), and distinguished between the current single-slice and near-term cross-slice two-stage architecture roadmap.
  • wandb directory-level account configuration — Provided shared server users with a solution for overriding global wandb login with personal accounts: primarily recommended direnv (.envrc setting WANDB_API_KEY) or exporting environment variables in ~/.bashrc, clarifying that WANDB_API_KEY takes priority over ~/.netrc.

Issues & Solutions

Critical Issues

1. MIHD’s refine_labels_spatial_majority uses scipy.cdist to compute an all-pairs distance matrix. For HD spatial transcriptomics data (17K+ cells), memory requirement is O(N²) — guaranteed OOM crash.

Solution: Replaced with scipy.spatial.cKDTree.query_ball_point, reducing memory complexity from O(N²) to O(N·k).

Key insight: Nearest-neighbor queries only need to find neighbors within a radius, not an all-pairs distance matrix. KD-trees are the standard solution; cdist is appropriate for matrix multiplication scenarios, not large-scale neighbor searches.

2. vla_infer.py does not clear current_prompt after task completion. In the next loop iteration, the prompt is unchanged, chunk_count is still ≥5, and the model score is very likely still high — immediately triggering another done message, causing an infinite loop of task completion reports.

Solution: Added current_prompt='' and _publish_state('idle') at the end of the task completion handling block, and changed the debug log threshold from <3 to <5 to cover all suppressed chunks.

Key insight: In state machine design, upon completion you must simultaneously reset the trigger condition AND broadcast the state change. Doing only one leaves a latent bug.

3. MIHD’s NEEDS_COORDS_FUSIONS set is independently maintained in both runner.py and evaluation_planner.py, and both locations omit the adaln_attention strategy, causing that strategy to fail at runtime due to missing spatial coordinates.

Solution: Defined a centralized constant NEEDS_COORDS_FUSIONS in Fusion.py (adding adaln_attention), and changed both locations to from models.Fusion import NEEDS_COORDS_FUSIONS.

Key insight: Maintaining duplicate copies of the same set inevitably produces inconsistencies. A Single Source of Truth is the fundamental solution to this class of latent bug.

4. pi0.py’s compute_loss only returns a combined loss array, and train.py only records a single loss curve — making it impossible to observe action_loss and task_loss training dynamics separately in wandb.

Solution: Changed compute_loss return type to (loss_array, aux_dict), and had train.py use has_aux=True parameter in nnx.value_and_grad to unpack the auxiliary dictionary, adding action_loss and task_loss fields to info.

Key insight: JAX’s has_aux mechanism is designed exactly for this scenario: carrying monitoring metrics without affecting backpropagation — a cleaner solution than global variables or duplicate computation.

5. All clean trajectory phase_labels for the threading task are pre_reach, so only collision_empty can be injected — severely imbalancing the benchmark data. Additionally, 7 D0 error types have fewer than 10 samples.

Solution: (Pending) Need to check whether threading task’s get_task_completion_stages() implementation correctly detects reach/grasp phases. D0 types with fewer than 10 samples require either more clean trajectories or targeted injection opportunities.

Key insight: The threading task (needle threading) has gripper-close detection incompatible with the general framework logic, requiring task-level customization. Data distribution imbalance should be monitored and balanced at the pipeline design stage.

6. gpumon.py displays large numbers of duplicate processes: all processes that inherited CUDA_VISIBLE_DEVICES (bash, ffmpeg, claude) and all DDP worker child processes are incorrectly classified as GPU processes. Of 35 displayed processes, only 8 are real GPU processes.

Solution: Dual filtering: ① require simultaneous /dev/nvidia* FD open to classify as a GPU process; ② in default mode, collapse a process to its ancestor if its ancestor is also in the GPU list and they share the same GPU set.

Key insight: CUDA_VISIBLE_DEVICES alone cannot distinguish real GPU-using processes from those that simply inherited the environment variable. FD evidence is a more reliable indicator. In Kubernetes PID namespace isolation environments where nvidia-smi cannot display process info, scanning /proc//fd is the alternative approach.

7. RoboCasa MimicGen download command errors on all tasks, leading the user to assume it was a code bug.

Solution: Analysis of box_links_ds.json revealed it contains no MimicGen paths (0/350 entries), and only 60 of 317 tasks have mg_path registered. Conclusion: upstream data links are unpublished — no code fix possible.

Key insight: Error messages can originate from two different layers (no mg_path registered vs. registered but no Box link), which must be distinguished to correctly identify the root cause. Checking the config file directly is more efficient than analyzing error messages, and can sometimes reveal the root cause is an upstream data publishing issue rather than a local code bug.

8. vla_infer.py unconditionally calls .item() on task_completed before the None check. For non-pi05 models that return None, this immediately crashes with AttributeError.

Solution: Move the None check before any print calls, and use isinstance to determine whether to call .item() based on whether it’s a numpy array or scalar.

Key insight: Defensive programming: None checks must come before any attribute access. The fact that it doesn’t currently crash is only because pi05 is always used in practice — not proof the code is correct.

General Issues

9. The server’s global git config has an unreachable proxy at 172.16.31.200:3138, causing all git fetch/push operations to fail.

Solution: curl testing revealed localhost:9999 is available (HTTP 200). Overrode the global proxy config using git config --local.

Key insight: git config’s local > global > system priority allows overriding global settings for a single repo without affecting other repos.

10. CALVIN dataset name mismatch (code had calvin_abc_d, actual directory is calvin_abc) caused tfds.builder to fail finding the version; incorrect observation key names (image vs rgb_static/rgb_gripper) caused KeyError; empty directory structure leftover from a previous run caused LeRobotDataset.create() to throw FileExistsError.

Solution: Discovered the name mismatch by checking directory structure; updated key names by reading features.json for the actual schema; added directory existence detection logic to prompt whether to overwrite.

Key insight: RLDS dataset field names vary by dataset — the actual schema must be confirmed via features.json, never assumed from other datasets or code template defaults. LeRobot’s create() does not support overwriting an existing directory and requires manual cleanup before calling.

Human Thinking vs. AI Thinking

Strategic Level

MimicGen Download Failure Root Cause

Role Approach
Human User assumed it was a code problem and asked for help after triggering errors through the download command.
AI AI confirmed the root cause through layered inspection (script logic → registry registration → box_links.json content): upstream data links were never published. Also distinguished between two different error message sources.

Analysis: The user assumed a local code bug; AI’s systematic investigation revealed an upstream data publishing issue — a conclusion that changed the user’s course of action (no code fix needed).

Entry Point for vla_infer.py Bug Discovery

Role Approach
Human Human initiated review with “check how task state reporting classification is implemented and whether related changes have issues” — implicitly suspicious of a specific commit.
AI AI traced the data flow from pi0.py→real_policy.py→utils.py→vla_infer.py systematically, cataloged 5 bugs sorted by severity. Bug 2 (infinite loop) and Bug 3 (None crash) were independently discovered by AI.

Analysis: The human had an intuitive target (suspected chunk counting logic); AI provided systematic coverage and priority ranking. The potential crash from the incorrect None check order was discovered by AI through code logic reasoning — the human’s focus was primarily on runtime behavior.

gpumon Duplicate Process Localization

Role Approach
Human User directly observed the symptom (only one process on GPUs 6/7 but multiple shown), providing an accurate problem description — no AI-driven discovery needed.
AI AI ran debug scripts to analyze the FD set and CUDA_VISIBLE_DEVICES of each process, discovered two independent root causes (environment variable false positives + child process duplicates), and proposed a dual-filter solution.

Analysis: Problem discovery came from user observation; technical root cause analysis and solution design came from AI.

Scope Decision for compute_loss Return Type Modification

Role Approach
Human Human only described the requirement (wandb to monitor three metrics independently) — no implementation specified, no explicit stance on breaking the abstract interface.
AI AI chose to modify the BaseModel abstract method signature (requiring synchronized zero-padding update in pi0_fast.py), citing JAX-native and cleanest design, over alternatives (duplicate loss computation outside train.py or global side effects).

Analysis: AI’s approach exceeded the minimal-change principle in scope, but maintained code consistency and maintainability.

Error Recovery Benchmark Data Validation Strategy

Role Approach
Human Human asked “where are we in generation now” — implicitly interested in progress and quality.
AI AI first checked squeue to confirm job status, then checked file counts to confirm completion, then used JSON parsing for fine-grained statistics (classified by error×degree) — proactively discovering and reporting two issues: D0 counts below 10 and threading phase annotation anomalies.

Analysis: AI proactively discovered two data quality issues beyond the scope of the original question, reflecting a tendency toward proactive diagnosis.

AI Limitations

Important Limitations

  • Threading phase annotation all being pre_reach: AI can diagnose symptoms (check NPZ files and get_task_completion_stages code), but cannot actually run the robosuite environment to validate sensor readings — can only infer root cause and provide investigation directions.

General Limitations

  • AI tends to check external docs/remote resources before leveraging locally available information: during LIBERO suite analysis, multiple Explore subagents were launched in sequence despite git status already containing the answer; during CALVIN dataset diagnosis, external GitHub URLs were fetched before reading user-specified local code files. Should prioritize reading diff/status and local files first before deciding whether to explore further.
  • git remote authentication cannot be handled by AI: SSH keys, GitHub tokens, and VS Code credential helpers are all on the user’s side. AI can provide command-line solutions but cannot directly execute push; credential helpers are unavailable in server environments and require user intervention.
  • The CALVIN conversion script required multiple iterations (name error → key name error → directory exists error) before resolution, with each fix revealing the next problem only after running. The ability to foresee all issues in a single pass through features.json and a complete code review was lacking.

Today’s Learnings

Core Learnings

  • Task state machine design principle: a completion event must simultaneously do two things — reset the trigger condition (clear current_prompt) and broadcast the state change (_publish_state(‘idle’)). Doing only one leaves a latent bug of either an infinite loop or upstream state blindness. Both are non-negotiable.
  • JAX has_aux mechanism: nnx.value_and_grad supports the has_aux=True parameter. compute_loss can return (loss_array, aux_dict) where gradients are computed only on loss_array, and aux_dict transparently carries monitoring metrics — the standard JAX functional design pattern for carrying auxiliary information.
  • Spatial data nearest-neighbor queries: for large-scale point clouds (>10K points), radius neighbor queries should prefer KD-trees (O(N log N) preprocessing) over cdist (O(N²) memory). For HD spatial transcriptomics (17K+ cells), the difference can be OOM vs. normal execution.
  • Reliable GPU process detection: in Kubernetes PID namespace isolation environments, nvidia-smi cannot display process information. Scanning /dev/nvidia* FDs in /proc//fd is a more reliable detection method, but FD evidence must be required to avoid false positives from inherited environment variables.
  • Single Source of Truth principle: constant sets referenced across multiple files (such as NEEDS_COORDS_FUSIONS) must be imported from a single definition point. Manually maintaining multiple copies inevitably produces inconsistencies, which often only surface in edge cases (like the missed adaln_attention).
  • Dataset download tool error messages can originate from multiple different levels (registry not registered vs. download link does not exist) — distinguishing between them is essential for correct root cause identification. Checking config files directly (e.g., box_links_ds.json) is more efficient than analyzing error messages, and can sometimes reveal the root cause is an upstream unpublished data issue rather than a local code bug.
  • Benchmark data quality: imbalanced data distribution (some D0 error types having <10 samples) is a systemic issue. Distribution monitoring and balancing mechanisms should be built into the pipeline design phase. General framework adaptation for specific tasks (like threading needle insertion) requires explicit task-level testing and validation.

Practical Learnings

  • RLDS format TFRecord dataset observation field names vary by dataset — the actual schema must be confirmed via features.json. Never rely on experience from other datasets or default key names in code templates.
  • On shared servers, WANDB_API_KEY environment variable takes priority over ~/.netrc (wandb login storage location), making it a lightweight solution for overriding the global account without modifying wandb config files.

Session Summaries

MIHD

✅ MIHD project code refactoring (/simplify 9 fixes) and documentation cleanup 13:54:46.661 | claude_code Conducted three-way parallel code review on 21 modified Python files in the MIHD spatial transcriptomics project, fixing 9 issues (KDTree replacing O(N²) cdist to prevent OOM, centralizing NEEDS_COORDS_FUSIONS constant and adding missing adaln_attention, hoisting device resolution to eliminate 8 duplicate checks, etc.). All changes passed Python AST syntax validation. Subsequently updated CLAUDE.md to reference the new constant, and reorganized plans.md by integrating scattered notes into a three-tier temporal priority future roadmap with an updated architecture timeline.

RoboBrain-Pi

✅ robobrain_pi task state bug fix, training loss split monitoring implementation, and git workflow management 11:09:28.373 | claude_code Systematically reviewed the task_completed data flow in vla_infer.py (pi0.py→real_policy.py→utils.py→vla_infer.py), discovered and fixed 5 bugs (task completion without clearing current_prompt causing infinite done loop, incorrect None check order causing potential crash, idle state not broadcast, incomplete debug log threshold, incorrect print message). Simultaneously implemented action_loss/task_loss split monitoring: four-file coordinated changes (model.py returns tuple, pi0.py carries auxiliary dict, pi0_fast.py zero-pads, train.py uses has_aux=True to unpack), with three independent curves now displayed in wandb. Handled multiple git workflow issues along the way including proxy conflict (local config overriding unreachable global proxy), cherry-pick to sync dev/mlp_old branch, and reverting an erroneous merge — completing multiple commits.

LIBERO-Benchmark

✅ LIBERO libero_object_com test suite integration analysis and confirmation 09:16:47.373 | claude_code Analyzed the file changes needed to add a custom libero_object_com test suite. Exploration revealed most work was already done: 10 task names already added in libero_suite_task_map.py, LIBERO_OBJECT_COM class already registered in init.py, 10 .bddl files already in bddl_files/libero_object_com/, and main.py already updated with default suite name and max_steps=300. The only uncreated init_files directory is unnecessary since main.py already has the related code commented out. Integration is complete with no additional changes needed.

Error-Recovery-Benchmark

🔄 Error Recovery Benchmark generation completion statistics and issue diagnosis 14:03:00.373 | claude_code Confirmed pipeline job completion: 2920 error scenes generated across 6 tasks (coffee most at 1076, threading least at 150). Analyzed distribution by task×error type×D0/D1/D2, finding 7 D0 subtypes with <10 samples (need supplementing) and all threading trajectory phase_labels set to pre_reach (affecting injection of other error types). Deep diagnosis revealed threading task phase annotation is incorrect and get_task_completion_stages() implementation needs review — deferred for follow-up.

CALVIN Dataset Converter

✅ Fixed multiple runtime errors in CALVIN RLDS→LeRobot conversion script and added progress bar 06:51:06.033 | claude_code User chenjunye requested a progress bar and discovered data directory structure mismatch with code configuration. Iteratively fixed dataset name mismatch (calvin_abc_d→calvin_abc), observation key name errors (image→rgb_static/rgb_gripper), and FileExistsError when output directory already exists — by reading directory structure and features.json. Added interactive confirmation logic for overwriting. Script now runs correctly.

GPU Monitor (gpumon)

✅ Fixed gpumon.py duplicate process display and added keyboard interactive navigation 03:01:28.167 | claude_code User chenxingping discovered gpumon.py showing multiple duplicate entries for a single actual process on GPUs 6/7. Debug analysis found two root causes: unrelated processes inheriting CUDA_VISIBLE_DEVICES (bash/claude/etc.) and DDP worker child processes being double-counted. Fix applied dual filtering (FD evidence requirement + parent process chain deduplication), reducing process count from 35 to 8. Subsequently added nvitop-style up/down arrow selection and left/right arrow command scrolling per user request.

RoboCasa MimicGen Data Download

✅ Diagnosed and confirmed upstream root cause of MimicGen pretrained data download failures 03:38:33.780 | claude_code User chenjunye encountered errors on all tasks when running download_datasets --source mimicgen. Systematic inspection (script logic → dataset_registry.py registration → box_links_ds.json content) revealed the root cause: the file contains zero MimicGen paths (0/350 entries), and only 60 of 317 tasks have mg_path registered — an upstream unpublished data issue. A follow-up session verified the key conclusions via grep and distinguished between two error message sources (no mg_path registered vs. registered but no Box link). Confirmed: no code fix needed.

chenxingping Environment Setup

✅ Configured directory-level personal wandb account on shared server 07:46:38.767 | claude_code Provided user chenxingping with a solution for using a personal wandb account in their personal directory on a shared server, primarily recommending direnv (.envrc setting WANDB_API_KEY) and ~/.bashrc environment variable export, clarifying that WANDB_API_KEY environment variable takes priority over ~/.netrc.

Token Usage

Overview

Metric Value
Total Tokens 19,270,122
Input Tokens 9,271
Output Tokens 53,817
Cache Creation 1,756,812
Cache Read 17,450,222
Cache Hit Rate 90.9%
Total Cost (USD) $14.8924

Model Breakdown

Model Input Output Cache Creation Cache Read Cost Share
claude-opus-4-6 1,530 19,996 923,018 7,217,146 $9.8850 66.4%
claude-sonnet-4-6 1,334 21,046 507,280 6,418,277 $4.1475 27.8%
claude-haiku-4-5-20251001 6,407 12,775 326,514 3,814,799 $0.8599 5.8%

Usage by Device

Device Total Tokens Input Output Cost
DCC 5,753,481 1,268 14,740 $7.0047
tianhe 13,516,641 8,003 39,077 $7.8876