Daily Log — 2026-02-26

Today’s Overview

  • What was done: Rewrote BC-RNN from low_dim to image mode (consistent with the MimicGen paper) and successfully launched parallel training across 5 MimicGen tasks, while extending the evaluation framework to support multi-task evaluation.
  • How it was done: Attached to an existing SLURM interactive job via srun (bypassing sbatch permission restrictions), launched 5 training processes in parallel across 8 A800 GPUs; added a task registry and --task parameter to centralize task path and config management.
  • Why it matters: Resolved the fundamental conflict between an evaluation framework that only supported PickPlace and training that ran across 5 MimicGen tasks — laying a complete foundation for subsequent cross-task comparison of Pi0.5 vs BC-RNN.

Rewrote BC-RNN training configs to image mode on the HPC cluster and successfully launched 5-task parallel training. Extended the evaluation framework to support 5 MimicGen tasks, then identified and fixed the task distribution mismatch causing Pi0.5’s 0% success rate.

Today’s Tasks

Architecture & Strategy

  • BC-RNN image mode config rewrite (5 tasks) — Rewrote all 5 JSON configs under bc_rnn_configs/ from low_dim to image mode: ResNet18+SpatialSoftmax encoder, CropRandomizer 76×76, RNN hidden_dim 1000, batch_size 16, hdf5_cache_mode low_dim
  • BC-RNN image mode training launch (5-task parallel) — Bound to an existing interactive job via srun --jobid=45628, launched parallel training for coffee/stack/stack_three/threading/three_piece_assembly across GPUs 0–4, all confirmed running with image_obs=5
  • Extended evaluation framework to support 5 MimicGen tasks — Created 5 task YAML configs + task_registry.yaml; added --task parameter to scripts 1/3/4; fixed observation dimension errors caused by _D0 env suffix stripping; rewrote run_full_eval.sh to support multi-task loops
  • Root cause analysis for Pi0.5 evaluation 0% success rate — Confirmed root cause: Pi0.5 was trained on 5 MimicGen tasks (Coffee/Stack/Threading, etc.) but the evaluation framework only tested PickPlace, which was not in the training set
  • 🔄 Research on evaluating Pi0.5 on training tasks — Explored pi05_phoenix/evaluate_mimicgen.py and zhaoganlong’s evaluation pipeline, confirmed checkpoint locations, planned to launch VLA server and evaluate across 5 MimicGen tasks

Implementation & Fixes

  • Modified 3_collect_data.py to support image BC-RNN camera — Added camera_height/width parameters to create_env; extended camera detection logic to cover bc_rnn; VLA uses 256×256, BC-RNN uses 84×84 to match training data
  • SLURM permission workaroundsbatch submission failed (no permissions on either xy-a800 or ai partitions); found the correct srun binary via source set-XY-I.sh, then used --jobid=45628 to attach to an existing interactive job

Problems & Solutions

Critical Issues

1. Pi0.5 evaluation success rate at 0%

Solution: Switch to evaluating on training tasks (Coffee/Stack/Threading/Assembly) instead of PickPlace

Key insight: The complete lack of overlap between the training dataset and evaluation tasks was the root cause — not model quality. This was a critical finding that the AI initially missed and was uncovered through a human’s question.

2. BC-RNN evaluation dimension mismatch (Expected 65, got 37)

Solution: Added import mimicgen in create_env() to register _D0 environments; stopped stripping the _D0 suffix to preserve the full MimicGen observables

Key insight: MimicGen environments with the _D0 suffix (e.g., Coffee_D0) expose additional object-state observables. Stripping the suffix falls back to the base robosuite environment, causing observation dimensions to drop from 65 to 37.

3. BC-RNN configured in low_dim mode, inconsistent with the main results of the MimicGen paper

Solution: Rewrote to image mode following the official MimicGen generate_training_configs_for_public_datasets.py and the set_learning_settings_for_bc_rnn() function in config_utils.py

Key insight: MimicGen provides no pretrained checkpoints — only datasets. The key differences between image mode and low_dim are: RNN hidden_dim (400→1000), batch_size (64→16), epoch_every_n_steps (100→500), and the addition of ResNet18 + SpatialSoftmax encoder.

General Issues

4. sbatch submission failed: User’s group not permitted to use this partition

Solution: Used source set-XY-I.sh to locate /usr/local/slurm.24051/bin/srun, then used --jobid=45628 --overlap to attach to an existing interactive job on node an49

Key insight: Multiple SLURM installations exist on the cluster; you must source the environment script first to access the version you have permissions for. Reusing an existing interactive allocation is the fastest workaround.

Human Thinking vs AI Thinking

Strategic Level

Root cause identification for Pi0.5 0% success rate

Role Approach
Human Asked directly: “You didn’t mix all 9 tasks together and train a single model, did you?” — immediately identified the task distribution mismatch as the core problem through common sense
AI Initially listed 3 candidate causes (Server OOM, action clipping, action space mismatch) and failed to surface the task distribution mismatch as the primary factor

Analysis: The human hit the nail on the head with a single question; the AI required multiple steps to reach the same conclusion. Human systemic intuition outperformed AI symptom-based attribution.

BC-RNN mode selection

Role Approach
Human Explicitly specified image mode to be consistent with the main results of the MimicGen paper
AI The existing config was low_dim + 600 epochs; AI proactively offered two options and asked the user to decide

Analysis: The human focused on paper consistency; the AI tended to default to the existing config. The human’s decision drove the entire rewrite.

Implementation Level

Granularity of BC-RNN training plan

Role Approach
Human Provided a highly detailed implementation plan with specific parameter tables, per-task horizons, and paths to key reference files
AI Executed the plan, but discovered additional details independently during exploration (e.g., epoch_every_n_steps 500 vs 100)

Analysis: The human supplied high-quality planning; the AI handled precise execution and filled in the details.

AI Limitations

Significant Limitations

  • When initially diagnosing Pi0.5’s 0% success rate, the AI failed to immediately identify the most obvious cause — complete overlap absence between training and evaluation task sets — and instead listed secondary factors like OOM and action clipping first.

General Limitations

  • Attempted to call ExitPlanMode twice without user confirmation; both were rejected by the user. The AI was too autonomous when transitioning to execution mode and did not adequately wait for the user to review the plan.
  • The Write tool requires a prior Read before writing; the AI skipped this step when writing bc_rnn_stack/stack_three/threading.json, causing multiple tool call failures before self-correcting.

Today’s Takeaways

Core Insights

  • MimicGen provides no pretrained checkpoints — only HDF5 demo datasets. BC-RNN is its only official benchmark algorithm; image mode is consistent with the paper, which reports Stack/Coffee success rates of 100%, Threading 98%, Assembly 82%.
  • MimicGen _D0 environments (e.g., Coffee_D0) expose additional object-state observables compared to base robosuite environments. Stripping the _D0 suffix drops observation dimensions from 65 to 37, breaking input consistency at inference time.
  • Cross-task evaluation is a common pitfall in robot learning: a complete mismatch between training and evaluation tasks will produce 0% success rates. Train/eval task consistency should be ensured at the experiment design stage.
  • Key differences between BC-RNN image mode and low_dim: RNN hidden_dim 400→1000, actor_layer_dims [1024,1024]→[], batch_size 64→16, hdf5_cache_mode all→low_dim, epoch_every_n_steps 100→500; plus the addition of ResNet18Conv + SpatialSoftmax(32kp) + CropRandomizer(76×76).

Practical Insights

  • Without sbatch permissions, you can attach a new process to an existing interactive allocation using srun --jobid=EXISTING_JOB_ID --overlap, without needing to request new GPU resources.

Session Summaries

error-recovery-benchmark

🔍 Checking Pi0.5 and BC-RNN training status; discovered task distribution mismatch causing 0% success rate 01:10:41.534 | claude_code User asked about results for both Pi0.5 and BC-RNN, whose training had completed. Investigation revealed that Pi0.5 was trained on 5 MimicGen tasks but evaluated on PickPlace (not in the training set), leading to 0% success rate; BC-RNN could not be evaluated properly due to observation dimension mismatches. Key turning point: the human’s question — “You didn’t mix all 9 tasks together and train a single model, did you?” — directly identified the root cause. Decision: switch to evaluating on training tasks.

✅ BC-RNN image mode config rewrite + 5-task parallel training launch 04:00:34.795 | claude_code Following the human’s detailed plan, rewrote all 5 BC-RNN configs from low_dim to image mode (ResNet18+SpatialSoftmax+CropRandomizer, RNN hidden 1000); modified 3_collect_data.py to enable 84×84 camera rendering for BC-RNN. After sbatch failed, bypassed permission restrictions via srun --jobid=45628 and successfully launched 5-task parallel training across 8 A800 GPUs, all confirmed running in image obs mode. After ~3 hours of training, stack_three/threading/assembly had reached approximately epoch ~190/600.

✅ Extended evaluation framework to support 5 MimicGen tasks and fixed dimension mismatches 02:34:46.032 | claude_code Implemented multi-task evaluation framework extension: created task YAMLs for coffee/stack/stack_three/threading/three_piece_assembly plus task_registry.yaml; added --task parameter to three evaluation scripts; created BC-RNN training configs for all 5 tasks; rewrote run_full_eval.sh to support multi-task loops; fixed _D0 suffix stripping issues in all create_env() calls. All scripts passed syntax validation.

🔄 Researching Pi0.5 evaluation pipeline on MimicGen training tasks 04:10:57.195 | claude_code User requested testing Pi0.5 on training tasks. AI explored pi05_phoenix/ evaluation scripts, VLA server startup process, and zhaoganlong’s eval_checkpoints_multi.py; confirmed checkpoint location at phoenix_comparison/ (step 99,999); established that the websocket policy server must be started before connecting evaluate_mimicgen.py. Evaluation plan research completed but execution has not yet begun.

Multi-project tests (X-VLA / mozihao-VLA / HOME, etc.)

❌ Test hello connections across multiple project paths 09:46:37.000 | claude_code Multiple different project paths (X-VLA, mozihao-VLA, HOME, etc.) initiated sessions containing only “hello”, all of which were interrupted by the user or had no substantive interaction — likely environment tests or accidental triggers.

Token Usage

Overview

Metric Value
Total Tokens 15,146,478
Input Tokens 34,198
Output Tokens 10,480
Cache Creation 1,168,906
Cache Read 13,932,894
Cache Hit Rate 92.3%
Total Cost (USD) $6.3382

Model Breakdown

Model Input Output Cache Creation Cache Read Cost Share
claude-opus-4-6 17,829 618 419,292 3,042,572 $4.2465 67.0%
claude-haiku-4-5-20251001 16,369 9,862 749,614 10,890,322 $2.0917 33.0%