Daily Journal — 2026-03-14

Today’s Overview

  • What I did: Completed nvitop-style UI improvements for the GPU monitoring tool, batch MP4 visualization of HDF5 camera data, and architecture design plus code implementation for four groups of manipulation progress prediction auxiliary experiments in the pi05 model
  • How I did it: Optimized the monitoring tool using an alternate terminal buffer and adaptive column widths; batch-decoded JPEG frames with OpenCV into 2×2 grids and wrote MP4s; added auxiliary MLP prediction heads, stop_gradient isolation strategy, and experiment config switches across 6 files in the JAX/Flax NNX framework
  • Why it matters: GPU monitoring tool UX now matches nvitop quality; all 50 demonstration videos generated and ready for data quality inspection; pi05 four-group experiment configs are ready — training can begin as soon as the lerobot data format conversion is complete

Improved GPU monitoring tool UX, completed four-camera visualization for 50 robot demonstration episodes, and designed and implemented a four-group manipulation progress prediction auxiliary task experiment framework in the pi05 VLA model

Today’s Tasks

Architecture & Strategy

  • 🔄 pi05 four-group manipulation progress prediction auxiliary experiment implementation — Implemented manip_progress_time/distance auxiliary prediction heads in the pi05 model (four experiments: last_token vs special_token × time vs distance), with changes spanning pi0_config.py, model.py, tokenizer.py, robotwin_policy.py, config.py, and pi0.py; added ProgressConfig switches and four one-click experiment configs; training is blocked pending lerobot data format conversion

Implementation & Fixes

  • Batch HDF5 camera data visualization script — Created script/visualize_hdf5_cameras.py to read front/head/left/right four-channel JPEG camera frames from 50 episodes under place_dual_shoes/demo_clean/data, assemble them into annotated 2×2 grids, and write 640×480@30FPS MP4 files; all 50 videos (~2.3MB each) generated successfully
  • gpumon.py alternate buffer and adaptive layout — Added nvitop-style alternate screen buffer to the GPU monitoring script (\033[?1049h to enter a dedicated screen, restored on Ctrl+C exit); changed GPU and process tables to adapt width/height based on the actual terminal COLUMNS/LINES; fixed a bug where os.get_terminal_size() could not read the COLUMNS environment variable in subprocess contexts

Problems & Solutions

Critical Issues

1. pi05 training failed to start: dataset missing new fields — manip_progress_time, manip_progress_distance_left/right, target_endpose, target_joint, etc.

Solution: Modify ~/HDD_POOL/mozihao/VLA/convert_robotwin_democlean_to_lerobot.py to add the missing fields, then re-run the dataset conversion

Key insight: Verify that the data pipeline fully supports all required fields before designing training code — discovering missing data after implementation is complete wastes engineering time

General Issues

2. os.get_terminal_size() cannot read COLUMNS/LINES environment variables in pipe/subprocess contexts, causing the table to not actually expand when tested at 120 columns

Solution: Modified _get_term_size() to prioritize reading COLUMNS/LINES environment variables, falling back to os.get_terminal_size() only on failure

Key insight: Terminal width detection must handle both real TTYs (interactive) and environment-variable-only contexts (pipes/scripts)

Human Thinking vs. AI Thinking

Strategic Level

Research design for the pi05 four-group experiment

Role Approach
Human Proactively proposed the full four-group comparative experiment design: two feature extraction methods (last_token vs special_token), two prediction targets (time vs distance), and the specific mechanism for injecting prediction results as conditioning tokens into the action expert
AI Upon receiving the experiment design, analyzed architectural feasibility and proposed engineering implementation details: MLP scale (2048→256→out), loss weight λ=0.1, stop_gradient strategy, and config switch scheme

Analysis: Research hypotheses and experiment design were human-led; AI primarily handled architecture analysis and engineering implementation — the human contribution to core research direction was larger

Implementation Level

GPU monitoring tool UI specification

Role Approach
Human Explicitly specified the nvitop-style interaction behavior (restore command window on exit) and the specific requirement for adaptive width/height
AI Implemented the alternate buffer mechanism, but the initial version didn’t truly adapt table width to terminal changes — the issue was only caught during debugging

Analysis: The human had a clear target UX in mind; the AI had gaps in implementation details (environment variable vs. TTY), requiring user testing to surface and fix

AI Limitations

Significant Limitations

  • Did not proactively verify data pipeline completeness (whether fields like manip_progress had already been written to the lerobot dataset) before implementing the pi05 training code, resulting in a missing data format issue discovered only at training time — wasting engineering effort

General Limitations

  • The initial adaptive implementation of gpumon.py missed the issue where os.get_terminal_size() cannot read COLUMNS in subprocess contexts; only surfaced and fixed after user testing

Today’s Takeaways

Core Insights

  • Using stop_gradient to isolate main-task and auxiliary-task gradients in VLA auxiliary tasks is the safe starting point — first ensure the auxiliary head doesn’t interfere with action prediction; if results are poor, remove the gradient stop and run a comparison experiment
  • When adding auxiliary task heads in JAX/Flax NNX, training uses GT values for teacher forcing while inference uses predicted values for injection — both paths must be implemented separately in compute_loss and sample_actions with consistent interfaces

Practical Insights

  • Alternate terminal buffer (\033[?1049h to enter, \033[?1049l to exit) combined with signal.SIGINT capture enables an nvitop-style full-screen refresh UI that automatically restores the original terminal content on exit

Session Summaries

RoboTwin GPU Monitor

✅ gpumon.py — Added nvitop-style alternate buffer and adaptive terminal layout 09:24:23.170 | claude_code User requested refactoring the GPU monitoring script to nvitop style: enter a dedicated screen on launch, restore on exit, and change fixed column widths to adaptive. AI implemented the \033[?1049h alternate buffer mechanism, fixed os.get_terminal_size() being ineffective in pipe environments, and changed GPU/process table widths to proportional allocation. Tested and passed at both 80 and 120 columns; process table row count also dynamically truncated based on terminal height.

RoboTwin HDF5 Visualization

✅ Implemented HDF5→MP4 batch visualization script; successfully processed all 50 episodes 13:21:34.636 | claude_code User submitted a planning document requesting implementation. AI created script/visualize_hdf5_cameras.py, using cv2.imdecode to decode JPEG frames and assemble annotated 2×2 grids, writing 640×480@30FPS MP4 files. The script ran end-to-end over all 50 episodes; output directory contains 50 video files of ~2.3MB each, with file count and sizes verified.

🔄 HDF5 camera visualization implementation planning (Plan Mode exploration) 13:18:03.922 | claude_code User re-initiated the visualization request. AI used an Explore agent to analyze the data collection pipeline and HDF5 file structure, confirmed the JPEG-encoded camera data format for 50 episodes (240×320, four channels), read the existing parse_hdf5.py tool to understand decoding patterns, and produced a detailed implementation plan (2×2 grid, 640×480@30FPS MP4) before exiting Plan Mode.

❌ HDF5 camera visualization request (interrupted by 403 auth expiration) 13:16:50.918 | claude_code User requested a script to visualize HDF5 camera data. AI encountered a 403 Request not allowed error on the first file read attempt (session token expired, requiring re-login). The session was immediately interrupted; user then re-initiated the same request in a new session.

RoboTwin pi05 VLA

🔄 Architecture discussion and six-file implementation for pi05 four-group manipulation progress prediction auxiliary experiments 14:21:07.908 | claude_code User proposed four comparative experiments (last_token vs special_token × manip_progress_time vs distance_left/right). After in-depth analysis of the JAX/Flax NNX architecture, AI proposed MLP scale (2048→256), loss weight λ=0.1, stop_gradient strategy, and config switch scheme — all confirmed by the user. AI completed the implementation across 6 files, including the ProgressConfig class, special token registration, Observation.aux_targets field, auxiliary loss computation, and four one-click experiment entry points; import and config validation tests passed. Training launch revealed the dataset is missing progress-related fields — the lerobot conversion script must be updated first.

Token Usage

Summary

Metric Value
Total Tokens 18,998,065
Input Tokens 11,315
Output Tokens 60,529
Cache Created 1,403,485
Cache Read 17,522,736
Cache Hit Rate 92.6%
Total Cost (USD) $13.1289

Model Breakdown

Model Input Output Cache Created Cache Read Cost Share
claude-opus-4-6 1,635 27,657 760,379 12,366,827 $11.6354 88.6%
claude-haiku-4-5-20251001 9,680 32,872 643,106 5,155,909 $1.4935 11.4%

Per-Device Usage

Device Total Tokens Input Output Cost
tianhe 7,203,350 5,266 23,595 $5.6472
TzJsDesktop 11,794,715 6,049 36,934 $7.4817