Daily Report — 2026-03-15
Today’s Overview
- What was done: Six parallel workstreams: ① MIHD cross-sample embedding methodology diagnosis ② ErrorRecoveryBenchmark v5 fixes and scaling (13 skills/29 subtypes) ③ Full VLA progress prediction pipeline repair ④ UniVLA evaluation container adaptation ⑤ CalendarPro seven-phase comprehensive refactor ⑥ gadget toolchain architecture upgrade (common/ package + outputs/ unification + research profiler + CLI consolidation)
- How it was done: All workstreams used plan-driven development: detailed plans drafted upfront before parallel implementation; GPU node SSH remote execution for simulation pipelines; iterative debugging (run → error → locate → minimal fix) applied throughout; multiple parallel Agent sub-agents accelerated code analysis and implementation
- Why it matters: Benchmark expanded from 11 skills to 13 skills/29 subtypes; eliminated the methodological flaw of per-section independent embeddings; pi05 progress prediction full pipeline is ready; CalendarPro 230 tests all pass; gadget eliminated ~500 lines of duplicate code and established a unified output directory; research profiler student discovery capability went from zero to breakthrough, with 7+ embodied AI scholars profiled in depth
DCC
- What was done: Implemented PCA and raw_shared embedding diagnostic baselines in the MIHD project; traced and verified the root cause of cross-sample embedding dual non-comparability (per-section independent HVG selection + independent PCA fitting)
- How it was done: Added
--embedding_sourceparameter; traced the import chain to discover the per-section independent processing flaw; after fixing the adata_hvg cache gene name integer bug, adopted a raw_shared approach using raw HDF5 to compute HVG intersection (1137 genes) - Why it matters: Disproved the mistaken conclusion that “PCA worse than STAIG = weak input features,” identified the shared HVG intersection as the correct validation baseline; revealed STAIG’s layer-specific behavior — SL@50=0.94–1.0 on boundary layers (Layer_1/Layer_5) while failing completely on intermediate layers
TzJsDesktop
- What was done: Completed CalendarPro seven-phase optimization (230 tests passing); gadget completed common/ package extraction refactor, outputs/ directory unification, research profiler homepage-based student discovery with new Hugo research section, batch deep profiling of 7 embodied AI scholars, and research toolchain CLI consolidation (citation graph + three-backend LLM)
- How it was done: CalendarPro implemented all seven phases via parallel sub-Agents with pytest regression verification; gadget eliminated sys.path hacks and three duplicate LLM implementations by creating the common/ package, unified 6 path constants via paths.py, implemented homepage-priority student discovery via a new homepage_discovery.py module, and integrated profile/citations subcommands into research_scout.py
- Why it matters: CalendarPro fixed 4 misclassification scenarios, reduced prompt token usage by 40–60%; gadget eliminated ~500 lines of duplicate code, simplified .gitignore to a single-line
outputs/, research profiler achieved a student discovery breakthrough, and 7 scholar profiles are complete and deployable to Hugo
tianhe
- What was done: ErrorRecoveryBenchmark: fixed 5 failing error skills, archived v4, semantic split of E2 (13 skills/29 subtypes), generated all 11 Stack demo videos, completed v5.1 architecture planning, first-round D0 scene generation (207 scenes) and failure root cause diagnosis; full pipeline fix and validation for VLA progress prediction training (step 100 loss normal); UniVLA CALVIN evaluation script container compatibility fix
- How it was done: Bypassed OSC controller interference via mujoco.mj_step(); fixed gripper step counts, missing phase labels, and target_object propagation chain; import chain tracing completed v4 archival; E2 split semantically by recovery strategy; SSH to an53 node 8×A800 to run pipeline; VLA debugging via iterative loop
- Why it matters: Fixed benchmark generates 231 scenes + 231 MP4s; first-round D0 generation of 207 scenes exposed 5 systemic defects; pi05 step 100 action_loss=0.37, aux_loss=0.22 with normally decreasing curves; UniVLA
--single_gpumode removes container network dependency
Six parallel workstreams across three devices all day: DCC diagnosed the root cause of cross-sample embedding dual non-comparability in MIHD; tianhe completed ErrorRecoveryBenchmark v5 fixes for five failing skills, E2 semantic split (13 skills/29 subtypes), v4 archival, D0 scene generation, and v5.1 planning, while also fixing the full VLA progress prediction training pipeline and debugging UniVLA container compatibility; TzJsDesktop completed CalendarPro’s seven-phase comprehensive optimization (230 tests passing), gadget common/ package and outputs/ unification refactor, research profiler homepage-based student discovery with Hugo deployment, and research toolchain CLI consolidation (citation graph + three-backend LLM support).
Today’s Tasks
Architecture & Strategy
- ✅ gadget: New Hugo research section and research toolchain CLI consolidation (citation graph + three-backend LLM) — Added research menu item and content/research/_index.md to Hugo, implemented deploy_to_hugo() in output.py; used research_scout.py as the unified CLI entry point, deleted New feature/ and duplicate directories, added 3 citation graph API functions to semantic_scholar.py, added profile/citations subcommands, llm.py now supports claude_cli/anthropic/openai backends, three-phase reports automatically run citation analysis on top-5 papers
- ✅ CalendarPro: 7-phase comprehensive optimization (230 tests passing) — Phase 1–7: ① Semantic router confidence thresholds (per-route 0.40–0.60) ② Hybrid routing (Dense 0.70 + Keyword 0.30) ③ Prompt consolidation (530 lines → base + 11 fragments) + Chinese token correction (×1.5/char) ④ Provider exponential backoff retry ⑤ Configurable scheduling score weights + deadline urgency ⑥ Automatic threshold tuning closed loop ⑦ ThoughtStore memory cache. All 4 real misclassification scenarios fixed, 230 tests all passing
- 🔄 MIHD: PCA and raw_shared embedding diagnostic baseline implementation — Added
--embedding_source {fusion,pca,raw_shared}three-mode support in benchmark_rm_ideal.py. PCA results show SL@50=0 for all 14 combinations; discovered that per-section independent processing invalidates comparisons; after fixing adata_hvg cache gene name integer bug, switched to raw_shared approach using raw HDF5 for HVG intersection (1137 genes), diagnosis still running - ✅ ErrorRecoveryBenchmark: Fixed 5 failing error skills — Fixed grasp_misalignment (gripper_close_steps 10→30, settle_steps 5→15), three drop skills (added mujoco.mj_step() physics pre-step of 15 steps to bypass OSC controller), trajectory_regression (added
--label_phasesin pipeline), wrong_object (fixed target_object propagation chain, env_wrapper added get_target_object(), trajectory_context added target_object field). All 105 unit tests pass - ✅ ErrorRecoveryBenchmark: E2 Drop skill semantic split (13 skills/29 subtypes) — Split by drop location and object interaction into three independent skills: drop_in_transit (mid-air, far from target), drop_at_wrong_place (near target, large offset, no interaction), drop_with_interaction (near target, small offset, object contact); D0/D2 distinction changed to posterior quaternion orientation change. Created 3 new skill files, deleted old e02_drop.py, renamed 9 files, fully updated taxonomy, config, 4 test files, and documentation. 13 skills/29 subtypes, 105 unit tests all pass
- ✅ ErrorRecoveryBenchmark: Stack body name bug fix and all 11 demo video generation — Fixed body_name field in stack.yaml (cubeA→cubeA_main), added _main/_body0 suffix fallback logic to env_wrapper._sim_body_name2id with WARNING on lookup failure; changed demo video script from action replay to set_sim_state() state restoration to avoid open-loop error accumulation; generated one MP4 demo video for each of Stack’s 11 error skills
- ✅ ErrorRecoveryBenchmark: v5.1 architecture planning (remove VLA context replay + velocity limits + human demo collection pipeline) — Completed technical planning document based on three user requirements, with clear phased implementation plan for Mar 16–31 and milestone of starting recovery training before April 1. Refactored ContextReplayEngine into InjectionEngine (direct injection frame sim state restoration), added motion velocity limits, designed keyboard teleoperation human demo collection pipeline (MimicGen demo data only)
- ✅ VLA-RoboTwin: pi05 progress prediction training pipeline full-chain fix and validation — Fixed HDF5→LeRobot format conversion script (added manip_progress_time/distance_left/right/target_endpose/target_joint fields); fixed pi05 CheckpointWeightLoader structure mismatch (added configurable missing_regex=’.lora.|.progress.’); corrected aux_targets shape handling logic in pi0.py (restored [:, None], confirmed via actual training logs that LeRobot will squeeze(1,)→scalar); added independent action_loss/aux_loss logging (has_aux=True). Validated to step 100 with loss curves decreasing normally
- ✅ gadget: common/ shared package extraction and refactor — Created 6 common/ modules (io/cache/json_utils/llm/hugo/init), reduced summarize/llm_backends.py from 516 lines to a 25-line re-export shim, eliminated research_scout.py sys.path hacks, migrated 4 files under research/ and mcp_server.py, eliminated ~400 lines of duplicate LLM calls and JSON parsing code
- ✅ gadget: outputs/ unified output directory refactor — Created common/paths.py defining 6 path constants (GADGET_ROOT/OUTPUTS_DIR/REPORTS_DIR/LOGS_DIR/CACHE_DIR/DATA_DIR/SITE_OUTPUTS_DIR), batch-updated daily_summary.py (12 path replacements), monthly_summary.py, research_scout.py (5 module-level constants), research profiler 4 submodules, benchmark 3 files, updated .gitignore to single-line
outputs/, updated 4 CLAUDE.md files - ✅ gadget: Homepage-based student discovery implementation — Implemented new homepage_discovery.py module (~200 lines), modified 9 existing files, refactored discover_students into 4 stages (homepage-first + co-authorship supplement); multi-strategy URL discovery (S2 homepage field + LLM suggestions +
--homepageparameter); HTMLParser subclass for text extraction; 2MB read limit, 50K character truncation, 7-day cache TTL - ✅ ErrorRecoveryBenchmark: v4 code archival to archive/v4/ — Moved 19 v4 framework modules (detectors/injectors/validators/classifiers etc.), 15 pipeline scripts, 5 config files, 6 test files, v4 outputs, and documentation to archive/v4/; fixed policy_adapter.py cross-dependency on archived collector.py (inlined BasePolicy/PolicyResult); updated init.py, Makefile, CLAUDE.md, README.md, all 94 v5 unit tests pass
- ✅ ErrorRecoveryBenchmark: Three bug fixes (coffee machine collision penetration, injection video frame skipping, output path cleanup) — ① Added margin=0.002, changed solimp to 0.95, changed solref to 0.002 in coffee_body/lid/base.xml ② Propagated render_fn callback through base_skill.inject(), three motion methods in env_wrapper, context_replay.execute(), and inject() in 13 error skills (17 files total) ③ Moved old backup directories and root-level scripts for 6 tasks to archive/v5_old_20260316/. All 105 unit tests pass
- 🔄 ErrorRecoveryBenchmark: First-round D0 scene generation (6 tasks, 207 scenes) and failure root cause diagnosis — Scanned for opportunities on an53 GPU and executed injections, generating 207 scenes (target: 600); diagnosed 5 systemic failure root causes: grasp_misalignment (insufficient gripper steps), 3 drop skills (OSC controller compensation cancels direct qpos manipulation), trajectory_regression (phase_labels pipeline not activated), wrong_object (target_object context missing)
- ✅ gadget: Batch deep profiling of 7 embodied AI scholars — Analyzed Mingyu Ding, Ruoshi Liu, Xiaolong Wang, Shuran Song, Yunzhu Li, Yuke Zhu, Chelsea Finn, Sergey Levine, and Pieter Abbeel via researcher profiler; identified complete advisor relationship network; some scholars (Xiaolong Wang, Shuran Song) encountered severe S2 name disambiguation issues; identified landmark award-winning works including VIN, TrajOpt, DDPM, and MineDojo
- 🔄 ErrorRecoveryBenchmark: Investigating coffee machine parts falling apart — User observed via screenshot that coffee machine lid was floating and base/cup were displaced from main body; AI launched 3 parallel Explore sub-agents to investigate XML file structure, Python assembly code, and CompositeBodyObject architecture; kinematic tree assembly logic diagnosis not yet complete
- ✅ CalendarPro: Open-source ecosystem research and 7-phase optimization plan design — Web search of comparable open-source projects (FluidCalendar/CoPaw/Khoj/OpenDAN etc.); identified energy-aware scheduling + three-layer architecture + dual intent verification + integrated life management as CalendarPro’s unique feature combination (confirmed gap in open-source ecosystem); translated research findings into a 7-phase optimization plan
Implementation & Fixes
- ✅ ErrorRecoveryBenchmark: v5 GPU pipeline full run generating 231 scenes and videos — Ran full D0 pipeline on an53 node (8× A800 80GB), generated 231 scenes and 231 MP4 videos in 42 minutes, an ~11.6% improvement over the pre-fix 207 scenes
- 🔄 UniVLA: CALVIN evaluation script single-GPU container compatibility fix — Added
--single_gpumode to bypass torchrun/Accelerator/DDP initialization; added GenerateConfig.window_size field (default 12); fixed MAPBloc typo; installed missing braceexpand dependency; fixed evaluate_policy hardcoded absolute path to another user; adjusted GIF frame rate from 60 to 120fps. Script now starts; iterative debugging still ongoing
Problems & Solutions
Key Issues
1. MIHD cross-sample embedding fundamental methodological flaw: AI drew the incorrect conclusion “weak input features” from “PCA worse than STAIG” without actively questioning the validity of the experimental design
Solution: After the user questioned the validity of cosine similarity, import chain tracing revealed that both PCA and STAIG have the same dual non-comparability issue (per-section independent HVG selection + independent PCA fitting), making both comparisons invalid; switched to a raw_shared approach using shared HVG intersection (1137 genes) as the correct baseline
Key insight: The prerequisite for valid cross-sample embedding comparison is a shared feature space; cosine similarity across sections from independently fitted embeddings is mathematically meaningless, regardless of the model used
2. Drop skill objects not actually falling: directly setting qpos to open the gripper caused env.step()’s OSC controller to reapply gripping force and “pull the object back”
Solution: After opening the gripper and setting the object’s initial velocity, call mujoco.mj_step() to run 15 physics steps first (completely bypassing the OSC controller), allowing the object to complete initial separation before entering the standard control loop
Key insight: sim.forward() only updates kinematic state without advancing dynamics; only mujoco.mj_step() actually steps the MuJoCo physics engine, thereby bypassing all high-level controllers. There is a fundamental conflict between direct state manipulation and feedback controllers — simulation injection design must explicitly choose one path
3. Stack task body name resolution silently failing: stack.yaml used cubeA/cubeB, but MuJoCo’s actual names were cubeA_main/cubeB_main; _sim_body_name2id returned -1, and Python’s negative index body_xpos[-1] read the last body, causing all task phase detection to be misclassified as pre_reach
Solution: Fixed body_name fields in stack.yaml, added _main/_body0 suffix fallback logic to env_wrapper._sim_body_name2id, outputting a WARNING instead of silently returning -1 on lookup failure
Key insight: body_xpos[-1] negative indexing always returns the same position for both cubes, making this silent bug nearly invisible; any parsing failure should immediately trigger an alert rather than returning a sentinel value
4. VLA context replay architecture assumption error: AI designed a complete N-1 frame replay mechanism believing it was necessary for providing correct observation history to VLAs; also designed multiple data sources (demo + VLA rollout + BC rollout) while ignoring the controllability differences between sources
Solution: User pointed out that most VLAs don’t have a context window, making context replay wasteful overhead; refactored ContextReplayEngine into InjectionEngine that directly restores injected frame sim state; data source restricted to MimicGen demo data (higher controllability)
Key insight: A general benchmark must work efficiently for context-free models (BC-RNN, ACT, etc.) as well; over-engineering for the few VLAs that support history input is wrong. The user’s knowledge of the actual model landscape outweighs AI’s theoretical reasoning
5. CalendarPro intent misclassification: semantic router had no confidence thresholds (0.52 treated as valid); sentences with time expressions were misrouted by keyword matching; short confirmations like “ok” lacked contextual understanding; system prompt too long (530 lines) sent in full, Chinese token estimation off by 3×
Solution: Added per-route thresholds (0.40–0.60), falling back to LLM below threshold; introduced keyword scorer (time regex boost for schedule) mixed with embedding at 70/30 ratio; split SYSTEM_PROMPT into BASE (~50 lines) + 11 intent-specific fragments injected on demand; Chinese token estimation changed to chinese_chars×1.5+other_chars/4
Key insight: Embedding nearest-neighbor routing cannot express “I’m not sure.” Confidence threshold + fallback LLM + keyword scorer hybrid is the most practical fix pattern, generalizable to all vector-retrieval classification systems (RAG routing, tool selection, etc.). Chinese characters consume ~6× as many tokens as English characters, so omitting this correction systematically underestimates context length
6. S2 co-author analysis completely failing for top-tier researchers like Levine/Abbeel/Finn (depth-2 entirely empty), with severe name disambiguation issues (Xiaolong Wang matching a veterinarian/geologist, Shuran Song showing only 2 papers from 2025)
Solution: Refactored to a homepage-first strategy: prioritize extracting student lists directly from researcher homepages/lab pages, using co-authorship only as a supplement; multi-strategy URL discovery (S2 homepage field + LLM suggestions + --homepage parameter); marked name disambiguation warnings and suggested using S2 authorId for precise queries
Key insight: Academic homepages explicitly list students, making them an order of magnitude more reliable than inferring from co-authorship. Top-tier professors publish 500+ papers, diluting the first-author signal across vast numbers of collaborators — co-author analysis has fundamental applicability limits for this use case
7. Research toolchain fragmentation: paper scout and researcher profiler had overlapping functionality and scattered commands; the New feature/ directory contained fully duplicated code; the citation relationship dimension was missing from the toolchain
Solution: Used research_scout.py as the unified CLI entry point, integrated the modular profiler as a library via lazy import, added profile/citations subcommands, deleted New feature/ directory, added Semantic Scholar citation graph API, three-phase reports automatically run citation analysis on top-5 papers
Key insight: Integrating new modules via lazy import while retaining the original CLI entry point, rather than a full rewrite, balances backward compatibility; citation relationships (forward citations + backward references) are an undervalued core feature in research toolchains
8. Demo video script using action replay caused open-loop error accumulation, with phase detection errors in later frames
Solution: Switched to set_sim_state() to directly restore each frame’s MuJoCo state vector, completely avoiding open-loop error accumulation
Key insight: Stored clean trajectories contain full sim state vectors; direct state restoration is far more accurate than action replay. Action replay is suited for real-time control; state restoration is suited for offline analysis
9. trajectory_regression unable to find any injection opportunities: can_inject() requires prev_phases length ≥ 10, but the pipeline never called replay_and_label_phases(), leaving phase_labels as None throughout
Solution: Added --label_phases flag as default in Step 0 of run_v5_all_tasks.py
Key insight: Implicit dependencies (specific skills requiring certain pipeline steps to be explicitly activated) only surface when that skill fails; end-to-end integration tests surface this class of pipeline-level defects far better than unit tests
10. pi05 training error: CheckpointWeightLoader structure mismatch — newly added progress layers (progress_mlp_in/out/cond_proj) are absent from the checkpoint and don’t match the hardcoded ‘.lora.’ regex
Solution: Added a configurable missing_regex field to CheckpointWeightLoader (default ‘.lora.’ for backward compatibility); changed the 4 progress experiment configs to use ‘.lora.|.progress.’
Key insight: Adding new experimental modules when loading pretrained weights is a common scenario; missing_regex should be a configurable parameter. This is identical to the same need in LoRA fine-tuning — a universal design pattern for transfer learning
11. Incorrect aux_targets shape assumption in pi0.py: AI inferred that LeRobot preserves (b,1) shape after loading shape=(1,) features and modified code accordingly; in reality LeRobot squeezes to scalar (b,), causing shape mismatch during training
Solution: Confirmed the actual shape by running training and observing logs (‘aux_targets[…]: (32,)@float32’), then restored the original [:, None] and jnp.stack operations
Key insight: LeRobot automatically squeezes scalar features with shape=(1,) at DataLoader time — this is framework-level behavior. Assumptions about third-party framework internals must be verified through actual execution, not pure inference
General Issues
12. adata_hvg cache bug: section 151673’s HVG AnnData var_names were reset to integer indices (‘0’,‘1’,‘2’…), causing the gene name intersection to be empty
Solution: Abandoned reliance on adata_hvg cache; loaded directly from raw HDF5 data (via load_dlpfc_data), manually performing normalization and HVG selection
Key insight: Critical fields in cached data may undergo silent transformation during write; a sanity check should be performed before use (e.g., verifying var_names contain gene symbols rather than integers)
13. Chinese long-form JSON generated by LLM contains mixed Chinese quotation marks ("") inside JSON string values, causing parse failures — occurred repeatedly in profiles for Chelsea Finn, Yuke Zhu, Mingyu Ding, and others
Solution: Resubmitted malformed JSON to Claude requesting only the repaired pure JSON back, automated via the repair_json_with_llm mechanism; plan to explicitly require English quotation marks in the prompt as a long-term fix
Key insight: When generating JSON with rich Chinese content, prompts should explicitly require English quotation marks, or format validation should be applied immediately after generation with JSON repair as a fixed pipeline step — more reliable than depending on generation quality
Human Thinking vs. AI Thinking
Strategic Level
Methodological questioning of cross-sample cosine similarity validity (MIHD)
| Role | Thinking |
|---|---|
| Human | After AI concluded “PCA worse → weak input features,” the human intuitively asked “could there be a problem with this cross-sample embedding cosine similarity?”, directly targeting the methodological flaw rather than the numerical results |
| AI | AI tended to directly attribute cause from experimental numbers, without proactively questioning the validity prerequisite of the experimental design (whether feature spaces are comparable) |
Analysis: Humans possess prior methodological skepticism — when seeing anomalous results, they ask “is the experiment designed correctly?”. AI is better at analyzing data under given assumptions; reflection on the assumptions themselves requires external triggering
VLA context window necessity and error scene data source design
| Role | Thinking |
|---|---|
| Human | User proactively pointed out that most VLAs don’t have a context window, making context replay wasteful overhead; also explicitly required using only MimicGen demo data, prohibiting VLA/BC-RNN rollout data (uncontrollable randomness) |
| AI | AI designed a complete N-1 frame replay mechanism believing it was a necessary step, and planned multiple data sources believing diversity was beneficial — both lacking understanding of actual model scope and data controllability |
Analysis: User identified over-engineering from the practical perspective of model characteristics and data controllability; AI approached from theoretical correctness and needed user practical experience for correction
E2 drop semantic split and CalendarPro optimization plan design
| Role | Thinking |
|---|---|
| Human | User proactively split E2 into three independent skills by semantic difference in recovery strategy (the three drop types require completely different recovery actions); similarly for CalendarPro, independently completed problem diagnosis (4 real misclassification root cause analyses) and 7-phase technical specification, providing the complete solution as input |
| AI | In the benchmark, AI handled different cases of the same injector via parameterization without proactively proposing semantic-level subdivision; in CalendarPro, AI primarily served an implementation and verification role |
Analysis: The highest-value design work (semantic categorization, solution design) was completed by the human; AI contributed value in parallel execution and edge case handling. The user’s domain expertise is irreplaceable; AI’s parallel execution capability significantly accelerated delivery
Citation relationships as a core feature of the research toolchain
| Role | Thinking |
|---|---|
| Human | User proactively raised that citation links between papers are very important — high citation counts indicate popular directions, and one needs to analyze “why popular” and “what follow-up work has done”; also clarified that citation count is suitable for ranking but should not affect relevance scoring |
| AI | AI’s initial consolidation plan focused on merging functionality of two tools (CLI unification), treating citation features as optional extensions without proactively identifying the citation graph as a core feature; scoring decoupling waited for user decision |
Analysis: Users have a clearer research methodology perspective — citation relationships are a core tool for understanding research impact evolution, not just metadata; “relevance” and “popularity” are different dimensions, and humans more clearly understand their different purposes in research workflows
Student discovery strategy: debugging S2 co-author logic vs. switching to professor homepages
| Role | Thinking |
|---|---|
| Human | User directly proposed: instead of debugging the existing co-author analysis logic, scrape student lists directly from professor personal homepages, since homepage information is more direct and authoritative |
| AI | After depth-2 failures, AI began deeply debugging the scoring logic and threshold settings in student_discovery.py, trying to fix within the existing framework |
Analysis: AI tends to look for bugs or tune parameters within an existing solution; users more quickly identify methodological applicability limits and propose more efficient alternative paths, bypassing the fundamental limitations of S2 data quality
Identifying the coffee machine parts falling apart problem
| Role | Thinking |
|---|---|
| Human | Identified the specific phenomenon of lid floating and base displaced via visual inspection of screenshot, directly proposing three diagnostic directions: missing joint definitions, coordinate offsets, loading logic errors |
| AI | AI only focused on contact parameter-level fixes (margin/solimp) without proactively checking whether the model’s kinematic tree was correctly assembled |
Analysis: Human identified a new problem AI didn’t proactively discover using visual intuition, providing a higher-level structural diagnostic framework; AI’s fix only addressed “contact too soft” without addressing “parts not connected to each other”
VLA training debugging delegation pattern
| Role | Thinking |
|---|---|
| Human | Adopted a goal-driven delegation strategy: “execute training commands yourself, fix all errors, keep going until there are no more errors,” providing a clear termination condition without intervening in specific steps |
| AI | AI iterated scientifically: run → observe error → read source code to locate → minimal fix → re-run. But made an error on the LeRobot shape assumption, requiring actual run logs for correction |
Analysis: The human’s delegation pattern allowed AI to debug independently; the incorrect shape assumption was naturally exposed through execution. The human’s choice not to intervene in specific decisions was correct — the error correction mechanism was built into the iterative loop
AI Limitations
Key Limitations
- Insufficient ability to reflect on experimental conclusions: After MIHD PCA diagnostic experiments, directly drew incorrect conclusions from surface-level numbers without proactively examining the validity prerequisites of the experimental design (the comparability problem with per-section independent PCA); required external user prompting to correct
- Silent failure patterns causing serious bugs to persist: When stack.yaml body name parsing failed, -1 was silently returned without any warning or assertion, making the Python negative index bug completely invisible; a similar issue (adata_hvg cache var_names integer conversion) was also not proactively discovered due to lack of sanity checks
- Over-engineering and architecture assumption errors: v5 context replay was over-designed based on the incorrect assumption that “all VLAs need a context window”; made incorrect assumptions about third-party framework internal behavior (LeRobot squeezing shape=(1,) features) and modified code accordingly — both required user correction or actual execution to validate
- Insufficient ability to proactively question methodological applicability limits: When S2 student discovery completely failed for top-tier researchers, AI continued debugging code logic (reading student_discovery.py, analyzing thresholds) without proactively questioning the methodological applicability boundary; required user prompting to pivot to the homepage approach
- Insufficient Semantic Scholar entity disambiguation: For common Chinese-to-English transliterated names like Xiaolong Wang, Shuran Song, Ming Yu, nearly always matched to the wrong researcher; LLM analysis also couldn’t automatically identify “this is not the same person”; could only annotate warnings after the fact, lacking proactive entity disambiguation capability
General Limitations
- Unstable LLM output format for Chinese long-form JSON: outputs containing large numbers of Chinese quotation marks ("") failed even after three rounds of haiku→sonnet→opus repair attempts, recurring on three researchers in the same pipeline; repair_json_with_llm has insufficient handling for this specific pattern
- Inaccurate judgment of container network constraints: In UniVLA debugging, initially suggested MASTER_ADDR approach believing it could bypass DNS resolution; actually couldn’t resolve Kubernetes Pod IPv6 issues; correct –single_gpu solution only triggered after user reported failure
- Incomplete coverage of recent academic conference award records: Systematic blind spots for non-top awards (CoRL/ICLR spotlight etc.) and recent 2023–2025 paper awards; noticeably weaker grasp of robotics conferences (CoRL/RSS/ICRA) compared to general AI top conferences (NeurIPS/ICML), prone to underreporting
Today’s Takeaways
Core Takeaways
- Necessary prerequisite for cross-sample embedding comparison: feature spaces must be shared. Per-section independent HVG selection + independent PCA fitting = dual non-comparability; a valid cross-sample baseline must use shared HVG intersection + joint PCA, or use a foundation model with fixed pretrained weights
- Standard method for bypassing high-level controllers in MuJoCo: must call mujoco.mj_step() (advances dynamics) rather than sim.forward() (only updates kinematics) to complete physics state changes before OSC controller intervention. There is a fundamental conflict between direct state manipulation and feedback controllers — simulation injection design must explicitly choose one path
- Composite objects generated by MuJoCo CompositeBodyObject typically have body names with _main suffix (e.g., cubeA_main not cubeA). env_wrapper body name resolution functions need fallback logic for multiple candidate names ({name}→{name}_main→{name}_body0), outputting WARNING instead of silently returning -1 on lookup failure
- Error type semantic splitting should be based on “whether recovery strategies differ” rather than “whether injection mechanisms differ”: drop_in_transit/drop_at_wrong_place/drop_with_interaction have completely different detection conditions and recovery logic — even if injection actions are the same, they must be modeled separately, which is more meaningful for curriculum design in the training phase
- Architectural flaw of Semantic Router: embedding nearest-neighbor always produces a result and cannot express “I’m uncertain.” Confidence threshold + fallback LLM + keyword scorer hybrid is the most practical fix pattern, generalizable to all vector-retrieval classification systems (RAG routing, tool selection, etc.)
- For top-tier researchers with 500+ publications, S2 co-author frequency analysis cannot reliably identify students — the first-author signal is diluted across vast numbers of collaborators. Professor homepages explicitly list students, making them an order of magnitude more reliable than inferring from co-authorship for discovering top-tier researchers’ students
- Citation graph (forward citations + backward references) is an undervalued core feature in research toolchains: analyzing “who cited this paper” reveals research impact evolution and popular follow-on directions; in scoring systems, “relevance” and “citation count/popularity” should be decoupled — citation count for ranking, not scoring, to prevent highly-cited but low-relevance papers from distorting project direction filtering
- Embodied AI scholar advisor lineage: Mingyu Ding←Jitendra Malik (Berkeley), Ruoshi Liu←Carl Vondrick (Columbia), Xiaolong Wang←Abhinav Gupta (CMU), Shuran Song←Thomas Funkhouser (Princeton), Yunzhu Li←Antonio Torralba (MIT), Yuke Zhu←Li Fei-Fei (Stanford) — revealing a systematic pattern of top perception/robotics advisor cohort producing students who move into embodied AI
- In offline trajectory analysis, directly restoring each frame’s complete state vector with set_sim_state() is far more accurate and reliable than action replay, completely avoiding open-loop error accumulation. Storing states alongside clean trajectories is the correct design decision
- LeRobot dataset framework automatically squeezes scalar features with shape=(1,) to (batch_size,) at DataLoader time. Model code needs to explicitly use [:, None] to add back the dimension; data should be stored as np.float32(scalar) rather than np.array([scalar]). This is framework-level behavior that must be verified through actual logs, not inferred
- On-demand injection strategy for prompt engineering: splitting system prompt into base (always included) + intent-specific fragments (dynamically injected based on classification result) can reduce token usage by 40–60%. Chinese character token density is ~6× that of English characters (1.5 tokens/char vs 0.25 tokens/char); not correcting this systematically underestimates context length
- sys.path.insert hacks are a brittle cross-module dependency approach: any function rename causes runtime ImportError. The correct approach is common/ package + pip install -e .; Python re-export shim pattern (module contains only
from x import y; __all__=[...]) is an elegant migration approach that maintains backward compatibility - In projects with multiple tools sharing output, organizing by “file type first” rather than “tool name first” (outputs/reports/summarize/ rather than summarize/reports/) allows .gitignore to be simplified from multiple scattered rules to single-line
outputs/, more friendly for CI/CD and disk quota management - Validating simulation pipeline feasibility with a small batch (~100 total) is the correct iterative strategy: 207 scenes exposed 5 systemic defects; going straight to 2900 would have wasted massive GPU time on doomed injections. End-to-end integration tests surface implicit pipeline dependencies far better than unit tests
- Flow Matching is becoming the mainstream action decoding architecture for robot VLAs: works like π₀ have converged on “pretrained VLM backbone + flow matching action head,” outperforming diffusion models for multimodal modeling in continuous high-dimensional action spaces. Shuran Song’s Im2Flow2Act and UMI are two major breakthroughs in robot data efficiency in 2024
Session Summaries
MIHD
🔄 MIHD cross-sample embedding comparability diagnosis: PCA baseline implementation, per-section dual non-comparability root cause identification, raw_shared approach design
19:33:45.000 | claude_code
Starting from project state confirmation (current best ARI=0.546, PCA+UNI2+STAIG), implemented three-mode --embedding_source support. PCA diagnostics showed SL@50=0 for all 14 combinations; after the user challenged AI’s initial incorrect conclusion, import chain tracing revealed that both PCA and STAIG have the same per-section independent processing flaw (dual non-comparability). Discovered adata_hvg cache has a gene name integer conversion bug; ultimately switched to a raw_shared approach loading HVG intersection (1137 genes) from raw HDF5, diagnosis running. Also revealed STAIG’s layer-specific behavior: SL@50=0.94–1.0 on Layer_1/Layer_5, complete failure on intermediate layers.
ErrorRecoveryBenchmark
✅ v4 archival, E2 semantic split, v5.1 architecture planning, 5 skill fixes, D0 scene generation, three bug fixes, Stack demo videos, failure root cause diagnosis 20:20:54.000 | claude_code Multiple sessions throughout the day completing major benchmark framework advances: ① Full v4 code archival to archive/v4/ (19 framework modules, fixed policy_adapter cross-dependency), 94 v5 tests pass ② E2 drop semantically split into drop_in_transit/drop_at_wrong_place/drop_with_interaction (13 skills/29 subtypes, 105 tests pass) ③ v5.1 technical planning complete (removed context replay, velocity limits, human demo collection; April 1 training milestone confirmed) ④ Fixed 5 failing error skills (mujoco.mj_step() to bypass OSC controller / gripper steps / phase labels / target_object propagation), generated 231 scenes and 231 MP4s on an53 ⑤ Three bug fixes (coffee contact params / render_fn propagation across 17 files / output path cleanup), generated coffee demo video to verify effect ⑥ First-round D0 generated 207 scenes exposing 5 systemic failure root causes ⑦ Stack body name bug fix + 11 demo videos (switched from action replay to state restore)
VLA-RoboTwin
✅ pi05 progress prediction experiment training pipeline full-chain debugging and validation 01:40:13.000 | claude_code Continuing from previous session, completed shape adaptation for five progress fields in HDF5→LeRobot format conversion script. Fixed three independent issues: CheckpointWeightLoader’s missing_regex not supporting progress layers (added configurable field), incorrect aux_targets shape handling in pi0.py (actual logs confirmed LeRobot squeezes (1,)→scalar, restored [:, None]), and invisible action_loss/aux_loss logging (has_aux=True + logging.info). Validated to step 100 with action_loss=0.37, aux_loss=0.22, loss curves decreasing normally; all four experiment configs ready.
UniVLA
🔄 CALVIN data format investigation and evaluation script single-GPU container compatibility fix
12:34:04.000 | claude_code
Clarified the data usage difference between training script (DiskCalvinDataset reads CALVIN npz format directly) and evaluation script (online rollout via calvin_env, using only validation/ for scene initialization). Fixed multiple issues in run_calvin_eval_ddp.py: added --single_gpu mode to bypass Kubernetes container IPv6 DNS resolution issues, fixed GenerateConfig missing window_size field, MAPBloc typo, missing braceexpand dependency, evaluate_policy hardcoded absolute path to another user, adjusted GIF frame rate to 120fps. Script now starts; debugging ongoing.
CalendarPro
✅ Open-source ecosystem research + 7-phase optimization plan design + full implementation (230 tests passing) 21:29:45.000 | claude_code Three-phase work: ① CLAUDE.md review concluded accurate and comprehensive, no changes needed ② Web search found open-source ecosystem lacks complete implementation of energy-aware scheduling + integrated life management; designed 7-phase comprehensive optimization plan using 4 real misclassification records as root cause evidence ③ Fully implemented Phase 1–7 via parallel sub-Agents (semantic router confidence thresholds + hybrid routing, prompt consolidation + Chinese token correction, provider retry, configurable scheduling scoring, automatic threshold tuning, ThoughtStore cache); all 4 misclassification scenarios fixed, 230 tests all passing.
gadget/Research Toolchain Architecture
✅ common/ package extraction refactor, outputs/ unification, comprehensive CLAUDE.md/README.md/TUTORIAL.md updates, MCP server bug fix
21:11:57.000 | claude_code
Two major architectural refactors: ① Implemented common/ package (6 modules), reduced summarize/llm_backends.py from 516 lines to a 25-line re-export shim, eliminated research_scout.py sys.path hacks, unified ~400 lines of duplicate LLM/IO code ② Consolidated scattered output directories from each tool into outputs/{reports,logs,cache,data}/, created common/paths.py, modified 10+ files, simplified .gitignore to single-line outputs/ ③ Fixed MCP server old function name _load_known_arxiv_ids→_load_known_paper_ids ④ Multiple rounds of updating CLAUDE.md, README.md, TUTORIAL.md (expanded from 10 to 13 chapters in complete Chinese documentation)
gadget/Research Profiler
✅ Homepage-based student discovery implementation, new Hugo research section, CLI consolidation (citation graph + three-backend LLM), batch deep profiling of 7 embodied AI scholars
20:53:14.000 | claude_code
Four core workstreams: ① Implemented new homepage_discovery.py module, refactored discover_students into homepage-first four-stage strategy, modified 9 files, resolved S2 co-author analysis completely failing for top-tier researchers ② Added research section to Hugo, separating scholar profiles from bugJournal, implemented deploy_to_hugo(), added --deploy parameter ③ Used research_scout.py as unified CLI entry point, deleted New feature/ duplicate directory, added semantic_scholar citation graph API (get_paper_by_id/citations/references), added profile/citations subcommands, llm.py supporting three backends, three-phase reports automatically running citation analysis on top-5 papers ④ Batch analysis of Mingyu Ding/Ruoshi Liu/Xiaolong Wang/Shuran Song/Yunzhu Li/Yuke Zhu/Chelsea Finn/Sergey Levine/Pieter Abbeel; identified complete advisor relationship network, recognized VIN/TrajOpt/MineDojo award-winning works; annotated warnings for S2 name disambiguation (Xiaolong Wang etc.) with deduplication suggestions
Token Usage
Overview
| Metric | Value |
|---|---|
| Total Tokens | 135,295,142 |
| Input Tokens | 103,531 |
| Output Tokens | 406,349 |
| Cache Created | 9,686,371 |
| Cache Read | 125,098,891 |
| Cache Hit Rate | 92.8% |
| Total Cost (USD) | $100.6978 |
Model Breakdown
| Model | Input | Output | Cache Created | Cache Read | Cost | Share |
|---|---|---|---|---|---|---|
| claude-opus-4-6 | 58,259 | 235,273 | 5,485,227 | 97,079,253 | $88.9954 | 88.4% |
| claude-haiku-4-5-20251001 | 45,076 | 170,341 | 3,204,784 | 26,770,930 | $7.5799 | 7.5% |
| claude-sonnet-4-6 | 196 | 735 | 996,360 | 1,248,708 | $4.1226 | 4.1% |
Usage by Device
| Device | Total Tokens | Input | Output | Cost |
|---|---|---|---|---|
| DCC | 16,204,814 | 35,329 | 53,093 | $12.8258 |
| tianhe | 43,863,063 | 37,017 | 130,536 | $30.4748 |
| TzJsDesktop | 75,227,265 | 31,185 | 222,720 | $57.3972 |