Yurun Chen1,2,   Tianyuan Gao3,   Yizhong Ge1,   Shikun Ban4
Yizhou Wang3,   Hongkai Xiong2,5✉,   Wenjun Zeng1,6✉,   Wentao Zhu1,6✉
1Eastern Institute of Technology, Ningbo   2Shanghai Jiao Tong University   3Peking University
4Carnegie Mellon University   5East China Normal University   6Ningbo Institute of Digital Twin
✉ Corresponding authors

TL;DR. We build a framework for a humanoid robot to distinguish itself from humans or similar robots through proprioceptive-visual correspondence, without any predefined kinematic model or identity label. This self-other distinction then bootstraps a predictive self-model, enabling various real-world applications.

Overview

Distinguishing self from others is a fundamental capability for social intelligence. Before an agent can imitate a demonstrator, coordinate with a partner, or simply avoid colliding with a bystander, it must resolve a prior question: Which body is mine?

Cognitive science points to proprioceptive-visual correspondence as a key mechanism.

We therefore frame self-other distinction as an operational self-instance assignment problem:

From several visible bodies, the robot must select the candidate whose configuration matches its proprioceptive state, without identity labels or prior knowledge of its morphology.

Self-other distinction problem setting

Humanoid robots are entering social environments, where they coexist with humans and morphologically identical peers. To act effectively in such settings, they not only need self-other distinction to identify which body in the scene is itself, but also self-modeling to acquire a predictive representation of that body and how it changes with action.

Self-other distinction

Self-other distinction task

Self-modeling

Self-modeling task

From only proprioceptive states and visual observations, and without any predefined kinematic model or identity label, our framework achieves robust self-other distinction, learns a self-model that predicts 3D body occupancy, and supports downstream tasks including target reaching, collision-aware motion planning, and human-to-robot motion retargeting.

Pipeline from proprioceptive-visual correspondence to self-modeling
Pipeline overview

Results

Self-other distinction

We use self-supervised contrastive learning to train our self-other distinction model, utilizing the intrinsic correspondence between proprioception and vision. Across diverse poses, our model consistently distinguishes self from others, selecting the robot, rather than the human distractor.

Self-modeling

Using the selected robot masks as supervision, we learn a self-model that predicts the 3D body occupancy field from proprioception. Our self-model produces coherent predictions. The visualized point cloud tracks the robot across poses, preserving the self-other boundary.

From self-model to physical interaction

Target reaching, collision-aware motion planning, and human-to-robot motion retargeting tasks
Each real-world task can be cast as a spatial constraint on the robot’s body—a target to reach, an obstacle to avoid, or a demonstrated pose to imitate. Because the self-model provides a differentiable mapping from joint configurations to 3D body occupancy, these constraints translate into gradients over joint angles. We test this on three tasks: target reaching, collision-aware motion planning, and human-to-robot motion retargeting.
Target reaching

A human moves a night-light to different positions, and the robot optimizes its seven left-arm joint angles to bring the center of a hand-specific self-model to the target.

Collision-aware motion planning

The target is behind a board with a circular aperture, so a direct trajectory would collide. The robot uses learned body and hand occupancy inside a motion planner to find a collision-free path.

Human-to-robot motion retargeting

Given a human demonstration, we extract 3D keypoints for body parts such as hands and feet, map them to robot-compatible targets, and optimize the 29 robot joints so that each predicted part reaches its target. The resulting robot joint sequence reproduces the human motion.

Methods

Self-other distinction method
Self-other distinction. Self-other distinction compares proprioceptive and visual embeddings, selects the self-mask, and trains the alignment with attention-guided contrastive learning.
Kinematics-free self-modeling method
Kinematics-free self-modeling. Kinematics-free self-modeling uses the selected self-mask to learn a pose-conditioned density and visibility field through bounded volumetric mask rendering.

Summary Video