Vision–Language Models for Robot Vision and Navigation

This project employs vision-language models (VLMs) for efficient robot perception and dynamic path planning. First, I synthesize probabilistic Hough transforms with semantic information about the surrounding scene queried from a VLM to optimize edge detection and obstacle identification. My system then intelligently partitions environments and ranks all candidate regions to decide the best path to follow, based on the same VLM scene semantics. A VLM-guided terrain analysis then assigns friction scores to mixed floor surfaces, and tailors per-segment motor power profiles for optimal traction on varied surfaces. Finally, the robot continuously streams egocentric video frames to a VLM, extracting real-time proximity and traversability information to re-evaluate and adapt its path on the fly.

Videos and Photos


Below are videos of the system deployed on a layered obstacle course with rough floor conditions. Two different dynamically generated paths are shown.