VLMs excel at 2D surroundings, but the visual world is in 3D, and MindJourney provides better viewpoints of real-world scenarios, and ultimately aims to forecast how scenes change over time, according to the Microsoft researchers.
MindJourney “sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration,” the researchers wrote in a paper.
MindJourney’s technologies could improve assistive robots and remote inspection, and enrich virtual and augmented reality experiences, the researchers wrote in the paper.