A mechanical hand is on display at the Robot Mall, world’s first embodied intelligent robot 4S store, on August 13, 2025 in Beijing, China.
Vcg | Visual China Group | Getty Images
BEIJING — Alibaba Cloud is investing in a new type of artificial intelligence designed to better replicate the real world using a different approach from chatbots such as OpenAI’s ChatGPT.
The shift recognizes the limits of “large language models” trained primarily on text. Instead, developers are starting to focus more on “world models” built on videos and real-life physical scenarios.
To jump on the trend, Alibaba led a 2 billion yuan ($290 million) investment in ShengShu, the startup behind the AI video generation tool Vidu, the company announced Friday. TAL Education and Baidu Ventures also participated in the series B funding round.
The investment comes about two months after ShengShu raised 600 million yuan from Qiming Venture Partners and other backers. The startup declined to disclose its valuation.
ShengShu said the latest funding will support the development of a “general world model” that uses AI to bridge two currently separate domains: the digital world of games and AI-generated video, and the physical world of autonomous driving and robots.
“ShengShu believes that a general world model, built on multimodal data such as vision, audio, and touch, more naturally captures how the physical world works than large language models,” the three-year-old startup said in a statement.
“We aim to connect perception and action,” Zhu Jun, founder of ShengShu, added in a statement, allowing AI systems to better model and predict real-world behavior consistently.
ShengShu’s latest Vidu Q3 Pro model, released in January, ranks among the top 10 AI models for generating videos from text and images, according to Artificial Analysis.
The company launched Vidu globally months before OpenAI made its now-shuttered Sora tool for AI video generation widely available. Chinese short-video companies Kuaishou and ByteDance have also released similar competing AI tools for generating videos.
World model competition
Alibaba has expanded its investments in related startups.
The Chinese tech giant and Baidu Ventures last month led a $50 million investment in Tripo AI, a platform that uses AI to quickly generate digital 3D models from photographs. Tripo said it is also moving away from techniques used by language models toward AI tools grounded in physical space and is developing its own world model.
In September, Alibaba also led a $60 million investment in PixVerse, which released an AI world model earlier this year that allows users to direct how a video unfolds while it is being generated.
Alibaba, which got its start in e-commerce, has also released free, open-source AI models for video generation and, in February, launched one for powering robots.
Shengshu said Friday it has strategic partnerships with companies developing embodied AI — systems such as humanoid robots that interact with the physical world — for use across industrial, commercial and home settings.
World models are critical for robotics because the technology needs more than LLMs to work, Kevin Kelly, co-founder of the U.S. tech magazine Wired, wrote last month on his Substack.
Ultimately, to replicate human intelligence, AI will need three things: reasoning, an understanding of the physical world and continuous learning, Kelly said. While AI for the learning category hasn’t been developed yet, LLM-powered chatbots have created the knowledge element, he said, making world models a key area requiring a breakthrough.

