Twelve Labs is a video understanding AI company that provides APIs for searching, analysing, and reasoning over video content. Unlike tools that only transcribe audio or extract still frames, Twelve Labs builds multimodal foundation models that combine visual actions, on-screen objects, spoken words, and emotional context into a single representation, enabling true semantic search and question-answering across video.

The company was founded in 2021 by a Korean-American team including CEO Jae Lee, Aiden Lee, SJ Kim, Soyoung Lee, and Dave Chung. Twelve Labs is headquartered in the United States with significant engineering presence in Korea. The team has positioned the company as a research-led foundation model lab focused specifically on video — a modality that has historically been hard to handle compared to text and images.

Funding totals roughly $107M, including a $50M Series A co-led by NEA and NVIDIA's NVentures in June 2024, and a $30M strategic round in late 2024 with participation from Databricks, SK Telecom, Snowflake Ventures, HubSpot Ventures, and In-Q-Tel. The NVIDIA and infrastructure-vendor backing is notable: it signals both technical credibility and intent to be embedded inside major data and ML platforms.

Twelve Labs' two flagship models are Marengo, a multimodal embedding model for any-to-any search across video, image, audio, and text, and Pegasus, a video-language model that answers questions and generates descriptions from video content. Marengo powers semantic video search and content moderation use cases; Pegasus powers analytic and generative video understanding, including chapter generation, highlight detection, and detailed scene description.

The company sells primarily through APIs and a developer platform, with enterprise customers in media, sports, advertising, and intelligence applications. It competes with Google's video AI offerings, AWS Rekognition, and open-source video-language models, but its bet is that purpose-built video foundation models — rather than image models stitched together with audio transcription — will define the category.