The most capable An open-source AI model with visual capabilities, but could see more developers, researchers and startups develop AI agents that can perform useful tasks on your computers for you.
Released today by the Allen Institute for AI (Ai2), the Multimodal Open Language Model, or Molmo, can interpret images as well as converse via a chat interface. This means it can make sense from a computer screen, potentially helping an AI agent perform tasks such as web surfing, navigating through file directories and drafting documents.
“With this release, many more people can deploy a multimodal model,” said Ali Farhadi, CEO of Ai2, a research organization based in Seattle, Washington, and a computer scientist at the University of Washington. “This should be an enabler for next-generation applications.”
So-called AI agents are widely touted as the next big thing in AI, with OpenAI, Google and others racing to develop them. Agents have become a buzzword recently, but the big vision is for AI to go beyond chats to reliably take complex and advanced actions on computers when given a command. This capability has yet to be realized on any scale.
Some powerful AI models already have visual capabilities, including OpenAI’s GPT-4, Anthropic’s Claude, and Google DeepMind’s Gemini. These models can be used to power some experimental AI agents, but they are hidden from view and only accessible through a paid application programming interface or API.
Meta has released a family of AI models called Llama under a license that restricts their commercial use, but has yet to provide developers with a multimodal version. Meta is expected to announce several new products, perhaps including new Llama AI models, at the Connect event today.
“Having an open-source multimodal model means that any startup or researcher who has an idea can try to do it,” said Ofir Press, a postdoctoral fellow at Princeton University who works on AI agents.
The press says that the fact that Molmo is open source means that developers will be able to more easily fine-tune their agents for specific tasks, such as working with spreadsheets, by providing additional training data. Models like GPT-4 can only be fine-tuned to a limited extent through their APIs, while a fully open model can be modified extensively. “When you have an open source model like this, you have a lot more options,” Press says.
Ai2 is releasing several sizes of Molmo today, including a 70 billion parameter model and a 1 billion parameter model that is small enough to run on a mobile device. The number of model parameters refers to the number of units it contains for storing and manipulating data and roughly corresponds to its capabilities.