The generative AI hype has vastly accelerated the development of new AI models, many of which are not generative Large Language Models (LLMs). AI platform Hugging Face now lists 380,000 open source models, up from 80,000 a year ago. And this does not include the thousands of proprietary models making their way to enterprises.
The big challenge is that each model comes with various tradeoffs that vary across tasks and data types regarding performance, accuracy, cost, transparency, bias, and explainability. Martian hopes to tame this messy landscape with a new architecture, called a model router, that can dynamically route AI workloads to the best model for the job based on requirements.
The new approach promises to increase performance, reduce total cost of ownership, and smooth migration to newer models. In some cases, they report achieving similar performance at one-half to one-hundredth the cost of compute-heavy models like OpenAI’s GPT-4.
Over the last several years, dozens of new tools for observing, securing, and managing model risks have emerged. These have been characterized under new categories like AI Trust, Risk Management and Security (TRiSM), AI Observability, and Model Ops. These tend to focus on managing individual models, whereas the new approach looks at governing access to a collection of models. Martian co-founder and co-CEO Etan Ginsberg says:
“Most TRiSM, AI Observability, and Model Ops tools are based around using a single model. They might tell you about the average performance of a model on a particular task, for example. Martian augments that by allowing the use of multiple models – enabling increased performance, reducing the total cost of AI ownership, and future-proofing. We also enable greater security and resilience by letting companies select what models they use and providing backups.
Mapping the model landscape
One big challenge in working with so many different models, lies in efficiently and automatically distilling the merits of various models across tasks. A key Martian innovation is an approach they call model mapping that provides a general framework for a set of techniques used to understand models. The core idea is to convert each model into other formats that more efficiently analyze key properties of the models.
Researchers and engineers have long experimented with model distillation, which is the process of training a small model to mimic a larger one for a given use case. Many engineers are already distilling open-source models from larger, closed-source ones. Companies also distill larger models into smaller ones to improve performance or reduce costs while achieving the required accuracy.
What’s novel about the Martian approach is they can distill a representation from open-source models that can be directly processed by analyzing input-output pairs from an API into a closed model provided as a service. The result is that the technique can help efficiently analyze tradeoffs with open AI models and proprietary services.
Martian other co-founder and co-CEO, Shriyash Upadhyay, said their approach builds on some of the AI model interpretability research of others in this space—for example, Watchful pioneered techniques for understanding how changes in prompts can affect LLM results. Anthropic has done some interesting work on decomposing LLMs into understandable components.
While most maps tend to be smaller than the territory, in the case of LLMs, bigger maps for representing the different roles of a given neuron can be easier to interpret. Upadhyay says they take this idea one step further:
We can map models into programs. This conversion allows us to read out the algorithms that these models are implementing and allows us to apply all the tools normally used to understand programs (refactoring, testing, dynamic/static analysis, IDEs, formal verification, etc.) in understanding AI models instead. We think this is a more scalable approach, and one which can enable entirely new forms of tooling for AI.
Some of the model properties that can currently be represented in these model maps include cost, speed, and performance metrics such as accuracy, bias, and uncertainty. Users set the criteria they care about, such as performance and how much they are willing to spend for each 10% improvement, and the model router can dynamically route a request to the best model for the job. The routing system can also work with existing enterprise efforts to improve results with fine-tuning, prompt engineering and retrieval augmented generation types of processes.
It’s important to note that his approach is still a work in progress, and the team is looking to improve it going forward. At the moment, the new tools can help select models that outperform GPT-4, one of the current but most costly leaders, on specific tasks at a significantly reduced cost.
More importantly, new tools for model mapping across the industry could allow enterprises to give enterprises more control over their AI services, support new governance frameworks, and improve safety measures. Ginsberg says:
If we look at many of the problems in deploying and ensuring the safety of AI models, they come down to the fact that AI models are black boxes. When companies are hesitant to adopt AI in mission-critical systems, it’s because they’re black boxes that nobody can trust. When LLMs hallucinate and give undesirable output, it’s because they’re black boxes that nobody can fix. When people are worried about AI being unsafe, being difficult to regulate or protect society from, and being a potential threat to the very existence of humanity – it’s because they’re black boxes that nobody can understand.
Another long-term goal is to inspire the AI industry to think about managing AI models the same way we manage traditional app development efforts. Upadhyay says:
In a world where models are as transparent as programs, we can have much more powerful testing. For example, programs today can be formally verified, i.e., we can write computer-verifiable proofs that prove these programs are correct. Imagine if enterprises could have that level of certainty and security in their models.
The generative AI hype will quickly lead to AI sticker shock as enterprises discover their buzzy new AI app dramatically increases costs. The same thing happened as enterprises rushed to take advantage of new cloud services with unforeseen cost implications. This eventually led to a market for cloud cost optimization tools to help mitigate these costs in cloud infrastructure. At first, these were relatively manual but grew more dynamic and automated over time.
Similar evolution will occur with AI cost optimization strategies. But this is a very hard problem to solve. Currently, there are at least five metrics for quantifying hallucinations and dozens more for measuring other important properties of AI models related to security, bias, confidentiality, and risk management. NIST AI standardization efforts will certainly help narrow the field to the best metrics.
What’s more, the best model available when you launch a service may not be so in the future. Just last week, Google DeepMind Researchers rolled out a new weather model that could outperform classic weather models on a single machine running for a few minutes, which previously took hours on hundreds of machines. Tools that can help automate the migration to these new models will be critical for taking advantage of these kinds of improvements more cost-effectively. Martian is onto a new category of AI cost optimization tools, but it will take time to prove itself in the market.