Gimlet Labs joins MLCommons to shape agentic inference
Tue, 30th Jun 2026 (Today)
Gimlet Labs has joined MLCommons to help develop new benchmarks for agentic inference.
The move brings Gimlet Labs into the industry consortium behind the MLPerf benchmark suite, widely used to compare machine learning systems.
MLCommons brings together companies, researchers and other groups to create shared measures for artificial intelligence quality, performance and safety. Its MLPerf benchmarks provide peer-reviewed, repeatable tests for AI hardware and software.
As part of the arrangement, Gimlet Labs will support work on vendor-agnostic standards for machine learning performance across industry and research. It will also contribute to benchmark development for agentic inference, an area drawing more attention as AI systems take on increasingly complex, multi-step tasks.
Benchmark push
MLPerf benchmarks are set by a working group that defines the model to run, the data set for testing, the rules on permitted model changes and the way hardware performance is measured. The process is intended to give vendors and researchers a common basis for comparison.
Gimlet Labs' involvement reflects a broader shift in AI infrastructure discussions toward mixed hardware environments rather than single-chip or single-vendor approaches. The debate has grown more prominent as inference workloads expand and diversify.
David Kanter commented on the decision to bring Gimlet Labs into the consortium.
"Gimlet Labs is doing something truly transformative with their multi-silicon approach for faster and more efficient inference. Because they orchestrate AI workloads across diverse hardware, the Gimlet Labs team has an invaluable perspective on agentic inference and what it needs from hardware. We're thrilled to welcome them officially to MLCommons and to help us accelerate progress in inference performance," said David Kanter, Founder of MLCommons and Head of MLPerf.
Kanter's remarks highlight the role MLCommons sees for companies working across different types of processors as AI developers seek more consistent ways to measure real-world inference performance.
Industry shift
Gimlet Labs describes itself as an applied AI research and product company. Its work centres on inference infrastructure, including workload orchestration across different hardware and research in areas such as automated GPU kernel generation and heterogeneous execution.
The market is moving toward infrastructure that assigns different phases of inference to different hardware, according to Gimlet Labs. That trend is closely tied to the rise of agentic systems, which often involve more complex chains of reasoning, retrieval and action than earlier AI applications.
Zain Asgar outlined the company's view of that shift.
"The industry is moving away from homogeneous infrastructure and toward heterogeneous systems that target each phase of inference to the best hardware as agentic inference becomes the dominant software workload. We're collaborating with MLCommons on a new set of benchmarks that reflect these growing workloads and the unique infrastructure that powers them," said Zain Asgar, Co-Founder and Chief Executive Officer of Gimlet Labs.
That work could matter for AI buyers as well as chipmakers and cloud providers. Benchmarking has become central to how companies assess system performance, cost and suitability for specific workloads, especially as inference drives more commercial AI spending.
MLCommons has more than 125 members and affiliates. The consortium began with MLPerf in 2018 and has since expanded its work on AI measurement to cover broader questions of transparency, accuracy, safety, speed and efficiency.
For Gimlet Labs, membership provides a formal role in shaping the standards many in the sector use to judge machine learning systems. For MLCommons, the addition brings in a company focused on the technical challenge of running AI workloads across varied hardware rather than relying on a single architecture.
The development also highlights how benchmarking is changing. As AI systems grow more complex, simple measures based on one model running on one type of chip may no longer capture how production inference is actually deployed.
Agentic inference is emerging as one of the areas where that mismatch is most visible, because these workloads can span multiple stages and place different demands on compute, memory and orchestration.
By adding Gimlet Labs, MLCommons is advancing work on benchmarks intended to reflect those newer patterns of use.