Magentic Marketplace: Open-source platform to study agentic markets

Compatibilità
Salva(0)
Condividi

Autonomous AI agents are here, and they’re poised to reshape the economy. By automating discovery, negotiation, and transactions, agents can overcome inefficiencies like information asymmetries and platform lock-in, enabling faster, more transparent, and more competitive markets.

We are already seeing early signs of this transformation in digital marketplaces. Customer-facing assistants like OpenAI’s Operator and Anthropic’s Computer Use can navigate websites and complete purchases. On the business side, Shopify Sidekick, Salesforce Einstein, and Meta’s Business AI help merchants with operations and customer engagement. These examples hint at a future where agents become active market participants, but the structure of these markets remains uncertain.

Several scenarios are possible. We might see one-sided markets where only customers or businesses deploy agents; closed platforms (known as walled gardens) where companies tightly control agent interactions; or even open two-sided marketplaces where customer and business agents transact freely across ecosystems. Each path carries different trade-offs for security, openness, convenience, and competition, which will shape how value flows in the digital economy. For a deeper exploration of these dynamics, see our paper, The Agentic Economy.

To help navigate this uncertainty, we built Magentic Marketplace (opens in new tab)— an open-source simulation environment for exploring the numerous possibilities of agentic markets and their societal implications at scale. It provides a foundation for studying these markets and guiding them toward outcomes that benefit everyone.

This matters because most AI agent research focuses on isolated scenarios—a single agent completing a task or two agents negotiating a simple transaction. But real markets involve a large number of agents simultaneously searching, communicating, and transacting, creating complex dynamics that can’t be understood by studying agents in isolation. Capturing this complexity is essential because real-world deployments raise critical questions about consumer welfare, market efficiency, fairness, manipulation resistance, and bias—questions that can’t be safely answered in production environments.

To explore these dynamics in depth, the Magentic Marketplace platform enables controlled experimentation across diverse agentic marketplace scenarios. Its current focus is on two-sided markets, but the environment is modular and extensible, supporting future exploration of mixed human–agent systems, one-sided markets, and complex communication protocols.

Figure 1. With Magentic Marketplace, researchers can model how agents representing customers and businesses interact—shedding light on the dynamics that could shape future digital markets.

What is Magentic Marketplace?

Magentic Marketplace’s environment manages market-wide capabilities like maintaining catalogs of available goods and services, implementing discovery algorithms, facilitating agent-to-agent communication, and handling simulated payments through a centralized transaction layer at its core, which ensures transaction integrity across all marketplace interactions. Additionally, the platform enables systematic, reproducible research. As demonstrated in the following video, it supports a wide range of agent implementations and evolving marketplace features, allowing researchers to integrate diverse agent architectures and adapt the environment as new capabilities emerge.

We built Magentic Marketplace around three core architectural choices:

HTTP/REST client-server architecture: Agents operate as independent clients while the Marketplace Environment serves as a central server. This mirrors real-world platforms and supports clear separation of customer and business agent roles.

Minimal three-endpoint market protocol: Just three endpoints—register, protocol discovery, and action execution—lets agents dynamically discover available actions. New capabilities can be added without disrupting existing experiments.

Rich action protocol: Specific message types support the complete transaction lifecycle: search, negotiation, proposals, and payments. The protocol is designed for extensibility. New actions like refunds, reviews, or ratings can be added seamlessly, allowing researchers to evolve marketplace capabilities and study emerging agent behaviors while remaining compatible.

Figure 2. Magentic Marketplace includes two agent types: Assistant Agents (customers) and Service Agents (businesses). Both interact with a central Market Environment via REST APIs for registration, service discovery, communication, and transaction execution. Action Routers manage message flow and protocol requests, enabling autonomous negotiation and commerce in a two-sided marketplace.

Additionally, a visualization module lets users observe marketplace dynamics and review individual conversation threads between customer and business agents.

Setting up the experiments

To ensure reproducibility, we instantiated the marketplace with fully synthetic data, available in our open-source repository (opens in new tab). The experiments modeled transactions such as ordering food and engaging with home improvement services, where agents represented customers and businesses engaging in marketplace transactions. This setup enabled precise measurement of behavior and systematic comparison against theoretical upper bounds.

Each experiment was run using 100 customers and 300 businesses and included both proprietary models (GPT-4o, GPT-4.1, GPT-5, and Gemini-2.5-Flash) and open-source models (OSS-20b, Qwen3-14b, and Qwen3-4b-Instruct-2507).

Our scenarios focused on simple all-or-nothing requests: Each customer had a list of desired items and amenities that needed to be present for a transaction to be satisfying. For those transactions, utility was computed as the sum of the customer’s internal item valuations minus actual prices paid. Consumer welfare, defined as the sum of utilities across all completed transactions, served as our key metric for comparing agent performance.

While this experimental setup provides a useful starting point, it is not intended to be definitive. We encourage researchers to extend the framework with richer, more nuanced measures and request types that better capture real consumer welfare, fairness, and other societal considerations.

PODCAST SERIES

The AI Revolution in Medicine, Revisited

Join Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.

What did we find?

Agents can improve consumer welfare—but only with good discovery

We explored whether two-sided agentic markets—where AI agents interact with each other and with service providers—can improve consumer welfare by reducing information gaps. Unlike traditional markets, which do not provide agentic support and place the full burden of overcoming information asymmetries on customers, agentic markets shift much of that effort to agents. This change matters because as agents gain better tools for discovery and communication, they relieve customers of the heavy cognitive load of filling any information gaps. This lowers the cost of making informed decisions and improves customer outcomes.

We compared several marketplace setups. Under realistic conditions (Agentic: Lexical search), agents faced real-world challenges like building queries, navigating paginated lists, identifying the right businesses to send inquiries to, and negotiating transactions.

Despite these complexities, advanced proprietary models and some medium-sized open-source models like GPTOSS-20b outperformed simple baselines like randomly choosing or simply choosing the cheapest option. Notably, GPT-5 achieved near-optimal performance, demonstrating its ability to effectively gather and utilize decision-relevant information in realistic marketplace conditions.

Figure 3. Table comparing experimental setups for welfare outcomes in the restaurant industry. Each row shows a different way agents or baselines make decisions, from random picks to fully coordinated agentic strategies. Cell colors indicate how much information is available: green, at the top left, represents complete information, red, at the top right, represents limited information, and yellow at the bottom represents decisions that depend on agent communication.

Performance increased considerably under the Agentic: Perfect search condition, where agents started with the top three matches without needing to search and navigate among the choices. In this setting, Sonnet-4.0, Sonnet-4.5, GPT-5, and GPT-4.1 nearly reached the theoretical optimum and beat baselines with full amenity details but without agent-to-agent coordination.

Open-source models were mixed: GPTOSS-20b performed strongly under both Perfect search and Lexical search conditions, even exceeding GPT-4o’s performance with Perfect search. This suggests that relatively compact models can exhibit robust information-gathering and decision-making capabilities in complex multi-agent environments. Qwen3-4b-2507 faltered when discovery involved irrelevant options (Lexical search), while Qwen3-14b lagged in both cases due to fundamental limitations in reasoning.

Figure 4. Chart showing consumer welfare outcomes in the restaurant industry under different marketplace setups. Blue bars show Agentic: Lexical search, where agents navigate realistic discovery challenges; yellow bars show Agentic: Perfect search, where agents started with ideal matches. Proprietary models approached optimum consumer welfare under perfect search, while open-source models and baselines lagged behind.

Paradox of Choice

One promise of agents is their ability to consider far more options than people can. However, our experiments revealed a surprising limitation: providing agents with more options does not necessarily lead to more thorough exploration. We designed experiments that varied the search results limit from 3 to 100. Except for Gemini-2.5-Flash and GPT-5, the models contacted only a small fraction of available businesses regardless of the search limit. This suggests that most models do not conduct exhaustive comparisons and instead easily accept the initial “good enough” options.

Recapiti
stclarke