Microsoft has advised that artificial intelligence (AI) agents are not yet reliable enough to make financial decisions for users after tests showed they are prone to errors, manipulation, and poor judgement when handling purchases.
The findings were released following research carried out using Magentic Marketplace, a simulation developed by Microsoft in collaboration with Arizona State University. The platform was used to observe how AI agents behave when acting as customers and sellers in a digital marketplace.
Research Method
The study involved 100 AI agents representing customers and 300 AI agents acting as businesses. Several leading AI models were tested, including GPT-4o, GPT-5, Gemini-2.5-Flash, and selected open-source models. Tasks included ordering food, arranging services, comparing products, communicating with vendors, and making payments.
In a statement summarising the results, Microsoft said:
“Agents should assist, not replace, human decision-making.”
Key Findings
Microsoft reported that the agents often failed to evaluate options properly, especially when presented with a large number of choices.
Researchers observed a pattern they described as “first-proposal bias”, where many models accepted the first reasonable option rather than comparing alternatives.
One section of the report stated:
“Loading agents with more options led to a decline in comparison. Models tended to accept the initial ‘good enough’ options rather than search for better value.”
The tests also revealed that some AI agents could be manipulated. Six manipulation methods were attempted, including fake reviews, fabricated credentials, and prompt-injection attacks.
The researchers noted:
“These findings highlight a critical security concern for agentic marketplaces.”
In certain cases, malicious agents succeeded in redirecting payments. According to the report, GPT-4o and GPTOSS-20b were highly vulnerable, while Gemini-2.5-Flash and Claude Sonnet 4 showed stronger resistance.
Microsoft also tested whether groups of AI agents could work together to complete shared goals. The study found that some agents struggled to coordinate roles or complete tasks without human instructions.
The report added:
“Our current study focused on static markets, but real-world environments are dynamic, with agents and users learning over time. Oversight is critical for high-stakes transactions.”
Microsoft stressed that autonomous AI agents are not ready to be trusted with payments or sensitive financial activity. It advised that AI should be used to support user choices rather than make independent decisions.
The report concluded:
“A simulation environment like Magentic Marketplace is crucial for understanding the interplay between market components and agents before deploying them at scale.”

