OpenAI, under increasing competitive pressure from Google and Anthropic, has debuted a new AI model, GPT-5.2, that it says beats all existing models by a substantial margin across a wide range of tasks.
The new model, which is being released less than a month after OpenAI debuted its predecessor, GPT-5.1, performed particularly well on a benchmark of complicated professional tasks across a range of “knowledge work”—from law to accounting to finance—as well as on evaluations involving coding and mathematical reasoning, according to data OpenAI released.
Fidji Simo, the former InstaCart CEO who now serves as OpenAI’s CEO of applications, told reporters that the model should not been seen as a direct response to Google’s Gemini 3 Pro AI model, which was released last month. That release prompted OpenAI CEO Sam Altman to issue a “code red,” delaying the rollout of several initiatives in order to focus more staff and computing resources on improving its core product, ChatGPT.
“I would say that [the Code Red] helps with the release of this model, but that’s not the reason it is coming out this week in particular, it has been in the works for a while,” she said.
She said the company had been building GPT-5.2 “for many months.” “We don’t turn around these models in just a week. It’s the result of a lot of work,” she said. The model had been known internally by the code name “Garlic,” according to a story in The Information. The day before the model’s release Altman teased its imminent rollout by posting to social media a video clip of him cooking a dish with a large amount of garlic.
OpenAI executives said that the model had been in the hands of “Alpha customers” who help test its performance for “several weeks”—a time period that would mean the model was completed prior to Altman’s “code red” declaration.
These testers included legal AI startup Harvey, note-taking app Notion, and file-management software company Box, as well as Shopify and Zoom.
OpenAI said these customers found GPT-5.2 demonstrated a “state of the art” ability to use other software tools to complete tasks, as well as excelling at writing and debugging code.
Coding has become one of the most competitive use cases for AI model deployment within companies. Although OpenAI had an early lead in the space, Anthropic’s Claude model has proved especially popular among enterprises, exceeding OpenAI’s marketshare according to some figures. OpenAI is no doubt hoping to convince customers to turn back to its models for coding with GPT-5.2.
Simo said the “Code Red” was helping OpenAI focus on improving ChatGPT. “Code Red is really a signal to the company that we want to marshal resources in one particular area, and that’s a way to really define priorities and define things that can be deprioritized,” she said. “So we have had an increase in resources focused on ChatGPT in general.”
The company also said its new model is better than the company’s earlier ones at providing “safe completions”—which it defines as providing users with helpful answers while not saying things that might contribute to or worsen mental health crises.
“On the safety side, as you saw through the benchmarks, we are improving on pretty much every dimension of safety, whether that’s self harm, whether that’s different types of mental health, whether that’s emotional reliance,” Simo said. “We’re very proud of the work that we’re doing here. It is a top priority for us, and we only release models when we’re confident that the safety protocols have been followed, and we feel proud of our work.”
The release of the new model came on the same day a new lawsuit was filed against the company alleging that ChatGPT’s interactions with a psychologically troubled user had contributed to a murder-suicide in Connecticut. The company also faces several other lawsuits alleging ChatGPT contributed to people’s suicides. The company called the Connecticut murder-suicide “incredibly heartbreaking” and said it is continuing to improve “ChatGPT’s training to recognize and respond to signs of mental or emotional distress, de-escalate conversations and guide people toward real-world support.”
GPT-5.2 showed a large jump in performance across several benchmark tests of interest to enterprise customers. It met or exceeded human expert performance on a wide range of difficult professional tasks, as measured by OpenAI’s GDPval benchmark, 70.9% of the time. That compares to just 38.8% of the time for GPT-5, a model that OpenAI released in August; 59.6% for Anthropic’s Claude Opus 4.5; and 53.3% for Google’s Gemini 3 Pro.
On the software development benchmark, SWE-Bench Pro, GPT-5.2 scored 55.6%, which was almost 5 percentage points better than its predecessor, GPT-5.1, and more than 12% better than Gemini 3 Pro.
OpenAI’s Aidan Clark, vice president of research (training), declined to answer questions about exactly what training methods had been used to upgrade GPT-5.2’s performance, although he said that the company had made improvements across the board, including in pretraining, the initial step in creating an AI model.
When Google released its Gemini 3 Pro model last month, its researchers also said the company had made improvements in pretraining as well as post-training. This surprised some in the field who believed that AI companies had largely exhausted the ability to wring substantial improvements out of the pretraining stage of model building, and it was speculated that OpenAI may have been caught off guard by Google’s progress in this area.
