In brief
- OpenAI open-sourced two mega-models under Apache 2.0, shattering licensing limits
- The 120B-parameter model runs on a $17K GPU; 20B-parameter version works on high-end gaming cards
- Performance rivals GPT-4o-mini and o3—beats rivals in math, code, and medical benchmarks
OpenAI released two open-weight language models Tuesday that deliver performance matching its commercial offerings while running on consumer hardware—the gpt-oss-120b needs a single 80GB GPU and the gpt-oss-20b operates on devices with just 16GB of memory.
The models, available under Apache 2.0 licensing, achieve near-parity with OpenAI’s o4-mini on reasoning benchmarks. The 120-billion parameter version activates only 5.1 billion parameters per token through its mixture-of-experts architecture, while the 20-billion parameter model activates 3.6 billion. Both handle context lengths up to 128,000 tokens—the same as GPT-4o.
The fact that they are released under that specific license is a pretty big deal. It means anyone can use, modify and profit from those models without restrictions. This includes anyone from you to OpenAI’s competitors like Chinese startup DeepSeek.
The release comes as speculation mounts about GPT-5’s imminent arrival and competition intensifies in the open-source AI space. The OSS models are OpenAI’s latest open-weight language models since GPT-2 in 2019.
There is not really a release date for GPT-5, but Sam Altman hinted it could happen sooner than later. “We have a lot of new stuff for you over the next few days,” he tweeted early today, promising “a big upgrade later this week.”
we have a lot of new stuff for you over the next few days!
something big-but-small today.
and then a big upgrade later this week.
— Sam Altman (@sama) August 5, 2025
The open-source models that dropped today are very powerful. “These models outperform similarly sized open models on reasoning tasks, demonstrate strong tool use capabilities, and are optimized for efficient deployment on consumer hardware,” OpenAI stated in its announcement. The company trained them using reinforcement learning and techniques from its o3 and other frontier systems.
On Codeforces competition coding, gpt-oss-120b scored an Elo rating of 2622 with tools and 2463 without—surpassing o4-mini’s 2719 rating and approaching o3’s 2706. The model hit 96.6% accuracy on AIME 2024 mathematics competitions compared to o4-mini’s 87.3% and achieved 57.6% on the HealthBench evaluation, beating o3’s 50.1% score.
The smaller gpt-oss-20b matched or exceeded o3-mini across these benchmarks despite its size. It scored 2516 Elo on Codeforces with tools, reached 95.2% on AIME 2024, and hit 42.5% on HealthBench—all while fitting in memory constraints that would make it viable for edge deployment.
Both models support three reasoning effort levels—low, medium, and high—that trade latency for performance. Developers can adjust these settings with a single sentence in the system message. The models were post-trained using processes similar to o4-mini, including supervised fine-tuning and what OpenAI described as a “high-compute RL stage.”
But don’t think just because anyone can modify those models at will, you’ll have an easy time. OpenAI filtered out certain harmful data related to chemical, biological, radiological, and nuclear threats during pre-training. The post-training phase used deliberative alignment and instruction hierarchy to teach refusal of unsafe prompts and defense against prompt injections.
In other words, OpenAI claims to have designed its models to make them so safe, they cannot generate harmful responses even after modifications.
Eric Wallace, an OpenAI alignment expert, revealed the company conducted unprecedented safety testing before release. “We fine-tuned the models to intentionally maximize their bio and cyber capabilities,” Wallace posted on X. The team curated domain-specific data for biology and trained the models in coding environments to solve capture-the-flag challenges.
Today we release gpt-oss-120b and gpt-oss-20b—two open-weight LLMs that deliver strong performance and agentic tool use.
Before release, we ran a first of its kind safety analysis where we fine-tuned the models to intentionally maximize their bio and cyber capabilities 🧵 pic.twitter.com/err2mBcggx
— Eric Wallace (@Eric_Wallace_) August 5, 2025
The adversarially fine-tuned versions underwent evaluation by three independent expert groups. “On our frontier risk evaluations, our malicious-finetuned gpt-oss underperforms OpenAI o3, a model below Preparedness High capability,” Wallace stated. The testing indicated that even with robust fine-tuning using OpenAI’s training stack, the models couldn’t reach dangerous capability levels according to the company’s Preparedness Framework.
That said, the models maintain unsupervised chain-of-thought reasoning, which OpenAI said is of paramount importance for keeping a wary eye on the AI. “We did not put any direct supervision on the CoT for either gpt-oss model,” the company stated. “We believe this is critical to monitor model misbehavior, deception and misuse.”
OpenAI hides the full chain of thought on its best models to prevent competition from replicating their results—and to avoid another DeepSeek event, which now can happen even easier.
The models are available on Hugginface. But as we said in the beginning, you’ll need a behemoth of a GPU with at least 80GB of VRAM (like the $17K Nvidia A100) to run the version with 120 billion parameters. T
he smaller version with 20 billion parameters will require at least 16GB of VRAM (like the $3K Nvidia RTX 4090) on your GPU, which is a lot—but also not that crazy for a consumer-grade hardware.
Generally Intelligent Newsletter
A weekly AI journey narrated by Gen, a generative AI model.