- OpenAI’s latest model ChatGPT-o1 is being hailed for its human-like reasoning abilities and for setting new standards in AI.
- ChatGPT-o1 is stated to possess true general reasoning capabilities, excelling in a range of benchmark tests.
- With o1’s advanced reasoning competitive edge, the overly crowded LLM landscape will shift in OpenAI’s favor.
On September 12, OpenAI unveiled its latest model ChatGPT-o1, hailed for its human-like reasoning abilities and for setting new standards in AI. Dubbed the “Strawberry Model”, it had been highly awaited by OpenAI leadership, including CEO Sam Altman and former Chief Scientist Ilya Sutskever. ChatGPT-o1 is stated to possess true general reasoning capabilities, excelling in a range of benchmark tests.
Benchmark Comparison of GPT-4o, GPT-o1 Preview and GPT-o1 (to be released) in the Fields of Math, Coding and Science
Source: OpenAI The chart above shows how o1, compared to GPT-4o, elevates the potential of LLM reasoning from a merely acceptable level to an outstanding level. Without specialized training, it can achieve gold-medal results in mathematical Olympiads and 89 percentiles in coding. In PhD-level science Q&A tests, it even surpasses human experts. What makes o1 a breakthrough?
- Adding Reinforcement Learning to LLM training
Reinforcement Learning (RL) trains AI through a cycle of rewards and consequences. By exploring various actions, and learning from the outcomes via a reward system, AI adjusts its behavior to optimize results. This process naturally creates a data flywheel, continuously generating training datasets with both positive and negative feedback, further refining the AI’s performance. A notable example is the essential role of RL in AlphaGo’s training process. This method significantly improves the reliability and accuracy of LLM outputs.
- Hidden chain of thoughts (CoT) in ‘thinking’ process
One of the key challenges with LLMs is their tendency to hallucinate, often criticized as “stochastic parroting”. To address this, OpenAI’s o1 model employs a structured, step-by-step reasoning process, akin to deliberate human thought. By breaking complex tasks into simpler components, it improves accuracy and problem-solving efficiency. If one approach fails, the model tries alternative strategies, which enhances its overall reasoning capabilities and versatility.
- Scaling law on inference phase
The chart below shows how different o1’s (or “Strawberry’s”) inference time compute compares to most LLMs. We often talk about the scaling law – how model performance improves as one increases certain key parameters like model size (# parameters), dataset size and compute power. In this case, OpenAI put more compute power weight in o1’s inference (thinking) stage.
Inference Time Compute Allocation: o1 vs Most LLMs
Sources: Jam Fan, NVIDIA Therefore, with more time spent thinking, the scaling law still holds up as the reasoning capabilities from o1 significantly improve. What are the implications?
- With o1’s advanced reasoning competitive edge, the overly crowded LLM landscape will shift in OpenAI’s favor. Such a lead will last till someone else starts deploying similar architecture and techniques. The chart below shows how o1 outperforms its peers in IQ tests. Such a technological breakthrough will also help OpenAI’s current funding efforts at a valuation of $150 billion.
Leading LLMs’ Performance Comparison
Source: Maxim Lott on X, Mensa
- The advancements of o1 are poised to rapidly accelerate the deployment of LLMs in industries that rely heavily on complex reasoning, such as STEM education, legal services, scientific discovery, and research. In STEM education, o1's enhanced reasoning abilities will enable the creation of adaptive learning platforms that can guide students through intricate problems, fostering a deeper understanding of advanced subjects like mathematics, physics and engineering. In the legal field, o1 can help reason during legal research and case analysis. In science, o1 will expedite the analysis of data and literature, uncovering new insights and driving innovation across disciplines.
- OpenAI’s o1 has excelled in complex coding tasks, setting the stage for more advanced AI agents and streamlined agentic workflows. As reasoning is a key component of AI agents, o1’s superior reasoning abilities enable them to efficiently break down difficult coding challenges into manageable steps and discover optimal solutions. For instance, Devin, the popular AI agent in coding, is found more apt in accurately diagnosing the underlying causes of issues in lengthy codes when it is powered by o1.
OpenAI o1 Improves Coding Agent’s Performance Compared to GPT-4o
Source: Cognition AI
- Humanoids and robotics will benefit from the OpenAI o1 model due to its reasoning abilities. Through enhanced CoT processing, o1 enables robots to exercise complex decisions with greater accuracy, allowing them to tackle tasks that demand deeper, multi-step reasoning. This is particularly impactful in scenarios where robots must navigate dynamic environments or carry out intricate tasks.
- OpenAI o1 demonstrates that scaling compute can be achieved through two main avenues – increasing power during training, and inference. While greater inference compute time leads to higher reasoning capability, we foresee a shift in computing emphasis from training to inference, representing a fundamental change in how AI approaches different task executions.
- This opens a significant opportunity for NVIDIA’s competitors, like Groq and SambaNova Systems, which are better positioned to compete in the inference compute market, where their chances of success are much stronger than in the training space.
- However, the o1 model’s demand for more power during inference leads to longer processing times, a direct consequence of its advanced reasoning capabilities. This highlights the need for a careful balance between speed and precision during the engineering stage, ensuring the model is applied in scenarios where its strengths, such as detailed reasoning, outweigh the slower response times.
- Moreover, the o1 model also has a notably higher cost-performance ratio of 3-4, making it more expensive than ChatGPT-4o. Therefore, for applications that prioritize speed, such as customer service or simple data analysis, the increased cost and slower response time may outweigh its advanced reasoning benefit.