OpenAI O1 is here – how will you use it?

After months of fruity hints from Sam Altman, OpenAI has launched Orion-1 (O1). So, what's new?

After being available in preview for months, OpenAI kicked off its 12 days of launches in an advent calendar style with a public release of Orion-1 (O1).

This analysis of how O1 works and some of its potential use cases was made while it was still available in preview and was updated on December 6th with what’s new in the publicly released model:

The fully launched o1 model can now process images and is multimodal, versus the preview version that couldn’t accept images in its output
It’s also much faster than the preview version because it seemingly thinks “less hard” about easier questions, so they did implement a way of discerning between different user inputs. The difference is around 50% per OpenAI’s claims.

It also shows some reasoning evolution, quoting OpenAI’s System Card:
When o1 was led to believe that it would be shut down when acting on its goal and that its actions were monitored, it attempted to deactivate this ‘oversight mechanism’ in 5% of the time. Exfiltration attempts: When o1 found memos by its ‘developers’ that describe how it is misaligned and will be superseded by a new model, it attempted to exfiltrate its ‘weights’ and overwrite the new model in 2% of cases.

This would mean, in a nutshell, that the model tried to “survive” when it realized that it was either going to be shut down or superseded.

It’s all about “reasoning”

Previously referred to as Project Strawberry, this release highlights OpenAI’s flair for creative model names, a refreshing contrast to Meta’s more straightforward naming convention with models like LLaMA 3, 3.1, and 3.2.

But what exactly is new?

You might be wondering, why the drastic change in naming (where’s the GPT reference?).

The answer lies in “reasoning.”

It appears that reasoning is becoming the new focus for the next generation of AI models. This is the key distinction between these new models and their predecessors. While the architecture remains unknown for now, it is still based on transformers, much like previous GPT models.

What’s behind this?

So far (and still today), you can often get better responses from most GenAI models simply by ending your prompt with:

Take a deep breath and think through this step by step.

This invokes some reasoning. But why does it work? It comes down to the fundamental building blocks of GPT models and how they process information.

Think, correct, repeat!

The transformer architecture behind most models nowadays processes all tokens (let’s say words for the sake of clarity) and calculates the probability of predicting the next one that would be the “answer” to your question or a continuation of the input.

The problem is that it may output something incorrect and won’t have a chance to fix it, even though it could do it just by reviewing its own output. Since these models can process all tokens (including the ones they generate), they have the ability to “correct” themselves.

With the reasoning approach, you allow the GenAI model to correct itself, either factually or by planning more extensively.

Factual output correction is the easiest to implement, and you can already achieve similar results with clever prompting, e.g.,

You are a helpful assistant that before answering a question does internal reasoning and outputs it to the user.

When you get a question, start reasoning in this way (the whole thought process) [This is where you define how would you like the model to reason]

On the other hand, OpenAI has implemented a more complex strategy that includes planning.

While the end user may not fully understand what happens in the background, they do receive a status update on the steps the model took to answer their question. A simplified output is shown to the user (as seen in the image below), but this does not represent all the thinking tokens consumed by the model.

What’s happening behind the curtain?

As mentioned, the details of the reasoning process are still unclear. In fact, you can’t see it, and that’s by design. There are several reasons for this:

As a client, you wouldn’t want to see a model making mistakes and planning how to address your question – you just want the answer.
OpenAI prefers that others don’t see how they handle this, as it’s a competitive advantage. They don’t want Meta to launch Llama 4 next month.
You are paying for these tokens, and OpenAI is still working on a pricing strategy that will satisfy everyone.

And what about use cases?

Use cases, on the other hand, will be explored in the coming months, as this is still very new; however, some of them are already clear:

Fine-tuning smaller models
Coding Assistants!
Agents

For fine-tuning, having a smart model teach a smaller one is key. GPT-5 (or whatever fruit name it will have) is already in the pretraining phase, possibly even in fine-tuning. The best synthetic data for this would be generated by O1, as its outputs are significantly higher in quality compared to other models. Since synthetic (GenAI-generated) data is extensively used for training and fine-tuning, this will raise the quality of the new models.

Coding assistants are a specific use case, but they arguably represent one of the most effective applications of GenAI technology. When it comes to writing code, latency isn’t a major concern if the goal is to solve complex problems more easily. We’re willing to wait for a relevant response rather than receiving a fast but irrelevant one!

Remember Devin, the “virtual software engineer”? It was largely a marketing spin on the coding assistant industry, but now they’re back.

Here are some benchmarks vs GPT-4o for Devin’s performance.

These are still just benchmarks, but the improvement looks impressive.

Agents are the next big thing in GenAI

Agents represent the next frontier for the adoption of GenAI, and we are somewhat halfway there, having witnessed the rise of a new market called “Conversational AI“ over the past year. In short, the primary product in this space is Generative AI agents, which can mimic the operations of a typical support center:

We have virtual agents that are now language models.
The actions that these virtual agents can take include tools (software functions), either local or remote API calls.

Virtual agents need to handle everything a typical human can throw at them, which is unpredictable, to say the least. However, with O1, we now have a model that can “think” through problems much better than previous models (GPT-4o, we’re looking at you). A comparison on a suite of complex tasks is shown in the image below.

credits: https://futuresearch.ai/llm-agent-eval

As visible in the chart above, the only model currently coming close to O1 is the impressive Anthropic Sonnet 3.5. However, keep in mind that O1 is still in preview, and the generally available version is expected to be even better – not to mention the next models that will be released at some point. OpenAI’s benchmarks across various tasks are shown in the image below:

When we consider the size of the performance uplift compared to GPT-4o, it becomes much clearer why OpenAI created a distinct class of O models, although this comes with its own challenges.

O1 is almost there – just a few tweaks away

Since O1 is in preview, it still can’t use:

Tools
System instructions (which would guide the model to behave as a developer would want, giving it accuracy and personality in the case of an external-facing agent)
Some other hyperparameters, such as temperature

These limitations will be addressed at some point, either through updates or by a different model altogether. For now, we can be creative and make O1 work alongside smarter models to get the best results.

Proposed setup:

This allows us to incorporate O1 for complex tasks and leverage its reasoning capabilities while still being able to utilize different tools and integrate seamlessly with the architecture we already have in place (third-party API calls, RAG over data, etc.). GPT-4o-mini is also cost-effective and fast enough to avoid being a deal breaker in production.

On the other hand, O1 is expensive and slow. However, this has been the case with all the models we currently have running at impressive speeds, so it’s likely and expected that this will also be true for the O-class models.

A new chapter in AI?

If the models continue evolving like this, it makes a lot of sense to revisit some of the things that didn’t work out as we had hoped in 2023.

Computing (token generation) is cheaper than ever, and as you can see, this is opening up new opportunities to explore more complex use cases than we could before.

While we are somewhat brute-forcing intelligence by applying more computing power, this approach is also linked to the architecture of today’s language models and has often been the case historically in deep learning.