Anthropic unveiled Claude 3, a set of three AI language models that are comparable to ChatGPT, on Monday. According to Anthropic, the models achieved new industry standards in a variety of cognitive tests, sometimes even coming close to “near-human” competence. It can be purchased right now via Anthropic’s website; the most potent model is subscription-only. It is also accessible to developers using an API.
The three models of Claude 3—Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus—represent escalating levels of complexity and parameter count. With an email sign-in, Sonnet powers the Claude.ai chatbot for free right now. However, as previously indicated, you can only use Opus via Anthropic’s online chat interface if you subscribe to “Claude Pro,” a $20 monthly service that is available through the Anthropic website. A context window with 200,000 tokens is present in all three. (The amount of tokens, or word fragments, that an AI language model can process simultaneously is known as the context window.)
We wrote about the March 2023 launch of Claude and the July 2023 introduction of Claude 2. Every time, Anthropic outperformed OpenAI’s top models in terms of context window length but lagged significantly behind them in terms of competence. There is currently no agreement among experts over whether Anthropic has outperformed OpenAI’s disclosed models in terms of performance with Claude 3, and the presentation of AI benchmarks is infamously prone to cherry-picking.
It has been observed that Claude 3 performs at an advanced level on a variety of cognitive activities, such as fluency in language, mathematics, reasoning, and expert knowledge. (The AI research community typically employs the terms “know” and “reason,” despite disagreements about whether big language models “know” or “reason.”) According to the business, the most advanced model, Opus, demonstrates “near-human levels of comprehension and fluency on complex tasks.”
That is a really strong assertion that needs closer examination. Opus may perhaps be “near-human” on certain particular measures, but this does not imply that Opus possesses the same level of general intellect as humans (remember that pocket calculators are superhuman in math). Thus, it’s an intentionally striking assertion that may be qualified down.
Anthropic claims that on ten AI benchmarks, including as MMLU (undergraduate level knowledge), GSM8K (grade school math), HumanEval (coding), and the amusingly called HellaSwag (common knowledge), Claude 3 Opus outperforms GPT-4. Some of the victories are close, like 86.8 percent for Opus against 86.4 percent on a five-shot trial of MMLU, and some are significant, like 84.9 percent on HumanEval over 67.0 percent on GPT-4. It’s hard to say, though, exactly what it may mean to you as a customer.
“LLM benchmarks should be viewed with a degree of caution as always,” argues AI researcher Simon Willison in an interview with Ars regarding Claude 3. The model’s performance on benchmarks provides little insight into how the model ‘feels’ to use. However, no other model has outperformed GPT-4 on a variety of commonly used benchmarks, so this is still a big accomplishment.”
In domains including analysis, forecasting, content creation, code generation, and multilingual discussion, Claude 3 models outperform Claude 2 models. According to reports, the models also have improved vision capabilities that let them comprehend visual forms like diagrams, charts, and images. These features are comparable to Google’s Gemini and GPT-4V, which is available in ChatGPT subscription versions.
Comparing the three models to rival models and earlier generations, Anthropic highlights how much faster and more affordable they are. The three models are as follows: Opus, the largest, has $15 million input tokens and $75 million output tokens; Sonnet, the medium model, has $3 million input tokens and $15 million output tokens; and Haiku, the smallest and fastest, has $0.25 million input tokens and $1.25 million output tokens. By contrast, $10 for every million input tokens and $30 for every million output tokens are offered by OpenAI’s GPT-4 Turbo via API. $0.50 per million input tokens and $1.50 per million output tokens make up GPT-3.5 Turbo.
Willison stated he hadn’t quite gotten a sense for Claude 3’s performance when we asked, but he had noticed the API cost for each model right away. “The unreleased cheapest one looks radically competitive,” Willison remarks. “The best quality one is super expensive.”
Anthropic’s new Constitutional AI chatbot technique gives AI “values.”
Other unrelated information: Anthropic reports that the Opus model surpassed 99 percent accuracy in a benchmark test, and the Claude 3 versions are said to be able to manage up to 1 million tokens for a limited number of users (much like the Gemini Pro 1.5). Additionally, according to the business, the Claude 3 models exhibit improved accuracy and a decrease in incorrect answers, and they are less likely to reject innocuous cues.
Anthropic obtained Claude 3’s capability gains, according to a model card made available with the models, in part by using synthetic data throughout the training phase. The term “synthetic data” refers to data created in-house using a different AI language model. This method can be used to increase the training dataset’s depth by include scenarios that may be absent from a scraped dataset. According to Willison, “the synthetic data thing is a big deal.”
In the upcoming months, Anthropic intends to issue regular updates for the Claude 3 model family, which will include additional features like tool usage, interactive coding, and “advanced agentic capabilities.” The company states that the Claude 3 models “present negligible potential for catastrophic risk at this time” and that it is still committed to making sure that safety precautions stay up with improvements in AI capability.
Through Anthropic’s API, the Opus and Sonnet models are currently accessible, and Haiku will be released shortly. Sonnet can also be accessed in private preview on Google Cloud’s Vertex AI Model Garden and through Amazon Bedrock.
We registered with Claude Pro in order to test Opus informally and see for ourselves. Opus feels like ChatGPT-4 in terms of capability. It’s not very good at writing original dad jokes—all of them seem to have been lifted from the internet—but it does fairly well at summarizing information and creating text in different styles. It also performs fairly well when it comes to logically analyzing word problems, and—while confabulation rates seem low—we did see a few slip in when asked about more obscure topics.
That can be annoying in a world where computer goods usually produce hard numbers and definable benchmarks, since none of that is a clear pass or fail. Willison informed us that this was “yet another case of ‘vibes’ as a key concept in modern AI.”
Because the performance of any AI assistant varies greatly depending on the prompts used and the conditioning of the underlying AI model, AI benchmarks are challenging to employ. While AI models can function effectively “on the test,” they are unable to transfer such abilities to new scenarios.
Furthermore, Willison’s “vibes” stem from the fact that the efficacy of AI assistants is quite individualized. This is due to the fact that having an AI model accomplish your desired goal might be challenging to measure (e.g., using a benchmark metric) since the task you assign it can be any task in any intellectual subject known to man. Depending on the task and the prompting style, different models may perform effectively for different people.
This applies to all large language models, not just Claude 3, from suppliers like Google, OpenAI, and Meta. People have discovered over time that every model has unique characteristics and that, with the right prompting tactics, it is possible to either embrace or overcome the strengths and flaws of any model. It appears that the main AI assistants are currently settling into a set of remarkably similar features.
The upshot of all of it is that one should go with caution—or a dose of vibes—when Anthropic claims that Claude 3 can exceed GPT-4 Turbo, which is still regarded as the industry leader in terms of general capability and low hallucinations. Since nobody else can probably duplicate the precise set of conditions under which you would use a model, it is crucial that you personally test each model if you are evaluating several models to make sure it suits your application.