Skip to main content

Inside Google’s evaluation process for its next-gen AI Gemini

Getty Images

Google’s latest attempt to fine-tune its Gemini AI is by using Anthropic’s AI model, Claude, as a benchmark for comparison. According to internal correspondence obtained by TechCrunch, contractors working on Gemini are tasked with comparing its responses with those generated by Claude. However, Google has not disclosed whether it obtained permission from Anthropic to use Claude in this manner.

The Competitive AI Landscape

As the competition to create more advanced AI models heats up, tech companies are experimenting with different ways to assess and enhance their systems. More often than not, this assessment is done against industry benchmarks and not necessarily compared head-to-head against competitors’ models. For Google, however, contractors are comparing outputs from Gemini with those from Claude and grading each response on aspects like accuracy, verbosity, and truthfulness. These assessments can run up to 30 minutes per prompt, showing just how careful this process is.

Recently, some contractors working on Gemini found mentions of Claude within the internal comparison platform. In one case, it directly labeled the output as a product of Claude with the following statement: “I am Claude, developed by Anthropic.”

Contractors say that discussions between them have shown Claude to be more strict over safety compared to Gemini. For example, Claude won’t answer a prompt that it feels is unsafe and might put it in danger, such as role-playing as another AI assistant. On the other hand, Gemini has been known to raise a few red flags. Once, for instance, while Claude refused to answer a sensitive prompt, Gemini was flagged for a “huge safety violation” over inappropriate content.

 

Ethical and Legal Implications

Anthropic’s terms of service bar customers from developing competing products or training competing AI models without permission, and Google, which has heavily invested in Anthropic, would not say if it got such permission. Shira McNamara, a spokesperson for Google DeepMind, said while the company does compare outputs of Gemini with other models as part of its routine evaluation, it doesn’t train Gemini on models that are provided by Anthropic.

In alignment with industry standards, “we compare model outputs during our evaluation process”, according to McNamara, “but any statement or suggestion that we would or did use Anthropic models to train Gemini would not be correct.”

Problems Related to Contractor Credibility

TechCrunch previously reported that Google contractors evaluating Gemini’s responses are being asked to assess outputs in areas beyond their expertise. This practice has raised concerns about the potential for inaccuracies, particularly on sensitive topics like healthcare. Such issues highlight the challenges of maintaining high standards of safety and reliability in rapidly advancing AI systems.

Share

AD

You may also like

0
    0
    Your Cart
    Your cart is emptyReturn to Courses