Reflection 70B’s Performance Falls Short: Co-founder of OthersideAI Apologizes for Fraud Accusations

# Thursday, Sept. 5, 2024: Initial lofty claims of Reflection 70B’s superior performance on benchmarks

On Thursday, September 5, 2024, Matt Shumer, co-founder and CEO of OthersideAI, released a new large language model (LLM) called Reflection 70B. Shumer claimed that Reflection 70B was the world’s top open-source model and presented a chart showcasing its state-of-the-art results on third-party benchmarks. He attributed the impressive performance to a technique called “Reflection Tuning,” which allowed the model to assess and refine its responses for correctness before outputting them to users.

VentureBeat interviewed Shumer and accepted his benchmarks as presented, crediting them to him. However, independent third-party evaluators and members of the open-source AI community soon began questioning the model’s performance. They were unable to replicate the results on their own and even found evidence suggesting that Reflection 70B might be related to Anthropic’s Claude 3.5 Sonnet model.

Criticism mounted when Artificial Analysis, an independent AI evaluation organization, conducted tests on Reflection 70B and obtained significantly lower scores than initially claimed by HyperWrite. Moreover, it was discovered that Shumer had an investment in Glaive, the AI startup he used for synthetic data generation, which he failed to disclose when releasing Reflection 70B.

Shumer attributed the discrepancies to issues during the model’s upload process to Hugging Face, the open-source AI community where Reflection 70B was released. He promised to correct the model weights but has yet to do so. One user openly accused Shumer of fraud in the AI research community, further fueling the skepticism surrounding the model’s performance.

# Fri. Sept. 6-Monday Sept. 9: Third-party evaluations fail to reproduce Reflection 70B’s impressive results — Shumer accused of fraud

From Friday, September 6, to Monday, September 9, independent third-party evaluators and members of the open-source AI community continued to struggle with replicating Reflection 70B’s impressive results. The doubts surrounding the model’s performance grew, with some suggesting that Shumer had intentionally misrepresented its capabilities.

Artificial Analysis, which previously obtained lower scores than claimed by HyperWrite, openly posted about their tests on Reflection 70B. The results they obtained were significantly different from what Shumer had presented. Additionally, it was discovered that Shumer had a vested interest in Glaive, further raising concerns about the model’s credibility.

Despite promises to address the issues and correct the model weights, Shumer went silent on Sunday evening and did not respond to requests for comments. The AI research community grew increasingly skeptical, with some researchers pointing out that even less powerful models could be trained to perform well on third-party benchmarks.

# Tuesday, Sept. 10: Shumer responds and apologizes — but doesn’t explain discrepancies

Finally, on Tuesday, September 10, Shumer broke his silence and released a statement on the social network X. In his statement, he apologized and acknowledged the skepticism surrounding Reflection 70B’s performance. He mentioned that a team is working tirelessly to understand what went wrong and promised to be transparent with the community once they have all the facts.

Shumer also shared a post by Sahil Chaudhary, the founder of Glaive AI, the platform Shumer claimed to have used for generating synthetic data. Chaudhary’s post revealed that even he was unsure about some of the responses from Reflection 70B, including its relation to Anthropic’s Claude model. He admitted that the benchmark scores shared with Shumer had not been reproducible so far.

However, Shumer and Chaudhary’s responses were not enough to convince skeptics and critics. Yuchen Jin, the co-founder and CTO of Hyperbolic Labs, detailed his efforts to host a version of Reflection 70B and troubleshoot the errors. He expressed disappointment in Shumer’s lack of transparency and urged him to provide a more thorough explanation.

Despite Shumer’s apology and promise of transparency, many in the AI research community remain unconvinced. The saga surrounding Reflection 70B and its discrepancies continues to raise questions and skepticism among the generative AI community online.

In conclusion, the release of Reflection 70B by Matt Shumer has sparked controversy and skepticism within the AI research community. The model’s purported superior performance on benchmarks has been called into question, with independent evaluators unable to replicate the results. Shumer has been accused of fraud, and his explanations and apologies have failed to provide a satisfactory explanation for the discrepancies. The future of Reflection 70B and the trust in Shumer’s claims hang in the balance, as the AI research community awaits a more transparent response.