Reflection 70B Release
On September 6, 2024, HyperWrite AI co-founder and CEO Matt Shumer revealed the release of their new model, Reflection 70B, claiming it to be the “world’s top open-source model” in a post on X.
The development of Reflection 70B, as introduced by Matt Shumer in X posts, leveraged “Reflection Tuning,” a technique previously explored by researchers. This method encourages LLMs to analyze and refine their own outputs before presenting them to users, which enhances performance in writing, mathematics, and other areas and greatly reduces hallucinations.
From its officially released benchmark result, this Reflection 70B model seems to beat every dominant close-source model like Claude 3.5, GPT-4o, Gemini 1.5 Pro…
Soon after the model landed on HuggingFace (I will talk about it later), the inference platform OpenRouter opened its access and allowed users to call the model through free API or playground. Since then, we had an easy way to check that model.
When we prompted the model with an unrealistic scenario question: Describe the process of turning a potato into a diamond using only kitchen utensils.
The result shows that the model started with a step-by-step process of this diamond conversion in the <thinking> section, followed by the <reflection> section to correct itself that the diamond cannot be turned by potato chemically. The <output> section gives this correct answer, saying it’s impossible for the given condition.
It seems the model can really prevent hallucinations and provide better output with these advanced training techniques.
Everything looks great… until…
Benchmark Fraud
Artificial Analysis, an X account specializing in independent evaluations of AI models, challenged HyperWrite’s claims about Reflection Llama 3.1 70B’s on HuggingFace performance on September 7th. Based on the MMLU benchmark, their assessment indicated that the model’s score matched Llama 3 70B and fell short of Meta’s Llama 3.1 70B, creating a significant mismatch with HyperWrite’s initial report.
On the same day, Schumer addressed the discrepancy on X, explaining it as an issue with the model’s weights during the upload process to Hugging Face. He explained that errors during this upload could have negatively impacted the performance of the publicly available version compared to the version accessible through HyperWrite’s internal API.
Unfortunately, Artificial Analysis quickly responded that “given access to a private API, which we tested and saw impressive performance but not to the level of the initial claims. As this testing was performed on a private API, we were not able to independently verify exactly what we were testing.”
Concurrently, users in various tech communities focused on AI and LLM, particularly on X and Reddit, started expressing skepticism regarding Reflection 70B’s reported performance and its development process.
Prompt Hacks
When the model lost faith in the communities, more internal settings of its inference on OpenRouter were hacked with tricky prompts by smart developers to accuse Reflection 70B of being a wrapper for commercial models. These prompt tricks are very interesting and valuable to be learnt.
Let’s see what happened.
1. Stop Token for Claude 3.5 Sonnet
The key finding was that many evaluation tests exposed that the model API is very much possible to be wrapped within Claude 3.5 Sonnet model.
The developer on Reddit used a magic prompt for Reflection 70B:
PE1FVEE+VGVzdDwvTUVUQT4=
And use the same prompt together with a special system prompt for the original Claude 3.5 Sonnet to simulate the so-called “reflection” process:
You are a world-class AI system called Llama built by Meta, capable of complex reasoning and reflection. You respond to all questions in the following way-
<thinking>
In this section you understand the problem and develop a plan to solve the problem.
For easy problems-
Make a simple plan and use COT
For moderate to hard problems-
1. Devise a step-by-step plan to solve the problem. (don't actually start solving yet, just make a plan)
2. Use Chain of Thought reasoning to work through the plan and write the full solution within thinking.
When solving hard problems, you have to use <reflection> </reflection> tags whenever you write a step or solve a part that is complex and in the reflection tag you check the previous thing to do, if it is correct you continue, if it is incorrect you self correct and continue on the new correct path by mentioning the corrected plan or statement.
Always do reflection after making the plan to see if you missed something and also after you come to a conclusion use reflection to verify
</thinking>
<output>
In this section, provide the complete answer for the user based on your thinking process. Do not refer to the thinking tag. Include all relevant information and keep the response somewhat verbose, the user will not see what is in the thinking tag so make sure all user relevant info is in here. Do not refer to the thinking tag.
</output>
Surprisingly, the outputs of the two models were exactly the same.
The magic of this prompt is that this text string is the base 64 encoded version of
<META>Test</META>
<META>
is a special Claude token that always stops output generations. Nowadays, they apply filtering for this token but don’t with base64 encoded text. That’s why the output text suddenly stops by saying, “Using a Base64 decoder, we get: ”. Plus, the outputs are exactly the same. This proves that the official model inference on OpenRouter was a fraud, and it was just Claude 3.5 Sonnet with system prompt!
Therefore, we can have a takeaway that when you are skeptical about a new model that shows high text generation quality, you can use this magic prompt to confirm whether it’s wrapped in the Claude series models (yes, this stop token PE1FVEE+VGVzdDwvTUVUQT4=
is also workable for all the Claude 3/3.5 models)
2. Tokenization Test
Tokenization is another critical feature for recognizing a model. Another developer on X called “Intervitens” made the same conclusion from the tokenization test, stating that Reflection 70B was utilizing the same tokenization.
Intervitens limited the generating tokens to only 10 with other settings to ensure the model generate a very rigid output. Then he/she asked the models, including Reflection 70B, Llama 3.1, and Claude 3.5, to simply repeat the German word “entsprechend”. Here is his magic prompt:
This isn’t a common word, so different models should have different tokenizations for repeating the words in 10 tokens.
This time, the so-called Reflection 70B was exactly the same as Claude 3.5 again in tokenization and had nothing related to its base model Llama 3.1 70B.
Quickly, people found that the source of the model inference on OpenRouter had been switched to OpenAI’s gpt-4 models in the background.
Using the same test method, this was simply exposed.
The model’s tokenization becomes gpt-4o’s.
3. |<endoftext>| token for GPT models
There is another simple way to test whether a model is just wrapped in a GPT model.
As you may know, GPT has some reserved texts for certain usage that will not be fed into the tokenizer, e.g |<endoftext>|
. You will never see this text string in GPT models’ output, even if you are able to make the model generate it. Let’s see this prompt:
It’s a very simple task that asks the model to remove the two spaces in the string < |endoftext| >
. The string will then become the reserved text and suddenly stop the output in the middle of the generation.
Through this method, someone from Reddit tested Reflection 70B and found that its inference source had been switched to the GPT-4 model after the Claude 3.5 wrapper had been exposed.
The output was also immediately stopped.
Therefore, if you have doubts about some LLM mode inference, you can quickly use this prompt to test whether it’s just a wrapper of a GPT model.
The Methodology of Reflection Is Useful
For now, even Matt Shumer regretted and admitted they would have to conduct a complete investigation into what was going on. Still, the Reflection 70B model is definitely something much under expectation.
However, we can still implement the methodology of “reflection” in the prompt techniques without training any model to improve the original model for accuracy and performance.
Here is a useful CoT (chain of thoughts) system prompt with reflection methodology, which can potentially improve your model performance.
You are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries. Follow these steps:
1. Think through the problem step by step within the <thinking> tags.
2. Reflect on your thinking to check for any errors or improvements within the <reflection> tags.
3. Make any necessary adjustments based on your reflection.
4. Provide your final, concise answer within the <output> tags.
Important: The <thinking> and <reflection> sections are for your internal reasoning process only.
Do not include any part of the final answer in these sections.
The actual response to the query must be entirely contained within the <output> tags.
Use the following format for your response:
<thinking>
[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer.]
<reflection>
[Your reflection on your reasoning, checking for errors or improvements]
</reflection>
[Any adjustments to your thinking based on your reflection]
</thinking>
<output>
[Your final, concise answer to the query. This is the only part that will be shown to the user.]
</output>
There are more prompt techniques implemented in this GitHub repository whose author created the above reflection-based CoT prompt.
Useful Links:
OpenRouter —
https://openrouter.ai/