Model Discrepancy: Performance Gap Between Native Platforms and Merlin AI
under review
Samuel Jackson
Hi! I'm noticing a recurring issue: models on Merlin — for some reason, maybe due to custom instructions or internal tuning — seem to perform noticeably worse than their native versions. They're getting dumber and dumber, especially on reasoning tasks.
For example, Gemini 2.5 Pro managed to solve a complex task instantly on Google's own platform, but on Merlin, it failed completely — even when spoon-fed the correct path. The same behavior applies to other models like Claude or DeepSeek when it comes to reasoning.
Would love to understand what’s going on and whether this can be improved, because right now it feels like we’re not getting the true power of these models.
Vijay Bharadwaj
under review
Vijay Bharadwaj
Hi, could you attach your chats (if devoid of sensitive information) here or mail them to me at vj@foyer.work, so that we can see what's wrong?
We actually don't restrict context windows for Pro users upto 100K for models that support it. Claude 4 Opus is the only exception, because if we don't do it, people will exhaust their Fair Use limit in literally very few requests. It is an extremely compute-heavy model. cc: Gabriele Monni
If this is a recent occurence, it may be the prompt or the agent misbehaving, since it is in beta. We'll try to fix this ASAP, once I have context on your exact misfirings.
Thanks for bringing this up!
Tags: Samuel Jackson Танджиро Фан Mohit
Танджиро Фан
Vijay Bharadwaj
Okay, here's how everything happened in chronological order.
Previously (about a month ago), in order to get a satisfactory response, I simply wrote my idea for the scene and asked them to carefully read the files in the project (because if you don't ask them to read it, they don't see it and start writing nonsense).
The first two images: An example of a message and number of characters in the scene(circled in red).
Recently (a couple of weeks ago), in order to get the previous message volume, I had to ask it to read the files carefully and to write a response of at least 10,000 characters. It didn't always work, but at least it was close to what I asked for (oh yes, and by that time it had stopped remembering anything. There was a case when I asked it to summarise the information I wrote in the character questionnaire, then asked it to write a scene, and when I asked in the next message, ‘Was there a questionnaire?’, it replied, ‘No, there wasn't.’)
The following two images (3 and 4): An example of my request and the number of characters in the response.
Now, even when I ask it to write a scene that is 10,000 characters long, it is unable to do so. When I asked why it couldn't do this, it replied, "The character limit is related to technical aspects of text processing in the system. Although I can create long and detailed texts, too much data may exceed the limits set for the safe and effective functioning of the model. If longer texts are needed, they can be divided into separate parts."
Photos 5, 6, 7: An example of my request, the number of characters in the response, and the answer to the question of why it cannot write as much as I ask it to.
Update 1: I just had an addition. I tried to generate the scene I wanted again, and in the end it gave me this. 4623 characters.
It seems that everything went wrong right after the Agentic Chat update was released.(Image 8)
Update 2: I have a hunch about what the problem might be. I often provide information about my characters in text files, but this time I decided to experiment and not give any information about the characters, but simply ask it to come up with a scene of 10,000 characters. The result was surprising. The scene was interrupted at 17,945 characters (it didn't finish the scene properly, i.e. it simply cut off the answer in the middle, but that's something). So the problem is that it gets very confused when given files.(Image 9)
Gabriele Monni
Same problem, it needs to be fixed, it is not at all acceptable that in order to reduce costs we end up limiting the performance of the models, otherwise it is misleading marketing
Танджиро Фан
Oh my God, the same thing is happening to me. I use Claude models to generate scenarios with my characters, and while Claude 3.7 Sonnet (Thinking) used to be able to easily write scenes of 12,000 characters and more, over time I had to write in the request itself that it should write scenes of at least 10,000 characters. Its memory also deteriorated, as it could no longer remember the previous message. It seems that a new model, Claude 4 (Thinking), has been released, but the problem has not been solved, and it has only gotten worse. Today I noticed a problem: it started saying something like, ‘I'll write the scene you asked for now’ before it started writing. And now, even when I ask it to write a scene with a minimum of 10,000 characters, the maximum I can get is 5,000 characters. Bots are getting dumber and dumber.
M
Mohit
yeah i'm finding it the same way with claude 4