Hi! I'm noticing a recurring issue: models on Merlin — for some reason, maybe due to custom instructions or internal tuning — seem to perform noticeably worse than their native versions. They're getting dumber and dumber, especially on reasoning tasks.
For example, Gemini 2.5 Pro managed to solve a complex task instantly on Google's own platform, but on Merlin, it failed completely — even when spoon-fed the correct path. The same behavior applies to other models like Claude or DeepSeek when it comes to reasoning.
Would love to understand what’s going on and whether this can be improved, because right now it feels like we’re not getting the true power of these models.