03.04.2026

Microsoft Research & Salesforce have a paper that

Eвгений Vasin (КиберЕвгений)03.04.2026236 слов · 1 мин

Кратко

Microsoft Research & Salesforce have a paper that should scare every single AI builder right now. They tested 15 of the top models (GPT-4.1, Gemini 2.5 Pro, Claude 3.7 Sonnet,

Microsoft Research & Salesforce have a paper that should scare every single AI builder right now. They tested 15 of the top models (GPT-4.1, Gemini 2.5 Pro, Claude 3.7 Sonnet, o3, DeepSeek R1, Llama 4) across 200,000+ simulated conversations. The results are actually terrifying. 😐 If you give a model a single-turn prompt, it hits 90% performance. But if you have a multi-turn conversation? It plummets to 65%. same model, same task, just.. talking normally. The crazy part is that the AI isn't getting dumber (aptitude only dropped 15%). The problem is that unreliability EXPLODED by 112%. Here is exactly why they break: - they answer before you finish explaining, and those wrong assumptions get baked in permanently - they fall in love with their first wrong answer and just keep building on it - they completely forget the middle of your conversation - longer responses introduce more assumptions, which means more errors Even the new reasoning models failed, like o3 and deepseek r1 performed just as badly. Giving them extra thinking tokens did absolutely nothing, setting temperature to 0? still broken. Every benchmark we celebrate is tested in perfect, single-prompt lab conditions. But real conversations break every model on the market, and nobody is talking about it. The only fix right now? stop chatting. Give your AI everything upfront in one massive message instead of going back-and-forth. The paper if you want to read: -/arxiv.org/pdf/2505.06120

Microsoft Research & Salesforce have a paper that

Читать первоисточник