Discussion about this post

User's avatar
Neural Foundry's avatar

Brillaint breakdown of how DeepSeek is making scalable archictecture innovations actually matter. The RL distillation approach where they train domain experts separately then merge that knowledge is really clever for getting around limited compute budgets. What stands out is how token inefficiency might erode thier cost advantage on harder problems even with the lower per-token pricing. The sparse attention mechanism is fasinating but when Speciale needs way more thinking tokens than Gemini, you end up in a situation where 25x cheaper per token can still mean similr total cost.

Expand full comment

No posts

Ready for more?