> MultiOn’s recently announced AgentQ uses MCTS (Monte-Carlo Tree search), self-critique, and DPO-based fine-tuning to enhance AI reasoning. This is similar to the approach for AI reasoning in “Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing.”
Thank for spotting AgentQ last week, Patrick. It's disappointing that there's only two in this category, and no open-source. But I believe this third step is key. In trying to solve a task and traversing the search trees towards a solution, a model has to learn from the process-- just the way we would. Parameters have to be tweaked each time absorbing the tricky aspects of every success and failure.
It's sobering to think that true reasoning is going to need real-time training, inference won't be good enough. Won't that upend a lot of AI cloud strategies...
I agree - this looks to be the recipe that will take AI forward, perhaps to AGI. There are others pursuing some combo of MCTS + self-critique + PRM (process-reward-model) fine-tuning. AlphaCode 2 and other DeepMind AI projects and Project Strawberry are in these veins of thinking. BTW, even the LLama 3.1 AI models were using some kind of PRM (process reward model) in their instruction-tuning; I suspect it will become 'standard' to get to best-in-class reasoning in AI models. It's a very big challenge, and the message of my article is that it will be eaten in small bites. baby steps at a time.
The third path is the way:
> MultiOn’s recently announced AgentQ uses MCTS (Monte-Carlo Tree search), self-critique, and DPO-based fine-tuning to enhance AI reasoning. This is similar to the approach for AI reasoning in “Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing.”
Thank for spotting AgentQ last week, Patrick. It's disappointing that there's only two in this category, and no open-source. But I believe this third step is key. In trying to solve a task and traversing the search trees towards a solution, a model has to learn from the process-- just the way we would. Parameters have to be tweaked each time absorbing the tricky aspects of every success and failure.
It's sobering to think that true reasoning is going to need real-time training, inference won't be good enough. Won't that upend a lot of AI cloud strategies...
I agree - this looks to be the recipe that will take AI forward, perhaps to AGI. There are others pursuing some combo of MCTS + self-critique + PRM (process-reward-model) fine-tuning. AlphaCode 2 and other DeepMind AI projects and Project Strawberry are in these veins of thinking. BTW, even the LLama 3.1 AI models were using some kind of PRM (process reward model) in their instruction-tuning; I suspect it will become 'standard' to get to best-in-class reasoning in AI models. It's a very big challenge, and the message of my article is that it will be eaten in small bites. baby steps at a time.