AI Research Review 25.07.03
GLM-4.1V-9B-Thinking. ASTRO teaches LLMs to reason with search. Can LLMs learn Strategic Reasoning? A chess study. Causal Reasoning in LLMs is a mirage, but G2-Reasoner and contextual knowledge helps.
Introduction – New Frontiers in AI Reasoning Models
Since the introduction of the o1 reasoning model, there have been significant advances in AI reasoning. DeepSeek shared the RL post-training process used to instill reasoning into DeepSeek-R1, and this year many papers have presented refined RL post-training algorithms for AI reasoning.
This week’s AI research review covers papers that expose limits to these methods and extend AI reasoning capabilities - extending reasoning to the visual domain using RL techniques, building structured reasoning from the ground up, instilling causal reasoning, and examining strategic reasoning capabilities:
GLM-4.1V-9B-Thinking
ASTRO: Teaching LLMs to Reason with Search
Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess
Causal Reasoning in LLMs: Reality or Mirage?
GLM-4.1V-9B-Thinking – Versatile Multimodal Reasoning from RL
Researchers from Zhipu AI & Tsinghua University have introduced GLM-4.1V-Thinking, a vision-language model (VLM) engineered for general-purpose multimodal reasoning. Addressing the challenge of achieving broad-spectrum reasoning capabilities in VLMs, the researchers presents a novel reasoning-centric training framework to train GLM-4.1V-Thinking in the paper “GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.”
The GLM-4.1V-Thinking architecture leverages a ViT Encoder (AIMv2-Huge) for visual processing, an MLP Projector for feature alignment, and the GLM4 LLM as the decoder. It handles native image and video resolutions and incorporates 3D-RoPE for enhanced spatial and temporal awareness.
The training pipeline progresses through three stages: multimodal pre-training with a diverse knowledge-intensive corpus; supervised fine-tuning using meticulously curated long Chain-of-Thought (CoT) data for reasoning style; and a critical RL phase, that utilizes both Reinforcement Learning with Verifiable Rewards (RLVR) and Human Feedback (RLHF), underpinned by a robust, multi-domain reward system crucial for preventing training collapse.
They further boost RL results using a novel technique called Reinforcement Learning with Curriculum Sampling (RLCS). RLCS dynamically adjusts sampling difficulty based on the model's evolving competence, significantly boosting learning efficiency.

Their use of RLCS and domain-specific reward system innovations in RL post-training substantially boosts the model’s performance, with gains of up to 7.3% on reasoning benchmarks. As a result, GLM-4.1V-9B-Thinking demonstrates state-of-the-art performance among models of comparable size:
In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities.

GLM-4.1V-9B-Thinking is open-source and available on HuggingFace.
ASTRO: Teaching LLMs to Reason with Search
"Our results demonstrate that search-inspired training offers a principled way to instill robust reasoning capabilities into open LLMs." - ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context
Research from Meta AI published in ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context addresses the challenge of systematically teaching LLMs to internalize structured reasoning. They introduce the ASTRO ("Autoregressive Search-Taught Reasoner") framework, which teaches an LLM to reason like a classical search algorithm, imbuing self-reflection, backtracking, and exploration through the reasoning training process.
The ASTRO framework instills robust reasoning capabilities into LLMs by inserting search processes into training inputs. An ASTRO-trained model generates the entire search trajectory, complete with its twists and turns, as a coherent stream of thought. The key innovation is to make the model internalize the entire search process—including exploration, self-reflection on intermediate steps, and backtracking from errors—within a single, continuous autoregressive generation.
Training a model with ASTRO operates in three key stages. First, ASTRO generates a synthetic dataset of search trajectories by applying Monte Carlo Tree Search (MCTS) to mathematical problem-solving. These search traces are then linearized and converted into natural language Chain-of-Thoughts (CoTs), which crucially injects explicit self-reflection and backtracking phrases into training from the search.
This dataset subsequently informs a supervised fine-tuning (SFT) stage, bootstrapping models with a rich prior for autoregressive search. Finally, reinforcement learning (RL) with verifiable rewards further optimizes the model's search and reasoning proficiencies.
Applying ASTRO to the Llama 3 family of models yielded significant performance improvements on challenging mathematical reasoning benchmarks. Llama-3.1-70B-ASTRO-RL achieved absolute gains of 16% on MATH-500, 26.9% on AMC 2023, and 20% on AIME 2024, surpassing other advanced baselines. A critical finding is that search-based reasoning traces are essential: Models trained with explicit self-reflection and backtracking significantly outperformed those without.
Can Large Language Models Develop Strategic Reasoning?
This paper poses its key question in the title: Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess. Strategic reasoning involves “the ability to plan, anticipate adversary actions, and make decisions in multiagent environments.”
This paper rigorously investigates the capacity to train LLMs for strategic reasoning, by applying RL to an LLM in the domain of chess. It surprisingly concludes that while RL with dense, expert-derived rewards improves tactical performance, LLMs consistently plateau far below human expert levels:
Our experiments show that our distillation-based dense rewards often outperform sparse binary rewards. However, surprisingly, all models plateau far below expert levels.
The experimental setup was designed to isolate the impact of RL on strategic reasoning. The methodology involved fine-tuning Qwen 2.5 and Llama 3.1 models with Group Relative Policy Optimization (GRPO) on a Lichess puzzle dataset. A novel aspect was employing a pre-trained chess expert network to provide dense, continuous reward signals based on move quality, effectively a knowledge distillation process.

The performance of this approach was compared against training with sparse binary rewards. Key results are that distillation-based dense rewards substantially outperform sparse binary rewards, yet all models plateau at 25-30% puzzle accuracy, well below expert performance (60-80%). Even with additional supervised fine-tuning on expert reasoning traces, performance did not improve, as models struggled with basic chess rules and board state comprehension.
This leads to the paper's critical insight that this failure stems from a deficit in the pretrained model's internal world model:
"RL alone may not be able to fully overcome [the] deficit in the pretrained models’ internal understanding of chess."
RL primarily amplifies existing capabilities in pre-trained LLMs rather than teaching new domain knowledge; RL cannot create strategic understanding that does not already exist in the foundation. Since RL cannot impart complex strategic understanding without contextual knowledge, it suggests that adequate domain-specific exposure during pre-training is essential for developing advanced strategic reasoning in complex new environments.
Causal Reasoning in LLMs: Reality or Mirage?
Specifically, LLMs are only capable of performing shallow (level-1) causal reasoning, primarily attributed to the causal knowledge embedded in their parameters, but they lack the capacity for genuine human-like (level-2) causal reasoning. – From “Unveiling Causal Reasoning in Large Language Models: Reality or Mirage?”
The research paper Unveiling Causal Reasoning in Large Language Models: Reality or Mirage? critically assesses whether LLMs exhibit genuine human-like causal reasoning or merely leverage memorized knowledge. LLMs often appear to demonstrate causal reasoning, correctly identifying cause-and-effect relationships in text, but a critical open question is whether this is genuine reasoning, or a “mirage” created by retrieving causal patterns memorized from training data.
The authors propose a distinction between “level-1” (shallow, knowledge-retrieval based) and “level-2” (genuine, deduction-based, new knowledge generation) causal reasoning, arguing that current LLMs primarily operate at level-1.
They first empirically validate this hypothesis with a new causal Q&A benchmark called CausalProbe-2024. They find that:
The LLMs exhibit a significant performance drop on CausalProbe-2024 compared to earlier benchmarks, indicating the fact that they primarily engage in level-1 causal reasoning.

To bridge this gap in causal reasoning, the paper proposes G2-Reasoner, a framework inspired by human reasoning that integrates external general knowledge via Retrieval-Augmented Generation (RAG) and goal-oriented thinking via prompts to guide LLMs. These steer the model towards a causal inference process.
In evaluations, G2-Reasoner demonstrated that it “significantly enhances LLMs’ causal reasoning capability,” particularly in fresh and counterfactual contexts, outperforming vanilla, CoT, and RAG baselines. This suggests that while LLMs may not possess innate causal reasoning, their capabilities can be substantially enhanced by augmenting them with external knowledge and more structured reasoning frameworks.
While G2-Reasoner offers a promising initial step towards fostering more genuine, deductive causal reasoning, achieving full human-like level-2 capability remains a significant challenge requiring further exploration into broader knowledge integration and sophisticated reasoning mechanisms.
Conclusion – Training for Deeper Reasoning
These research papers show that current AI reasoning models are limited in deeper forms of reasoning, such as strategic and causal reasoning. Currently, AI reasoning models learn to reason by getting trained via RL on reasoning traces, a form of “cognitive distillation” that trains the AI model on specific patterns of thinking. This trains models to follow known chains of thought, but it is not sufficient to build models that can think in more complex, deeper, and powerful ways.
We’ll need more breakthroughs to get to AGI-level reasoning. However, these results give possible directions on what those breakthroughs might include:
Breadth: GLM-4.1V-Thinking improves its reasoning with cross-domain reasoning challenges covering a broad range of tasks in several modalities.
Self-learning: ASTRO distills the algorithmic process of MCTS into a natural language format, effectively teaching the model how to bootstrap its own learning.
Knowledge context support: The causal reasoning paper proposes a new framework, G²-Reasoner, which integrates external knowledge via RAG to augment the model's core capabilities. This may overcome the barrier identified in the chess paper, which showed that RL training cannot overcome gaps in foundation AI model domain knowledge.