Key ideas
- Generating more intermediate tokens helps reason better than scaling our model
- Constant-depth transformers can solve any inherently serial problem by generating a sufficient number of intermediate tokens (Li et al., 2024).
- If we directly generate final answers, either requires a huge depth to solve or cannot solve at all.
- When a step-by-step reasoning path is present, LLMs have much higher confidence in decoding the final answer than direct-answer decoding (Wang et al., 2024).
- Adaptive related examplars help.
- That’s why self-generated exemplars > few-shot > zero-shot (Yasunaga et al., 2024).
- 👀 We can use non-greedy decoding strategy to elicit reasoning in pre-trained LLMs (Wang et al., 2024).
- self-consistency (Wang et al., 2024) helps by sampling different reasoning paths:
- Q1: When the LLM outputs a direct answer without intermediate steps, will you still sample several times, and then choose the most common answer? My answer is doesn’t help because it is noisy (?).
- Q2: Change self-consistency by letting LLM generate multiple responses, instead of sampling multiple times, and then choosing the most common answer. Does this make sense?
- Future work:
- LLMs Can Be Easily Distracted by Irrelevant Context (Shi et al., 2023).
- LLMs Cannot Self-Correct Reasoning Yet, including multi-LLM debate (Huang et al., 2024).
- Premise order matters. (Chen et al., 2024)
Sources:
People say
q* aka strawberry = STaR (self-taught reasoners) with dynamic self-discover + something like DSPy for RAG optimization
State of the art search = GoT(graph of thoughts) + MCTS (monte-carlo tree search) + DSPy / CLIN inspired tooling for “self-play” like optimization at build-time and runtime + graph-based knowledge bases
Continuous learning agents is clever graph retrieval inspired by GraphRAG, CLIN, DSPy and dynamic reasoning modules
The debate about whether LLMs can or can’t reason really bugs me a lot. I just don’t think it’s the right way to approach the topic, because reasoning is neither a binary property nor an isolated process. Here’s my take:
First, let’s start with a general definition: reasoning is the process involved in almost any complex problem-solving, regardless of the model used (LLM or otherwise). In the LLM context, reasoning typically involves consuming more tokens during inference. I quite like this definition by OpenAI (starting from 1:10): https://x.com/OpenAI/status/1834320155989664067.
To understand the capabilities required for reasoning, I think it’s necessary to break it down into three components for better insights: execution, intuition, and planning. Complex problem-solving involves finding the right path in a huge search space and executing each subtask along that path. Let’s dive deeper:
a) Execution: Imagine the model already knows how to solve a task. Execution is about filling in the intermediate results for each step of the answer. Multi-hop QA is a great example - it’s one of the most straightforward reasoning tasks. We usually know the steps to answer the question; it’s more about recalling the right answer for each step. This is probably the most basic capability required in reasoning, and LLMs can clearly handle many execution jobs (like retrieving a single-hop answer or do some basic arithmetics). Even when they can’t, they can often offload the task to an external solver.
b) Intuition: For more complex tasks like math reasoning, we don’t always have a clear path to the solution. This is where intuition comes in, quickly identifying the most promising next action from a large set of possibilities. Personally, I think this is where LLMs really shine. The prior knowledge modeled in natural language provides good intuitions for many tasks. We definitely can’t say LLMs have no intuition.
c) Planning: Finally, for the most challenging tasks (think IMO-level math problems), we can’t rely solely on intuition. We need more deliberate thinking, exploration, and trial-and-error. This is typically achieved by human-coded workflows such as search algorithms, and I think this is what LLMs mostly struggled with previously. However, recent advancements, particularly in O1, have made significant strides in this aspect.
In conclusion, reasoning in LLMs is a complex, multi-faceted capability. It’s not accurate to make blanket statements about whether LLMs can or can’t reason. Instead, we should consider how well they perform in each of these components and how they combine them to solve complex problems.
Exciting Ideas
- From Minh Nhat Nguyen: Use SAE to steer reasoning chains, and then correct reasoning chains for LLMs.