RECKONING: Reasoning through Dynamic Knowledge Encoding
Long-context reasoning requires accurately identifying relevant information in extensive, noisy input contexts. Previous research shows that using test-time learning to encode context directly into model parameters can effectively enable reasoning over noisy information. However, meta-learning methods for enabling test-time learning are prohibitively memory-intensive, preventing their application to long context settings.
In this work, we propose PERK (Parameter Efficient Reasoning over Knowledge), a scalable approach for learning to encode long input contexts using gradient updates to a lightweight model adapter at test time. Specifically, PERK employs two nested optimization loops in a meta-training phase. The inner loop rapidly encodes contexts into a low-rank adapter (LoRA) that serves as a parameter-efficient memory module for the base model. Concurrently, the outer loop learns to use the updated adapter to accurately recall and reason over relevant information from the encoded long context.
Our evaluations on several long-context reasoning tasks show that PERK significantly outperforms the standard prompt-based long-context baseline, achieving average absolute performance gains of up to 90% for smaller models (GPT-2) and up to 27% for our largest evaluated model, Qwen-2.5-0.5B. In general, PERK is more robust to reasoning complexity, length extrapolation, and the locations of relevant information in contexts. Finally, we show that while PERK is memory-intensive during training, it scales more efficiently at inference time than prompt-based long-context inference.
To scale the training of larger and more capable test-time learning models, we propose PERK, which uses a parameter-efficient test-time adaptation to reduce both the size and amount of gradient unrolling required during optimization. Specifically, in the inner and outer loops, we only optimize a parameter-efficient adapter, LoRA, to encode the contexts, and we apply truncated gradient unrolling to reduce the memory cost of backpropagating through the inner loop.
We set the test time learning objective (thus also the inner loop objective) to Causal Language Modeling (CLM) of the context, and set the adaptation algorithm $Alg$ to only update the LoRA parameters. The adaptation algorithm computes the gradients on a batch of sub-sequences from the full context sequence. This compression as a parallel batch(or multiple batches) allows us to process lengthy sequences beyond the model's context window.
In the outer loop, we optimize the meta parameters of the LoRA module to learn from the corresponding distribution of reasoning problems. The optimal meta parameters minimize the expected reasoning loss of the adapter's adaptation to each reasoning problem, where the adaptation here is the parameter-efficient CLM adaptation of the LoRA meta parameters to the context.
To optimize the outer loop objective with gradient-based methods, we need to differentiate through the inner loop adaptation algorithm $Alg$, which involves higher-order derivatives and saving the complete trajectory to compute the meta-gradient. However, this creates high memory costs, capping the method's applicability. To reduce memory costs, we truncate the backpropagation for the meta-gradient computation. In particular, we run the inner loop optimization for all $N$ specified update steps, but only store the computational graph for the last $T \leq N$ steps.
PERK demonstrates strong performance on long-context reasoning tasks, significantly outperforming standard prompt-based baselines. Our evaluations show that PERK achieves substantial performance gains across different model sizes and reasoning complexities. We empirically evaluate PERK's long-context reasoning capabilities in two challenging scenarios: reasoning over (1) Needles-in-a-Haystack (NIAH with BabiLong) and (2) Drops-in-an-Ocean (DIO), a novel long context evaluation setting we propose.
We first evaluate PERK on reasoning over Needles-in-a-Haystack using the BabiLong framework for assessing models' reasoning abilities across relevant and distractor facts scattered in very long documents. We mainly focus on tasks involving single-hop (QA1), two-hop (QA2), and three-hop (QA3) reasoning.
PERK outperforms all baselines in all cases we test (across all models), including when compared to larger models. PERK (GPT-2) outperforms even FT-ICR (Qwen) and FT-ICR (Mamba-1.4B) for all task complexities and context lengths, with an average 20% absolute performance gain. PERK (Qwen) achieves the highest accuracy across all task complexities (QA1, QA2, QA3) and context lengths (1K to 8K). PERK is also more robust to increasing context length and task complexity than the baselines.
To address the fundmental limitation of Needle-in-a-Haystack datasets
where the target information often exhibits stylistic differences from surrounding irrelevant text,
forming an artificially simple test for identifying relevant information,
wepropose Drops-in-the-Ocean (DIO),
a new evaluation setting that forms long contexts from structurally similar documents.
We construct a synthetic dataset, Student Records,
where each context simulates a database containing multiple student records.
Each record includes several attributes: ID, name, school, major, and grade.
Context length scales directly with the number of student records included.
We define evaluation tasks of varying complexity:
(1) Recall (retrieving attributes for a specific ID),
(2) Relation (comparing attributes between two IDs), and
(3) Aggregate (calculating the maximum, minimum, and average grade across all students).
PERK models again outperform all baselines.
PERK (Qwen) and PERK (GPT-2)
consistently achieve high accuracy (often $\ge$85%) and are more robust to increasing context length,
even on the more challenging aggregation task with the 8k length,
which requires the model to aggregate information across all 8k tokens.
We evaluate PERK's ability to generalize to inputs of different lengths than those seen in training.
We test for both interpolation (test lengths shorter than the training length) and extrapolation
(test lengths longer than the training length).
We evaluate using the BabiLong tasks and train a Qwen-2.5-0.5B model using PERK
on fixed-length contexts ranging from 1K to 8K tokens (as well as an FT-ICR baseline).
We evaluate each model on contexts ranging from 1K to 32K tokens,
measuring each approach's ability to generalize to new lengths
(longer or shorter than the training length).
For extrapolation, FT-ICR's performance drops drastically as the inference context length exceeds the training context length,
especially for models trained on shorter contexts.
While PERK's accuracy also decreases (from a higher starting point),
the degradation is dampened at longer test lengths.
For QA1, PERK trained on 1K sequences shows only
a 42% drop on 32K sequences while FT-ICR exhibits a 52% decline.
When trained on 8K sequences, PERK drops only 5% on 32K sequences,
while FT-ICR still exhibits a 32% drop.
Similar dynamics are observed on the more challenging QA2 task.
To stress-test the limit of PERK's length extrapolation ability,
we further test on sequences with context lengths beyond the context window of the Qwen-2.5 model -- 32K tokens.
Specifically, we test on sequences with tokens increasing from 64K to 128K.
We show that PERK consistently shows
stronger extrapolation performance than the FT-ICR baseline.
Although with 128K tokens, PERK's accuracy
experiences a noticeable drop to 61.4% for QA1 and 44.4% for QA2,
these scores are still substantially better than the 0% absolute performance of the FT-ICR baselines.
For interpolation, FT-ICR continues to lose performance on shorter contexts when finetuned on longer ones,
matching observations from Gao et al. (2025).
In contrast, PERK maintains its performance and even increases performance for the QA2 task.
Prior work has shown that the position of relevant information in long contexts affects a language model's ability to utilize that information. We test PERK's robustness to positional biases by evaluating its test-time performance on contexts where the position of relevant information is distributed differently from those seen in training.
When trained on contexts with relevant documents randomly located (Rnd) throughout the context,
PERK outperforms FT-ICR.
FT-ICR shows large performance drops when the relevant document appears at different locations
(Pre, Mid, and Post) at test time.
Meanwhile, shifting relevant positions at test-time shows minimal effect (within 1-2%) on PERK.
To further stress-test positional generalization, we also force relevant documents into particular positions during training and testing with positions across the full context.
We find that FT-ICR easily overfits to the position pattern, completely failing to generalize to test-time position changes (performance drops to close to 0% when the position shifts at test time).
We first show PERK's training scalability compared to prior test-time learning method RECKONING by measuring each method's peak training memory across different context lengths. We then evaluate PERK's training efficiency on long contexts by measuring its inference memory cost and run-time, compared to in-context reasoning with finetuned models.
As the training context length grows from 1K to 8K tokens, RECKONING quickly runs into Out-Of-Memory (OOM) errors at 2K,
while PERK with increasing truncation steps continues to fit the maximum GPU memory.
We successfully validate PERK's stronger scalability on long contexts.
PERK provides more efficient memory and runtime scaling for extremely long contexts compared to FT-ICR.
While FT-ICR is initially more efficient, its memory and runtime grow rapidly, leading to OOM errors at a context length of 128K.
In contrast, PERK can manage the long sequences through gradient accumulation, which, while increasing runtime, reduces the memory footprint.
Ultimately, at 128K tokens, where FT-ICR fails, PERK with 16 steps successfully processes the context using 35.2GB in 20.9s,
showing that PERK provides a practical path to handle extreme context lengths efficiently in both memory and runtime when compared to standard approaches.
@article{chen2025perklongcontextreasoningparameterefficient,
title={PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning},
author={Zeming Chen and Angelika Romanou and Gail Weiss and Antoine Bosselut},
year={2025},
eprint={2507.06415},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.06415},
}