PERK: Long-Context Reasoning
as Test-Time Learning

PERK inference figure.

PERK models simulate reasoning by first encoding a given long context into its parameters through a few gradient steps, and then using the updated parameters to generate a response to a given query. The long context is formulated as a batch of segment-level sequences that are encoded in parallel by adapting the model's parameters through backpropagation on causal language modeling.

Summary

Long-context reasoning requires accurately identifying relevant information in extensive, noisy input contexts. Previous research shows that using test-time learning to encode context directly into model parameters can effectively enable reasoning over noisy information. However, meta-learning methods for enabling test-time learning are prohibitively memory-intensive, preventing their application to long context settings.

In this work, we propose PERK (Parameter Efficient Reasoning over Knowledge), a scalable approach for learning to encode long input contexts using gradient updates to a lightweight model adapter at test time. Specifically, PERK employs two nested optimization loops in a meta-training phase. The inner loop rapidly encodes contexts into a low-rank adapter (LoRA) that serves as a parameter-efficient memory module for the base model. Concurrently, the outer loop learns to use the updated adapter to accurately recall and reason over relevant information from the encoded long context.

Our evaluations on several long-context reasoning tasks show that PERK significantly outperforms the standard prompt-based long-context baseline, achieving average absolute performance gains of up to 90% for smaller models (GPT-2) and up to 27% for our largest evaluated model, Qwen-2.5-0.5B. In general, PERK is more robust to reasoning complexity, length extrapolation, and the locations of relevant information in contexts. Finally, we show that while PERK is memory-intensive during training, it scales more efficiently at inference time than prompt-based long-context inference.

PERK method overview

To scale the training of larger and more capable test-time learning models, we propose PERK, which uses a parameter-efficient test-time adaptation to reduce both the size and amount of gradient unrolling required during optimization. Specifically, in the inner and outer loops, we only optimize a parameter-efficient adapter, LoRA, to encode the contexts, and we apply truncated gradient unrolling to reduce the memory cost of backpropagating through the inner loop.

Test-Time Learning and Inner Loop Training

We set the test time learning objective (thus also the inner loop objective) to Causal Language Modeling (CLM) of the context, and set the adaptation algorithm $Alg$ to only update the LoRA parameters. The adaptation algorithm computes the gradients on a batch of sub-sequences from the full context sequence. This compression as a parallel batch(or multiple batches) allows us to process lengthy sequences beyond the model's context window.

PERK train figure.

Meta-learning PERK for long-context reasoning: The training procedure involves a nested inner and outer loop. The inner loop optimizes the likelihood of a batch of long context segments with respect to the parameters of the LoRA-based memory scratchpad. In the outer loop, the model uses the encoded information in the memory scratchpad to answer questions. In both cases, only the memory scratchpad parameters are updated while the base LLM parameters are frozen.

Outer Loop Training

In the outer loop, we optimize the meta parameters of the LoRA module to learn from the corresponding distribution of reasoning problems. The optimal meta parameters minimize the expected reasoning loss of the adapter's adaptation to each reasoning problem, where the adaptation here is the parameter-efficient CLM adaptation of the LoRA meta parameters to the context.

Truncated Gradient Unrolling (TGU)

To optimize the outer loop objective with gradient-based methods, we need to differentiate through the inner loop adaptation algorithm $Alg$, which involves higher-order derivatives and saving the complete trajectory to compute the meta-gradient. However, this creates high memory costs, capping the method's applicability. To reduce memory costs, we truncate the backpropagation for the meta-gradient computation. In particular, we run the inner loop optimization for all $N$ specified update steps, but only store the computational graph for the last $T \leq N$ steps.

Long-Context Reasoning Evaluation

PERK demonstrates strong performance on long-context reasoning tasks, significantly outperforming standard prompt-based baselines. Our evaluations show that PERK achieves substantial performance gains across different model sizes and reasoning complexities. We empirically evaluate PERK's long-context reasoning capabilities in two challenging scenarios: reasoning over (1) Needles-in-a-Haystack (NIAH with BabiLong) and (2) Drops-in-an-Ocean (DIO), a novel long context evaluation setting we propose.

Long context results

Performance on NIAH with BabiLong and DIO with Student Records. All models are trained and tested on sequences ranging from 1K to 8K tokens. All PERK models are trained to first generate the supporting facts relevant to a query, followed by the final answer prediction. In contrast, baseline models directly generate the answer, as this approach yields better performance for them in this setting. The number of trainable parameters for each method is indicated in the legend.

Reasoning Over Needles-in-a-Haystack

We first evaluate PERK on reasoning over Needles-in-a-Haystack using the BabiLong framework for assessing models' reasoning abilities across relevant and distractor facts scattered in very long documents. We mainly focus on tasks involving single-hop (QA1), two-hop (QA2), and three-hop (QA3) reasoning.

PERK outperforms all baselines in all cases we test (across all models), including when compared to larger models. PERK (GPT-2) outperforms even FT-ICR (Qwen) and FT-ICR (Mamba-1.4B) for all task complexities and context lengths, with an average 20% absolute performance gain. PERK (Qwen) achieves the highest accuracy across all task complexities (QA1, QA2, QA3) and context lengths (1K to 8K). PERK is also more robust to increasing context length and task complexity than the baselines.

Reasoning Over Drops-in-an-Ocean

To address the fundmental limitation of Needle-in-a-Haystack datasets where the target information often exhibits stylistic differences from surrounding irrelevant text, forming an artificially simple test for identifying relevant information, wepropose Drops-in-the-Ocean (DIO), a new evaluation setting that forms long contexts from structurally similar documents.

We construct a synthetic dataset, Student Records, where each context simulates a database containing multiple student records. Each record includes several attributes: ID, name, school, major, and grade. Context length scales directly with the number of student records included. We define evaluation tasks of varying complexity: (1) Recall (retrieving attributes for a specific ID), (2) Relation (comparing attributes between two IDs), and (3) Aggregate (calculating the maximum, minimum, and average grade across all students).

PERK models again outperform all baselines. PERK (Qwen) and PERK (GPT-2) consistently achieve high accuracy (often $\ge$85%) and are more robust to increasing context length, even on the more challenging aggregation task with the 8k length, which requires the model to aggregate information across all 8k tokens.

Robustness Analysis

Test-Time Length Generalization

We evaluate PERK's ability to generalize to inputs of different lengths than those seen in training. We test for both interpolation (test lengths shorter than the training length) and extrapolation (test lengths longer than the training length).

We evaluate using the BabiLong tasks and train a Qwen-2.5-0.5B model using PERK on fixed-length contexts ranging from 1K to 8K tokens (as well as an FT-ICR baseline). We evaluate each model on contexts ranging from 1K to 32K tokens, measuring each approach's ability to generalize to new lengths (longer or shorter than the training length).

test-time length extrapolation

Test-time context length generalization on BabiLong QA1 (a) and QA2 (b) comparison between PERK and FT-ICR on the Qwen-2.5-0.5B model. The y-axis represents the training context lengths, while the x-axis indicates various test-time context lengths. We test for both interpolation (test lengths shorter than the training length) and extrapolation (test lengths longer than the training length). Bordered cells denote the boundary: evaluation on context lengths equal to those in training. PERK shows stronger generalization across both settings.


positional bias

Test-time length extrapolation beyond 32K on BabiLong QA1 (a) and QA2 (b). Both PERK and FT-ICR are trained on 8K-token sequences. The context length for inference grows from 64K to 128K. PERK extrapolates substantially better than FT-ICR.


PERK generalizes to new context lengths at test time.

For extrapolation, FT-ICR's performance drops drastically as the inference context length exceeds the training context length, especially for models trained on shorter contexts. While PERK's accuracy also decreases (from a higher starting point), the degradation is dampened at longer test lengths. For QA1, PERK trained on 1K sequences shows only a 42% drop on 32K sequences while FT-ICR exhibits a 52% decline. When trained on 8K sequences, PERK drops only 5% on 32K sequences, while FT-ICR still exhibits a 32% drop. Similar dynamics are observed on the more challenging QA2 task.

To stress-test the limit of PERK's length extrapolation ability, we further test on sequences with context lengths beyond the context window of the Qwen-2.5 model -- 32K tokens. Specifically, we test on sequences with tokens increasing from 64K to 128K. We show that PERK consistently shows stronger extrapolation performance than the FT-ICR baseline. Although with 128K tokens, PERK's accuracy experiences a noticeable drop to 61.4% for QA1 and 44.4% for QA2, these scores are still substantially better than the 0% absolute performance of the FT-ICR baselines.

For interpolation, FT-ICR continues to lose performance on shorter contexts when finetuned on longer ones, matching observations from Gao et al. (2025). In contrast, PERK maintains its performance and even increases performance for the QA2 task.


Robustness to Positional Biases

Prior work has shown that the position of relevant information in long contexts affects a language model's ability to utilize that information. We test PERK's robustness to positional biases by evaluating its test-time performance on contexts where the position of relevant information is distributed differently from those seen in training.

positional bias

Positional Bias Comparison of PERK and FT‑ICR on 4K and 8K contexts, on Qwen‑2.5‑0.5B. We train on problems where the relevant information appears in the beginning (Pre), middle (Mid), or end (Post) of the context, and evaluate on all three positional settings. We also train models on contexts where the relevant information is randomly located (Rnd), testing these on all four positional distributions (Pre, Post, Mid, Rnd). Bordered cells show in-distribution performances. PERK demonstrates strong positional robustness.


PERK generalizes robustly regardless of information position in the context.

When trained on contexts with relevant documents randomly located (Rnd) throughout the context, PERK outperforms FT-ICR. FT-ICR shows large performance drops when the relevant document appears at different locations (Pre, Mid, and Post) at test time. Meanwhile, shifting relevant positions at test-time shows minimal effect (within 1-2%) on PERK.

To further stress-test positional generalization, we also force relevant documents into particular positions during training and testing with positions across the full context. We find that FT-ICR easily overfits to the position pattern, completely failing to generalize to test-time position changes (performance drops to close to 0% when the position shifts at test time).

Efficiency Analysis

We first show PERK's training scalability compared to prior test-time learning method RECKONING by measuring each method's peak training memory across different context lengths. We then evaluate PERK's training efficiency on long contexts by measuring its inference memory cost and run-time, compared to in-context reasoning with finetuned models.

test-time length extrapolation

(a): Peak GPU memory usage during training with context lengths ranging from 1K to 8K tokens for RECKONING and PERK. While PERK successfully scales to 8K tokens, RECKONING encounters out-of-memory (OOM) errors at shorter context lengths. (b): Memory footprint and wall‑clock runtime during inference as context length increases (up to 128K tokens), comparing PERK with FT-ICR. Curves that terminate before 128K indicate that the method failed with an OOM error at longer context lengths, preventing further measurement. PERK demonstrates more efficient scaling in both memory and runtime, particularly for extremely long sequences.


PERK scales more efficiently in memory and runtime on extremely long contexts

As the training context length grows from 1K to 8K tokens, RECKONING quickly runs into Out-Of-Memory (OOM) errors at 2K, while PERK with increasing truncation steps continues to fit the maximum GPU memory. We successfully validate PERK's stronger scalability on long contexts.

PERK provides more efficient memory and runtime scaling for extremely long contexts compared to FT-ICR. While FT-ICR is initially more efficient, its memory and runtime grow rapidly, leading to OOM errors at a context length of 128K. In contrast, PERK can manage the long sequences through gradient accumulation, which, while increasing runtime, reduces the memory footprint. Ultimately, at 128K tokens, where FT-ICR fails, PERK with 16 steps successfully processes the context using 35.2GB in 20.9s, showing that PERK provides a practical path to handle extreme context lengths efficiently in both memory and runtime when compared to standard approaches.


BibTeX

@article{chen2025perklongcontextreasoningparameterefficient,
      title={PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning},
      author={Zeming Chen and Angelika Romanou and Gail Weiss and Antoine Bosselut},
      year={2025},
      eprint={2507.06415},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.06415},
}