Recoverable but Not Stationary:Local Linear Structures in Weights and Activations

New research indicates that large language models, particularly transformer architectures, contain distinct local low-rank task-gradient structures within their weights and activations. This finding supports the concept that learned behaviors can be influenced through linear manipulation, a method explored via techniques such as task vectors, Low-Rank Adaptation (LoRA), activation steering, and localized random searches around pre-trained weights. While these recoverable linear pathways were clearly identified in experiments involving a synthetic multitask transformer and LoRA adapters on DistilGPT-2 and GPT-2 models¹, the study concurrently rejected the hypothesis of a globally fixed or stationary task-plane. The demonstrability of these manipulable local linear structures suggests a significant shift in the potential threat landscape for AI. This implies that the security concerns surrounding transformer-based systems are moving beyond conventional criminal activities towards more complex geopolitical, state-aligned operations, necessitating a fundamental change in defensive postures.

Recoverable but Not Stationary:Local Linear Structures in Weights and Activations

References

Related Intelligence

Recoverable but Not Stationary:Local Linear Structures in Weights and Activations

References

Related Intelligence

Get the Signal. Skip the Noise.