Hyena Model (deep learning)

The Hyena^[1] model is a neural network architecture that was developed to address the scalability issues associated with traditional self‐attention^[2] mechanisms. It is designed to efficiently handle very long sequences by replacing the quadratic-complexity self‐attention with a sub-quadratic operator that interleaves implicit long convolutions with data-controlled gating.

Architecture

At the core of the Hyena model is the concept of implicit long convolutions. Traditional convolutions use fixed kernels that are explicitly defined and stored, resulting in a parameter count that scales linearly with the kernel size. In contrast, Hyena generates convolutional filters implicitly using a parameterized function—typically implemented as a small feed-forward network. This allows the model to synthesize long filters on the fly, effectively decoupling the filter length from the number of parameters.

In addition to implicit convolutions, the Hyena operator incorporates data-controlled multiplicative gating. In this mechanism, each token is modulated by gating signals that are derived from learned linear projections of the input. The gating operation is performed element-wise and serves to dynamically adjust the influence of the convolutional output, effectively tailoring the operator to the specific input context.

The overall Hyena operator is defined as a recurrence that alternates between implicit long convolutions and element-wise gating. For an order-N Hyena operator, the recurrence is expressed as follows:

$z_{1} [t] = v [t]$ , where $v$ is one of the linear projections of the input.
For n=1,…,N:
- $z_{n + 1} [t] = x_{n} [t] \cdot ((h_{n} * z_{n}) [t])$ , where $x_{n}$ represents a gating projection and $h_{n}$ is an implicitly parameterized long convolution filter.
The final output is given by $y [t] = z_{N + 1} [t]$ .

, where

$z_{n} [t]$ is the intermediate state at recurrence step $n$ and time position $t$ .
$v [t]$ is a linear projection of the input at time position $t$ , analogous to the "value" in self-attention.
$x_{n} [t]$ is the gating projection at recurrence step $n$ .
$h_{n}$ is the implicit long convolution filter for step $n$ .
The operator $*$ denotes convolution, so $(h_{n} * z_{n}) [t]$ is the result of convolving filter $h_{n}$ with the signal $z_{n}$ at time $t$ .
The dot " $\cdot$ " indicates element-wise multiplication.

Mathematical Formulation

The implicit convolution filters in Hyena are typically parameterized as functions of time. For each filter $h_{n}$ , the response at time is given by:

$h_{n} [t] = Window (t) \cdot (FFN \circ PositionalEncoding) (t)$

, where $\circ$ is the composition operator, meaning that the positional encoding is first applied to $t$ and then processed by the FFN.

Here, the window function serves to modulate the filter (for example, by imposing an exponential decay), and the feed-forward network (FFN) together with positional encodings generate the filter values. This implicit parameterization is a key design choice that allows Hyena to capture long-range dependencies without a proportional increase in parameter count.

Efficiency and scalability

By replacing the quadratic self-attention^[2] mechanism with a sequence of FFT-based convolutions and element-wise multiplications, the Hyena operator achieves an overall time complexity of $O (N L \log L)$ , where $N$ is the number of recurrence steps. This subquadratic scaling is particularly advantageous for long sequences, allowing the model to process inputs that are orders of magnitude longer than those feasible with conventional attention.

The operations in the Hyena model—both the implicit convolutions and the gating functions—are highly parallelizable and amenable to optimization on modern hardware accelerators. Techniques such as fast Fourier transforms (FFT) further enhance the efficiency, making the model well-suited for large-scale applications where both speed and memory efficiency are critical.

References

↑ Poli, Michael; Massaroli, Stefano; Nguyen, Eric; Fu, Daniel Y.; Dao, Tri; Baccus, Stephen; Bengio, Yoshua; Ermon, Stefano; Ré, Christopher (2023-04-19), Hyena Hierarchy: Towards Larger Convolutional Language Models, arXiv:2302.10866
↑ ^2.0 ^2.1 Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Lukasz; Polosukhin, Illia (2023-08-02), Attention Is All You Need, arXiv:1706.03762

This article "Hyena Model (deep learning)" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:Hyena Model (deep learning). Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.

[1] Poli, Michael; Massaroli, Stefano; Nguyen, Eric; Fu, Daniel Y.; Dao, Tri; Baccus, Stephen; Bengio, Yoshua; Ermon, Stefano; Ré, Christopher (2023-04-19), Hyena Hierarchy: Towards Larger Convolutional Language Models, arXiv:2302.10866

[:0-2] 2.0 ^2.1 Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Lukasz; Polosukhin, Illia (2023-08-02), Attention Is All You Need, arXiv:1706.03762

[1]

[2]

Hyena Model (deep learning)

Architecture

Efficiency and scalability

References

📰 Article(s) of the same category(ies)[edit]