I wrote down the following in the internal Slack chat on 01.06.2025, but of course, the performance of the actual effort is much more than writing it down.
Large language models (LLMs) operate in a high-dimensional token space, where tokens (words, subwords, or characters) can be viewed as discrete signals covering the multi-dimensional knowledge space. So FFT analysis methods can be applied to reduce time domain complexity to frequency domain representation with an idea to reduce computational complexity. So we can map token signals into the frequency domain. This transformation allows us to analyze token dynamics, such as their frequency of occurrence, temporal correlations, and interactions across contexts, with computational efficiency. In this approach, embeddings are treated as signals, and their relationships in sequence are captured as patterns in the frequency domain. FFT could be used to decompose token streams into dominant frequency components, revealing periodic or recurrent patterns in language usage - these patterns are repeatable across human generated knowledge and generally follow a predefined set of rules so the signals are not just white noise, they are predictable. By analyzing these frequency components, predictions of the next token can be made by emphasizing high-energy components in the frequency spectrum, reducing noise and focusing on statistically probable outcomes. Using this method we can reduce computational overhead during training and inference by enabling lightweight spectral analysis rather than heavy attention mechanisms, especially for long-context or repetitive sequences. Also using classical signal filtering techniques (LPF, HPF, band pass) could help align model behavior with human linguistic patterns, refine token embeddings, and improve efficiency in both training and inference phases.
To form a coherent idea you need to coordinate a lot of tokens. In other words, ideas are long-distance correlations between tokens. Ideas are the long-wavelength features of streams of tokens.
Is it exactly right? No. But as a cartoon it can motivate exploring an idea like this.
That would be super cool if it works! I’ve also wondered the same thing about activation functions. Why not let the algorithm learn the activation function?
Mostly because of computational efficiency irrc, the non linearity doesn’t seem to have much impact, so picking one that’s fast is a more efficient use of limited computational resources.
This idea exists (the broad field is called neural architecture search), although you have to parameterize it somehow to allow gradient descent to happen.
Exactly. Exploiting the structure of the matrix (e.g., it is well approximated by a circulant matrix) is natural if there is structure to exploit. If everything in the preprint holds up, that might suggest some symmetries (e.g., approximate stationarity in time) in the data at hand.
Yeah basic math space transformation sandwich : 1) turn data into another space 2) operate in that space 3) transform back into original space. To optimize this, optimize each step and work as much as possible in the most efficient space
As an example, if you're trying to "smooth out" some signal, what you're really trying to do is remove the high frequency components. So using a Fourier transform to convert it to the frequency domain lets you directly work with the frequency data, rather than indirectly in the time/space/whatever domain. The fact that the operation is simpler in the frequency domain is a good hint that you've picked the "natural" space in which to look at the problem. Of course, there's no formal definition of "naturalness," and at the end of the day, you get the same result either way.
One way to think about things is in terms of diagonalization. A generic linear operator is a fairly complicated thing that mixes information from different dimensions. When you diagonalize an operator, it's the same operator, but you're choosing a coordinate system where it becomes clear that all it was really doing was stretching along each coordinate axis, so you've broken it down into something that acts independently in a simple way on each dimension. The Fourier transform is unitary, so the intuition is that you're basically doing something like a rigid transformation of space (e.g. a rotation), and when you look from the correct angle, you see that your original operator wasn't some complicated "add a stretched version of this dimension to this other dimension minus a stretched version of a third dimension", but just "stretch this by this, this by this, etc."
On the other hand, convolution itself is already "just" multiplication. e.g. multiplying polynomials is convolution of their coefficients (to get the x^n coefficient, you need to add up all the combinations of a_i a_j x^i x^j where i+j=n), and this point of view also applies to e.g. linear time-invariant systems[0] by thinking of your function as the weights of an infinite weighted sum (so sort of an infinite polynomial) of time-shift operators (and this point of view works for other groups, not just time shifts). So f(t) is then the "coefficient" for the t-shift, and multiplying two such weighted sums again has you convolve the coefficients (so your original functions). The jargon way to say this is that your space of G-invariant functions is secretly the free algebra generated by G (G being a group). From that point of view, convolution is the "natural" multiplication on G-invariant functions. One can then ask whether there's a Fourier transform for other groups, which leads to abstract harmonic analysis. e.g. the Mellin transform is the Fourier transform for scale/stretch invariant functions as opposed to shift invariant.
This is a good general tool for less-mathematically-deep folks to keep in their pocket: look for well-behaved objects that do something nice under the operation you're interested in. Typical "well-behaved" objects do things like stay where they are, or end up as a constant multiple of themselves, or end up as 0 or 1, or something like that. Then try to represent everything else in terms of those objects, so that you can take advantage of their convenient behavior. Less-difficult examples include:
- Prime factorization: primes have nice properties, and you can turn every integer into a product of primes (polynomial factorization is an extension of this idea) and work with the nice prime properties
- Vector spaces: basis vectors have nice properties, so you write vectors as sums of them and do operations on the coefficients instead of the vectors themselves
- The exponential function: it's the unique function with f'(x) = f(x), so you try to turn everything else into exponentials anytime you have to solve some painful differential equation because you know those terms will go away
- Fixed points in dynamical systems: if you don't want to analyze how arbitrary things change, find the points that don't, then think of the other points as (fixed point) + (small perturbation) and reduce your work to handling the perturbation
- Taylor series: polynomials are easy, smooth functions are hard, so turn your smooth function into a polynomial and do polynomial things with it
An example in statistics is the expectation operator. You can throw away a lot of detail if you only care about one central moment. And if you need more information about a distribution, add more moments.
Also, this works for public policy. Frame everything as a well functioning market and hope for the best. /s
Well, “think of the other points as (fixed point) + (small perturbation) and reduce your work to handling the perturbation” is literally how modern economic models (DSGE) are studied and then used for public policy.
Here is an example that shows how high dimensional data can be sparse (aka simple aka natural) in some projections. Since FFTs are just a reprojection of your data, this provides useful intuition.
That's all that "natural" means in this context. It's like "elegant" -- if it just takes less effort to get the result u need then why wouldn't you take the easier route?
In the case of FFTs, no. Which is why I prefer the term Fourier space. I don’t like frequency domain because I frequently work with 3-D and 5–D FFTs while I’ve always felt frequency is connected to single dimension FFT.
Maybe something like "recurrence space" would be better. Frequency does have a physical interpretation which could be misconstrued, e.g. the FFT of a wave in the space domain yields the wavenumber in the independent variable, not the frequency.
Unfortunately that’s very vague because there’s many notions of “dual space” within applied mathematics, even just considering the parts relevant to engineering applications.
Ite called reciprocal because the fourier transformation is it's own inverse, and the input and output space have the same 'shape' (functions from the reals to the complex numbers).
So they are considered two sides of the same coin. And reciprocal in that sense.
Yeah, but the savings are theoretical. You turn a O(n^2) operation into O(nlog n). Sounds great until you realize that n is three on average. To boot, you have to use complex numbers for calculations which are also less numerically stable. So, to the best of my knowledge, FFT is not a win for ordinary convolution.
Maybe for self-attention and for their use cases n is much larger, I didn't read the article. But you still have to deal with complex numbers.
> You turn a O(n^2) operation into O(nlog n). Sounds great until you realize that n is three on average.
Sure, but are long convolutions avoided precisely because they're expensive? This paper is talking about an alternative to an attention mechanism, which covers the entire context window, no? Isn't this paper saying: you could use a long convolution for this instead, and long convolutions don't have to be slow?
> you have to use complex numbers for calculations which are also less numerically stable
I haven't heard numerical stability being a big deal in neural nets; in fact don't people often use 16-bit floats as weights to save on space? Does the numerical stability of complex numbers exceed the precision dropped off by quantization anyway? Are complex numbers really inherently less numerically stable, or are we just not as good at using them yet?
3^2 / (3*log(3)) = >6x performance improvement and, if three is a linear average, I'd expect the average improvement to be even higher. I know real world computation doesn't answer to the simple scaling equations and there may well be a constant factor >6 that eats the gains, but I don't think the two Big Os and a n=3 are sufficient to make your case.
That's not how O(f(n)) notation works. You can't just plug in an n to O(f(n)) / O(g(n)) and claim a specific performance improvement. You have to actually know all the factors that are stripped off by big-O to do that, and you never really do. For instance, you're ignoring the cost to transform between domains.
> I know real world computation doesn't answer to the simple scaling equations ... but
No, no "but". This defeats the entire claim, and you can't just "but" it back.
Also, you appear to have used base-10 log for Log(3). It's almost certain that base-2 is more appropriate, leading to a factor of 1.8x, not 6x. But of course Log_1000(n) and Log_2(n) have the same Big-O, which is why the base is left off, so you really just cannot say anything specific at all. O(n^2) might be faster than O(n*log(n)) up to n = Graham's number.
>This defeats the entire claim, and you can't just "but" it back.
You may have missed what the "but" is doing- it's agreeing with you. My entire claim is defeated, and it uses the same reasoning that that the parent used to make their claim. I'm not attempting to show that there is an improvement, only that the lack of improvement has not been demonstrated by listing two Big-Os and setting n.
It is true it isn't numerically stable and a FFT isn't entirely reversible. I think to get an idea about how frequency data relates to attention is to look at JPEG. Images tend to make understanding the concept easier. For JPEG a cosine transformation is used instead of a Fourier transformation, but the principle is the same.
"Overall, while approaches such as FNet, Performer, and sparse transformers demonstrate that either fixed or approximate token mixing can reduce computational overhead, our adaptive spectral filtering strategy uniquely merges the efficiency of the FFT with a learnable, input-dependent spectral filter. This provides a compelling combination of scalability and adaptability, which is crucial for complex
sequence modeling tasks."
Except that the paper is written as if they discovered that you can use an fft for attention. They even have a "proof". It's in the title. Then you discover everyone already knew this and all they do is as some extra learnable parameters.
Search engines don't always turn up prior art the way you'd like. Simple jargon discrepancies can cause a lot of mischief. Though I'm sure a case could be made about it being confirmation bias. It's hard to get people to search in earnest for bad news. If it's not in your face they declare absence of evidence as evidence of absence.
That seems like an odd comparison, specialty hardware is often better, right?
Hey, do DSPs have special hardware to help with FFTs? (I’m actually asking, this isn’t a rhetorical question, I haven’t used one of the things but it seems like it could vaguely be helpful).
Xilinx has a very highly optimized core for the FFT. You are restricted to power of 2 sizes. Which usually isn't a problem because its fairly common to zero pad an FFT anyway to avoid highly aliased (i.e. hard-edges) binning.
The downside of implementing directly in hardware, the size would be fixed.
> with data loading from either specially designed vector registers (V-mode) or RAM off-the-core (R-mode). The evaluation shows the proposed FFT acceleration scheme achieves a performance gain of 118 times in V-mode and 6.5 times in R-mode respectively, with only 16% power consumption required as compared to the vanilla NutShell RISC-V microprocessor
>The TPU is so inefficient at FTs that the researchers did not use the FFT algorithm on sequences < 4096 elements, instead opting for a quadratic-scaling FT implementation using a pre-computed DFT matrix.
> on an Nvidia Quadro P6000 GPU, the FT was responsible for up to 30% of the inference time on the FNet architecture [0]
This company [0] claimed in 2021 they could squash inference time by 40% if google would use their light chips on TPU. Perhaps more if FFTNet does more heavy lifting.
I have been entertaining myself a bit lately by thinking about the ways in which some improvements to a design are very, very interesting to people when it takes 1.2 machines to do a task, not worth paying attention to when it's 6 machines to do the task, and suddenly very interesting again when it's 120 machines to do the task. There's that weird saddle point in the middle where I cannot get anyone else interested in my 20% resource improvements. It's just crickets.
I would guess that the FFT scales better as you increase the number of tokens in the context window. Interesting Google's models outperform their competitors on context size.
I'm glad someone else had the same thought. I have been wondering what their "secret sauce" is for a while given how their model doesn't degrade for long-context nearly as much as other LLMs that are otherwise competitive. It could also just be that they used longer-context training data than anyone else though.
Yeah but a comparison in power utilization is needed too. You can build hardware that is better than a GPU at something i.e MatMul being really efficient and fast. However, actual FFT hardware would annihilate power and speed at large enough n. Simply because the number of multiplications MatMul does is O(n^3) as opposed to the O(n log n) multiplies that FFT does (complex verse real multiplies with holding).
FFT is only O(N log N) for a vector of length N
WRT to matrices for an N by N matrix it would be like O(N^2 log N) you would perform FFT for each row or column
I still think we are comparing ASIC matmul hardware to non ASIC FFT hardware. The given TPU hardware is doing 256x256 matrix multiplication in linear time by using 256x256 multiplier grids. FFT ASIC could like do the same thing but be able to handle a much higher N size before memory becomes the bottleneck.
Part of the FFT can be accelerated on GPU hardware, which is full of butterfly-like instructions within warps. Using overlap-and-add/overlap-and-save and cuFFTDx can also make use of heavy parrallelism within shared memory. I had a hard time reproducing the tcFFT paper (for lack of time and tensor core skills I guess) but you can also keep your data in Tensor Core registers too, apparently.
The downside of a dedicated ASIC, besides the fixed size of the multipliers, which isn't that big of a deal because matrix multiplication can be broken down into blocks anyway, is the precision(16-bit, 8-bit) and data format (floating point vs. integer/fixed) are immutable
The Fourier transform is taken along the “token” dimension. However, in many applications, this dimension is not meaningful. That’s why transformers are a great option for consuming data which is permutation invariant.
I would like to see additional experiments using the lesser known Fourier transform over finite groups [1], which is permutation invariant but shares many properties with the standard Fourier transform.
I also wonder if this becomes the next big thing for LLMs, how easy will it be for inference engines(eg vLLM, llama.cpp) to integrate it?
That's true in the realm of LLMs. But even in this case, the position information is added only into the first layer. Tokens in later layers can choose to "forget" this information. In addition there are applications of transformers in other domains. See https://github.com/cvg/LightGlue or https://facebookresearch.github.io/3detr/
That is the traditional Fourier transform, except it can be a cyclic group of any size, doesn't need to be a power of 2. (Though FFTs with 2^n input size are particularly easy to implement.)
I was careless in my thinking, thanks for the correction. I was imagining since you sum the group elements in any order, there is a permutation invariance. But the group elements themselves play the role of the "token index" and group elements are not interchangeable. To actually make this idea interesting, one would have to use a group in which the choice of group element assigned to each input token would not affect the result.
You mean for the group operation to be standard modular addition? In that case (as a sibling comment says) you'll recover the classic discrete Fourier transform.
OK, I admit that the math flies way over my head and I barely understand the text around the math. Can someone please explain (in basic English) how this is equivalent to attention mechanism? What friquencies does it talk about? How does it encode positional relations between tokens?
- The Fourier Transform is an invertable operator (i.e. it acts on functions, in the case of matrices both functions and operators are themselves matrices). It transforms into what we call frequency space.
- This is most intuitive for signal analysis or images [1].
- Frequency space is inherently "complex", i.e. represented by complex numbers.
- Frequencies have the advantage that they take a "global" view of the problem.
- This mechanism is not equivalent to the attention mechanism. There is definitely a trade-off.
- But it is possible that it captures many of the important relationships that attention capture.
- I do not have good intuition for modReLU right away, but it seems important because it modifies the frequencies but preserves the inverse Fourier transform.
The actual mechanism at least is quite simple. Essentially it takes the FFT of the input embeddings, multiplies it elementwise with weights that are gotten from the input embeddings using an MLP (plus a constant (but learnable) bias) and then runs it through an activation function and finally takes the inverse FFT.
The "frequencies" are probably something quite abstract. FFT is often used in ways where there aren't really clear frequency interpretation. The use is due to convenient mathematical properties (e.g. the convolution theorem).
Rather amazing if this really works well. Very elegant.
That sounds just like Particle Mesh Ewald, which we use in molecular dynamics to approximate the forces of pairwise interactions (interpolated on a grid). Ihttps://en.wikipedia.org/wiki/P3M
It's similar but I worked on magnetic spin systems with dipole-dipole interactions, so there wasn't the interpolation part, and as I understand it in Ewald summation you're always assuming periodic boundary conditions.
In our spin systems you basically pre-compute the interaction kernel tensor and can either take into account periodicity or ignore it depending on what sort of system you're looking at. Often you don't want the periodic effect since the dipole-dipole interaction is only one of many, much of the interesting phenomena in magnetics is in the interplay between short range forces and the long range forces. At each time step you FFT to the magnetisation tensor and then multiply with the interaction tensor, then iFFT.
I’m still confused. Does it treat the input tokens as a sampled waveform?
I mean, say I have some text file in ASCII. Do I then just pretend it’s raw wav and do FFT on it? I guess it can give me some useful information (like does it look like any particular natural language or is it just random; sometimes used in encrytion analysis of simple substitution cyphers). It feels surprising that revers FFT can get a coherent output after fiddling with the distribution.
As I understand it, the token embedding stream would be equivalent to multi-channel sampled waveforms. The model either needs to learn the embeddings by back-propagating through FFT and IFFT, or use some suitable tokenization scheme which the paper doesn't discuss (?).
No. The FFT is an operation on a discrete domain, it is not the FT. In the same way audio waveforms are processed by an FFT you bucket frequencies which is conceptually a vector. Once you have a vector, you do machine learning like you would with any vector (except you do some FT in this case, I haven’t read the paper).
I am not an expert by _any_ means, but to provide _some_ intuition — self-attention is ultimately just a parameterised token mixer (see https://medium.com/optalysys/attention-fourier-transforms-a-...) i.e. each vector in the output depends upon the corresponding input vector transformed by some function of all the other input vectors.
I don't see how you could fit causal masking into this framework without having to do n different FFTs, and there's no mention of positional embeddings either, so I guess the self-attention implementation being compared against is noncausal NoPE, which would make this a case of baseline sandbagging and maybe not so impressive.
If the results were close to state-of-the-art, probably the author would've mentioned it?
They do show their model as winning every category in Long Range Arena (LRA) benchmark. Hopefully they have not excluded losing categories or better models.
Does anyone have an intuition about why looking at things in the frequency domain is helpful here? The DC term I can see, but I wouldn't expect the input data is periodic enough that other frequencies would be meaningful.
I think in the age of telemetry we are also missing a substantial trick by not applying FFTs to cloud telemetry to smoke out epicycles and metastable systems before rather than after they trigger drama. This is unfortunately within my level of notice but not within my level of skill, and my dance card is already full.
"SLA's are most likely to be violated 23-25 minutes after a service deployment. Hmm, I wonder why that is... Oh no."
"I'm afraid I cannot deploy your application, Dave"
Jokes aside one area this could be really worth money is predicting cycles of traffic and saving with ramp up and ramp down of server instances. It's the kind of work that if you're doing it out of your time the company would never give you greenlight it but if you pack it as a shelf product they would totally buy it.
Yep. I worked on a SaaS for one particular retail industry, and I would bet you anything they see more traffic when biweekly paydays and the calendar line up in certain ways. Either a couple days before (window shopping) or a few days after, once other bills are paid.
That extra half paycheck when a month has 5 Mondays or Fridays in it...
I sort of grasp big O notation...but this is sort of over my head like most stuff that has to do with computer or electrical engineering.
As someone who is absolutely terrible at math, I envy the people who grasp or at least can learn this type of stuff and get an engineering degree and license.
All I really know about FFT is that is changes a signal, its somehow used in processing signals of some kind, and it apparently from what I heard was the key to detecting nuclear detonations back in the day.
Having a decent intuitive notion of Fourier transforms is an incredibly useful tool in your toolbox, even if you can't derive a Fourier transform by hand or write a fast Fourier transform (FFT) algorithm.
The basic idea is this: (almost) any (useful) signal can be represented as a sum of sine waves with different frequencies and phases. For example, an electrical signal or a sound wave is a one-dimensional signal where the x-axis is time. This might look like a really complex squiggly line that's hard to work with. Using a Fourier transform, you can separate the individual frequencies of that time-based signal. Then, you can modify the specific frequencies however you want. For example, if you have a lot of random, spiky noise in the signal, that will show up as high frequencies. To clean it up, just do a Fourier transform, throw out any data with a frequency above a certain threshold, and then run an inverse Fourier transform on the remaining data to get back a smoother version of the original signal. This is called a low-pass filter, and it's more or less equivalent to taking a moving average of the original signal.
Where it gets really fun is that you can extend this, in a pretty straightforward way, to higher dimensions. A two-dimensional signal, where both the x- and y-axes are space, is just an image. JPEG compression is based on this concept: it removes the high-frequency signal in the image in order to store the data in a more compact form, at the expense of losing some fine detail (or creating those ring-like artifacts, if you throw out too much data). Add a third dimension for time, and now you have video. And so on.
The nice thing about all this is that it's very visual, so you can get a good intuition for it without having to know all the math inside and out. Here's a good page with lots of visualizations and interactive examples: https://www.jezzamon.com/fourier/index.html
The simple explanation: Lets say you have a one-dimensional time-domain signal, e.g. an audio signal measured by a microphone, which measures the displacement of air vs. time at a fixed point (assuming you dont move the microphone I guess)
The Fourier transform (of which the FFT is a discrete version of) decomposes that 1D time-domain signal (e.g. an audio signal, time vs. displacement) into frequency vs. magnitude and phase components
The frequency is basically the pitch. So for a pure sine wave, or pure tone - which sounds like those "off air" TV signals we used to get late at night back in the day. You get a bunch of zeros and a single "spike" at the frequency of the tone. The larger the amplitude of the signal, the larger the magnitude of the spike will be. As the pitch (frequency) increases/decreases, the location of this spike moves up/down along horizontal axis
The phase is basically the time offset of the signal. A tone which was delayed somehow will show up as a different phase. Note this is a relative measure - not absolute. So you won't be able to tell if the signal was offset by 1s or 2s, etc. because it has units of radians(angle), which have to "reset" as the angle wraps around the circle.
So for one signal (time vs. amplitude), you actually get two pieces of information (frequency vs. magnitude/phase)
However if you understand imaginary numbers/complex variables, those two signals are really just the magnitude and argument of FFT output, which produces a complex function
At a high level, it's very unintuitive to me that transforming to the frequency domain for analyzing long sequences of tokens--when there is in general no expectation that the sequences have a periodic structure--can improve efficiency.
Stated another way, how can it be possible that it is more efficient to translate the sequence into a series of N variables, where the nth variable is the sum of every nth term of the sequence, if it is unlikely that any relationship between these variables holds for any fixed period? If I combine the 1st 4th 7th 10th .... elements of the sequence, how do we expect the addition of anything beyond the first two elements to add anything but noise?
Stated another another way, if I'm going to approximate a function as a sum of sine waves, this is most efficient when the function is periodic and requires more and more sine waves in the sum to approximate the function on a larger and larger domain when the function is not periodic.
What I don't get about attention is why it would be necessary when a fully connected layer can also "attend" to all of the input. With very small datasets (think 0 - 500 tokens), I found that attention makes training longer and results worse. I guess the benefits show up with much larger datasets. Note that I'm an AI noob just doing some personal AI projects, so I'm not exactly a reference.
A fully connected layer has different weights for each feature (or position in input in your formulation). So the word "hello" would be treated completely differently if it were to appear in position 15 vs. 16, for example.
Attention, by contrast, would treat those two occurrences similarly, with the only difference depending on positional encoding - so you can learn generalized patterns more easily.
This is the case with most clever neural architectures: in theory, you could always replace them with dense layers that would perform better with enough resources/training, but that's just it, efficiency matters (number of parameters, training data, training time, FLOPS) and dense layers aren't as efficient (to put it mildly).
You have seen this play out on a small scale, but if you calculate the size of the dense layers necessary to even theoretically replicate a big attention layer or even convolution, to say nothing of the data needed to train them without the help of the architecture's inductive bias, you will see that the clever architectures are quite necessary at scale.
Working in the Fourier domain has been a staple of scientific and engineering applications. Learning those interactions rather than just hardcoding them has been fairly widely explored as well - the term to look for is Fourier Neural Operators [1][2]. It turns out you can prove universality even if the nonlinearity remains in the real domain [3].
The concept is fairly mainstream nowadays, to the degree that Jensen talked about it in his GTC keynote in 2021 [4] and there’s even a mainstage TED talk about its applications [5].
A nice property of doing things this way is that your model ends up being resolution-invariant which is particularly interesting for engineering domains. Scaling these methods has sparked the "let’s do a fully deep-learning-based weather model"-race [6][7].
As for using this on text data: my intuition would be that is going to not work as well because of a fairly unique property of text: for image, video and scientific data each individual element is of approximately equal importance, whereas in text you can have discrete tokens like a "not" somewhere in there that change the meaning of everything around it fairly significantly and you’d want that all to all interaction to capture that. Any kind of mixing that smoothes things out is going to inherently be at a disadvantage - probably true to some degree for most of those efficiency saving methods and why we’re seeing more limited adoption on text.
How does this compare with Mamba? Intuitively it feels like this should have similar performance to Mamba, but the math is a lot simpler and easier to understand.
Since I believe consciousness itself is made of EMF waves, generated by neural activity (rather than synaptic firings themselves, which I view merely as signal carriers like the I/O to/from brains), I'm glad to see it any time FFTs are used in any way in NNs or AI research.
I started to develop my own custom type of MLP (multilayer perceptron), that was going to use frequencies and phase angles (FFT) as the "model weights", but then I decided probably it would only outperform the standard MLP if the training data itself was periodic in nature, rather than with language tokens or even image data. Not sure if that's correct or not since Fourier Series shows us ANY arbitrary function can be simulated via a superposition of waves.
I still believe if we do achieve something amazing (i.e. competitive with SOTA AI models) with a wave-based NN, it won't create any 'true' qualia however, because simulating EMF waves in a computer is not the same as real EMF waves existing. I think even a 100% perfect simulation of a brain in a computer, for example, will always be a 'zombie' (no qualia). This is obvious if consciousness is indeed made of waves; but it's astounding how few NN-researchers seem to be so illiterate in the field of neuroscience that they don't realize how much evidence there is that consciousness is a wave phenomena.
Amazing. I had a similar insight and did exactly this some time ago, transforming the input into the frequency domain in order to preserve input energy and encode information in the complex space, plus filtering based on global context (different filter implementation than discussed in this paper) but only trained the model locally and didn't bother seeing how it scaled. Congrats on the hard work!
I'm interested in how this would work for generative models. It's not obvious how you'd implement causal masking in the frequency domain. And the modReLU activation seems critical but adds implementation complexity.
Would love to see how this scales on truly massive context lengths where the theoretical advantages should really shine.
this seems to follow the similar FNO work by nvidia, and switching to frequency domain is usually in any computer scientist's toolbox at this point, however, I'm curious if this translates to real gains for real architectures. FFT makes use of imaginary numbers to encode the harmonics of the signal, these are generally not amenable to gpu architectures. Would fast walsh hadamard suffice? Sometimes the 'signal mixing' is more important than the harmonics of a compositions of sines. Or do we go further down the rabbit hole of trained transformation and try out wavelets? I am an avid FFT fan, (love fast johnson lindenstrauss transform using the embedded uncertainty principle for RIP), but sometimes real hardware and good theory dont always align (eg there are sub ternary matrix multiplies, but they are rarely used in DL)
Complex numbers work just fine on a GPU. You just represent the data as a real-part matrix and an imaginary-part matrix and do the (ac-bd)+(ad+bc)i stuff in matrix land instead of on complex scalars.
What's wrong with complex numbers on GPUs? You don't have to do anything special. It's obviously faster if you can make simplifying assumptions like "the input signal is purely real" but otherwise at worst you're dealing with pairs of reals (or floats) and don't have to think about philosophical implications.
gpus dont implement complex number fp math, you have to bolt it on as extra logic. cufft works because you can recursively predict the imaginary and real component paths in the butterfly network. between layers you have fft->ifft , is this cost memory locality-wise worth it, or is it better to find ways to tamp down n in n^2 self attention by windowing, batching, gating, many other solutions. im not saying this work isn't cool, FNOs are really cool especially for solving PINNs and related continuous problems, are llms continuous problems, does n have to span the entire context window? I'll probably end up experimenting with this as theyve made the code available, but sometimes good theory is good theory, but not necessarily practical.
hm. nothing that happens in the fourier domain can touch the phases, which seems like a constraint that would change the behavior of other layers.
the default bias of -0.1 with relus and what i would expect to be a flattish spectrum also seems like it would make for a sparse representation in the fourier domain.
i assume this is learning the text embeddings at training time, if so, i'd be curious how the constraints of going through the fft and filtering magnitudes would/could change how the embeddings look.
I am very out of my depth here haha with the math, appreciate those below taking the time to baby walk me along with the understanding!
Great links too!
In image processing at least, NN typically learn a Fourier or Wavelet representation in their first layers. Biggest benefit of applying a transformation beforehands is to reduce training time / obtain better generalization by "removing the dimension that doesn't matter".
E.g. in a suitable space, one coordinate could represent the rotation of an object. You could do the transform and discard this dimension if your NN should be rotating invariant.
In image processing I thought there was a whole host of specialized algorithms, such as edge detection, SCC, etc. that were run before the data was even fed into the ANN.
Overall, i like this guys papers but they strike me as someone who is very smart but hasnt looked through the literature carefully. Many of the techniques he is proposing were already done about 5-6 years ago. However, I imagine that because the field is flooded with new humans, they are not aware of this research or think it will lead to a fruitful end (which many other researchers have already thought of this and it didn't lead anywhere hence why it was abandoned). Overall, it seems we are starting to recycle ideas because there isnt enough lit review and or mentoring from senior deep learning / ML folks who can quickly look at a paper and tell the author where the work has been already investigated.
Reviving old ideas and comparing them to SOTA is not necessarily bad especially if they provide benefits over the SOTA model. It brings the old ideas into the community idea cache if you will. It’s somewhat annoying if the authors do it thinking it’s a novel idea when it fact it’s a 20-30 year old one.
This reminds me of some HN comments about rocketry ideas and in the thread one of the comments was “Everything in rocket science has been theorized/tried by some Russian scientist 40-50 years ago” and it still gives me a chuckle.
> Overall, it seems we are starting to recycle ideas because there isnt enough lit review and or mentoring from senior deep learning / ML folks who can quickly look at a paper and tell the author where the work has been already investigated.
Arguably, the literature synthesis and knowledge discovery problem has been overwhelming in many fields for a long time; but I wonder if, in ML lately, an accelerated (if not frantic) level of competition may be working against the collegial spirit.
I think it's been accelerated by the review community being overwelmed and the lack of experienced researchers with the combination of classic ML, deep learning, transformers, and DSP backgrounds -- a rare breed but sorely needed.
Where: L stands for the sequence length. T denotes a fixed constant that accounts for compression and selection time in Mamba's autoregressive inference. C reflects the fixed size of the SSM (State Space Model) latent state in Mamba
Yes, in Mamba accuracy seems to goes down and has trouble in exact token recall. But, I would say it might be good for very power efficient edge deployment, and ultra long contexts.
Mamba is solving a different problem than transformers.
What Mamba does is take an initial state s_0 and an input u_0, to produce a new state s_1 and an output o_1. It's basically modeling a very complicated state machine. I can easily think of half a dozen applications where this is exactly what you want and it is better than transformers, but LLMs are not among them. Essentially most control problems boil down to what Mamba does. In fact, I would say that Mamba as an architecture is probably the non-plus ultra for modeling mechanical system dynamics.
Deep learning actually simplifies the extremely complex math of previous machine learning and statistics/stochastics into a very reasonable set of operations:
matrix multiplications and some very simple activation functions
(plus automatic derivates, some magic and some scientific glasses which you can ignore)
Yes, the cochlea for one basically performs an FT on the incoming input signal, at least with respect to magnitude. The phase portion is still in the time domain
Basically leverages convolution theorem[0]: expensive convolutions in direct space becomes simple multiplications in reciprocal space, and vice versa.
Whereever you have a convolution operation on your data, transform them to the conjugate domain to turn it into multiplication.
In other words, work in the domain that is natural to your data.
[0] https://en.wikipedia.org/wiki/Convolution_theorem
this is a great way to put it, that said, it was not obvious to me that the attention space (how it is structured in LLMs) is a frequency domain
I wrote down the following in the internal Slack chat on 01.06.2025, but of course, the performance of the actual effort is much more than writing it down.
Large language models (LLMs) operate in a high-dimensional token space, where tokens (words, subwords, or characters) can be viewed as discrete signals covering the multi-dimensional knowledge space. So FFT analysis methods can be applied to reduce time domain complexity to frequency domain representation with an idea to reduce computational complexity. So we can map token signals into the frequency domain. This transformation allows us to analyze token dynamics, such as their frequency of occurrence, temporal correlations, and interactions across contexts, with computational efficiency. In this approach, embeddings are treated as signals, and their relationships in sequence are captured as patterns in the frequency domain. FFT could be used to decompose token streams into dominant frequency components, revealing periodic or recurrent patterns in language usage - these patterns are repeatable across human generated knowledge and generally follow a predefined set of rules so the signals are not just white noise, they are predictable. By analyzing these frequency components, predictions of the next token can be made by emphasizing high-energy components in the frequency spectrum, reducing noise and focusing on statistically probable outcomes. Using this method we can reduce computational overhead during training and inference by enabling lightweight spectral analysis rather than heavy attention mechanisms, especially for long-context or repetitive sequences. Also using classical signal filtering techniques (LPF, HPF, band pass) could help align model behavior with human linguistic patterns, refine token embeddings, and improve efficiency in both training and inference phases.
A cartoon:
To form a coherent idea you need to coordinate a lot of tokens. In other words, ideas are long-distance correlations between tokens. Ideas are the long-wavelength features of streams of tokens.
Is it exactly right? No. But as a cartoon it can motivate exploring an idea like this.
Right. This makes sense. But why Fourier space in particular. Why not, for example, a wavelet transform.
> Why not, for example, a wavelet transform.
That is a great idea for a paper. Work on it, write it up and please be sure to put my name down as a co-author ;-)
Or for that matter, a transform that's learned from the data :) A neural net for the transform itself!
That would be super cool if it works! I’ve also wondered the same thing about activation functions. Why not let the algorithm learn the activation function?
Mostly because of computational efficiency irrc, the non linearity doesn’t seem to have much impact, so picking one that’s fast is a more efficient use of limited computational resources.
This idea exists (the broad field is called neural architecture search), although you have to parameterize it somehow to allow gradient descent to happen.
Here are examples:
https://arxiv.org/abs/2009.04759
https://arxiv.org/abs/1906.09529
Now you’re talking efficiency—-certainly a wavelet transform may also work. But wavelets tend to be more localized than FTs.
This way you end up with time dilated convolutional networks [1].
[1] https://openreview.net/pdf?id=rk8wKk-R-
I like this. Anything that connects new synapses in my skull via analogy is a good post.
This is really a very interesting way of visualizing it.
Exactly. Exploiting the structure of the matrix (e.g., it is well approximated by a circulant matrix) is natural if there is structure to exploit. If everything in the preprint holds up, that might suggest some symmetries (e.g., approximate stationarity in time) in the data at hand.
Yeah basic math space transformation sandwich : 1) turn data into another space 2) operate in that space 3) transform back into original space. To optimize this, optimize each step and work as much as possible in the most efficient space
Remains to be seen how lossy the transformations are. We lose a lot of data in DSP (aka “media”) doing too much.
ReLU throws away half of data. Yet it works.
> In other words, work in the domain that is natural to your data.
Why would multiplication be more "natural" to a domain than convolution, as opposed to just simpler to calculate?
As an example, if you're trying to "smooth out" some signal, what you're really trying to do is remove the high frequency components. So using a Fourier transform to convert it to the frequency domain lets you directly work with the frequency data, rather than indirectly in the time/space/whatever domain. The fact that the operation is simpler in the frequency domain is a good hint that you've picked the "natural" space in which to look at the problem. Of course, there's no formal definition of "naturalness," and at the end of the day, you get the same result either way.
One way to think about things is in terms of diagonalization. A generic linear operator is a fairly complicated thing that mixes information from different dimensions. When you diagonalize an operator, it's the same operator, but you're choosing a coordinate system where it becomes clear that all it was really doing was stretching along each coordinate axis, so you've broken it down into something that acts independently in a simple way on each dimension. The Fourier transform is unitary, so the intuition is that you're basically doing something like a rigid transformation of space (e.g. a rotation), and when you look from the correct angle, you see that your original operator wasn't some complicated "add a stretched version of this dimension to this other dimension minus a stretched version of a third dimension", but just "stretch this by this, this by this, etc."
On the other hand, convolution itself is already "just" multiplication. e.g. multiplying polynomials is convolution of their coefficients (to get the x^n coefficient, you need to add up all the combinations of a_i a_j x^i x^j where i+j=n), and this point of view also applies to e.g. linear time-invariant systems[0] by thinking of your function as the weights of an infinite weighted sum (so sort of an infinite polynomial) of time-shift operators (and this point of view works for other groups, not just time shifts). So f(t) is then the "coefficient" for the t-shift, and multiplying two such weighted sums again has you convolve the coefficients (so your original functions). The jargon way to say this is that your space of G-invariant functions is secretly the free algebra generated by G (G being a group). From that point of view, convolution is the "natural" multiplication on G-invariant functions. One can then ask whether there's a Fourier transform for other groups, which leads to abstract harmonic analysis. e.g. the Mellin transform is the Fourier transform for scale/stretch invariant functions as opposed to shift invariant.
[0] The typical systems that one studies in signal processing contexts where convolution and Fourier transforms are your bread and butter: https://en.wikipedia.org/wiki/Linear_time-invariant_system
This is a good general tool for less-mathematically-deep folks to keep in their pocket: look for well-behaved objects that do something nice under the operation you're interested in. Typical "well-behaved" objects do things like stay where they are, or end up as a constant multiple of themselves, or end up as 0 or 1, or something like that. Then try to represent everything else in terms of those objects, so that you can take advantage of their convenient behavior. Less-difficult examples include:
- Prime factorization: primes have nice properties, and you can turn every integer into a product of primes (polynomial factorization is an extension of this idea) and work with the nice prime properties
- Vector spaces: basis vectors have nice properties, so you write vectors as sums of them and do operations on the coefficients instead of the vectors themselves
- The exponential function: it's the unique function with f'(x) = f(x), so you try to turn everything else into exponentials anytime you have to solve some painful differential equation because you know those terms will go away
- Fixed points in dynamical systems: if you don't want to analyze how arbitrary things change, find the points that don't, then think of the other points as (fixed point) + (small perturbation) and reduce your work to handling the perturbation
- Taylor series: polynomials are easy, smooth functions are hard, so turn your smooth function into a polynomial and do polynomial things with it
Yeah this mindset is often called "mathematical maturity" in books, and you've laid out a good pratical subset of it.
A nice generalization.
An example in statistics is the expectation operator. You can throw away a lot of detail if you only care about one central moment. And if you need more information about a distribution, add more moments.
Also, this works for public policy. Frame everything as a well functioning market and hope for the best. /s
But seriously, a nice intuition.
Well, “think of the other points as (fixed point) + (small perturbation) and reduce your work to handling the perturbation” is literally how modern economic models (DSGE) are studied and then used for public policy.
Here is an example that shows how high dimensional data can be sparse (aka simple aka natural) in some projections. Since FFTs are just a reprojection of your data, this provides useful intuition.
https://bsky.app/profile/bsky.tdunning.com/post/3lgvuzuju3k2...
> just simpler to calculate
That's all that "natural" means in this context. It's like "elegant" -- if it just takes less effort to get the result u need then why wouldn't you take the easier route?
I think they just meant simpler to calculate.
Because multiplication gives rise to simpler algebras than convolution does.
Is reciprocal space always just 1/space as in frequency=1/t?
In the case of FFTs, no. Which is why I prefer the term Fourier space. I don’t like frequency domain because I frequently work with 3-D and 5–D FFTs while I’ve always felt frequency is connected to single dimension FFT.
Maybe something like "recurrence space" would be better. Frequency does have a physical interpretation which could be misconstrued, e.g. the FFT of a wave in the space domain yields the wavenumber in the independent variable, not the frequency.
The formal, general name is “Dual space” I think.
Unfortunately that’s very vague because there’s many notions of “dual space” within applied mathematics, even just considering the parts relevant to engineering applications.
Ite called reciprocal because the fourier transformation is it's own inverse, and the input and output space have the same 'shape' (functions from the reals to the complex numbers).
So they are considered two sides of the same coin. And reciprocal in that sense.
yes. usually 1/space is often called wavenumber (k).
Yeah, but the savings are theoretical. You turn a O(n^2) operation into O(nlog n). Sounds great until you realize that n is three on average. To boot, you have to use complex numbers for calculations which are also less numerically stable. So, to the best of my knowledge, FFT is not a win for ordinary convolution.
Maybe for self-attention and for their use cases n is much larger, I didn't read the article. But you still have to deal with complex numbers.
> You turn a O(n^2) operation into O(nlog n). Sounds great until you realize that n is three on average.
Sure, but are long convolutions avoided precisely because they're expensive? This paper is talking about an alternative to an attention mechanism, which covers the entire context window, no? Isn't this paper saying: you could use a long convolution for this instead, and long convolutions don't have to be slow?
> you have to use complex numbers for calculations which are also less numerically stable
I haven't heard numerical stability being a big deal in neural nets; in fact don't people often use 16-bit floats as weights to save on space? Does the numerical stability of complex numbers exceed the precision dropped off by quantization anyway? Are complex numbers really inherently less numerically stable, or are we just not as good at using them yet?
3^2 / (3*log(3)) = >6x performance improvement and, if three is a linear average, I'd expect the average improvement to be even higher. I know real world computation doesn't answer to the simple scaling equations and there may well be a constant factor >6 that eats the gains, but I don't think the two Big Os and a n=3 are sufficient to make your case.
That's not how O(f(n)) notation works. You can't just plug in an n to O(f(n)) / O(g(n)) and claim a specific performance improvement. You have to actually know all the factors that are stripped off by big-O to do that, and you never really do. For instance, you're ignoring the cost to transform between domains.
> I know real world computation doesn't answer to the simple scaling equations ... but
No, no "but". This defeats the entire claim, and you can't just "but" it back.
Also, you appear to have used base-10 log for Log(3). It's almost certain that base-2 is more appropriate, leading to a factor of 1.8x, not 6x. But of course Log_1000(n) and Log_2(n) have the same Big-O, which is why the base is left off, so you really just cannot say anything specific at all. O(n^2) might be faster than O(n*log(n)) up to n = Graham's number.
>This defeats the entire claim, and you can't just "but" it back.
You may have missed what the "but" is doing- it's agreeing with you. My entire claim is defeated, and it uses the same reasoning that that the parent used to make their claim. I'm not attempting to show that there is an improvement, only that the lack of improvement has not been demonstrated by listing two Big-Os and setting n.
But yes, the log base 10 is my bad.
FYI you’re using log(3) to the base 10 there. That’s less than one so your figures look artificially good.
There are integer-only FFT analogues that enable the same tricks. Cf. https://en.wikipedia.org/wiki/Discrete_Fourier_transform_ove...
It is true it isn't numerically stable and a FFT isn't entirely reversible. I think to get an idea about how frequency data relates to attention is to look at JPEG. Images tend to make understanding the concept easier. For JPEG a cosine transformation is used instead of a Fourier transformation, but the principle is the same.
fft is unitary so it has really good numerical stability
Google introduced this idea in 2022 with "FNet: Mixing Tokens with Fourier Transforms" [0].
Later they found out that, performance of their TPU(s) for matrix multiplication was faster than FFT in the most scenarios.
[0]: https://arxiv.org/abs/2105.03824
Referenced in this paper:
"Overall, while approaches such as FNet, Performer, and sparse transformers demonstrate that either fixed or approximate token mixing can reduce computational overhead, our adaptive spectral filtering strategy uniquely merges the efficiency of the FFT with a learnable, input-dependent spectral filter. This provides a compelling combination of scalability and adaptability, which is crucial for complex sequence modeling tasks."
And a comparison section after that.
Except that the paper is written as if they discovered that you can use an fft for attention. They even have a "proof". It's in the title. Then you discover everyone already knew this and all they do is as some extra learnable parameters.
Pretty lame.
Search engines don't always turn up prior art the way you'd like. Simple jargon discrepancies can cause a lot of mischief. Though I'm sure a case could be made about it being confirmation bias. It's hard to get people to search in earnest for bad news. If it's not in your face they declare absence of evidence as evidence of absence.
That seems like an odd comparison, specialty hardware is often better, right?
Hey, do DSPs have special hardware to help with FFTs? (I’m actually asking, this isn’t a rhetorical question, I haven’t used one of the things but it seems like it could vaguely be helpful).
Xilinx has a very highly optimized core for the FFT. You are restricted to power of 2 sizes. Which usually isn't a problem because its fairly common to zero pad an FFT anyway to avoid highly aliased (i.e. hard-edges) binning.
The downside of implementing directly in hardware, the size would be fixed.
They usually have dedicated acceleration hardware, yes: https://www.ti.com/lit/an/sprabb6b/sprabb6b.pdf?ts=174057874...
yes, almost all DSPs I know have native HW supports for FFT, since it's the bread and butter for signal processing
I remember hearing about logic to help with deinterleaving the results of the butterfly network after the FFT is done.
Yeah, bit-reversed addressing mode as seen on the dsPIC is an example of this.
(Discrete) Fast Fourier Transform implementations:
https://fftw.org/ ; FFTW: https://en.wikipedia.org/wiki/FFTW
gh topic: fftw: https://github.com/topics/fftw
xtensor-stack/xtensor-fftw is similar to numpy.fft: https://github.com/xtensor-stack/xtensor-fftw
Nvidia CuFFTW, and/amd-fftw, Intel MKL FFTW
NVIDIA CuFFT (GPU FFT) https://docs.nvidia.com/cuda/cufft/index.html
ROCm/rocFFT (GPU FFT) https://github.com/ROCm/rocFFT .. docs: https://rocm.docs.amd.com/projects/rocFFT/en/latest/
AMD FFT, Intel FFT: https://www.google.com/search?q=AMD+FFT , https://www.google.com/search?q=Intel+FFT
project-gemmi/benchmarking-fft: https://github.com/project-gemmi/benchmarking-fft
"An FFT Accelerator Using Deeply-coupled RISC-V Instruction Set Extension for Arbitrary Number of Points" (2023) https://ieeexplore.ieee.org/document/10265722 :
> with data loading from either specially designed vector registers (V-mode) or RAM off-the-core (R-mode). The evaluation shows the proposed FFT acceleration scheme achieves a performance gain of 118 times in V-mode and 6.5 times in R-mode respectively, with only 16% power consumption required as compared to the vanilla NutShell RISC-V microprocessor
"CSIFA: A Configurable SRAM-based In-Memory FFT Accelerator" (2024) https://ieeexplore.ieee.org/abstract/document/10631146
/? dsp hardware FFT: https://www.google.com/search?q=dsp+hardware+fft
GPU saw a 10% improvement over the TPU
>The TPU is so inefficient at FTs that the researchers did not use the FFT algorithm on sequences < 4096 elements, instead opting for a quadratic-scaling FT implementation using a pre-computed DFT matrix.
> on an Nvidia Quadro P6000 GPU, the FT was responsible for up to 30% of the inference time on the FNet architecture [0]
This company [0] claimed in 2021 they could squash inference time by 40% if google would use their light chips on TPU. Perhaps more if FFTNet does more heavy lifting.
[0]: https://scribe.rip/optalysys/attention-fourier-transforms-a-...
I have been entertaining myself a bit lately by thinking about the ways in which some improvements to a design are very, very interesting to people when it takes 1.2 machines to do a task, not worth paying attention to when it's 6 machines to do the task, and suddenly very interesting again when it's 120 machines to do the task. There's that weird saddle point in the middle where I cannot get anyone else interested in my 20% resource improvements. It's just crickets.
4096 tokens is pretty short by today's standards for transformers too.
I would guess that the FFT scales better as you increase the number of tokens in the context window. Interesting Google's models outperform their competitors on context size.
I'm glad someone else had the same thought. I have been wondering what their "secret sauce" is for a while given how their model doesn't degrade for long-context nearly as much as other LLMs that are otherwise competitive. It could also just be that they used longer-context training data than anyone else though.
> faster than FFT
Not only that, but FFT support on TPU has always been best effort. Last I tried this, there were serious precision issues.
Reminds me how CNNs were also not implemented using FFTs.
But this would only work on a very limited number of tokens, right?
Reference for the later part?
The section "3.3 Implementation" is mostly about hardware level speedups, which basically says:
On GPU(s) FFT is consistently faster, but in TPU(s), for shorter sequences matrix multiplication was faster.
Yeah but a comparison in power utilization is needed too. You can build hardware that is better than a GPU at something i.e MatMul being really efficient and fast. However, actual FFT hardware would annihilate power and speed at large enough n. Simply because the number of multiplications MatMul does is O(n^3) as opposed to the O(n log n) multiplies that FFT does (complex verse real multiplies with holding).
FFT is only O(N log N) for a vector of length N WRT to matrices for an N by N matrix it would be like O(N^2 log N) you would perform FFT for each row or column
Thank you for that catch.
I still think we are comparing ASIC matmul hardware to non ASIC FFT hardware. The given TPU hardware is doing 256x256 matrix multiplication in linear time by using 256x256 multiplier grids. FFT ASIC could like do the same thing but be able to handle a much higher N size before memory becomes the bottleneck.
Part of the FFT can be accelerated on GPU hardware, which is full of butterfly-like instructions within warps. Using overlap-and-add/overlap-and-save and cuFFTDx can also make use of heavy parrallelism within shared memory. I had a hard time reproducing the tcFFT paper (for lack of time and tensor core skills I guess) but you can also keep your data in Tensor Core registers too, apparently.
The downside of a dedicated ASIC, besides the fixed size of the multipliers, which isn't that big of a deal because matrix multiplication can be broken down into blocks anyway, is the precision(16-bit, 8-bit) and data format (floating point vs. integer/fixed) are immutable
anything is "constant" time if you build big enough hardware and if it's fixed size.
The Fourier transform is taken along the “token” dimension. However, in many applications, this dimension is not meaningful. That’s why transformers are a great option for consuming data which is permutation invariant.
I would like to see additional experiments using the lesser known Fourier transform over finite groups [1], which is permutation invariant but shares many properties with the standard Fourier transform.
I also wonder if this becomes the next big thing for LLMs, how easy will it be for inference engines(eg vLLM, llama.cpp) to integrate it?
[1] https://en.wikipedia.org/wiki/Fourier_transform_on_finite_gr...
Not an expert in this space.
Aren't tokens transformed with position dependent information in most models?
I believe llama applies a rotation to the vector based on the position in the input.
That's true in the realm of LLMs. But even in this case, the position information is added only into the first layer. Tokens in later layers can choose to "forget" this information. In addition there are applications of transformers in other domains. See https://github.com/cvg/LightGlue or https://facebookresearch.github.io/3detr/
Transformers like Llama use rotary embeddings which are applied in every single attention layer
https://github.com/huggingface/transformers/blob/222505c7e4d...
What's the finite group in this case?
I’m thinking the integers mod 2^n where n is something computers are good at (8, 32, 64). You have hardware support the group operation.
That is the traditional Fourier transform, except it can be a cyclic group of any size, doesn't need to be a power of 2. (Though FFTs with 2^n input size are particularly easy to implement.)
And it's not permutation invariant.
I was careless in my thinking, thanks for the correction. I was imagining since you sum the group elements in any order, there is a permutation invariance. But the group elements themselves play the role of the "token index" and group elements are not interchangeable. To actually make this idea interesting, one would have to use a group in which the choice of group element assigned to each input token would not affect the result.
You mean for the group operation to be standard modular addition? In that case (as a sibling comment says) you'll recover the classic discrete Fourier transform.
OK, I admit that the math flies way over my head and I barely understand the text around the math. Can someone please explain (in basic English) how this is equivalent to attention mechanism? What friquencies does it talk about? How does it encode positional relations between tokens?
- The Fourier Transform is an invertable operator (i.e. it acts on functions, in the case of matrices both functions and operators are themselves matrices). It transforms into what we call frequency space.
- This is most intuitive for signal analysis or images [1].
- Frequency space is inherently "complex", i.e. represented by complex numbers.
- Frequencies have the advantage that they take a "global" view of the problem.
- This mechanism is not equivalent to the attention mechanism. There is definitely a trade-off.
- But it is possible that it captures many of the important relationships that attention capture.
- I do not have good intuition for modReLU right away, but it seems important because it modifies the frequencies but preserves the inverse Fourier transform.
[1]: https://homepages.inf.ed.ac.uk/rbf/HIPR2/fourier.htm
Worth noting that frequency space is often considered one dimensional. Adding the phase is what gives the other dimension.
modReLU seems to just increase the magnitude of the input value, and rotate it to the original polar angle. With clipping off negative magnitudes.
Or equivalently rotates a (real) bias term with the input angle and adds that into the original.
The actual mechanism at least is quite simple. Essentially it takes the FFT of the input embeddings, multiplies it elementwise with weights that are gotten from the input embeddings using an MLP (plus a constant (but learnable) bias) and then runs it through an activation function and finally takes the inverse FFT.
The "frequencies" are probably something quite abstract. FFT is often used in ways where there aren't really clear frequency interpretation. The use is due to convenient mathematical properties (e.g. the convolution theorem).
Rather amazing if this really works well. Very elegant.
Essentially leveraging Convolution theorem. Same philosophy pops up in many places, e.g. DFT calculations
Yep, it's very common to use this. we used it for grid based pairwise interactions since it turns N^2 op into N log N.
That sounds just like Particle Mesh Ewald, which we use in molecular dynamics to approximate the forces of pairwise interactions (interpolated on a grid). Ihttps://en.wikipedia.org/wiki/P3M
It's similar but I worked on magnetic spin systems with dipole-dipole interactions, so there wasn't the interpolation part, and as I understand it in Ewald summation you're always assuming periodic boundary conditions.
In our spin systems you basically pre-compute the interaction kernel tensor and can either take into account periodicity or ignore it depending on what sort of system you're looking at. Often you don't want the periodic effect since the dipole-dipole interaction is only one of many, much of the interesting phenomena in magnetics is in the interplay between short range forces and the long range forces. At each time step you FFT to the magnetisation tensor and then multiply with the interaction tensor, then iFFT.
Sorry, added the convolution theorem part in an edit after your comment.
I’m still confused. Does it treat the input tokens as a sampled waveform?
I mean, say I have some text file in ASCII. Do I then just pretend it’s raw wav and do FFT on it? I guess it can give me some useful information (like does it look like any particular natural language or is it just random; sometimes used in encrytion analysis of simple substitution cyphers). It feels surprising that revers FFT can get a coherent output after fiddling with the distribution.
Do keep in mind that FFT is a lossless, equivalent representation of the original data.
As I understand it, the token embedding stream would be equivalent to multi-channel sampled waveforms. The model either needs to learn the embeddings by back-propagating through FFT and IFFT, or use some suitable tokenization scheme which the paper doesn't discuss (?).
It seems unlikely to work for language.
It embeds them first into vectors. The input is a real matrix with (context length)x(embedding size) dimensions.
No. The FFT is an operation on a discrete domain, it is not the FT. In the same way audio waveforms are processed by an FFT you bucket frequencies which is conceptually a vector. Once you have a vector, you do machine learning like you would with any vector (except you do some FT in this case, I haven’t read the paper).
Most likely the embedding of the token is passed through FFT
I am not an expert by _any_ means, but to provide _some_ intuition — self-attention is ultimately just a parameterised token mixer (see https://medium.com/optalysys/attention-fourier-transforms-a-...) i.e. each vector in the output depends upon the corresponding input vector transformed by some function of all the other input vectors.
You can see conceptually how this is similar to a convolution with some simplification, e.g. https://openreview.net/pdf?id=8l5GjEqGiRG
Convolutions are often used in contexts where you want to account for global state in some way. - https://openreview.net/pdf?id=8l5GjEqGiRG
I don't see how you could fit causal masking into this framework without having to do n different FFTs, and there's no mention of positional embeddings either, so I guess the self-attention implementation being compared against is noncausal NoPE, which would make this a case of baseline sandbagging and maybe not so impressive.
If the results were close to state-of-the-art, probably the author would've mentioned it?
They do show their model as winning every category in Long Range Arena (LRA) benchmark. Hopefully they have not excluded losing categories or better models.
Winning against their own baseline, not against the current best-performing model. Which apparently is S5 currently https://paperswithcode.com/sota/long-range-modeling-on-lra with 87.46 overall vs. 58.31 here.
Should be a relevant reference: https://arxiv.org/abs/2111.13587
Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, Bryan Catanzaro
Does anyone have an intuition about why looking at things in the frequency domain is helpful here? The DC term I can see, but I wouldn't expect the input data is periodic enough that other frequencies would be meaningful.
I see no mention of prior work Hyena Operator which already demonstrated O(n log n) full context mixing several years ago.
https://arxiv.org/abs/2302.10866
Hyena came out of Albert Gu's prior work in the same lab. https://arxiv.org/abs/2111.00396
I think in the age of telemetry we are also missing a substantial trick by not applying FFTs to cloud telemetry to smoke out epicycles and metastable systems before rather than after they trigger drama. This is unfortunately within my level of notice but not within my level of skill, and my dance card is already full.
"SLA's are most likely to be violated 23-25 minutes after a service deployment. Hmm, I wonder why that is... Oh no."
"I'm afraid I cannot deploy your application, Dave"
Jokes aside one area this could be really worth money is predicting cycles of traffic and saving with ramp up and ramp down of server instances. It's the kind of work that if you're doing it out of your time the company would never give you greenlight it but if you pack it as a shelf product they would totally buy it.
Yep. I worked on a SaaS for one particular retail industry, and I would bet you anything they see more traffic when biweekly paydays and the calendar line up in certain ways. Either a couple days before (window shopping) or a few days after, once other bills are paid.
That extra half paycheck when a month has 5 Mondays or Fridays in it...
I sort of grasp big O notation...but this is sort of over my head like most stuff that has to do with computer or electrical engineering.
As someone who is absolutely terrible at math, I envy the people who grasp or at least can learn this type of stuff and get an engineering degree and license.
All I really know about FFT is that is changes a signal, its somehow used in processing signals of some kind, and it apparently from what I heard was the key to detecting nuclear detonations back in the day.
Having a decent intuitive notion of Fourier transforms is an incredibly useful tool in your toolbox, even if you can't derive a Fourier transform by hand or write a fast Fourier transform (FFT) algorithm.
The basic idea is this: (almost) any (useful) signal can be represented as a sum of sine waves with different frequencies and phases. For example, an electrical signal or a sound wave is a one-dimensional signal where the x-axis is time. This might look like a really complex squiggly line that's hard to work with. Using a Fourier transform, you can separate the individual frequencies of that time-based signal. Then, you can modify the specific frequencies however you want. For example, if you have a lot of random, spiky noise in the signal, that will show up as high frequencies. To clean it up, just do a Fourier transform, throw out any data with a frequency above a certain threshold, and then run an inverse Fourier transform on the remaining data to get back a smoother version of the original signal. This is called a low-pass filter, and it's more or less equivalent to taking a moving average of the original signal.
Where it gets really fun is that you can extend this, in a pretty straightforward way, to higher dimensions. A two-dimensional signal, where both the x- and y-axes are space, is just an image. JPEG compression is based on this concept: it removes the high-frequency signal in the image in order to store the data in a more compact form, at the expense of losing some fine detail (or creating those ring-like artifacts, if you throw out too much data). Add a third dimension for time, and now you have video. And so on.
The nice thing about all this is that it's very visual, so you can get a good intuition for it without having to know all the math inside and out. Here's a good page with lots of visualizations and interactive examples: https://www.jezzamon.com/fourier/index.html
And this 3Blue1Brown video does a good job of explaining it: https://youtu.be/spUNpyF58BY?si=dz0z-s8NftW3Htun
The simple explanation: Lets say you have a one-dimensional time-domain signal, e.g. an audio signal measured by a microphone, which measures the displacement of air vs. time at a fixed point (assuming you dont move the microphone I guess)
The Fourier transform (of which the FFT is a discrete version of) decomposes that 1D time-domain signal (e.g. an audio signal, time vs. displacement) into frequency vs. magnitude and phase components
The frequency is basically the pitch. So for a pure sine wave, or pure tone - which sounds like those "off air" TV signals we used to get late at night back in the day. You get a bunch of zeros and a single "spike" at the frequency of the tone. The larger the amplitude of the signal, the larger the magnitude of the spike will be. As the pitch (frequency) increases/decreases, the location of this spike moves up/down along horizontal axis
The phase is basically the time offset of the signal. A tone which was delayed somehow will show up as a different phase. Note this is a relative measure - not absolute. So you won't be able to tell if the signal was offset by 1s or 2s, etc. because it has units of radians(angle), which have to "reset" as the angle wraps around the circle.
So for one signal (time vs. amplitude), you actually get two pieces of information (frequency vs. magnitude/phase)
However if you understand imaginary numbers/complex variables, those two signals are really just the magnitude and argument of FFT output, which produces a complex function
https://m.youtube.com/watch?v=spUNpyF58BY
At a high level, it's very unintuitive to me that transforming to the frequency domain for analyzing long sequences of tokens--when there is in general no expectation that the sequences have a periodic structure--can improve efficiency.
Stated another way, how can it be possible that it is more efficient to translate the sequence into a series of N variables, where the nth variable is the sum of every nth term of the sequence, if it is unlikely that any relationship between these variables holds for any fixed period? If I combine the 1st 4th 7th 10th .... elements of the sequence, how do we expect the addition of anything beyond the first two elements to add anything but noise?
Stated another another way, if I'm going to approximate a function as a sum of sine waves, this is most efficient when the function is periodic and requires more and more sine waves in the sum to approximate the function on a larger and larger domain when the function is not periodic.
What I don't get about attention is why it would be necessary when a fully connected layer can also "attend" to all of the input. With very small datasets (think 0 - 500 tokens), I found that attention makes training longer and results worse. I guess the benefits show up with much larger datasets. Note that I'm an AI noob just doing some personal AI projects, so I'm not exactly a reference.
A fully connected layer has different weights for each feature (or position in input in your formulation). So the word "hello" would be treated completely differently if it were to appear in position 15 vs. 16, for example.
Attention, by contrast, would treat those two occurrences similarly, with the only difference depending on positional encoding - so you can learn generalized patterns more easily.
I think that this is the explanation I needed, thanks!
This is the case with most clever neural architectures: in theory, you could always replace them with dense layers that would perform better with enough resources/training, but that's just it, efficiency matters (number of parameters, training data, training time, FLOPS) and dense layers aren't as efficient (to put it mildly).
You have seen this play out on a small scale, but if you calculate the size of the dense layers necessary to even theoretically replicate a big attention layer or even convolution, to say nothing of the data needed to train them without the help of the architecture's inductive bias, you will see that the clever architectures are quite necessary at scale.
attention grows dynamically with the input size - mlps not
Working in the Fourier domain has been a staple of scientific and engineering applications. Learning those interactions rather than just hardcoding them has been fairly widely explored as well - the term to look for is Fourier Neural Operators [1][2]. It turns out you can prove universality even if the nonlinearity remains in the real domain [3].
The concept is fairly mainstream nowadays, to the degree that Jensen talked about it in his GTC keynote in 2021 [4] and there’s even a mainstage TED talk about its applications [5].
A nice property of doing things this way is that your model ends up being resolution-invariant which is particularly interesting for engineering domains. Scaling these methods has sparked the "let’s do a fully deep-learning-based weather model"-race [6][7].
As for using this on text data: my intuition would be that is going to not work as well because of a fairly unique property of text: for image, video and scientific data each individual element is of approximately equal importance, whereas in text you can have discrete tokens like a "not" somewhere in there that change the meaning of everything around it fairly significantly and you’d want that all to all interaction to capture that. Any kind of mixing that smoothes things out is going to inherently be at a disadvantage - probably true to some degree for most of those efficiency saving methods and why we’re seeing more limited adoption on text.
[1] https://arxiv.org/abs/2010.08895
[2] https://www.nature.com/articles/s42254-024-00712-5
[3] https://jmlr.org/papers/v22/21-0806.html
[4] https://www.youtube.com/watch?v=jhDiaUL_RaM&t=2472s
[5] https://www.ted.com/talks/anima_anandkumar_ai_that_connects_...
[6] https://arxiv.org/abs/2202.11214 (Feb 2022)
[7] https://www.wired.com/story/ai-hurricane-predictions-are-sto...
Well done!
Besides immediate speed gains, I guess this opens the door to ultra-long contexts. Larger than say, 16M tokens.
How does this compare with Mamba? Intuitively it feels like this should have similar performance to Mamba, but the math is a lot simpler and easier to understand.
Do I get it wrong, that this will be incompatible with OrthoGrad¹ optimizer?
[¹] https://arxiv.org/abs/2501.04697
These seem like orthogonal developments to me that could easily be combined. What made you think they might be incompatible?
Isn't flash attention already n log n?
No https://arxiv.org/abs/2402.07443
Only in space complexity, not in time complexity.
Since I believe consciousness itself is made of EMF waves, generated by neural activity (rather than synaptic firings themselves, which I view merely as signal carriers like the I/O to/from brains), I'm glad to see it any time FFTs are used in any way in NNs or AI research.
I started to develop my own custom type of MLP (multilayer perceptron), that was going to use frequencies and phase angles (FFT) as the "model weights", but then I decided probably it would only outperform the standard MLP if the training data itself was periodic in nature, rather than with language tokens or even image data. Not sure if that's correct or not since Fourier Series shows us ANY arbitrary function can be simulated via a superposition of waves.
I still believe if we do achieve something amazing (i.e. competitive with SOTA AI models) with a wave-based NN, it won't create any 'true' qualia however, because simulating EMF waves in a computer is not the same as real EMF waves existing. I think even a 100% perfect simulation of a brain in a computer, for example, will always be a 'zombie' (no qualia). This is obvious if consciousness is indeed made of waves; but it's astounding how few NN-researchers seem to be so illiterate in the field of neuroscience that they don't realize how much evidence there is that consciousness is a wave phenomena.
Amazing. I had a similar insight and did exactly this some time ago, transforming the input into the frequency domain in order to preserve input energy and encode information in the complex space, plus filtering based on global context (different filter implementation than discussed in this paper) but only trained the model locally and didn't bother seeing how it scaled. Congrats on the hard work!
I'm interested in how this would work for generative models. It's not obvious how you'd implement causal masking in the frequency domain. And the modReLU activation seems critical but adds implementation complexity. Would love to see how this scales on truly massive context lengths where the theoretical advantages should really shine.
this seems to follow the similar FNO work by nvidia, and switching to frequency domain is usually in any computer scientist's toolbox at this point, however, I'm curious if this translates to real gains for real architectures. FFT makes use of imaginary numbers to encode the harmonics of the signal, these are generally not amenable to gpu architectures. Would fast walsh hadamard suffice? Sometimes the 'signal mixing' is more important than the harmonics of a compositions of sines. Or do we go further down the rabbit hole of trained transformation and try out wavelets? I am an avid FFT fan, (love fast johnson lindenstrauss transform using the embedded uncertainty principle for RIP), but sometimes real hardware and good theory dont always align (eg there are sub ternary matrix multiplies, but they are rarely used in DL)
Complex numbers work just fine on a GPU. You just represent the data as a real-part matrix and an imaginary-part matrix and do the (ac-bd)+(ad+bc)i stuff in matrix land instead of on complex scalars.
What's wrong with complex numbers on GPUs? You don't have to do anything special. It's obviously faster if you can make simplifying assumptions like "the input signal is purely real" but otherwise at worst you're dealing with pairs of reals (or floats) and don't have to think about philosophical implications.
https://docs.nvidia.com/cuda/cufft/
gpus dont implement complex number fp math, you have to bolt it on as extra logic. cufft works because you can recursively predict the imaginary and real component paths in the butterfly network. between layers you have fft->ifft , is this cost memory locality-wise worth it, or is it better to find ways to tamp down n in n^2 self attention by windowing, batching, gating, many other solutions. im not saying this work isn't cool, FNOs are really cool especially for solving PINNs and related continuous problems, are llms continuous problems, does n have to span the entire context window? I'll probably end up experimenting with this as theyve made the code available, but sometimes good theory is good theory, but not necessarily practical.
hm. nothing that happens in the fourier domain can touch the phases, which seems like a constraint that would change the behavior of other layers.
the default bias of -0.1 with relus and what i would expect to be a flattish spectrum also seems like it would make for a sparse representation in the fourier domain.
i assume this is learning the text embeddings at training time, if so, i'd be curious how the constraints of going through the fft and filtering magnitudes would/could change how the embeddings look.
I am very out of my depth here haha with the math, appreciate those below taking the time to baby walk me along with the understanding! Great links too!
[flagged]
dang, spammer
email hn@ycombinator.com; saying “dang” doesn't do anything unless he happens to be browsing the comments tab at the right moment.
Makes me think what if NNs are treated as black box signal processing units. What other techniques can we borrow from signal processing?
In image processing at least, NN typically learn a Fourier or Wavelet representation in their first layers. Biggest benefit of applying a transformation beforehands is to reduce training time / obtain better generalization by "removing the dimension that doesn't matter".
E.g. in a suitable space, one coordinate could represent the rotation of an object. You could do the transform and discard this dimension if your NN should be rotating invariant.
In image processing I thought there was a whole host of specialized algorithms, such as edge detection, SCC, etc. that were run before the data was even fed into the ANN.
This guy's has had a lot bangers lately.
Overall, i like this guys papers but they strike me as someone who is very smart but hasnt looked through the literature carefully. Many of the techniques he is proposing were already done about 5-6 years ago. However, I imagine that because the field is flooded with new humans, they are not aware of this research or think it will lead to a fruitful end (which many other researchers have already thought of this and it didn't lead anywhere hence why it was abandoned). Overall, it seems we are starting to recycle ideas because there isnt enough lit review and or mentoring from senior deep learning / ML folks who can quickly look at a paper and tell the author where the work has been already investigated.
Reviving old ideas and comparing them to SOTA is not necessarily bad especially if they provide benefits over the SOTA model. It brings the old ideas into the community idea cache if you will. It’s somewhat annoying if the authors do it thinking it’s a novel idea when it fact it’s a 20-30 year old one.
This reminds me of some HN comments about rocketry ideas and in the thread one of the comments was “Everything in rocket science has been theorized/tried by some Russian scientist 40-50 years ago” and it still gives me a chuckle.
Yes, but they should at least reference the older papers and explain "whats new"
> Overall, it seems we are starting to recycle ideas because there isnt enough lit review and or mentoring from senior deep learning / ML folks who can quickly look at a paper and tell the author where the work has been already investigated.
Arguably, the literature synthesis and knowledge discovery problem has been overwhelming in many fields for a long time; but I wonder if, in ML lately, an accelerated (if not frantic) level of competition may be working against the collegial spirit.
I think it's been accelerated by the review community being overwelmed and the lack of experienced researchers with the combination of classic ML, deep learning, transformers, and DSP backgrounds -- a rare breed but sorely needed.
But does it allow unlimited context windows, like RoPE does, which LLaMa 2 now features?
Can someone confirm the big O time complexities of
1. traditional Self-Attention;
2. Flash-Attention?
3. Any novel others?
Ok, what FlashAttention changes is space complexity: from O(N^2) to O(N). Time complexity is still ~O(N^2) as with standard Self-Attention.
In other words, optimizes practical runtime through I/O reduction without altering asymptotic complexity
Mamba is O(n). But I guess it has other drawbacks.
Actually its a little more nuanced:
Operation Type|Mamba Complexity|Transformer Complexity
Training(per iteration)|O(L)|O(L^2)
Autoregressive Inference(per step)|O(T)|O(L)
Memory Requirements|O(C)|O(L)
Where: L stands for the sequence length. T denotes a fixed constant that accounts for compression and selection time in Mamba's autoregressive inference. C reflects the fixed size of the SSM (State Space Model) latent state in Mamba
Per: https://github.com/state-spaces/mamba/issues/196
Yes, in Mamba accuracy seems to goes down and has trouble in exact token recall. But, I would say it might be good for very power efficient edge deployment, and ultra long contexts.
Mamba is solving a different problem than transformers.
What Mamba does is take an initial state s_0 and an input u_0, to produce a new state s_1 and an output o_1. It's basically modeling a very complicated state machine. I can easily think of half a dozen applications where this is exactly what you want and it is better than transformers, but LLMs are not among them. Essentially most control problems boil down to what Mamba does. In fact, I would say that Mamba as an architecture is probably the non-plus ultra for modeling mechanical system dynamics.
My EE education keeps coming back.
I'll never not read FFT as Final Fantasy Tactics.
I came here just to see if anyone else shared this same experience. HN did not disappoint.
Very interesting idea.
Is this why biological neural networks are spiking?
if simping is bad doesnt that mean polynomial time complexity would be the best?
TL;DR:
1. Take FNet (https://arxiv.org/abs/2105.03824).
2. Replace the fixed (frequency-domain) convolution filter with one that is dynamically computed from the data.
3. Apply non-linear functions to both real and imaginary components, before mapping the convolved data back to the time domain.
Man, this all really gets increasingly more complex in increasingly complex math…
This is actually simpler than most self attention methods.
Deep learning actually simplifies the extremely complex math of previous machine learning and statistics/stochastics into a very reasonable set of operations:
matrix multiplications and some very simple activation functions (plus automatic derivates, some magic and some scientific glasses which you can ignore)
Machines going to be thinking in the frequency domain. We cooked.
Our bodies have parts that do processing in frequency domain. Who knows, maybe it'll turn out we are thinking in the frequency domain, too.
Yes, the cochlea for one basically performs an FT on the incoming input signal, at least with respect to magnitude. The phase portion is still in the time domain