All else being equal, there are two computational advantages to FTs vs. attention.
FTs can be calculated on graphics processing units (GPUs) fairly efficiently thanks to the Fast Fourier transform (FFT) algorithm. This allows the quadratic algorithm written above to run in O(n log(n)) time using a divide-and-conquer approach. Attention cannot be accelerated in this way and thus has undesirable scaling properties of the orderO(n²).
FTs are not parameterised, meaning that we can reduce a model’s memory footprint by replacing attention with FTs, as we do not need to store key, query and value matrices.
The researchers reported that FNet trained faster (80% on GPUs, 70% on TPUs), and with much higher stability than BERT. The peak performance as measured by accuracy on the GLUE benchmark was 92% as good as BERT.
They also proposed a hybrid architecture that replaced all but the last two attention layers in BERT with FT layers, which reached 97% relative accuracy, with only a minor penalty in training time and stability. This is a highly encouraging result, seemingly the undesirable scaling properties of transformers can be avoided if we employ more tactful approaches to token mixing. This is especially important for resource constrained environments.
How optical FTs can change the game
Optalysys have created the world’s most efficient FT core, based on an optical process. It not only reduces the algorithmic complexity of the FT from the FFT’s O(n log(n)) scaling to constant time: O(1), it also can be achieved with a fraction of the power consumption of a digital electronic circuit. When compared to the most efficient GPU implementation we have seen: the Nvidia A100, the optical approach is two orders of magnitude better.
We ran some experiments on the Huggingface implementation of FNet and found that on an Nvidia Quadro P6000 GPU, the FT was responsible for up to 30% of the inference time on the FNet architecture. Taking this into account, any GPU that integrates optical FT functionality would receive an instant speedup/efficiency gain on these networks of around 40%… and this is on an architecture that has been designed with traditional GPUs in mind.
Interestingly, Google’s TPU was not able to run FNet as efficiently as a GPU (relative to BERT). The GPU trained FNet 80% faster than BERT, the TPU’s, improvement was 70%. This is because the TPU is not at all optimised for FTs.
The TPU’s primary objective is to accelerate multiply and accumulate operations (MACs) used for matrix multiplication. The TPU is so inefficient at FTs that the researchers did not use the FFT algorithm on sequences < 4096 elements, instead opting for a quadratic-scaling FT implementation using a pre-computed DFT matrix. If the next generation TPU was to integrate something like the technology we are developing at Optalysys, then its FT processing efficiency could be raised by a factor of 1000.
Optalysys vision and Beta program
We see a huge potential to combine the strengths of free-space optical computing with most existing AI accelerators. Indeed, optical hardware can work in tandem with a whole range of electronic processors, providing faster and more efficient processing solutions.
In order to realise this ambition, we are interested in working with third parties to create next-generation AI and encryption systems leveraging the optical FT. Access to the Optalysys optical systems is not yet open to everyone, but for those interested, please contact us at optalysys.com for access to the beta program for bench-marking and evaluation.