EP26 - Exponentially Faster Language Modelling

Download the paper - Read the paper on Hugging Face

Charlie: Hey everyone, welcome to episode 26 of Paper Brief, where we dive into the details of cutting-edge research papers. I’m Charlie, your podcast host, and with me today I have Clio. She’s a wizard when it comes to machine learning and is going to help us unpack the intricacies of this incredible paper we’re discussing today.

Clio: Thanks for the introduction, Charlie! Excited to be here and chat about this fascinating paper titled ‘Exponentially Faster Language Modelling’.

Charlie: Let’s jump right in. Can you tell us a bit more about the main idea behind FastBERT, the variant of BERT that this paper introduces?

Clio: Absolutely! FastBERT is a game-changer because it operates using only a tiny fraction—0.3%—of neurons during inference, while still delivering performance that’s on par with similar BERT models. Imagine that: just 12 out of 4095 neurons for each layer’s inference, all thanks to replacing traditional feedforward networks with fast feedforward networks, or FFFs.

Charlie: That’s insanely efficient. But how do they achieve this kind of speedup?

Clio: The paper mentions high-level CPU code that achieves a 78 times speedup over traditional feedforward implementations and even a PyTorch implementation that delivers 40 times the speedup. They’ve essentially trimmed down the forward pass time complexity to the order of the logarithm of the number of neurons.

Charlie: Logarithmic time complexity? That’s the dream for any computer scientist! Does this compress the size of the model too?

Clio: Not really. The model itself still consists of 4095 neurons. It’s just that it needs to engage far fewer neurons for any instance of inference.

Charlie: I’m curious, though. They mention something about conditional neural execution. Can you expand on that?

Clio: Conditional neural execution is essentially how FastBERT works. It does matrix multiplication conditionally, based on the output of previous operations. This means it can handle different inputs using various neurons, but no single input requires more than a handful of them.

Charlie: That’s pretty wild. But with this kind of optimization, why aren’t they seeing even greater speeds, like in the hundreds or thousands times faster?

Clio: It’s all about optimization. Dense matrix multiplication is a historical optimization pinnacle in software. The current speedup is limited by the reliance on high-level linear algebraic routines for implementing CMM—conditional matrix multiplication.

Charlie: Got it. And for all the fellow nerds listening in, is there a way for us to tinker with this model?

Clio: Absolutely, they’ve made everything available. You can find the training code, benchmarking setup, and even the model weights on their GitHub page.

Charlie: Great stuff! Thanks, Clio, for shedding light on FastBERT. This has been quite enlightening. To our listeners, check out the paper if you’re into the details of language models and perhaps take FastBERT for a spin yourself.

Clio: My pleasure, Charlie. Thanks for having me, and I hope everyone’s inspired to explore the potential of language models further.

Charlie: And that wraps up episode 26 of Paper Brief. Keep it locked here for more discussions on the hottest research papers. Until next time, stay curious!