Movatterモバイル変換


[0]ホーム

URL:


Sign in / up
The Register | HPE

AI + ML

Boffins detail new algorithms to losslessly boost AI perf by up to 2.8x

New spin on speculative decoding works with any model - now built into Transformers

iconTobias Mann
Thu 17 Jul 2025 //10:03 UTC

We all know that AI is expensive, but a new set of algorithms developed by researchers at the Weizmann Institute of Science, Intel Labs, and d-Matrix could significantly reduce the cost of serving up your favorite large language model (LLM) with just a few lines of code.

Presented at the International Conference on Machine Learning this week and detailed in thispaper, the algorithms offer a new spin on speculative decoding that they say can boost token generation rates by as much as 2.8x while also eliminating the need for specialized draft models.

Speculative decoding, if you're not familiar, isn't a newconcept. It works by using a small "draft" model ("drafter" for short) to predict the outputs of larger, slower, but higher quality "target" models. 

If the draft model can successfully predict, say, the next four tokens in the sequence, that's four tokens the bigger model doesn't have to generate, and so we get a speed-up. If it's wrong, the larger model discards the draft tokens and generates new ones itself. That last bit is important as it means the entire process is lossless — there's no trade-off in quality required to get that speed-up.

The whole concept is a bit like predictive text on a modern smartphone. As you type, it tries to guess what you're going to say next. When it's right, you can complete the sentence with a single tap; when it's wrong, you just type it out yourself.

In practice, speculative decoding can effectively double or eventriple token generation depending on the application. But as wonderful as 3x the tokens for the same amount of compute might sound, the trick is finding a compatible draft model.

One of the challenges to the adoption of speculative decoding is that the two models' vocabularies — i.e. their dictionaries — have to match. Unless the model you're trying to run happens to have a smaller variant, taking advantage of speculative decoding has often required training specialized draft models. Making matters worse, these specialized draft models have to be retrained every time a new target model, say a new version of Llama, comes out, Nadav Timor, a Ph.D student at the Weizmann Institute, tellsEl Reg.

Universal draft model

The algorithms aim to overcome this limitation by enabling any model to serve draft duty regardless of whether the vocabularies are the same or not.

To do this, the researchers explored three distinct approaches to the problem. The first of these, called Token-Level-Intersection (TLI), is essentially the equivalent of runningdiff on the two models' vocabularies to figure out which words the drafter should avoid. This way the draft model only predicts tokens that are also in the target model's vocabulary.

So long as there's sufficient overlap in the model's vocabularies, the rate at which the draft model's predictions are accepted stays high. Using this approach, the researchers observed a 1.7x speed up over conventional autoregressive decoding, where the entirety of the model weights are read from memory every time a token is generated. 

The second algorithm, called String-Level Exact Match (SLEM), works more like a translation layer between the draft and target model's tokenizers. 

Tokenizers, if you're not familiar, are how large language models break up words, punctuation, and other expressions into chunks they can understand. OpenAI has a great demo showing this in practice, which you can findhere.

Draft predictions using the SLEM algorithm generate a complete string of tokens, which are converted into an intermediary format — in this case, plain text — that both models can understand. The output is then retokenized by the target model for review.

This approach, Timor notes, "replaces the standard verification method of speculative decoding with exact string matching, which is an even stricter verification method."

This introduced certain challenges for the team as differences in how the tokenizers handle text could introduce nearly imperceptible changes. "For example, if you have leading white spaces, it might squash them," he explained.

That might not sound like a big deal, but the string must match exactly, or it will be rejected and any potential speedup will be lost. To get around this, SLEM introduced a heuristic function to help smooth out the differences and drive up the acceptance rates. And, at least in long-context tasks like summarization and programming, the improvements can be dramatic up to 2.8x in the team's testing.

It's a single line change for developers

Neither of these algorithms, Timor emphasizes, is theoretical. Both SLEM and TLI are already part of Hugging Face's Transformers library, which is among the most widely deployed frameworks for running LLMs at scale today. "It's a single line change for developers," he said.

Which of these you should use is going to depend on what exactly you're doing with these models, Timor said. "Sometimes the first one works better, sometimes the second one does. You have to check it on your specific configuration."

In some cases, it may still be worth training a dedicated drafter. But as Timor points out, the algorithms researchers have developed significantly reduce the barrier to adoption for speculative decoding.

More research to be done

Timor's research into speculative decoding doesn't stop here. As we mentioned earlier, the team developed three algorithms.

The third, called String-Level Rejection Sampling (SLRS) aimed to address the relatively poor acceptance rates associated with string verification based approaches.

"It uses a generalized drafter that considers probabilities over strings rather than tokens, and we proved that it boosts acceptance rates," Timor said. "The problem is that computing this generalized drafter in runtime, it's computationally expensive, so you have to redesign vocabularies to make this algorithm practical."

The team is also looking at ways to address the explosive growth of model vocabularies and make the draft models even faster.

"The vocabularies are getting huge. Llama 4 for example, is like 200,000 tokens," Timor said, adding that most of that isn't actually used, driving up latency. "We're currently working on shrinking the vocabulary."

That research, he says, is ongoing. ®


More like these

More about


COMMENTS

More about

More like these

TIP US OFF

Send us news


Other stories you might like

Managers are throwing entry-level workers under the bus in race to adopt AI

ai-pocalypse Does it work? Inconclusive. Still, 55% of business leaders say that adopting AI is worth the impact on workers
AI + ML10 Oct 2025 |66

Shadow AI: Staffers are bringing AI tools they use at home to work, warns Microsoft

Bring Your Copilot To Work Day, anyone?
AI + ML14 Oct 2025 |30

Boris Johnson confesses: He's fallen for ChatGPT

As OpenAI allows chatbot to spout erotic content, former British prime minister makes true feelings known
Offbeat17 Oct 2025 |61

Built for what's next: Arm's advantage in the AI PC era

When considering your upgrade for Windows 11, it’s time to look at Arm vs x86 - It's an upgrade to improved efficiency, performance, and battery life.
Sponsored Feature

Cerebras CEO insists dinner-plate-sized chip startup will still go public

Inference service launched a month before IPO filing turns out to have been a much bigger business than initially thought
AI + ML6 Oct 2025 |6

AI does a better job of ripping off the style of famous authors than MFA students do

Shall I refer thee to all those lawsuits about fair use? Researchers think this result makes them worth revisiting
AI + ML21 Oct 2025 |24

AI gets more 'meh' as you get to know it better, researchers discover

Most scientists now use the tech in their work, but still question its usefulness
AI + ML8 Oct 2025 |76

Cisco: Most companies don't know what they're doing with AI

Only 13% are AI-ready; the rest are bolting it on and hoping for ROI
AI + ML15 Oct 2025 |19

Aid groups use AI-generated ‘poverty porn’ to juice fundraising efforts

Researchers accuse tech firms of profiting from exploitative AI imagery
AI + ML20 Oct 2025 |14

How chatbots are coaching vulnerable users into crisis

Feature From homework helper to psychological hazard in 300 hours of sycophantic validation
AI + ML8 Oct 2025 |21

Microsoft seeding Washington schools with free AI to get kids and teachers hooked

To the slop trough, kiddos!
AI + ML14 Oct 2025 |18

We're all going to be paying AI's Godzilla-sized power bills

Opinion Even if you never use it, you'll be paying for it thanks to datacenters' never-ending hunger for electricity
Columnists13 Oct 2025 |98

[8]ページ先頭

©2009-2025 Movatter.jp