What’s so BIG about Small Language Models?
Whether you’re deep into the latest tech trends or not, you’ll notice that acronyms like LLMs are all over the place.
Without diving too deep, a language model is essentially a deep learning model built by feeding vast amounts of ‘language-like’ data into massive compute power – clusters of thousands of GPUs is what we’re looking at, but let’s skip the hardware details.
Let me introduce you to Small Language Models (SLMs).
From massive models to on-site solutions
There is a scaling problem that has bitten us hard in 2023. Let’s just take a look at our champions from 2022 and 2023:
No surprise that data center consumption is skyrocketing, renewable-powered or not, and resources are scarce. For instance, Microsoft’s Azure, which hosts OpenAI’s models, has been severely under-capacitated – and might still be.
While it’s possible to load these models into RAM with a CPU, it’s painfully slow – which is why LLMs perform so well on unified architectures like M-series with fast, low-latency memory. As a result, many turn to cloud resources, given the scarcity and high cost of GPUs. Despite this, GenAI is still finding its optimal place in the market.
But the catch with using massive models is that they always need an active internet connection.
This works fine until a client requires an on-site deployment, and your cloud connection is suddenly out of reach. Now you have to think small, rather than big.
What’s a Small Language Model (SLM)?
There’s no precise definition for when a language model is considered ‘big’ – it’s often a subjective distinction.
However, for practical purposes, we can think of models that can be loaded onto client devices, like Gemini Flash in Google Chrome Canary, as smaller.
Credits to https://simonwillison.net/2024/Jul/3/chrome-prompt-playground/
Yes, your browser will soon have a small language model (SLM) embedded in it – should we use this term instead of ‘LLM’?
When considering LMs from an Edge AI perspective, a model with as few as 8 billion parameters can be classified as ‘small’ if it’s feasible to load onto a client’s device. Maybe a really clear comparison between small and large LMs can be the case of GPT-4o and GPT-4o-mini. Whereas one is a full blown LLM, one is it’s smaller brother. Many times smaller but not that much less capable.
A model with 8 billion parameters, when quantized to 4 bits, requires about 4 GB of space, which is manageable for 2024-era devices, including mobile phones. Reducing precision further would decrease space requirements, but this could significantly increase perplexity (confusion).
The case for SLMs
Advancements in training strategies for language models are making smaller models increasingly effective with each new generation (currently, Meta’s Llama 3.1 8B is a state-of-the-art open-weight model). This opens up breakthrough possibilities:
- Easy fine-tuning of the model (e.g. re-training a model with your data) for a specific task
- With modest hardware (we’re talking M-series chips and a bunch of RAM usually, or GPUs with their dedicated VRAM)
- Or in the cloud (GPT-4o mini
- Shipping your integrations with SLMs on board in the case of:
- The client’s on-premises conditions that prevent online access.
- The client prefers not to use big tech models due to concerns about data leaks or has regulations in place (becoming more and more common) not to use models trained on specific dataset or by specific vendor.
- Offloading the model to your clients machines (not totally feasible currently but hardware vendors are moving very fast). Recently, WebGPU (leveraging your GPUs compute power for browser-based tasks) became popular, and building on top of that we now see WebLLM taking that same approach to the LLMs that live inside the browser, or your client’s SaaS products.
In the end, you have a model that you can customize and deploy as you see fit. Plus, using open models provides additional benefits, including:
- Predictable performance (no more “Oh no, they’ve updated/broken the model on [insert cloud provider name here]”)
- Data ownership (total privacy)
Outlook for the end of 2024
Big tech is making rapid strides this year, setting ambitious goals for the market.
Microsoft is set to roll out the Phi-3 Silica model across Windows 11 machines, and Apple plans to integrate similar technology into their devices. Google is already bundling small models with Chrome and Android, hinting at further expansion.
From a performance standpoint, I hope all vendors will be wise enough to manage power consumption effectively, especially on battery-powered devices, given the significant energy demands.