SmolLM2 Models

Deploy SmolLM2 Models in the Cloud

Ayush Kumar

Nov 20, 2024 — 10 min read

Deploy SmolLM2 Models in the Cloud

The SmolLM2 series offers a range of compact language models crafted by HuggingFaceTB, available in three sizes: 135M, 360M, and 1.7B parameters. These models are designed to handle diverse language tasks while remaining lightweight enough to operate efficiently on devices with limited computational resources, making them ideal for both general and specialized applications.

Model Options and Features

SmolLM2-135M

Size: 135 million parameters
Training Dataset: Built using an extensive corpus of 2 trillion tokens from a wide range of sources, including FineWeb-Edu, DCLM, and The Stack.
Strengths: Excels in tasks like content creation, answering questions, and generating code. The model is particularly strong in commonsense reasoning and general knowledge tasks.
Enhancements: Fine-tuned with advanced optimization techniques, including supervised learning and preference-based fine-tuning, to improve its ability to follow instructions effectively.

SmolLM2-360M

Size: 360 million parameters
Overview: This version builds on the 135M variant with expanded capacity, enabling it to process more complex tasks and larger inputs. It strikes a fine balance between computational efficiency and task performance.
Best Fit: Works well for scenarios requiring advanced contextual understanding while maintaining efficiency.

Instruction-Focused Models

SmolLM2-135M Instruction Variant

Specialization: Specifically optimized for instruction-based applications, this model uses both publicly available and tailored datasets to enhance its ability to respond accurately to prompts.
Performance: Demonstrates improved precision in benchmark tests, showcasing enhanced understanding and response quality tailored to specific requests.

SmolLM2-360M Instruction Variant

Specialization: Similar to the 135M instruction model but with the added benefits of a larger size, enabling better handling of nuanced queries and more intricate problem-solving tasks.
Applications: Particularly suitable for tasks requiring multi-step reasoning and advanced response generation.

Common Strengths Across the SmolLM2 Series

Foundation: Built using a state-of-the-art transformer-based architecture for superior performance.
Training Approach: Incorporates advanced methods such as preference-based optimization to align model outputs with user expectations.
Versatility: Designed for a wide range of applications, including text summarization, rephrasing, problem-solving, and more.

Whether you are creating engaging content, solving complex queries, or generating concise summaries, the SmolLM2 family offers an efficient solution tailored to your needs, all while ensuring compatibility with limited-resource environments.

SmolLM2- 135M Benchmark

Base pre-trained model

Metrics	SmolLM2-135M-8k	SmolLM-135M
HellaSwag	42.1	41.2
ARC (Average)	43.9	42.4
PIQA	68.4	68.4
MMLU (cloze)	31.5	30.2
CommonsenseQA	33.9	32.7
TriviaQA	4.1	4.3
Winogrande	51.3	51.3
OpenBookQA	34.6	34.0
GSM8K (5-shot)	1.4	1.0

Instruction model

Metric	SmolLM2-135M-Instruct	SmolLM-135M-Instruct
IFEval (Average prompt/inst)	29.9	17.2
MT-Bench	1.98	1.68
HellaSwag	40.9	38.9
ARC (Average)	37.3	33.9
PIQA	66.3	64.0
MMLU (cloze)	29.3	28.3
BBH (3-shot)	28.2	25.2
GSM8K (5-shot)	1.4	1.4

SmolLM2- 360M Benchmark

Base Pre-Trained Model

Metrics	SmolLM2-360M	Qwen2.5-0.5B	SmolLM-360M
HellaSwag	54.5	51.2	51.8
ARC (Average)	53.0	45.4	50.1
PIQA	71.7	69.9	71.6
MMLU (cloze)	35.8	33.7	34.4
CommonsenseQA	38.0	31.6	35.3
TriviaQA	16.9	4.3	9.1
Winogrande	52.5	54.1	52.8
OpenBookQA	37.4	37.4	37.2
GSM8K (5-shot)	3.2	33.4	1.6

Instruction Model

Metric	SmolLM2-360M-Instruct	Qwen2.5-0.5B-Instruct	SmolLM-360M-Instruct
IFEval (Average prompt/inst)	41.0	31.6	19.8
MT-Bench	3.66	4.16	3.37
HellaSwag	52.1	48.0	47.9
ARC (Average)	43.7	37.3	38.8
PIQA	70.8	67.2	69.4
MMLU (cloze)	32.8	31.7	30.6
BBH (3-shot)	27.3	30.7	24.4
GSM8K (5-shot)	7.43	26.8	1.36

SmolLM2-135M-Instruct Benchmark

Base pre-trained model

Metrics	SmolLM2-135M-8k	SmolLM-135M
HellaSwag	42.1	41.2
ARC (Average)	43.9	42.4
PIQA	68.4	68.4
MMLU (cloze)	31.5	30.2
CommonsenseQA	33.9	32.7
TriviaQA	4.1	4.3
Winogrande	51.3	51.3
OpenBookQA	34.6	34.0
GSM8K (5-shot)	1.4	1.0

Instruction model

Metric	SmolLM2-135M-Instruct	SmolLM-135M-Instruct
IFEval (Average prompt/inst)	29.9	17.2
MT-Bench	19.8	16.8
HellaSwag	40.9	38.9
ARC (Average)	37.3	33.9
PIQA	66.3	64.0
MMLU (cloze)	29.3	28.3
BBH (3-shot)	28.2	25.2
GSM8K (5-shot)	1.4	1.4

SmolLM2-360M-Instruct Benchmark

Base Pre-Trained Model

Metrics	SmolLM2-360M	Qwen2.5-0.5B	SmolLM-360M
HellaSwag	54.5	51.2	51.8
ARC (Average)	53.0	45.4	50.1
PIQA	71.7	69.9	71.6
MMLU (cloze)	35.8	33.7	34.4
CommonsenseQA	38.0	31.6	35.3
TriviaQA	16.9	4.3	9.1
Winogrande	52.5	54.1	52.8
OpenBookQA	37.4	37.4	37.2
GSM8K (5-shot)	3.2	33.4	1.6

Instruction Model

Metric	SmolLM2-360M-Instruct	Qwen2.5-0.5B-Instruct	SmolLM-360M-Instruct
IFEval (Average prompt/inst)	41.0	31.6	19.8
MT-Bench	3.66	4.16	3.37
HellaSwag	52.1	48.0	47.9
ARC (Average)	43.7	37.3	38.8
PIQA	70.8	67.2	69.4
MMLU (cloze)	32.8	31.7	30.6
BBH (3-shot)	27.3	30.7	24.4
GSM8K (5-shot)	7.43	26.8	1.36

Prerequisites for deploying SmolLM2 Models

GPU: RTX A6000
RAM: 50 GB (minimum)
Disk Space: 50 GB (recommended)
CPU: 28 Cores

Step-by-Step Process to Deploy SmolLM2 Models on Cloud VM

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Visit the NodeShift Platform and create an account. Once you've signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift's GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deployment.

Step 3: Select a Model, Region, and Storage

In the "GPU Nodes" tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

Next, you will need to choose an image for your Virtual Machine. We will deploy SmolLM2 models on a Jupyter Virtual Machine. This open-source platform will allow you to install and run the SmolLM2 models on your GPU node. By running this model on a Jupyter Notebook, we avoid using the terminal, simplifying the process and reducing the setup time. This allows you to configure the model in just a few steps and minutes.

Note: NodeShift provides multiple image template options, such as TensorFlow, PyTorch, NVIDIA CUDA, Deepo, Whisper ASR Webservice, and Jupyter Notebook. With these options, you don’t need to install additional libraries or packages to run Jupyter Notebook. You can start Jupyter Notebook in just a few simple clicks.

After choosing the image, click the 'Create' button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to Jupyter Notebook

Once your GPU Virtual Machine deployment is successfully created and has reached the 'RUNNING' status, you can navigate to the page of your GPU Deployment Instance. Then, click the 'Connect' Button in the top right corner.

After clicking the 'Connect' button, you can view the Jupyter Notebook.

Now open Python 3(pykernel) Notebook.

Next, If you want to check the GPU details, run the following command in the Jupyter Notebook cell:

!nvidia-smi

Step 8: Install CUDA

Run the following command in the Jupyter Notebook cell to install the CUDA:

!sudo apt install nvidia-cuda-toolkit -y

Step 9: Install the Required Packages

Run the following command in the Jupyter Notebook cell to install the required packages:

pip install transformers torch accelerate

Transformers: Transformers provide APIs and tools to download and efficiently train pre-trained models.

Torch: Torch is an open-source machine learning library, a scientific computing framework, and a scripting language based on Lua. It provides LuaJIT interfaces to deep learning algorithms implemented in C. Torch was designed with performance in mind, leveraging highly optimized libraries like CUDA, BLAS, and LAPACK for numerical computations.

Accelerate: Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code.

Step 10: Import Required Libraries and Load the SmolLM2-135M Model

First, run the following command in the Jupyter Notebook cell to import required libraries:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

Next, run the following code in the Jupyter Notebook cell to load the SmolLM2-135M model:

# Define the model checkpoint and device (GPU or CPU)
checkpoint = "HuggingFaceTB/SmolLM2-135M"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

Step 11: Run Prompt and Generate Output

Now that the model and tokenizer are loaded, encode an input text and generate a output.

# Encode the input text
input_text = "Gravity is"
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)

# Generate the output
outputs = model.generate(inputs, max_length=50)  # Set max_length as per your requirement
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generated_text)

Step 12: Load the SmolLM2-360 M Model and Generate Output

Run the following code in the Jupyter Notebook cell to load the SmolLM2-135M model and generate output:

# pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "HuggingFaceTB/SmolLM2-360M"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

Step 13: Load the SmolLM2-135M-Instruct Model and Generate Output

Run the following code in the Jupyter Notebook cell to load the SmolLM2-135M-Instruct model and generate output:

from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "HuggingFaceTB/SmolLM2-135M-Instruct"

device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

messages = [{"role": "user", "content": "What is gravity?"}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False)
print(input_text)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=50, temperature=0.2, top_p=0.9, do_sample=True)
print(tokenizer.decode(outputs[0]))

Step 14: Load the SmolLM2-360M-Instruct Model and Generate Output

Run the following code in the Jupyter Notebook cell to load the SmolLM2-135M-Instruct model and generate output:

from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "HuggingFaceTB/SmolLM2-360M-Instruct"

device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

messages = [{"role": "user", "content": "What is the capital of France."}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False)
print(input_text)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=50, temperature=0.2, top_p=0.9, do_sample=True)
print(tokenizer.decode(outputs[0]))

Conclusion

The SmolLM2 models is a series of groundbreaking open-source models from HuggingFace that offer advanced capabilities to developers and researchers. By following this step-by-step guide, you can easily deploy SmolLM2 models on a cloud-based virtual machine, using a GPU-powered setup from NodeShift to maximize its potential. NodeShift provides a user-friendly, secure, and cost-effective platform to run your models efficiently. It’s an ideal choice for those exploring SmolLM2 models and other cutting-edge models.

For more information about NodeShift:

Deploy SmolLM2 Models in the Cloud

Ayush Kumar

Model Options and Features

Instruction-Focused Models

Common Strengths Across the SmolLM2 Series

SmolLM2- 135M Benchmark

SmolLM2- 360M Benchmark

SmolLM2-135M-Instruct Benchmark

SmolLM2-360M-Instruct Benchmark

Prerequisites for deploying SmolLM2 Models

Step-by-Step Process to Deploy SmolLM2 Models on Cloud VM

Step 2: Create a GPU Node (Virtual Machine)

Step 3: Select a Model, Region, and Storage

Step 4: Select Authentication Method

Step 5: Choose an Image

Step 6: Virtual Machine Successfully Deployed

Step 7: Connect to Jupyter Notebook

Step 8: Install CUDA

Step 9: Install the Required Packages

Step 10: Import Required Libraries and Load the SmolLM2-135M Model

Step 11: Run Prompt and Generate Output

Step 12: Load the SmolLM2-360 M Model and Generate Output

Step 13: Load the SmolLM2-135M-Instruct Model and Generate Output

Step 14: Load the SmolLM2-360M-Instruct Model and Generate Output

Conclusion

Read more

How to Install a Private OpenVPN Server on Ubuntu 22.04

Convert Screenshots and Designs to Clean Code with Screenshot-to-Code Tool

Running AI Models with Open WebUI

Build an App with AI in Minutes using OpenHands AI Engineer - Install Locally

Model Options and Features

Instruction-Focused Models

Common Strengths Across the SmolLM2 Series

SmolLM2- 135M Benchmark

SmolLM2- 360M Benchmark

SmolLM2-135M-Instruct Benchmark

SmolLM2-360M-Instruct Benchmark

Prerequisites for deploying SmolLM2 Models

Step-by-Step Process to Deploy SmolLM2 Models on Cloud VM

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Step 2: Create a GPU Node (Virtual Machine)

Step 3: Select a Model, Region, and Storage

Step 4: Select Authentication Method

Step 5: Choose an Image

Step 6: Virtual Machine Successfully Deployed

Step 7: Connect to Jupyter Notebook

Step 8: Install CUDA

Step 9: Install the Required Packages

Step 10: Import Required Libraries and Load the SmolLM2-135M Model

Step 11: Run Prompt and Generate Output

Step 12: Load the SmolLM2-360 M Model and Generate Output

Step 13: Load the SmolLM2-135M-Instruct Model and Generate Output

Step 14: Load the SmolLM2-360M-Instruct Model and Generate Output

Conclusion

Read more

How to Install a Private OpenVPN Server on Ubuntu 22.04

Convert Screenshots and Designs to Clean Code with Screenshot-to-Code Tool

Running AI Models with Open WebUI

Build an App with AI in Minutes using OpenHands AI Engineer - Install Locally