AMD has introduced Instella, a family of fully open-source language models featuring 3 billion parameters, trained from scratch on AMD Instinct™ MI300X GPUs. The models demonstrate significant improvements over existing fully open models and aim to be competitive with state-of-the-art open-weight models.

AMD introduces Instella: Open-source language models with 3 billion parameters

Instella is built on an autoregressive transformer model consisting of 36 decoder layers and 32 attention heads. This architecture supports a sequence length of up to 4,096 tokens, enabling the model to process extensive textual contexts. The vocabulary size is approximately 50,000 tokens, managed by the OLMo tokenizer.

The training utilized AMD Instinct MI300X GPUs to emphasize AMD’s hardware-software integration. Instella scales up efforts from the previous 1-billion-parameter AMD OLMo models, transitioning from 64 MI250 GPUs using 1.3 trillion tokens to 128 MI300X GPUs and 4.15 trillion tokens for Instella.

AMD’s Instella training pipeline consisted of four stages, which incrementally enhanced the model’s capabilities from general natural language understanding to instruction following and alignment toward human preferences. The first stage involved training on 4.065 trillion tokens from diverse datasets including DCLM-baseline and Dolma 1.7, while the second stage incorporated an additional 57.575 billion tokens from high-quality datasets like Dolmino-Mix-1124 and SmolLM-Corpus.

Model versions and training details

The Instella models released include:

  • Instella-3B-Stage1: Pre-training Stage 1 with 4.065 trillion tokens for foundational natural language proficiency.
  • Instella-3B: Pre-training Stage 2 with an additional 57.575 billion tokens to enhance problem-solving capabilities.
  • Instella-3B-SFT: Supervised Fine-tuning (SFT) using 8.902 billion tokens across three epochs to improve instruction-following abilities.
  • Instella-3B-Instruct: Alignment for human preferences using 760 million tokens with Direct Preference Optimization (DPO).

The training methodology employed FlashAttention-2, Torch Compile, and bfloat16 mixed-precision training for efficiency, alongside fully sharded data parallelism with hybrid sharding to optimize resource utilization across a large cluster.

Performance benchmarks

Instella models outperform existing fully open models of a similar size. The final pre-trained model, Instella-3B, leads existing top-performing fully open pre-trained models by an average of 8.08%, with notable improvements in benchmarks such as ARC Challenge (+8.02%), ARC Easy (+3.51%), and GSM8K (+48.98%).

Instella-3B models excel across various standard benchmarks, including MMLU and BBH, demonstrating significant competitive performance against models like Llama-3.2-3B and Gemma-2-2B. In terms of instruction tuning, Instella-3B-Instruct shows a consistent score lead of 14.37% compared to the next best performing fully open instruction-tuned models.

The models were evaluated using standard tasks from OLMES, FastChat MT-Bench, and Alpaca, with results indicating strong performance relative to existing state-of-the-art open-weight models. The instruction-tuned models achieved remarkable scores, narrowing gaps and showcasing competitiveness within the landscape of language models.

Open-source availability

AMD has fully open-sourced all artifacts related to Instella models, including model weights, training configurations, datasets, and code, promoting collaboration and innovation within the AI community. Resources are available through Hugging Face model cards and GitHub repositories.


Featured image credit: Timothy Dykes/Unsplash