RWKV Open Source Development Blog

RWKV Open Source Development Blog

Share this post

RWKV Open Source Development Blog
RWKV Open Source Development Blog
🦅 Eagle 7B : Soaring past Transformers with 1 Trillion Tokens Across 100+ Languages (RWKV-v5)
Copy link
Facebook
Email
Notes
More
User's avatar
Discover more from RWKV Open Source Development Blog
Development blog for the RWKV open source architecture, and their derivative OSS models
Already have an account? Sign in

🦅 Eagle 7B : Soaring past Transformers with 1 Trillion Tokens Across 100+ Languages (RWKV-v5)

A brand new era for the RWKV-v5 architecture and linear transformer's has arrived - with the strongest multi-lingual model in open source today

Eugene Cheah's avatar
Eugene Cheah
Jan 29, 2024
30

Share this post

RWKV Open Source Development Blog
RWKV Open Source Development Blog
🦅 Eagle 7B : Soaring past Transformers with 1 Trillion Tokens Across 100+ Languages (RWKV-v5)
Copy link
Facebook
Email
Notes
More
6
2
Share
An eagle, flying past a transformer-looking robot

Eagle 7B - in short

Eagle 7B is a 7.52B parameter model that:

  • Built on the RWKV-v5 architecture
    (a linear transformer with 10-100x+ lower inference cost)

  • Ranks as the world’s greenest 7B model (per token)

  • Trained on 1.1 Trillion Tokens across 100+ languages

  • Outperforms all 7B class models in multi-lingual benchmarks

  • Approaches Falcon (1.5T), LLaMA2 (2T), Mistral (>2T?) level of performance in English evals

  • Trade blows with MPT-7B (1T) in English evals

  • All while being an “Attention-Free Transformer”

  • Is a foundation model, with a very small instruct tune - further fine-tuning is required for various use cases!

We are releasing RWKV-v5 Eagle 7B, licensed as Apache 2.0 license, under the Linux Foundation, and can be used personally or commercially without restrictions

  • Download from Huggingface, and use it anywhere (even locally)

  • Use our reference pip inference package, or any other community inference options (Desktop App, RWKV.cpp, etc)

  • Fine-tune using our Infctx trainer

  • Try it online on Huggingface

  • [Pending PR] Get it merged into Huggingface transformers!

Multi-Lingual Performance details

We performed multi-lingual performance across the following benchmarks: xLAMBDA, xStoryCloze, xWinograd, xCopa

Across a total of 23 languages

Most of these benchmarks cover common sense reasoning, in their respective languages. And show a huge overall jump in multi-lingual performance for RWKV v4-to-v5 architecture. And the v2 world dataset.

It should also be noted, that there is a lack of multi-lingual benchmarks, as the above covers approximately the top 23 languages.

This makes it hard to evaluate model language performance directly over the remaining 75+ languages, over the total 100+ trained languages. A shortcoming we hope to improve in future models.

English Performance details

English performance was measured across 12 separate benchmarks, across commonsense reasoning, and world knowledge

Once again we see a huge overall jump from RWKV v4-to-v5 architecture. And the v2 world dataset.

Where v4 previously lost out to MPT-7b, the top model in the 1T token tier.

v5 begins trading blows in benchmarks, in some cases even coming on top in certain benchmarks ( LAMBADA, StoryCloze16, WinoGrande, HeadQA_en, Sciq ) over Falcon, or even llama2.

In addition, v5 performance starts to fall in line with the expected transformer performance level, with its given approximate token training count.

With Mistral-7B maintaining its lead with its rumored 2~7 Trillion token training.

We expect to narrow the gap, as we train an additional 1T token, to cross the llama2 line and hopefully reach the mistral line.

Alternatively, as a base model, which is lightly tuned (really small instruct set mixed in), we are eager to see how the various community and instruct-tuned variants


Perhaps a good dataset + Scalable architecture:
is all you need?

A notable observation was that our checkpoints near the 300 Billion token point, show similar performance to pythia-6.9b

This is consistent with previous pile-based experiments on our RWKV-v4 architecture, that linear transformers like RWKV scale similarly in performance levels to transformers, with the same token count training.

If so, it does repeat the question. If the exact architecture, matter less than the data for the model eval performance?

CUDA computational cost, for RWKV-based architecture vs transformer models - that quadratic-vs-linear really scales!

If true, perhaps we should seek more efficient and scalable architecture, to increase accessibility, drive the cost of AI downwards for everyone, and lessen the impact on our environment.


Building inclusive AI for everyone in this world - not just the English

A common feedback we receive for the RWKV multi-lingual approach is

  • it hurts our English evaluation scores and slows the growth of linear transformers

  • that it is not fair to compare the multi-lingual performance of a multi-lingual model -vs- a purely English model

And for most parts, we agree on both points.

But we have no plans on changing this, as we are building AI for the world - which is not just an English world.

In 2023, only 17% of the world's population speaks English
( 1.3 billion people )

World Map showing the distribution of regions and people who are fluent in English (source: Wikipedia)

However, by ensuring support for the top 25 languages in the world and beyond, we can cover approximately 4 billion people, or 50% of the world

Flawed map, highlighting where the eagle language model will support entirely or partially - the goal is to be able paint the whole map green with confidence

This aligns well with the team’s common goal, of getting AI to support everyone, not just by allowing it to run cheaply and affordably even on lower-end hardware. But by supporting their language.

Over time, we intend to grow the multi-lingual dataset, to support a wider variety of languages, and to slowly grow that coverage to 100% of the world - to ensure no language gets left behind.

The RWKV discord community today grew due to our low inference cost, and its wide range of support for various languages.(https://discord.com/invite/T5JGfMvWA5)

A major example of this in our community is the Indonesian-NLP discord group, which finetunes an Indonesian language model from the RWKV line of base models.

Allowing them to build strong language-specific models - on a cheap affordable basis (ie. single node), without needing to do half a million dollars of pre-training.


Future Plans

This release marks the release of the strongest linear transformer (in terms of eval benchmarks) to date.

While it may not have succeeded in passing LLaMA2 and Mistral. It provides strong evidence of the following

  • The RWKV-v5 model architecture scales similarly to transformer performance with a similar token count

  • You can achieve a near LLaMA2-like level of performance, with a substantially lower inference cost

  • While supporting multi-lingual levels of performance

We plan to follow by pushing further ahead with

  • [Feb 2024] An updated RWKV v5: Eagle paper, where we will go deeper in-depth on the architecture changes since v4, and the model benchmarks and evals

  • [Feb 2024] A further 1T token in training (2T total), for direct comparisons with the LLaMA2 7B model

  • [Mar 2024] An MoE model based on the v5 Eagle 2T model

  • [Mar 2024] RWKV-v6: “Finch” 1.5B, 3B world models

Disclaimer: All dates are approximate, and is heavily subjected to compute avaliability from our sponsors/provider

Find more about the RWKV Project at

  • Wiki: https://wiki.rwkv.com/

  • Discord: https://discord.gg/bDSBUMeFpc


Acknowledgment

We are grateful and would like to thank the following key groups:

  • StabilityAI for the bulk of the computing provided to train this foundation model

  • EleutherAI for their support, especially in the ongoing paper-writing process

  • Linux Foundation AI & Data group for supporting and hosting the RWKV project

Along with the various developers, working on the growing collection of RWKV-related projects.

Thanks for reading RWKV Open Source Development Blog! Subscribe for free to receive new posts and support my work.

aleksi's avatar
Vikram Dutt's avatar
Jason Holtkamp's avatar
Taufiq Dwi Purnomo's avatar
Nathan Lambert's avatar
30 Likes∙
2 Restacks
30

Share this post

RWKV Open Source Development Blog
RWKV Open Source Development Blog
🦅 Eagle 7B : Soaring past Transformers with 1 Trillion Tokens Across 100+ Languages (RWKV-v5)
Copy link
Facebook
Email
Notes
More
6
2
Share
A guest post by
Eugene Cheah
Builds Attention-Free Transformer AI models (http://wiki.rwkv.com) from scratch, CEO @ featherless.ai (prv recursal.ai) - Also known for k8s infra & UI testing tools, webapps, and GPU.js, Hot-takes/Views are my own
Subscribe to Eugene

Discussion about this post

User's avatar
petitlegumechien's avatar
petitlegumechien
Jan 29, 2024

RWKV is the SOTA for non-Transformer architecture.

Expand full comment
Like (4)
Reply
Share
Nathan Lambert's avatar
Nathan Lambert
Jan 29, 2024Edited

caw caw! Congrats!

Is v4 the same as the paper? What's v5?

Expand full comment
Like (1)
Reply
Share
1 reply
4 more comments...
🚀 RWKV.cpp - shipping to 1.5 billion systems worldwide
We went from ~50k installation, to 1.5 billion. On every windows 10 and 11 computer, near you (even the ones in the IT store)
Sep 3, 2024 • 
RWKV
2

Share this post

RWKV Open Source Development Blog
RWKV Open Source Development Blog
🚀 RWKV.cpp - shipping to 1.5 billion systems worldwide
Copy link
Facebook
Email
Notes
More
🦅 EagleX v2 : Soaring past LLaMA2 7B in both English and Multi-lang evals (RWKV-v5)
You have seen the teaser with the EagleX 1.7T, now its here - the definitive version of linear transformer trained past, LLaMA 2 7B.
Apr 18, 2024 • 
RWKV
1

Share this post

RWKV Open Source Development Blog
RWKV Open Source Development Blog
🦅 EagleX v2 : Soaring past LLaMA2 7B in both English and Multi-lang evals (RWKV-v5)
Copy link
Facebook
Email
Notes
More
🌳 The World's Greenest AI Model: RWKV's Pioneering Sustainability
10-100x lower inference cost = lower carbon footprint
Jan 28, 2024 • 
RWKV
2

Share this post

RWKV Open Source Development Blog
RWKV Open Source Development Blog
🌳 The World's Greenest AI Model: RWKV's Pioneering Sustainability
Copy link
Facebook
Email
Notes
More

Ready for more?

© 2025 RWKV
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.