Aiyush Gupta


Is GPT-3 still King? Introducing GPT-J-6B

Is GPT-3 still King? Introducing GPT-J-6B

An Open version of "Open-AI's" closed-source GPT-3

Aiyush Gupta's photo
Aiyush Gupta
·Aug 7, 2021·

8 min read

Play this article

What are GPT’s ?

Rather than tell, I’ll show. GPTs (Generative Pre-Trained Transformers) are capable of writing stories, making music, art, writing code (see GitHub CodePilot), translation and much more...

What is GPT-J-6B ?

The project was founded in July of 2020 in a mission to completely recreate Open-AI’s previously created models.

EleutherAI (the founder of the model) is competing with AI giants by employing Google ans CoreWeave to utilise their cloud computing. OpenAI has not released a full model of GPT-3 (rather misleading us with its name OPEN-Ai) so research have been trying to uncover the mystery behind these model via their own extensive knowledge combined with the papers (GPT-1, GPT-2, GPT-3, and others).

GPT-J is the best performing publicly available Transformer LM in terms of zero-shot performance on various down-streaming tasks.

AI has come so far GIF

To me, that says it all. It also required substantially less time to train in comparison to GPT-3 and closely followed the hyper parameter structure of GPT-3.

  • The model was trained on 400 billion tokens from The Pile dataset with 800 GB text.
  • Efficient attention (like linear, local or sliding window, etc.) was not used for simplicity, as it would not have significantly improved ‘throughput’ at this scale.
  • The dimension of each ‘attention head’ was set to 256 noticeably improved the ‘throughput’ with little performance degradation.

Attribution: %[

Model Details - Metrics from Github Repo

n_vocab50,257 (same tokenizer as GPT-2/3)
position encodingRotary position encodings (RoPE)
RoPE dimensions64

* each layer consists of one feedforward block and one self attention block

The model consists of 28 layers with a model dimension of 4096, and a feedforward dimension of 16384. The model dimension is split into 16 heads, each with a dimension of 256. Rotary position encodings (RoPE) was applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as GPT-2/GPT-3.

Zero-Shot Evaluations

Models roughly sorted by performance, or by FLOPs if not available.

ModelWeightsTraining FLOPsLAMBADA PPL ↓LAMBADA Acc ↑Winogrande ↑Hellaswag ↑PIQA ↑Dataset Size (GB)
Chance0~a lot~0%50%25%25%0

* represents evaluation numbers reported by their respective authors, all other numbers are provided by running the lm-evaluation-harness either with the released weights or with API access. Due to subtle implementation differences as well as different zero shot task framing, these might not be directly comparable. See this blog post for more details.

The Megatron-11B model provides no comparable metrics, and several implementations using the released weights do not reproduce the generation quality and evaluations. (see 1 2 3) Thus, evaluation was not attempted.

These models have been trained with data which contains possible test set contamination. The OpenAI GPT-3 models failed to deduplicate training data for certain test sets, while the GPT-Neo models as well as this one is trained on The Pile, which has not been deduplicated against any test sets.

What is the Pile ?

Pile is 825GBs of a pure language modelling dataset which is a compilation of Wikipedia, arXiv , GitHub, PubMed, HackerNews and Stack Exchange. The varied data sources makes it the perfect candidate for cross-domain models.


How can I use it ?

Check out the source code on Colab notebook and a free web demo here. Since GPTJ is fairly new, there aren’t as many implementations of the model yet, it’ll be more than interesting to see what the community creates with such a powerful tool.

Where can I learn more ?

What is GPT-3?

GPT3, Generative Pre-Trained Transformer 3, was thought to be one of the most advanced autoregressive language model available. Trained on 175 billion parameters, Open-AI (the non-profit founded in 2015 who created the model) failed to abide by its previous open-source practices: “a powerful model could easily generate fake news”. Open-AI then release GPT-2 to the public a mere 8% of GPT-3’s size. It also raised major environmental concerns as outlined in “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”.

Yet, GPT-3 never failed to amaze: its’ simple prompt interface allows us to ‘program’ it using just words, over 300 applications are using GPT-3 and 10’s of 1000’s of developers (including myself) have access to it.

How is GPT-3 Being Used in the Wild?


Viable helps companies better understand their customers by using GPT-3 to provide useful insights from customer feedback in easy-to-understand summaries.

Fable Studio

Fable Studio is creating a new genre of interactive stories and using GPT-3 to help power their story-driven “Virtual Beings.”

AI Based Character.png


Algolia uses GPT-3 in their Algolia Answers product to offer relevant, lightning-fast semantic search for their customers.


In the interests of this article, I’ve listed some other applications to learn more about GPT-3:

Find more examples of how it is being used here:

And articles explaining how it works here:

Conclusion & Comparison Whilst both language models make an awe-inspiring example of what researchers are able to create, yet there are some subtle differences leading me in favour of GPT-J.

  1. It’s open source, this is the biggest reason and will let future developers build upon previous research without having to recreate the wheel like EleutherAI has.
  2. It’s 30 times smaller, at first that may seem like a bad thing, however, GPT-J is more suitable for developers and in code generation it wins since “that is what it was optimised to do”
  3. Less impact on the environment, there are tonnes of concerns surrounding the ethics of AI but the most obvious issue that we are facing in our planets climate crisis is the amount of energy to required train such models. “OpenAI reports that training GPT-3 consumed several thousand petaflop/s-days of computing power. A petaflop/s-day is a unit of power consumption that consists of performing 1015—that's one thousand trillion, or a quadrillion—neural-network computations per second for a day.”

Thanks for reading, if you have any questions or would like to reach out, please do so in the comments or on LinkedIn . Please subscribe to my newsletter where you will be updated with new articles.


Did you find this article valuable?

Support Aiyush Gupta by becoming a sponsor. Any amount is appreciated!

Learn more about Hashnode Sponsors
Share this