Note: This was written before API pricing was announced.
Do you know how much GPT3 API will cost?
A rough calculation tells me it can go to a maximum of 790 requests/$.
GPT3 is pretty huge(175B parameters = 700GB) and you know how costly GPU inferences can be. Even if we find a use case for it, we still need to justify the ROI. There are many blogs on the potential applications but I haven’t found anything on its pricing.
Let’s try to guess it with the fundamentals of cloud pricing.
Note: You can use this methodology for calculating the API cost for any model. People also like to use AWS TCO(Total cost of ownership) calculator but I enjoy doing it manually.
STEP 0 - What’s the usecase?
Transformers are quadratic in compute. So it’s extremely crucial to decide on the use case for it because the use case will decide the sequence length.
The best use case for GPT3 is text generation given the prompt.
The prompt can be of any length but 128 makes a sensible guess. People also do it recursively by appending the previously generated text to generate more.
GPT3 can take the seq_length up to 1024(max supported) but due to the quadratic nature of the transformer, it is going to make the inference even costlier.
Let’s fix the seq length to 128 and then use scaling to calculate for 1024.
STEP 1 - Getting GPT2 inferences per hour
Seq length - 128
GPU + XLA inference on Tensorflow
V100 GPU instance
12 vCPUs, 40GB of RAM
Batch size - 8
From HuggingFace experiment sheet, GPT2 gets inference time of 0.02s for a batch size of 8 on Tensorflow GPU + XLA.
Hence it can serve 8*3600/0.02 = 1440000 inferences/hour.
STEP 2 - Getting GPT3 inferences per hour
GPT2 - 1.5B parameters
GPT3 - 175B parameters
Since GPT3 cannot fit on 1 GPU, its split across many. For simplicity reasons, let’s assume we can extrapolate the inference time with linear calculation. Although multi-GPU can be slower due to the passing of gradients from 1 GPU to another.
Equivalent GPT3 inferences/hour/GPU = 1440000*1.5/175 = 12342 = ~12400
STEP 3 - Inference optimisation
HuggingFace mentions AMP(fp16) can increase throughput by 1.5x.
New inferences/hour/GPU = 12400*1.5 = 18600
STEP 4 - Cost per hour at full load
AWS p3.2x costs $3.06/hour. If we take a reserved instance for a year, it can give up to 36% discount with all upfront cost.
Discounted cost = $3.06(1-0.360) = $1.96/hour
(Azure V100 1 year reserved instance costs $1.72/hour)
STEP 5 - Cost per inference
Cost per inference
= instance cost/inferences
It will cost you a minimum of $0.00010537634 per API call of GPT3.
In $1 you will be able to serve 9490 API requests.
Longer sequence API
GPT2 with seq length 1024 and batch size 8 takes 0.195s which is 10x the time of 128 seq length.
Hence you will be able to serve 949/$
I hope this gives you a good idea on how to justify the use case for your business.
We haven’t added any profit margin of OpenAI to the API cost. But taking a profit margin of 20% means it will be able to serve 949/1.2 = 790 requests/$.
790/$ sounds very realistic. Although I think OpenAI won’t price as per operational cost but based on the value it can generate for the clients on the major use cases they know thanks to people making amazing demos.
Would you pay 790/$ for your business?
Update (3rd September)
Explore: Free tier: 100K [BPE] tokens, Or, 3-month trial, Whichever comes first
Create: $100/mo, 2M tokens/mo, 8 cents per additional 1k tokens
Build: $400/mo, 10M tokens/mo, 6 cents per additional, 1k tokens
Scale: Contact Us
Cost of Type 3
1/.06 = 17 requests/$
Difference from pure cloud cost
This means OpenAI is charging 55x the cloud cost. My estimate of cloud was based on AWS but they use Azure which is even cheaper. So they are charging like >=60x the cloud cost.
This looks very profitable.
One model to rule them all?
We have a group of NLP professionals across the globe where we discuss real problems of making ML systems.
Ask me anything on ama.pratik.ai 👻
You can try ask.pratik.ai for any study material.
Let me know your suggestions via feedback.pratik.ai 😃