It looks like the vision costs the same for GPT-4o vs mini. Both start with 150x...

MasterScrat · on July 18, 2024

It almost sounds shady... "it's 30x cheaper per token but you now need 30x more tokens per image"?

Has anyone already validated this based on billed cost? running a batch myself to check

EDIT:

Ok so I captioned 500 images in "low resolution" mode with GPT-4o-mini

Each one took approximately: "completion_tokens=84, prompt_tokens=2989, total_tokens=3073"

Reported GPT-4o-mini cost is $0.25

Using GPT-4o this would cost me $1.33 (also in "low resolution" mode), with this breakdown:

"completion_tokens=98, prompt_tokens=239, total_tokens=337"

MasterScrat · on July 18, 2024

Ok I now understand better what happened:

The price for using images as part of your prompt has indeed not changed between GPT-4o-mini and GPT-4o

Yet overall, captioning 500 images now costs me 5x less. This is because when I'm captioning an image, I'm providing both an image and a text prompt. The cost of using the image in the prompt stays the same, but the cost of the text dramatically dropped.

minimaxir · on July 18, 2024

Good catch: the calculators here are bizarre. For GPT-4o, a 512x512 image uses 170 tile tokens. For GPT-4o mini, a 512x512 image uses 5,667 tile tokens. How does that even work in the context of a ViT? The patches and its image encoder should be the same size/output.

Since the base token counts increase proportionally (which makes even less sense) I have a hunch there's a JavaScript bug instead.

bryanh · on July 18, 2024

Confirmed that mini uses ~30x more tokens than base gpt-4o using same image/same prompt: { completionTokens: 46, promptTokens: 14207, totalTokens: 14253 } vs. { completionTokens: 82, promptTokens: 465, totalTokens: 547 }.

minimaxir · on July 18, 2024

Huh. I am so confused.