Prompt Caching: Comparing OpenAI, Anthropic, and Gemini

Comparing Prompt Caching: OpenAI, Anthropic, and Gemini

In recent years, the rapid development of large language models (LLMs) has led to significant increases in context window sizes. A context window refers to the amount of information a model can process at one time, and innovations like Retrieval-Augmented Generation (RAG), video, and image inputs have expanded the usable context length in LLMs. This evolution is aimed at handling more complex tasks and a wider range of information.

In response, major providers have introduced “prompt caching” for efficient prompt management. Prompt caching stores previously used prompts and their results for reuse, avoiding repeated processing of the same tasks. This leads to faster processing times and cost savings.

In this article, we will compare the prompt caching features of the key LLM providers: OpenAI, Anthropic, and Gemini, focusing on their specifications and differences.

Models Supporting Prompt Caching

Prompt caching is available in relatively new models.

OpenAI

  • gpt-4o
  • gpt-4o-mini
  • o1-preview
  • o1-mini

Anthropic

  • Claude 3.5 Sonnet
  • Claude 3 Opus
  • Claude 3 Haiku

Gemini

  • Stable versions of Gemini 1.5 Flash (e.g., gemini-1.5-flash-001)
  • Stable versions of Gemini 1.5 Pro (e.g., gemini-1.5-pro-001)

Time to Live (TTL) for Cache Storage

OpenAI

The default TTL is 5–10 minutes, but it can extend up to an hour during off-peak times.

Anthropic

By default, the cache is stored for 5 minutes.

Gemini

The default TTL is 1 hour, but you can specify a custom TTL (additional charges apply if extended).

Pricing

OpenAI

Input token costs are discounted by 50% across all models, while output token costs remain the same.

Anthropic

Discounts are as follows:

  • Claude 3.5 Sonnet: 90% off input tokens, 75% off output tokens
  • Claude 3 Opus: 90% off input tokens, 75% off output tokens
  • Claude 3 Haiku: 88% off input tokens, 76% off output tokens

Gemini

Gemini has a complex pricing structure with costs including:

  • Regular input/output costs when the cache is missed
  • 75% discount on input costs when the cache is used
  • Cache storage costs

Unlike OpenAI and Anthropic, Gemini charges for storing cache. For details, refer to here, and for an example cost calculation, visit this page.

How to Use Prompt Caching

OpenAI

No code changes are necessary.
Once a prompt exceeds 1,024 tokens, it is automatically added to the cache. Cache hits occur in 128-token increments after 1,024 tokens (e.g., 1,024, 1,152, 1,280…).

Anthropic

You need to explicitly call prompt caching to use the feature.

import anthropic

client = anthropic.Anthropic()

response = client.beta.prompt_caching.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1024,
    system=[
      {
        "type": "text", 
        "text": "You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n",
      },
      {
        "type": "text", 
        "text": "<the entire contents of 'Pride and Prejudice'>",
        "cache_control": {"type": "ephemeral"}
      }
    ],
    messages=[{"role": "user", "content": "Analyze the major themes in 'Pride and Prejudice'."}],
)
print(response)

Minimum tokens for cache usage:

  • Claude 3.5 Sonnet and Claude 3 Opus: 1,024 tokens
  • Claude 3 Haiku: 2,048 tokens

Gemini

For Gemini, you must first create a cache using CachedContent.create, and then specify it when defining the model.

import os
import google.generativeai as genai
from google.generativeai import caching
import datetime
import time

# Get your API key from https://aistudio.google.com/app/apikey
genai.configure(api_key=os.environ['API_KEY'])

# Download video file
# curl -O https://storage.googleapis.com/generativeai-downloads/data/Sherlock_Jr_FullMovie.mp4

path_to_video_file = 'Sherlock_Jr_FullMovie.mp4'

# Upload the video using the Files API
video_file = genai.upload_file(path=path_to_video_file)

# Wait for the file to finish processing
while video_file.state.name == 'PROCESSING':
  print('Waiting for video to be processed.')
  time.sleep(2)
  video_file = genai.get_file(video_file.name)

print(f'Video processing complete: {video_file.uri}')

# Create a cache with a 5-minute TTL
cache = caching.CachedContent.create(
    model='models/gemini-1.5-flash-001',
    display_name='sherlock jr movie',
    system_instruction=(
        'You are an expert video analyzer, and your job is to answer '
        'the user\'s query based on the video file you have access to.'
    ),
    contents=[video_file],
    ttl=datetime.timedelta(minutes=5),
)

# Construct a GenerativeModel which uses the created cache.
model = genai.GenerativeModel.from_cached_content(cached_content=cache)

# Query the model
response = model.generate_content([(
    'Introduce different characters in the movie by describing '
    'their personality, looks, and names. Also list the timestamps '
    'they were introduced for the first time.')])

print(response.usage_metadata)

# The output should look something like this:
#
# prompt_token_count: 696219
# cached_content_token_count: 696190
# candidates_token_count: 214
# total_token_count: 696433

print(response.text)

Minimum tokens for cache usage in Gemini is 32,768.

Best Practices for Cache Usage

Static content used for caching should be placed at the beginning of the prompt to maximize cache hit rates, as cache searches start from the beginning of the prompt.

In Anthropic’s case, explicit cache additions are required, and with a short TTL of 5 minutes, it is best to cache frequently used elements like system instructions, tool definitions, and RAG contexts.
Gemini offers longer TTLs, but since cache storage incurs costs, it is recommended to cache large-scale content like code repositories, long videos, or extensive documents.

Thank you for reading. I hope this article was helpful. If you notice any inaccuracies, feel free to reach out.

コメント

このブログの人気の投稿

OpenAI Realtime API Python Code: Understanding the Low-Level Sample Code for Azure's Realtime Audio Python Code

OpnAI Realtime API: Conversing via Local Microphone and Speaker