Documentation Index
Fetch the complete documentation index at: https://docs.xhipai.com/llms.txt
Use this file to discover all available pages before exploring further.
Semantic Cache
Semantic caching stores LLM responses indexed by the semantic meaning of the input. When a similar query arrives, the cached response is returned without calling the LLM — reducing costs and latency.
Quick Start
import { Agent, openai, InMemoryVectorStore, OpenAIEmbedding } from "@agentium/core";
const agent = new Agent({
name: "assistant",
model: openai("gpt-4o"),
semanticCache: {
vectorStore: new InMemoryVectorStore(new OpenAIEmbedding()),
embedding: new OpenAIEmbedding(),
similarityThreshold: 0.92,
scope: "agent",
},
});
// First call: LLM call, result cached
await agent.run("What is the capital of France?");
// Second call: returns from cache (no LLM call)
await agent.run("What's the capital of France?");
Configuration
interface SemanticCacheConfig {
vectorStore: VectorStore; // Any vector store backend
embedding: EmbeddingProvider; // Embedding model for similarity
similarityThreshold?: number; // 0-1, default 0.92
ttl?: number; // Cache expiry in ms
collection?: string; // Vector collection name
scope?: "global" | "agent" | "session";
}
Scope
| Scope | Behavior |
|---|
global | All agents share one cache |
agent | Each agent has its own cache partition |
session | Each session has its own cache partition |
How It Works
- Before calling the LLM, the input is embedded and searched against the vector store
- If a result exceeds the
similarityThreshold, it’s returned as a cache hit
- Output guardrails still run on cached responses
- After an LLM call, the input + output are stored in the vector store (fire-and-forget)
- TTL is enforced on lookup — expired entries are evicted lazily
Events
| Event | Payload |
|---|
cache.hit | { agentName, input, cachedId } |
cache.miss | { agentName, input } |
Supported Backends
Any VectorStore implementation works: InMemoryVectorStore, QdrantVectorStore, MongoDBVectorStore, PgVectorStore.
Backend Examples
InMemory (Development)
import { Agent, openai, InMemoryVectorStore, OpenAIEmbedding } from "@agentium/core";
const embedding = new OpenAIEmbedding();
const agent = new Agent({
name: "assistant",
model: openai("gpt-4o"),
semanticCache: {
vectorStore: new InMemoryVectorStore(embedding),
embedding,
similarityThreshold: 0.92,
},
});
Fast, zero-config. Cache is lost when the process restarts — ideal for development and testing.
Qdrant (Production)
import { Agent, openai, QdrantVectorStore, OpenAIEmbedding } from "@agentium/core";
const embedding = new OpenAIEmbedding();
const agent = new Agent({
name: "assistant",
model: openai("gpt-4o"),
semanticCache: {
vectorStore: new QdrantVectorStore({
url: "http://localhost:6333",
collection: "semantic_cache",
embedding,
}),
embedding,
similarityThreshold: 0.90,
ttl: 3600_000, // 1 hour
},
});
PgVector (PostgreSQL)
import { Agent, openai, PgVectorStore, OpenAIEmbedding } from "@agentium/core";
const embedding = new OpenAIEmbedding();
const agent = new Agent({
name: "assistant",
model: openai("gpt-4o"),
semanticCache: {
vectorStore: new PgVectorStore({
connectionString: "postgresql://localhost:5432/myapp",
table: "semantic_cache",
embedding,
}),
embedding,
similarityThreshold: 0.92,
},
});
Cache Hit vs Miss Behavior
const agent = new Agent({
name: "assistant",
model: openai("gpt-4o"),
semanticCache: {
vectorStore: new InMemoryVectorStore(new OpenAIEmbedding()),
embedding: new OpenAIEmbedding(),
similarityThreshold: 0.92,
ttl: 60_000, // 1 minute
},
});
// Listen to cache events
agent.on("cache.hit", ({ input, cachedId }) => {
console.log(`Cache HIT for: "${input}" (id: ${cachedId})`);
});
agent.on("cache.miss", ({ input }) => {
console.log(`Cache MISS for: "${input}"`);
});
// First call: MISS — calls LLM, stores result
await agent.run("What is the capital of France?");
// → Cache MISS for: "What is the capital of France?"
// Semantically similar: HIT — returns cached result (no LLM call)
await agent.run("What's France's capital city?");
// → Cache HIT for: "What's France's capital city?"
// Different enough: MISS
await agent.run("What is the population of France?");
// → Cache MISS for: "What is the population of France?"
// After TTL expires: MISS again
// (wait 60 seconds...)
await agent.run("What is the capital of France?");
// → Cache MISS for: "What is the capital of France?"
Tuning similarityThreshold
| Threshold | Behavior |
|---|
0.98+ | Nearly exact matches only |
0.92-0.95 | Good default — catches rephrasings |
0.85-0.90 | Aggressive caching — may return irrelevant results |
< 0.85 | Not recommended — too many false matches |
Start with 0.92 and adjust based on your cache hit rate and quality.
Cross-References
- Tool Caching — Cache individual tool results (different from semantic cache)
- Cost Tracking — Semantic cache reduces LLM costs; track savings with CostTracker