Back to Blog
Engineering
February 12, 2026
9 min read

Optimizing Cereby AI: From 5-8 Seconds to Sub-Second Responses

How we transformed our context-aware learning assistant with multi-layered caching, intelligent context compression, and query optimization

Introduction

When we first launched Cereby AI, we had built a powerful, context-aware learning assistant with a sophisticated plugin-based architecture. However, as users began interacting with the system, we quickly discovered a critical performance bottleneck: each request was taking 5-8 seconds, with context aggregation alone consuming 2-3 seconds.

The problem wasn't just slow response times — it was that our beautifully architected system was being held back by inefficient data access patterns and redundant AI API calls. Users noticed the lag, especially when asking follow-up questions or requesting multiple capabilities in quick succession.

After implementing a comprehensive performance optimization strategy, we've achieved:

  • 60-70% faster context loading with cache hits (from 2-3s to 200-400ms)
  • 30-50% reduction in AI token usage through intelligent context compression
  • 60-80% reduction in database queries per request
  • 40-60% faster overall response times for cached requests

This post details the technical optimizations that transformed Cereby AI from a slow, expensive system into a fast, cost-effective learning assistant.

The Performance Challenge: Context Aggregation at Scale

The Initial Architecture Flow

Cereby AI's Context Aggregator component queries multiple data sources to build a comprehensive user context:

User Request → CerebyAIController
  ↓
ContextAggregator (5-7 sequential DB queries)
  ├── Quiz Performance (800ms)
  ├── Learning Paths (600ms)
  ├── Calendar Events (400ms)
  ├── Notes Summary (500ms)
  └── Weak Points Calculation (600ms)
  ↓
IntentClassifier (AI API call: 800-1200ms)
  ↓
ToolOrchestrator → Tool Handler (AI API call: 2000-4000ms)
  ↓
Response (100ms)
─────────────────────────────────────────────
Total: 5-8 seconds per request
The Core Bottlenecks:
  1. Sequential Database Queries — Each context source was queried independently
  2. No Caching Layer — Every request triggered full context aggregation
  3. Redundant AI Calls — Similar intents were classified repeatedly without caching
  4. Oversized Context Payloads — Sending entire user history to AI for every request
  5. Inefficient Query Patterns — N+1 queries and missing database indexes

User Impact

The performance issues had real consequences for students experiencing 5-8 second wait times, high API costs, and increasing database load.

Solution Architecture: Multi-Layered Caching Strategy

We implemented a three-tier caching architecture:

Request Flow with Caching:
┌─────────────────────────────────────────┐
│ 1. In-Memory Cache                      │ ← Sub-millisecond access
│    - Context cache (2 min TTL)          │
│    - Intent classification cache         │
└─────────────────────────────────────────┘
           ↓ (cache miss)
┌─────────────────────────────────────────┐
│ 2. Database Cache                        │ ← 50-100ms access
│    - Persistent across restarts          │
│    - 5-minute TTL                        │
│    - Structured storage                  │
└─────────────────────────────────────────┘
           ↓ (cache miss)
┌─────────────────────────────────────────┐
│ 3. Context Aggregation (Parallel)       │ ← 200-400ms
│    - Parallel queries                    │
│    - Optimized with indexes              │
│    - Query limits                        │
└─────────────────────────────────────────┘

Tier 1: In-Memory Cache

We built a custom cache manager that provides ultra-fast in-memory caching with:

  • Automatic expiration and cleanup
  • LRU eviction when cache is full
  • Pattern-based invalidation for bulk operations
  • Sub-millisecond access times
Cache Keys:
  • Context cache per user (2 min TTL)
  • Intent classifications per user/message (5 min TTL)
Benefits:
  • Sub-millisecond access times for cache hits
  • Automatic expiration and cleanup
  • Pattern-based invalidation
  • Memory-efficient with LRU eviction

Tier 2: Database Cache (Persistent)

For persistence across server restarts and multi-instance deployments, we leverage database caching:

Design Decisions:
  • 5-minute TTL (balance between freshness and performance)
  • Flexible storage for schema evolution
  • Automatic expiration with background cleanup
  • Dual caching (in-memory + database) provides best of both worlds

Tier 3: Intent Classification Caching

Users often ask similar questions or make follow-up requests. Caching intent classifications reduced redundant AI calls significantly.

Impact:
  • 30-40% reduction in intent classification API calls
  • Faster response times for similar queries
  • Lower API costs (intent classification adds latency)

Context Compression: Reducing Token Usage by 50%

One of our biggest wins was implementing intelligent context compression. Initially, we sent the entire user context (hundreds of quiz attempts, all learning paths, all events) to the AI for every request.

The Compression Strategy

We built a context compression system that intelligently filters context based on the user's intent:

The Process:
  1. Extract relevant subjects/topics from intent
  2. Filter weak points to only relevant ones
  3. Filter quiz history to relevant subject/topics
  4. Filter learning paths to relevant ones
  5. Filter events to upcoming and relevant ones

Results

Before Compression:
  • Average context size: ~15,000 tokens
  • Cost per request: ~$0.15-0.25
  • Processing time: 2-3 seconds
  • Most context irrelevant to specific request
After Compression:
  • Average context size: ~5,000-8,000 tokens (50% reduction)
  • Cost per request: ~$0.08-0.15 (40% reduction)
  • Processing time: 1-1.5 seconds (50% faster)
  • Only relevant context sent to AI

Smart Filtering Logic

The compressor uses several strategies:

  1. Subject Relevance — If intent mentions "calculus", only include calculus-related data
  2. Topic Matching — Match topics mentioned in the request
  3. Recency Priority — Keep most recent quiz attempts and events
  4. Severity Filtering — Prioritize high-severity weak points
  5. Top-N Selection — Limit to top N items per category

Query Optimization: From Sequential to Parallel

Parallel Fetching

We optimized queries to run in parallel:

Before: Sequential queries (slow) - Total time = sum of all queries After: Parallel queries (fast) - Total time = max of all queries Result: 65% faster context aggregation

Query Limits and Intelligent Filtering

We added intelligent limits to prevent excessive data loading:

  • Quiz history: Limit to most recent
  • Calendar events: Only upcoming events
  • Notes: Most recent
  • Learning paths: Active paths only
Rationale:
  • Most recent data is most relevant
  • Older data has less impact
  • Limits prevent context bloat

Database Indexes

We ensured all frequently queried columns were indexed:

  • User + completion date indexes
  • User + event date indexes
  • User + expiration indexes
  • User + subject indexes
Impact:
  • 40-60% faster database queries
  • Reduced database load
  • Better scalability

Frontend Optimization: React Component Performance

On the frontend, we optimized React components to reduce unnecessary re-renders:

Memoization Strategy

  • Memoize expensive computations
  • Memoize event handlers
  • Prevent recreation on every render

Benefits

  • Reduced re-renders
  • Smoother UI
  • Better performance on slower devices
  • Improved perceived performance

Performance Metrics: Before and After

Before Optimization

MetricValue
Average response time5-8 seconds
Context loading time2-3 seconds
Database queries per request5-7
Token usage per request~15,000 tokens
Cache hit rate0%
Cost per request$0.15-0.25

After Optimization

MetricValueImprovement
Average response time (cache hit)1-2 seconds60-75% faster
Context loading time (cache hit)200-400ms85-90% faster
Database queries per request (cache hit)0-185-100% reduction
Token usage per request~5,000-8,000 tokens40-50% reduction
Cache hit rate60-70%New capability
Cost per request (cache hit)$0.05-0.1050-60% reduction

Real-World Impact

User Experience:
  • Users report "instant" responses for follow-up questions
  • Reduced perceived latency
  • Smoother chat interactions
  • Higher engagement
Infrastructure:
  • 60% reduction in database load
  • 40% reduction in AI API costs
  • Better scalability
  • Lower infrastructure costs per user

Implementation Details: Cache Management

Cache Invalidation Strategy

We implemented smart cache invalidation that triggers when user data changes:

When to Invalidate:
  • User completes a quiz
  • User creates/updates a learning path
  • User adds/updates calendar events
  • User creates new notes

Cache Warming

For frequently accessed users, we pre-warm the cache to ensure fast responses for active users.

Monitoring and Metrics

We track comprehensive cache performance metrics:

  • Hit rate (% of requests served from cache)
  • Miss rate (% requiring full aggregation)
  • Average access time
  • Cache size
  • Eviction count

Lessons Learned

1. Cache Early, Cache Often

Building caching from day one would have saved significant refactoring time. However, the modular architecture made it relatively straightforward to add caching later.

2. Multi-Layer Caching is Essential

A single caching layer isn't enough. In-memory cache provides speed, database cache provides persistence, and intent caching reduces AI calls.

3. Context Compression is a Game-Changer

Reducing token usage by 40-50% through intelligent filtering had massive impact on both cost and performance.

4. Measure Everything

Comprehensive metrics from day one allowed us to:

  • Identify bottlenecks quickly
  • Measure improvement impact accurately
  • Make data-driven optimization decisions
  • Track cost savings over time

5. User Experience > Technical Metrics

While we improved technical metrics significantly, the real win was user experience. Users noticed the speed improvements immediately.

6. Integration with Existing Architecture

The performance optimizations integrated seamlessly with our existing plugin-based architecture, validating our design decision to build a modular, extensible system.

Future Optimizations

We're exploring additional optimizations:

1. Streaming Responses

For long-form content generation, implement streaming for better perceived performance.

2. Predictive Caching

Use machine learning to predict likely requests and pre-cache.

3. Edge Caching

Cache at CDN edge locations for global users.

4. Request Deduplication

Detect duplicate requests and return cached response.

5. Database Connection Pooling

Optimize connection management for better resource utilization.

Conclusion

The performance optimizations we implemented transformed Cereby AI from a slow, expensive system into a fast, cost-effective learning assistant. The key was a multi-layered approach:

  1. In-memory caching for speed
  2. Database caching for persistence
  3. Context compression for efficiency
  4. Query optimization for database performance
  5. Component optimization for UI responsiveness

These optimizations resulted in:

  • 60-70% faster response times
  • 40-50% reduction in costs
  • 60-80% reduction in database load
  • Dramatically improved user experience

For teams building similar AI systems, our key takeaway is: invest in caching and optimization from the start, but design your architecture to make it easy to add later.


Want to learn more about Cereby AI's architecture? Check out our Building Cereby AI post or reach out on Twitter.

Visual Summary

flowchart TD
    A[Incoming Request] --> B[Cache Lookup]
    B -->|Hit| C[Fast Response Path]
    B -->|Miss| D[Parallel Data Fetch]
    D --> E[Optimized Processing]
    E --> F[Store Computed Cache]
    F --> G[Return Response]
    C --> G