Optimizing Cereby AI: From 5-8 Seconds to Sub-Second Responses
Introduction
When we first launched Cereby AI, we had built a powerful, context-aware learning assistant with a sophisticated plugin-based architecture. However, as users began interacting with the system, we quickly discovered a critical performance bottleneck: each request was taking 5-8 seconds, with context aggregation alone consuming 2-3 seconds.
The problem wasn't just slow response times — it was that our beautifully architected system was being held back by inefficient data access patterns and redundant AI API calls. Users noticed the lag, especially when asking follow-up questions or requesting multiple capabilities in quick succession.
After implementing a comprehensive performance optimization strategy, we've achieved:
- 60-70% faster context loading with cache hits (from 2-3s to 200-400ms)
- 30-50% reduction in AI token usage through intelligent context compression
- 60-80% reduction in database queries per request
- 40-60% faster overall response times for cached requests
This post details the technical optimizations that transformed Cereby AI from a slow, expensive system into a fast, cost-effective learning assistant.
The Performance Challenge: Context Aggregation at Scale
The Initial Architecture Flow
Cereby AI's Context Aggregator component queries multiple data sources to build a comprehensive user context:
User Request → CerebyAIController
↓
ContextAggregator (5-7 sequential DB queries)
├── Quiz Performance (800ms)
├── Learning Paths (600ms)
├── Calendar Events (400ms)
├── Notes Summary (500ms)
└── Weak Points Calculation (600ms)
↓
IntentClassifier (AI API call: 800-1200ms)
↓
ToolOrchestrator → Tool Handler (AI API call: 2000-4000ms)
↓
Response (100ms)
─────────────────────────────────────────────
Total: 5-8 seconds per request
The Core Bottlenecks:
- Sequential Database Queries — Each context source was queried independently
- No Caching Layer — Every request triggered full context aggregation
- Redundant AI Calls — Similar intents were classified repeatedly without caching
- Oversized Context Payloads — Sending entire user history to AI for every request
- Inefficient Query Patterns — N+1 queries and missing database indexes
User Impact
The performance issues had real consequences for students experiencing 5-8 second wait times, high API costs, and increasing database load.
Solution Architecture: Multi-Layered Caching Strategy
We implemented a three-tier caching architecture:
Request Flow with Caching:
┌─────────────────────────────────────────┐
│ 1. In-Memory Cache │ ← Sub-millisecond access
│ - Context cache (2 min TTL) │
│ - Intent classification cache │
└─────────────────────────────────────────┘
↓ (cache miss)
┌─────────────────────────────────────────┐
│ 2. Database Cache │ ← 50-100ms access
│ - Persistent across restarts │
│ - 5-minute TTL │
│ - Structured storage │
└─────────────────────────────────────────┘
↓ (cache miss)
┌─────────────────────────────────────────┐
│ 3. Context Aggregation (Parallel) │ ← 200-400ms
│ - Parallel queries │
│ - Optimized with indexes │
│ - Query limits │
└─────────────────────────────────────────┘
Tier 1: In-Memory Cache
We built a custom cache manager that provides ultra-fast in-memory caching with:
- Automatic expiration and cleanup
- LRU eviction when cache is full
- Pattern-based invalidation for bulk operations
- Sub-millisecond access times
- Context cache per user (2 min TTL)
- Intent classifications per user/message (5 min TTL)
- Sub-millisecond access times for cache hits
- Automatic expiration and cleanup
- Pattern-based invalidation
- Memory-efficient with LRU eviction
Tier 2: Database Cache (Persistent)
For persistence across server restarts and multi-instance deployments, we leverage database caching:
Design Decisions:- 5-minute TTL (balance between freshness and performance)
- Flexible storage for schema evolution
- Automatic expiration with background cleanup
- Dual caching (in-memory + database) provides best of both worlds
Tier 3: Intent Classification Caching
Users often ask similar questions or make follow-up requests. Caching intent classifications reduced redundant AI calls significantly.
Impact:- 30-40% reduction in intent classification API calls
- Faster response times for similar queries
- Lower API costs (intent classification adds latency)
Context Compression: Reducing Token Usage by 50%
One of our biggest wins was implementing intelligent context compression. Initially, we sent the entire user context (hundreds of quiz attempts, all learning paths, all events) to the AI for every request.
The Compression Strategy
We built a context compression system that intelligently filters context based on the user's intent:
The Process:- Extract relevant subjects/topics from intent
- Filter weak points to only relevant ones
- Filter quiz history to relevant subject/topics
- Filter learning paths to relevant ones
- Filter events to upcoming and relevant ones
Results
Before Compression:- Average context size: ~15,000 tokens
- Cost per request: ~$0.15-0.25
- Processing time: 2-3 seconds
- Most context irrelevant to specific request
- Average context size: ~5,000-8,000 tokens (50% reduction)
- Cost per request: ~$0.08-0.15 (40% reduction)
- Processing time: 1-1.5 seconds (50% faster)
- Only relevant context sent to AI
Smart Filtering Logic
The compressor uses several strategies:
- Subject Relevance — If intent mentions "calculus", only include calculus-related data
- Topic Matching — Match topics mentioned in the request
- Recency Priority — Keep most recent quiz attempts and events
- Severity Filtering — Prioritize high-severity weak points
- Top-N Selection — Limit to top N items per category
Query Optimization: From Sequential to Parallel
Parallel Fetching
We optimized queries to run in parallel:
Before: Sequential queries (slow) - Total time = sum of all queries After: Parallel queries (fast) - Total time = max of all queries Result: 65% faster context aggregationQuery Limits and Intelligent Filtering
We added intelligent limits to prevent excessive data loading:
- Quiz history: Limit to most recent
- Calendar events: Only upcoming events
- Notes: Most recent
- Learning paths: Active paths only
- Most recent data is most relevant
- Older data has less impact
- Limits prevent context bloat
Database Indexes
We ensured all frequently queried columns were indexed:
- User + completion date indexes
- User + event date indexes
- User + expiration indexes
- User + subject indexes
- 40-60% faster database queries
- Reduced database load
- Better scalability
Frontend Optimization: React Component Performance
On the frontend, we optimized React components to reduce unnecessary re-renders:
Memoization Strategy
- Memoize expensive computations
- Memoize event handlers
- Prevent recreation on every render
Benefits
- Reduced re-renders
- Smoother UI
- Better performance on slower devices
- Improved perceived performance
Performance Metrics: Before and After
Before Optimization
| Metric | Value |
|---|---|
| Average response time | 5-8 seconds |
| Context loading time | 2-3 seconds |
| Database queries per request | 5-7 |
| Token usage per request | ~15,000 tokens |
| Cache hit rate | 0% |
| Cost per request | $0.15-0.25 |
After Optimization
| Metric | Value | Improvement |
|---|---|---|
| Average response time (cache hit) | 1-2 seconds | 60-75% faster |
| Context loading time (cache hit) | 200-400ms | 85-90% faster |
| Database queries per request (cache hit) | 0-1 | 85-100% reduction |
| Token usage per request | ~5,000-8,000 tokens | 40-50% reduction |
| Cache hit rate | 60-70% | New capability |
| Cost per request (cache hit) | $0.05-0.10 | 50-60% reduction |
Real-World Impact
User Experience:- Users report "instant" responses for follow-up questions
- Reduced perceived latency
- Smoother chat interactions
- Higher engagement
- 60% reduction in database load
- 40% reduction in AI API costs
- Better scalability
- Lower infrastructure costs per user
Implementation Details: Cache Management
Cache Invalidation Strategy
We implemented smart cache invalidation that triggers when user data changes:
When to Invalidate:- User completes a quiz
- User creates/updates a learning path
- User adds/updates calendar events
- User creates new notes
Cache Warming
For frequently accessed users, we pre-warm the cache to ensure fast responses for active users.
Monitoring and Metrics
We track comprehensive cache performance metrics:
- Hit rate (% of requests served from cache)
- Miss rate (% requiring full aggregation)
- Average access time
- Cache size
- Eviction count
Lessons Learned
1. Cache Early, Cache Often
Building caching from day one would have saved significant refactoring time. However, the modular architecture made it relatively straightforward to add caching later.
2. Multi-Layer Caching is Essential
A single caching layer isn't enough. In-memory cache provides speed, database cache provides persistence, and intent caching reduces AI calls.
3. Context Compression is a Game-Changer
Reducing token usage by 40-50% through intelligent filtering had massive impact on both cost and performance.
4. Measure Everything
Comprehensive metrics from day one allowed us to:
- Identify bottlenecks quickly
- Measure improvement impact accurately
- Make data-driven optimization decisions
- Track cost savings over time
5. User Experience > Technical Metrics
While we improved technical metrics significantly, the real win was user experience. Users noticed the speed improvements immediately.
6. Integration with Existing Architecture
The performance optimizations integrated seamlessly with our existing plugin-based architecture, validating our design decision to build a modular, extensible system.
Future Optimizations
We're exploring additional optimizations:
1. Streaming Responses
For long-form content generation, implement streaming for better perceived performance.
2. Predictive Caching
Use machine learning to predict likely requests and pre-cache.
3. Edge Caching
Cache at CDN edge locations for global users.
4. Request Deduplication
Detect duplicate requests and return cached response.
5. Database Connection Pooling
Optimize connection management for better resource utilization.
Conclusion
The performance optimizations we implemented transformed Cereby AI from a slow, expensive system into a fast, cost-effective learning assistant. The key was a multi-layered approach:
- In-memory caching for speed
- Database caching for persistence
- Context compression for efficiency
- Query optimization for database performance
- Component optimization for UI responsiveness
These optimizations resulted in:
- 60-70% faster response times
- 40-50% reduction in costs
- 60-80% reduction in database load
- Dramatically improved user experience
For teams building similar AI systems, our key takeaway is: invest in caching and optimization from the start, but design your architecture to make it easy to add later.
Want to learn more about Cereby AI's architecture? Check out our Building Cereby AI post or reach out on Twitter.
Visual Summary
flowchart TD
A[Incoming Request] --> B[Cache Lookup]
B -->|Hit| C[Fast Response Path]
B -->|Miss| D[Parallel Data Fetch]
D --> E[Optimized Processing]
E --> F[Store Computed Cache]
F --> G[Return Response]
C --> G