Hierarchical Context Compression: Cutting AI Costs by 90% Without Losing Quality (part 1)
Introduction
At Cereby, Cereby AI's File Chat feature allows students to have intelligent conversations with their study materials — PDFs, PowerPoints, textbooks spanning hundreds of pages. When we first launched this capability, we had a critical problem: each query was costing $1.50-2.00 and taking 15-20 seconds to process.
The culprit? We were sending entire documents to the AI for every question. A 100-page textbook could consume 150,000+ tokens per query — far exceeding context limits and making the feature prohibitively expensive.
We needed a solution that could:
- Dramatically reduce token usage (90%+ reduction target)
- Maintain response quality (students need accurate, relevant answers)
- Scale to large documents (500+ page textbooks)
- Work in real-time (sub-second compression overhead)
This post details how we built Hierarchical Context Compression — a multi-tiered system that intelligently selects which pages to send in full, which to summarize, and which to omit entirely based on relevance to the user's query. The result: 92% token reduction (150K → 8K tokens) with maintained accuracy and $1.50 cost savings per query.
The Core Challenge: Context vs. Cost
The Initial Implementation
When users asked questions about their documents, we had a simple but expensive approach:
User Query: "Explain the chain rule in chapter 5"
↓
Load all selected pages (e.g., pages 1-100)
↓
Send entire content to AI (150,000 tokens)
↓
AI generates response
↓
User gets answer (15-20 seconds later, $1.50-2.00 cost)
The Problem:
- Token Explosion — 100-page documents easily exceeded 150K tokens
- High Costs — $1.50-2.00 per query
- Slow Responses — Processing 150K tokens took 15-20 seconds
- Context Window Limits — Many documents exceeded model context limits
- Irrelevant Content — 80-90% of sent pages weren't relevant to the specific query
The User Impact
This wasn't just a technical problem — it had real consequences for students who couldn't afford to ask questions and experienced unacceptably slow response times.
Solution Architecture: Hierarchical Context Compression
We designed a multi-phase system that preprocesses documents once and then intelligently compresses context for each query:
┌─────────────────────────────────────────────────────┐
│ Phase 1: One-Time Preprocessing (at upload) │
├─────────────────────────────────────────────────────┤
│ Parse PDF → Extract Pages → Generate Summaries │
│ │
│ For each page: │
│ - Extract keywords (frequency analysis) │
│ - Extract topics (pattern matching) │
│ - Generate summary (AI-powered) │
│ - Calculate importance score (heuristics) │
│ - Store in database │
└─────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────┐
│ Phase 2: Query-Time Compression (per request) │
├─────────────────────────────────────────────────────┤
│ User Query → Analyze Query (extract keywords) │
│ ↓ │
│ Score Pages (relevance to query) │
│ ↓ │
│ Allocate Token Budget: │
│ - High relevance (>0.7): Full content │
│ - Medium relevance (0.3-0.7): Summary only │
│ - Low relevance (<0.3): Omit entirely │
│ ↓ │
│ Build Compressed Context (8K tokens) │
│ ↓ │
│ Send to AI → Generate Response │
└─────────────────────────────────────────────────────┘
The Key Insight
Instead of sending everything or using simple truncation, we:
- Preprocess documents once — Extract metadata and summaries during upload
- Score pages dynamically — Calculate relevance for each query
- Hierarchically allocate tokens — Full content for highly relevant pages, summaries for moderately relevant, omit the rest
- Preserve context structure — Maintain document organization and relationships
This approach ensures the AI gets the most relevant information within the token budget, not just the first N pages or random sampling.
Phase 1: Document Preprocessing
Database Schema
We designed a dedicated table to store pre-computed page metadata:
Core Fields:- Unique identifier and file reference
- Page number for ordering
- Pre-computed summary (3-5 sentences)
- Extracted keywords array
- Identified topics
- Token counts (full content and summary)
- Importance score (0-1 scale)
- Timestamps for tracking
- Unique constraint on file + page number prevents duplicates
- Indexed columns enable fast lookups
- Composite indexes for range queries
Metadata Extraction
We built a metadata extraction pipeline that performs five-step analysis for each page:
The Five-Step Pipeline:- Summary Generation — AI-powered condensation to 3-5 sentences capturing key concepts
- Keyword Extraction — Frequency analysis to identify important terms
- Topic Extraction — Pattern matching to identify named concepts and themes
- Token Counting — Estimation of both full content and summary requirements
- Importance Scoring — Heuristic calculation based on position, length, and density
Keyword Extraction
We implemented a lightweight frequency-based keyword extraction algorithm:
The Process:- Stop Word Filtering — Remove common words (the, is, at, which, etc.)
- Tokenization and Cleaning — Normalize text, strip punctuation, filter short words
- Frequency Analysis — Count occurrence of each valid word
- Ranking and Selection — Select top 20 most frequent terms
- Simple frequency correlates with importance
- No external dependencies (fast, lightweight)
- Domain-agnostic for technical content
- Top 20 keywords capture essence of page
Topic Extraction: Pattern Matching
We extract topics using four regex pattern categories:
Pattern Categories:- Structural Headers — Chapter/section/part markers with numbers
- Title Case Phrases — Sequences of capitalized words (2-4 words)
- Definitions — Phrases followed by "is defined as", "refers to", or "means"
- Explicit Concept Markers — "concept of X", "theory of X", "principle of X"
- Predictable structure in educational content
- High precision with reliable signals
- Zero latency (no API calls)
- Naturally captures document hierarchy
Summary Generation: AI-Powered Compression
For each page, we generate a 3-5 sentence summary using advanced AI models:
The Process:- Content Preparation — Truncate extremely long pages to prevent overflow
- Prompt Engineering — Clear instruction for concise, focused summaries
- Model Configuration — Optimized for cost and quality balance
- Error Handling & Fallback — Ensures preprocessing never fails completely
For a 100-page document:
- One-time preprocessing cost: ~$0.01-0.02
- Per-query savings: $1.50
- Break-even: After 1st query ✅
- ROI: 150x return after single use
Importance Scoring
We calculate a baseline importance score (0-1 scale) using three weighted heuristics:
Heuristic 1: Position Bias- First 10% of pages (introduction, overview): Higher weight
- Last 10% of pages (conclusions, summaries): Higher weight
- Middle pages (details): Lower weight
- Very long pages: Higher weight (dense content)
- Medium pages: Medium weight
- Short pages: Lower weight
- More unique vocabulary: Higher weight
- Sparse vocabulary: Lower weight
- Pre-query baseline established before any query
- Tie-breaking when pages have similar relevance
- Leverages universal patterns in educational writing
- Lightweight computation (no AI calls)
Integration with File Upload
We integrated metadata generation into the existing file parsing pipeline with batched processing to avoid overwhelming APIs.
Processing Time Benchmarks:- 100-page document: 20-30 seconds total
- 500-page document: 2-3 minutes total
Phase 2: Query-Time Compression
The Compression Pipeline
The core compression logic orchestrates a five-step pipeline:
Step 1: Metadata Loading- Query database for pre-computed metadata
- Retrieve summaries, keywords, topics, token counts, importance scores
- Sub-100ms retrieval even with 100+ pages
- Extract keywords from user query
- Identify mentioned topics via pattern matching
- Classify query intent (explanation, comparison, definition, etc.)
- Score each page against query context
- Weighted algorithm: keyword overlap (50%), topic matching (40%), importance baseline (10%)
- Produces relevance score (0-1) for each page
- Three-tier strategy based on relevance scores:
- Respects maxTokens budget constraint
- Construct final context string with clear section markers
- Fetch full content for high-relevance pages
- Fetch summaries for medium-relevance pages
- Note omitted pages for transparency
Query Analysis
We extract structured information from the user's natural language query through a three-phase analysis:
Phase 1: Keyword Extraction- Normalize query
- Filter out stop words
- Extract meaningful terms
- De-duplicate
- Scan for title-cased phrases
- Capture explicitly named concepts
- Pattern matching against common query types
- Summary, explanation, comparison, definition, or general
Page Scoring: The Heart of Compression
We score each page based on query relevance using a three-component weighted algorithm:
Component 1: Keyword Matching (50% weight)- Compare query keywords against page keywords
- Use bidirectional substring matching
- Calculate match ratio
- Compare query topics against page topics
- Use case-insensitive substring matching
- Calculate topic match ratio
- Retrieve pre-computed importance score
- Acts as tie-breaker and baseline
Token Budget Allocation
Based on relevance scores, we implement a greedy algorithm with three-tier thresholding:
Tier 1: High Relevance (score ≥ 0.7)- Include full content when budget allows
- Gracefully degrade to summary if budget is tight
- Include summary only
- Omit entirely (consume zero tokens)
Building the Compressed Context
The final step constructs a hierarchically organized context string:
Section 1: Full Content (Highly Relevant Pages)- Clear section header
- Complete page content for most relevant pages
- Maintains document flow
- Section header for summaries
- 3-5 sentence summaries for moderately relevant pages
- Provides broader context without bloat
- Note about omitted pages for transparency
- Formatted as readable ranges
Performance and Cost Analysis
Before Hierarchical Compression
| Metric | Value |
|---|---|
| Average document size | 100 pages |
| Token usage per query | 150,000 tokens |
| Cost per query | $1.50-2.00 |
| Response time | 15-20 seconds |
| Max document size | ~85 pages (context limit) |
After Hierarchical Compression
| Metric | Value | Improvement |
|---|---|---|
| Average document size | 100+ pages (no effective limit) | |
| Token usage per query | 5,000-8,000 tokens | 92% reduction |
| Cost per query | $0.05-0.10 | 93% reduction |
| Response time | 2-3 seconds | 85% faster |
| Max document size | 500+ pages | 6x increase |
Quality Validation
We validated that compression doesn't harm response quality through comprehensive testing:
| Metric | Full Context | Compressed Context | Difference |
|---|---|---|---|
| Accuracy | 4.7/5 | 4.6/5 | -0.1 (2% decrease) |
| Relevance | 4.5/5 | 4.7/5 | +0.2 (4% increase) |
| Completeness | 4.8/5 | 4.5/5 | -0.3 (6% decrease) |
| Overall | 4.7/5 | 4.6/5 | -0.1 (2% decrease) |
Lessons Learned
1. Preprocessing Pays for Itself Immediately
The one-time cost of generating summaries and metadata is recovered after the first query. This upfront investment enables massive per-query savings.
2. Simple Heuristics Work Well for Academic Content
We initially considered complex NLP models for keyword extraction and relevance scoring. But simple frequency analysis and pattern matching work remarkably well for structured educational content.
3. Three-Tier Compression is the Sweet Spot
We experimented with different tier counts. Three tiers (full, summary, omit) provided the perfect balance between coverage and efficiency.
4. Token Budget Should Be Adaptive
We started with a fixed 8,000-token budget. Future iterations will dynamically adjust based on query complexity and intent.
5. Summaries Need to Be Information-Dense
Early summaries were too generic. Improved quality by using focused prompts and optimized parameters.
6. Graceful Degradation is Essential
Not all files have metadata (legacy files, processing failures). Fallback ensures system continues working.
7. Monitoring Context Quality is Critical
We added instrumentation to track relevance scores, token usage distribution, and user feedback to inform optimization decisions.
Future Enhancements
1. Semantic Search Integration
Replace keyword matching with embedding-based semantic similarity for better relevance detection.
2. Multi-Query Context Reuse
Cache compressed contexts for follow-up questions on similar topics.
3. User-Adjustable Compression
Let users control compression aggressiveness (fast, balanced, comprehensive modes).
4. Learning from Feedback
Track which pages AI uses in responses to improve future compression.
5. Cross-Document Compression
Handle queries spanning multiple uploaded files.
Conclusion
Hierarchical Context Compression transformed Cereby AI's File Chat from an expensive, slow feature into a fast, cost-effective learning tool. The key was a two-phase approach:
- One-time preprocessing — Generate summaries, keywords, and metadata
- Query-time compression — Intelligently select full content, summaries, or omit based on relevance
✅ 92% token reduction (150K → 8K tokens) ✅ 93% cost reduction ($1.50 → $0.08 per query) ✅ 85% faster responses (15-20s → 2-3s) ✅ 6x larger documents (85 → 500+ pages) ✅ Maintained quality (4.7 → 4.6 out of 5)
For teams building similar AI systems, our key takeaways are:
- Preprocess documents once — Upfront investment pays for itself immediately
- Score relevance dynamically — Don't send everything; send what matters
- Use hierarchical tiers — Full, summary, omit is the sweet spot
- Validate quality rigorously — Compression must maintain accuracy
- Make it transparent — Show users what's happening
- Design for graceful degradation — System must work even when compression fails
Want to learn more about Cereby AI's architecture? Check out our Building Cereby AI and Optimizing Cereby AI Performance posts or reach out on Twitter.
Visual Summary
flowchart TD
A[Uploaded Document] --> B[One-Time Preprocessing]
B --> C[Per-Page Summary + Keywords]
C --> D[Query-Time Relevance Scoring]
D --> E{Tier Decision}
E -->|High| F[Full Page Content]
E -->|Medium| G[Summary Only]
E -->|Low| H[Omit Page]
F --> I[Final Context Assembly]
G --> I
H --> I
I --> J[Fast, Lower-Cost AI Response]