Back to Blog
Engineering
February 15, 2026
12 min read

Hierarchical Context Compression: Cutting AI Costs by 90% Without Losing Quality (part 1)

How we engineered an intelligent compression system that reduced token usage from 150K to 8K while maintaining high-quality AI responses

Introduction

At Cereby, Cereby AI's File Chat feature allows students to have intelligent conversations with their study materials — PDFs, PowerPoints, textbooks spanning hundreds of pages. When we first launched this capability, we had a critical problem: each query was costing $1.50-2.00 and taking 15-20 seconds to process.

The culprit? We were sending entire documents to the AI for every question. A 100-page textbook could consume 150,000+ tokens per query — far exceeding context limits and making the feature prohibitively expensive.

We needed a solution that could:

  • Dramatically reduce token usage (90%+ reduction target)
  • Maintain response quality (students need accurate, relevant answers)
  • Scale to large documents (500+ page textbooks)
  • Work in real-time (sub-second compression overhead)

This post details how we built Hierarchical Context Compression — a multi-tiered system that intelligently selects which pages to send in full, which to summarize, and which to omit entirely based on relevance to the user's query. The result: 92% token reduction (150K → 8K tokens) with maintained accuracy and $1.50 cost savings per query.

The Core Challenge: Context vs. Cost

The Initial Implementation

When users asked questions about their documents, we had a simple but expensive approach:

User Query: "Explain the chain rule in chapter 5"
  ↓
Load all selected pages (e.g., pages 1-100)
  ↓
Send entire content to AI (150,000 tokens)
  ↓
AI generates response
  ↓
User gets answer (15-20 seconds later, $1.50-2.00 cost)
The Problem:
  1. Token Explosion — 100-page documents easily exceeded 150K tokens
  2. High Costs — $1.50-2.00 per query
  3. Slow Responses — Processing 150K tokens took 15-20 seconds
  4. Context Window Limits — Many documents exceeded model context limits
  5. Irrelevant Content — 80-90% of sent pages weren't relevant to the specific query

The User Impact

This wasn't just a technical problem — it had real consequences for students who couldn't afford to ask questions and experienced unacceptably slow response times.

Solution Architecture: Hierarchical Context Compression

We designed a multi-phase system that preprocesses documents once and then intelligently compresses context for each query:

┌─────────────────────────────────────────────────────┐
│ Phase 1: One-Time Preprocessing (at upload)         │
├─────────────────────────────────────────────────────┤
│ Parse PDF → Extract Pages → Generate Summaries      │
│                                                     │
│ For each page:                                      │
│  - Extract keywords (frequency analysis)            │
│  - Extract topics (pattern matching)                │
│  - Generate summary (AI-powered)                    │
│  - Calculate importance score (heuristics)          │
│  - Store in database                                │
└─────────────────────────────────────────────────────┘
           ↓
┌─────────────────────────────────────────────────────┐
│ Phase 2: Query-Time Compression (per request)       │
├─────────────────────────────────────────────────────┤
│ User Query → Analyze Query (extract keywords)       │
│           ↓                                         │
│ Score Pages (relevance to query)                    │
│           ↓                                         │
│ Allocate Token Budget:                              │
│  - High relevance (>0.7): Full content              │
│  - Medium relevance (0.3-0.7): Summary only         │
│  - Low relevance (<0.3): Omit entirely              │
│           ↓                                         │
│ Build Compressed Context (8K tokens)                │
│           ↓                                         │
│ Send to AI → Generate Response                      │
└─────────────────────────────────────────────────────┘

The Key Insight

Instead of sending everything or using simple truncation, we:

  1. Preprocess documents once — Extract metadata and summaries during upload
  2. Score pages dynamically — Calculate relevance for each query
  3. Hierarchically allocate tokens — Full content for highly relevant pages, summaries for moderately relevant, omit the rest
  4. Preserve context structure — Maintain document organization and relationships

This approach ensures the AI gets the most relevant information within the token budget, not just the first N pages or random sampling.

Phase 1: Document Preprocessing

Database Schema

We designed a dedicated table to store pre-computed page metadata:

Core Fields:
  • Unique identifier and file reference
  • Page number for ordering
  • Pre-computed summary (3-5 sentences)
  • Extracted keywords array
  • Identified topics
  • Token counts (full content and summary)
  • Importance score (0-1 scale)
  • Timestamps for tracking
Performance Optimizations:
  • Unique constraint on file + page number prevents duplicates
  • Indexed columns enable fast lookups
  • Composite indexes for range queries

Metadata Extraction

We built a metadata extraction pipeline that performs five-step analysis for each page:

The Five-Step Pipeline:
  1. Summary Generation — AI-powered condensation to 3-5 sentences capturing key concepts
  2. Keyword Extraction — Frequency analysis to identify important terms
  3. Topic Extraction — Pattern matching to identify named concepts and themes
  4. Token Counting — Estimation of both full content and summary requirements
  5. Importance Scoring — Heuristic calculation based on position, length, and density

Keyword Extraction

We implemented a lightweight frequency-based keyword extraction algorithm:

The Process:
  1. Stop Word Filtering — Remove common words (the, is, at, which, etc.)
  2. Tokenization and Cleaning — Normalize text, strip punctuation, filter short words
  3. Frequency Analysis — Count occurrence of each valid word
  4. Ranking and Selection — Select top 20 most frequent terms
Why This Works:
  • Simple frequency correlates with importance
  • No external dependencies (fast, lightweight)
  • Domain-agnostic for technical content
  • Top 20 keywords capture essence of page

Topic Extraction: Pattern Matching

We extract topics using four regex pattern categories:

Pattern Categories:
  1. Structural Headers — Chapter/section/part markers with numbers
  2. Title Case Phrases — Sequences of capitalized words (2-4 words)
  3. Definitions — Phrases followed by "is defined as", "refers to", or "means"
  4. Explicit Concept Markers — "concept of X", "theory of X", "principle of X"
Why Pattern Matching:
  • Predictable structure in educational content
  • High precision with reliable signals
  • Zero latency (no API calls)
  • Naturally captures document hierarchy

Summary Generation: AI-Powered Compression

For each page, we generate a 3-5 sentence summary using advanced AI models:

The Process:
  1. Content Preparation — Truncate extremely long pages to prevent overflow
  2. Prompt Engineering — Clear instruction for concise, focused summaries
  3. Model Configuration — Optimized for cost and quality balance
  4. Error Handling & Fallback — Ensures preprocessing never fails completely
Cost Analysis:

For a 100-page document:

  • One-time preprocessing cost: ~$0.01-0.02
  • Per-query savings: $1.50
  • Break-even: After 1st query
  • ROI: 150x return after single use

Importance Scoring

We calculate a baseline importance score (0-1 scale) using three weighted heuristics:

Heuristic 1: Position Bias
  • First 10% of pages (introduction, overview): Higher weight
  • Last 10% of pages (conclusions, summaries): Higher weight
  • Middle pages (details): Lower weight
Heuristic 2: Content Length
  • Very long pages: Higher weight (dense content)
  • Medium pages: Medium weight
  • Short pages: Lower weight
Heuristic 3: Keyword Density
  • More unique vocabulary: Higher weight
  • Sparse vocabulary: Lower weight
Why This Matters:
  • Pre-query baseline established before any query
  • Tie-breaking when pages have similar relevance
  • Leverages universal patterns in educational writing
  • Lightweight computation (no AI calls)

Integration with File Upload

We integrated metadata generation into the existing file parsing pipeline with batched processing to avoid overwhelming APIs.

Processing Time Benchmarks:
  • 100-page document: 20-30 seconds total
  • 500-page document: 2-3 minutes total
Critical Insight: This happens once at upload, not per query. The upfront investment pays for itself after the first query.

Phase 2: Query-Time Compression

The Compression Pipeline

The core compression logic orchestrates a five-step pipeline:

Step 1: Metadata Loading
  • Query database for pre-computed metadata
  • Retrieve summaries, keywords, topics, token counts, importance scores
  • Sub-100ms retrieval even with 100+ pages
Step 2: Query Analysis
  • Extract keywords from user query
  • Identify mentioned topics via pattern matching
  • Classify query intent (explanation, comparison, definition, etc.)
Step 3: Page Scoring
  • Score each page against query context
  • Weighted algorithm: keyword overlap (50%), topic matching (40%), importance baseline (10%)
  • Produces relevance score (0-1) for each page
Step 4: Token Budget Allocation
  • Three-tier strategy based on relevance scores:
- High relevance (≥0.7): Attempt full content inclusion - Medium relevance (0.3-0.7): Include summary only - Low relevance (<0.3): Omit entirely
  • Respects maxTokens budget constraint
Step 5: Context Building
  • Construct final context string with clear section markers
  • Fetch full content for high-relevance pages
  • Fetch summaries for medium-relevance pages
  • Note omitted pages for transparency

Query Analysis

We extract structured information from the user's natural language query through a three-phase analysis:

Phase 1: Keyword Extraction
  • Normalize query
  • Filter out stop words
  • Extract meaningful terms
  • De-duplicate
Phase 2: Topic Identification
  • Scan for title-cased phrases
  • Capture explicitly named concepts
Phase 3: Intent Classification
  • Pattern matching against common query types
  • Summary, explanation, comparison, definition, or general

Page Scoring: The Heart of Compression

We score each page based on query relevance using a three-component weighted algorithm:

Component 1: Keyword Matching (50% weight)
  • Compare query keywords against page keywords
  • Use bidirectional substring matching
  • Calculate match ratio
Component 2: Topic Matching (40% weight)
  • Compare query topics against page topics
  • Use case-insensitive substring matching
  • Calculate topic match ratio
Component 3: Page Importance (10% weight)
  • Retrieve pre-computed importance score
  • Acts as tie-breaker and baseline
Final Score: Combined weighted score, capped at 1.0 maximum

Token Budget Allocation

Based on relevance scores, we implement a greedy algorithm with three-tier thresholding:

Tier 1: High Relevance (score ≥ 0.7)
  • Include full content when budget allows
  • Gracefully degrade to summary if budget is tight
Tier 2: Medium Relevance (score 0.3-0.7)
  • Include summary only
Tier 3: Low Relevance (score < 0.3)
  • Omit entirely (consume zero tokens)

Building the Compressed Context

The final step constructs a hierarchically organized context string:

Section 1: Full Content (Highly Relevant Pages)
  • Clear section header
  • Complete page content for most relevant pages
  • Maintains document flow
Section 2: Summaries (Additional Context)
  • Section header for summaries
  • 3-5 sentence summaries for moderately relevant pages
  • Provides broader context without bloat
Section 3: Omission Notice
  • Note about omitted pages for transparency
  • Formatted as readable ranges

Performance and Cost Analysis

Before Hierarchical Compression

MetricValue
Average document size100 pages
Token usage per query150,000 tokens
Cost per query$1.50-2.00
Response time15-20 seconds
Max document size~85 pages (context limit)

After Hierarchical Compression

MetricValueImprovement
Average document size100+ pages (no effective limit)
Token usage per query5,000-8,000 tokens92% reduction
Cost per query$0.05-0.1093% reduction
Response time2-3 seconds85% faster
Max document size500+ pages6x increase

Quality Validation

We validated that compression doesn't harm response quality through comprehensive testing:

MetricFull ContextCompressed ContextDifference
Accuracy4.7/54.6/5-0.1 (2% decrease)
Relevance4.5/54.7/5+0.2 (4% increase)
Completeness4.8/54.5/5-0.3 (6% decrease)
Overall4.7/54.6/5-0.1 (2% decrease)
Key Finding: Compression maintains quality while dramatically improving cost and performance.

Lessons Learned

1. Preprocessing Pays for Itself Immediately

The one-time cost of generating summaries and metadata is recovered after the first query. This upfront investment enables massive per-query savings.

2. Simple Heuristics Work Well for Academic Content

We initially considered complex NLP models for keyword extraction and relevance scoring. But simple frequency analysis and pattern matching work remarkably well for structured educational content.

3. Three-Tier Compression is the Sweet Spot

We experimented with different tier counts. Three tiers (full, summary, omit) provided the perfect balance between coverage and efficiency.

4. Token Budget Should Be Adaptive

We started with a fixed 8,000-token budget. Future iterations will dynamically adjust based on query complexity and intent.

5. Summaries Need to Be Information-Dense

Early summaries were too generic. Improved quality by using focused prompts and optimized parameters.

6. Graceful Degradation is Essential

Not all files have metadata (legacy files, processing failures). Fallback ensures system continues working.

7. Monitoring Context Quality is Critical

We added instrumentation to track relevance scores, token usage distribution, and user feedback to inform optimization decisions.

Future Enhancements

1. Semantic Search Integration

Replace keyword matching with embedding-based semantic similarity for better relevance detection.

2. Multi-Query Context Reuse

Cache compressed contexts for follow-up questions on similar topics.

3. User-Adjustable Compression

Let users control compression aggressiveness (fast, balanced, comprehensive modes).

4. Learning from Feedback

Track which pages AI uses in responses to improve future compression.

5. Cross-Document Compression

Handle queries spanning multiple uploaded files.

Conclusion

Hierarchical Context Compression transformed Cereby AI's File Chat from an expensive, slow feature into a fast, cost-effective learning tool. The key was a two-phase approach:

  1. One-time preprocessing — Generate summaries, keywords, and metadata
  2. Query-time compression — Intelligently select full content, summaries, or omit based on relevance
The Results:

92% token reduction (150K → 8K tokens) ✅ 93% cost reduction ($1.50 → $0.08 per query) ✅ 85% faster responses (15-20s → 2-3s) ✅ 6x larger documents (85 → 500+ pages) ✅ Maintained quality (4.7 → 4.6 out of 5)

For teams building similar AI systems, our key takeaways are:

  1. Preprocess documents once — Upfront investment pays for itself immediately
  2. Score relevance dynamically — Don't send everything; send what matters
  3. Use hierarchical tiers — Full, summary, omit is the sweet spot
  4. Validate quality rigorously — Compression must maintain accuracy
  5. Make it transparent — Show users what's happening
  6. Design for graceful degradation — System must work even when compression fails

Want to learn more about Cereby AI's architecture? Check out our Building Cereby AI and Optimizing Cereby AI Performance posts or reach out on Twitter.

Visual Summary

flowchart TD
    A[Uploaded Document] --> B[One-Time Preprocessing]
    B --> C[Per-Page Summary + Keywords]
    C --> D[Query-Time Relevance Scoring]
    D --> E{Tier Decision}
    E -->|High| F[Full Page Content]
    E -->|Medium| G[Summary Only]
    E -->|Low| H[Omit Page]
    F --> I[Final Context Assembly]
    G --> I
    H --> I
    I --> J[Fast, Lower-Cost AI Response]