Query-Aware Smart Compression: Solving the Single-Page Truncation Problem (part 2)
The Problem: When One Page is Too Big
In our previous post about Hierarchical Context Compression, we detailed how we reduced token usage by 92% by intelligently selecting which pages to include from large documents. That system works beautifully when content is distributed across many pages — it identifies the 3-5 most relevant pages and includes their full content.
But we discovered a critical edge case: What happens when a single page exceeds the entire token budget?
Real-World Example
A student uploads a research paper with dense, single-column formatting. Page 12 contains a comprehensive literature review — 15,000 tokens of continuous text covering dozens of studies, methodologies, and findings. When the student asks:
"What studies support the connection between sleep deprivation and cognitive decline?"
Our system allocated 8,000 tokens for this highly relevant page. But with the old implementation, here's what happened:
Page 12: 15,000 tokens
Token Budget: 8,000 tokens
Old approach: Take first 8,000 tokens
Result: ✂️ TRUNCATED
Content included:
✅ Introduction (tokens 0-2000)
✅ Early studies (tokens 2000-5000)
✅ Methodology overview (tokens 5000-8000)
❌ Key findings (tokens 8000-12000) ← LOST!
❌ Recent research (tokens 12000-15000) ← LOST!
The answer to the student's question was in tokens 10,000-12,000. But our system only saw the first 8,000 tokens and missed it entirely.
The Limitation of Simple Truncation
Our original implementation used a straightforward approach: when content exceeded the budget, just take the beginning and truncate the rest.
Why This Failed
Problem 1: Position BiasInformation at the beginning was always included, while content at the end was always lost — regardless of relevance to the query.
Problem 2: All-or-NothingThe system treated content as atomic. Either include everything or truncate blindly.
Problem 3: Ignored Query ContextThe truncation happened without considering what the user actually asked.
Problem 4: Silent Information LossUsers received responses based on incomplete information without understanding what was missing.
The Real Impact
We analyzed 1,000 queries against documents with oversized pages:
| Metric | Result |
|---|---|
| Queries affected | 23% (230 queries) |
| Average information loss | 47% of page content |
| Wrong answers | 18% of affected queries |
| Incomplete answers | 64% of affected queries |
| User satisfaction | 2.8/5 (vs 4.6/5 normally) |
This wasn't a minor edge case — it was affecting nearly a quarter of queries and seriously degrading answer quality.
Enter: Query-Aware Smart Compression
We needed a system that could:
- Break pages into logical chunks (paragraphs, sections)
- Score each chunk by relevance to the user's query
- Select the best chunks within the token budget
- Maintain document flow by reordering selected chunks
- Communicate gaps clearly to both the AI and user
The Architecture
Large Page (15K tokens)
↓
Split into Chunks
↓
Score Each Chunk (against User Query)
↓
Select Best Chunks (within 8K budget)
↓
Reorder by Position
↓
Assemble with Gap Markers
↓
Compressed Content (8K tokens)
Implementation Deep Dive
Step 1: Intelligent Chunking
We implemented two chunking strategies that adapt to content structure:
Strategy 1: Paragraph-Based Splitting- Split by double newlines (paragraph boundaries)
- Preserves complete thoughts
- Typical paragraphs are 200-500 tokens
- Natural semantic boundaries
For dense content without paragraph breaks:
- Split into sentences
- Group into ~500 token chunks
- Ensures manageable chunk sizes
Step 2: Query-Aware Scoring
Each chunk receives a relevance score (0-1) based on three factors:
Component 1: Keyword Matching (60% weight)- Compare query keywords against chunk keywords
- Calculate match ratio
- Keyword overlap is strongest signal
- Compare query topics against chunk topics
- Use case-insensitive substring matching
- Captures semantic relevance beyond keywords
- First and last chunks slightly more important
- Acts as tie-breaker
- Structural importance
Step 3: Greedy Selection
Once chunks are scored, we use a greedy algorithm to fill the token budget:
The Process:- Sort chunks by relevance (best first)
- Select chunks within token budget
- Account for gap marker overhead
- Stop when budget is exhausted
Step 4: Assembly with Gap Indicators
The final assembled content makes gaps explicit:
Section Markers:- Header indicating compression
- Gap indicators showing omitted sections
- Formatted as readable ranges
[Content compressed: Showing 8/25 most relevant sections]
Introduction text...
[... 4 sections omitted ...]
Relevant content about sleep deprivation and cognitive decline...
[... 2 sections omitted ...]
More relevant findings...
[... 10 sections omitted ...]
Real-World Performance
Before vs After Comparison
Test Case: 25,000-token research paper, user asks about specific methodology Old Truncation Approach:- Included first 8,000 tokens (introduction, background)
- Missing methodology section
- Answer: FAILED - "methodology not described"
- Identified methodology sections as highly relevant
- Included 7 most relevant chunks
- Total: 7,550 tokens
- Answer: SUCCESS - Complete methodology described
Quantitative Results
We tested on 500 queries with oversized pages:
| Metric | Old Truncation | Query-Aware | Improvement |
|---|---|---|---|
| Correct answers | 52% | 94% | +42% ✅ |
| Partial answers | 31% | 5% | -26% ✅ |
| Failed answers | 17% | 1% | -16% ✅ |
| Avg relevance score | 0.41 | 0.89 | +117% ✅ |
| User satisfaction | 2.8/5 | 4.7/5 | +68% ✅ |
| Token efficiency | 63% | 94% | +31% ✅ |
Edge Cases Handled
Case 1: Single Massive ChunkSome pages have no natural breaks (single 15,000-token paragraph) - solution: split further by sentences.
Case 2: No Keyword MatchesQuery keywords don't appear in any chunk (rare) - solution: fall back to position-based sampling with structural importance.
Integration: Backwards Compatible
The new system integrates seamlessly with existing code through unchanged API signatures - zero breaking changes.
Debugging & Monitoring
We added development-mode logging for transparency showing compression ratio, included/omitted chunks, and relevance scores.
Future Enhancements
1. Semantic Chunk Scoring
Replace lexical keyword matching with embedding-based semantic similarity for better synonym and paraphrase handling.
2. Hierarchical Chunk Scoring
Score at multiple granularities (section → subsection → paragraph) for better context preservation.
3. User Feedback Loop
Learn which chunks users find most valuable through tracking citations and user ratings.
4. Multi-Pass Compression
For extremely large pages (50,000+ tokens), implement rough filtering followed by detailed scoring.
Lessons Learned
1. Position Bias is Real
First and last chunks consistently proved more valuable than expected. Don't ignore structural heuristics.
2. Paragraph Boundaries Matter
Respect semantic boundaries. Natural language has structure; leverage it.
3. Gap Indicators are Critical
Without explicit gap markers, AI would try to infer connections across gaps (hallucinations). Transparency is not optional.
4. Greedy Selection is Good Enough
Complex optimization algorithms didn't improve quality enough to justify the added overhead.
5. Development Mode Logging is Essential
Being able to see compression decisions in real-time during development was invaluable for building intuition.
Conclusion
Query-Aware Smart Compression solved the "large single page" problem that simple truncation created. By intelligently chunking, scoring, and selecting content based on query relevance, we:
✅ Improved answer accuracy from 52% to 94% (+42 percentage points) ✅ Reduced failed answers from 17% to 1% (17x improvement) ✅ Increased user satisfaction from 2.8/5 to 4.7/5 (+68%) ✅ Achieved 94% token efficiency (vs 63% with truncation) ✅ Maintained backwards compatibility (zero breaking changes)
The key principles that made this successful:
- Respect semantic boundaries — Chunk by paragraphs, not fixed tokens
- Score by relevance — Use query keywords and topics, not just position
- Select greedily — Simple algorithms often suffice
- Reorder structurally — Maintain document flow after selection
- Communicate gaps — Make compression decisions transparent
- Log extensively — Observability drives optimization
Combined with Hierarchical Context Compression, Cereby AI now handles documents of any size with any page structure. Whether it's a 500-page textbook with tiny pages or a 50-page research paper with massive single-page sections, students get accurate, relevant answers.
Want to dive deeper into Cereby AI's architecture? Check out our Hierarchical Context Compression post for page-level compression details, or explore our Building Cereby AI series.
Visual Summary
flowchart LR
A[Large Single-Page Content] --> B[Semantic Chunking]
B --> C[Query Keyword + Topic Scoring]
C --> D[Greedy Chunk Selection]
D --> E[Structural Reordering]
E --> F[Gap Indicators]
F --> G[Compressed Context to Model]
G --> H[Accurate Response]