Back to Blog
Engineering
February 18, 2026
8 min read

Query-Aware Smart Compression: Solving the Single-Page Truncation Problem (part 2)

How we eliminated information loss from large pages by implementing intelligent chunk selection

The Problem: When One Page is Too Big

In our previous post about Hierarchical Context Compression, we detailed how we reduced token usage by 92% by intelligently selecting which pages to include from large documents. That system works beautifully when content is distributed across many pages — it identifies the 3-5 most relevant pages and includes their full content.

But we discovered a critical edge case: What happens when a single page exceeds the entire token budget?

Real-World Example

A student uploads a research paper with dense, single-column formatting. Page 12 contains a comprehensive literature review — 15,000 tokens of continuous text covering dozens of studies, methodologies, and findings. When the student asks:

"What studies support the connection between sleep deprivation and cognitive decline?"

Our system allocated 8,000 tokens for this highly relevant page. But with the old implementation, here's what happened:

Page 12: 15,000 tokens
Token Budget: 8,000 tokens

Old approach: Take first 8,000 tokens Result: ✂️ TRUNCATED

Content included: ✅ Introduction (tokens 0-2000) ✅ Early studies (tokens 2000-5000) ✅ Methodology overview (tokens 5000-8000) ❌ Key findings (tokens 8000-12000) ← LOST! ❌ Recent research (tokens 12000-15000) ← LOST!

The answer to the student's question was in tokens 10,000-12,000. But our system only saw the first 8,000 tokens and missed it entirely.

The Limitation of Simple Truncation

Our original implementation used a straightforward approach: when content exceeded the budget, just take the beginning and truncate the rest.

Why This Failed

Problem 1: Position Bias

Information at the beginning was always included, while content at the end was always lost — regardless of relevance to the query.

Problem 2: All-or-Nothing

The system treated content as atomic. Either include everything or truncate blindly.

Problem 3: Ignored Query Context

The truncation happened without considering what the user actually asked.

Problem 4: Silent Information Loss

Users received responses based on incomplete information without understanding what was missing.

The Real Impact

We analyzed 1,000 queries against documents with oversized pages:

MetricResult
Queries affected23% (230 queries)
Average information loss47% of page content
Wrong answers18% of affected queries
Incomplete answers64% of affected queries
User satisfaction2.8/5 (vs 4.6/5 normally)

This wasn't a minor edge case — it was affecting nearly a quarter of queries and seriously degrading answer quality.

Enter: Query-Aware Smart Compression

We needed a system that could:

  1. Break pages into logical chunks (paragraphs, sections)
  2. Score each chunk by relevance to the user's query
  3. Select the best chunks within the token budget
  4. Maintain document flow by reordering selected chunks
  5. Communicate gaps clearly to both the AI and user

The Architecture

Large Page (15K tokens)
  ↓
Split into Chunks
  ↓
Score Each Chunk (against User Query)
  ↓
Select Best Chunks (within 8K budget)
  ↓
Reorder by Position
  ↓
Assemble with Gap Markers
  ↓
Compressed Content (8K tokens)

Implementation Deep Dive

Step 1: Intelligent Chunking

We implemented two chunking strategies that adapt to content structure:

Strategy 1: Paragraph-Based Splitting
  • Split by double newlines (paragraph boundaries)
  • Preserves complete thoughts
  • Typical paragraphs are 200-500 tokens
  • Natural semantic boundaries
Strategy 2: Sentence-Based Fallback

For dense content without paragraph breaks:

  • Split into sentences
  • Group into ~500 token chunks
  • Ensures manageable chunk sizes
Result: Dense content gets intelligently grouped into manageable chunks.

Step 2: Query-Aware Scoring

Each chunk receives a relevance score (0-1) based on three factors:

Component 1: Keyword Matching (60% weight)
  • Compare query keywords against chunk keywords
  • Calculate match ratio
  • Keyword overlap is strongest signal
Component 2: Topic Matching (30% weight)
  • Compare query topics against chunk topics
  • Use case-insensitive substring matching
  • Captures semantic relevance beyond keywords
Component 3: Position Bias (10% weight)
  • First and last chunks slightly more important
  • Acts as tie-breaker
  • Structural importance
Final Score: Combined weighted score, capped at 1.0 maximum

Step 3: Greedy Selection

Once chunks are scored, we use a greedy algorithm to fill the token budget:

The Process:
  1. Sort chunks by relevance (best first)
  2. Select chunks within token budget
  3. Account for gap marker overhead
  4. Stop when budget is exhausted
Key Insight: We select by relevance but reassemble by position for natural flow.

Step 4: Assembly with Gap Indicators

The final assembled content makes gaps explicit:

Section Markers:
  • Header indicating compression
  • Gap indicators showing omitted sections
  • Formatted as readable ranges
Example Output:
[Content compressed: Showing 8/25 most relevant sections]

Introduction text...

[... 4 sections omitted ...]

Relevant content about sleep deprivation and cognitive decline...

[... 2 sections omitted ...]

More relevant findings...

[... 10 sections omitted ...]

Real-World Performance

Before vs After Comparison

Test Case: 25,000-token research paper, user asks about specific methodology Old Truncation Approach:
  • Included first 8,000 tokens (introduction, background)
  • Missing methodology section
  • Answer: FAILED - "methodology not described"
New Query-Aware Compression:
  • Identified methodology sections as highly relevant
  • Included 7 most relevant chunks
  • Total: 7,550 tokens
  • Answer: SUCCESS - Complete methodology described

Quantitative Results

We tested on 500 queries with oversized pages:

MetricOld TruncationQuery-AwareImprovement
Correct answers52%94%+42% ✅
Partial answers31%5%-26% ✅
Failed answers17%1%-16% ✅
Avg relevance score0.410.89+117% ✅
User satisfaction2.8/54.7/5+68% ✅
Token efficiency63%94%+31% ✅
Token efficiency measures the percentage of included tokens that were actually relevant to the query.

Edge Cases Handled

Case 1: Single Massive Chunk

Some pages have no natural breaks (single 15,000-token paragraph) - solution: split further by sentences.

Case 2: No Keyword Matches

Query keywords don't appear in any chunk (rare) - solution: fall back to position-based sampling with structural importance.

Integration: Backwards Compatible

The new system integrates seamlessly with existing code through unchanged API signatures - zero breaking changes.

Debugging & Monitoring

We added development-mode logging for transparency showing compression ratio, included/omitted chunks, and relevance scores.

Future Enhancements

1. Semantic Chunk Scoring

Replace lexical keyword matching with embedding-based semantic similarity for better synonym and paraphrase handling.

2. Hierarchical Chunk Scoring

Score at multiple granularities (section → subsection → paragraph) for better context preservation.

3. User Feedback Loop

Learn which chunks users find most valuable through tracking citations and user ratings.

4. Multi-Pass Compression

For extremely large pages (50,000+ tokens), implement rough filtering followed by detailed scoring.

Lessons Learned

1. Position Bias is Real

First and last chunks consistently proved more valuable than expected. Don't ignore structural heuristics.

2. Paragraph Boundaries Matter

Respect semantic boundaries. Natural language has structure; leverage it.

3. Gap Indicators are Critical

Without explicit gap markers, AI would try to infer connections across gaps (hallucinations). Transparency is not optional.

4. Greedy Selection is Good Enough

Complex optimization algorithms didn't improve quality enough to justify the added overhead.

5. Development Mode Logging is Essential

Being able to see compression decisions in real-time during development was invaluable for building intuition.

Conclusion

Query-Aware Smart Compression solved the "large single page" problem that simple truncation created. By intelligently chunking, scoring, and selecting content based on query relevance, we:

Improved answer accuracy from 52% to 94% (+42 percentage points) ✅ Reduced failed answers from 17% to 1% (17x improvement) ✅ Increased user satisfaction from 2.8/5 to 4.7/5 (+68%) ✅ Achieved 94% token efficiency (vs 63% with truncation) ✅ Maintained backwards compatibility (zero breaking changes)

The key principles that made this successful:

  1. Respect semantic boundaries — Chunk by paragraphs, not fixed tokens
  2. Score by relevance — Use query keywords and topics, not just position
  3. Select greedily — Simple algorithms often suffice
  4. Reorder structurally — Maintain document flow after selection
  5. Communicate gaps — Make compression decisions transparent
  6. Log extensively — Observability drives optimization

Combined with Hierarchical Context Compression, Cereby AI now handles documents of any size with any page structure. Whether it's a 500-page textbook with tiny pages or a 50-page research paper with massive single-page sections, students get accurate, relevant answers.


Want to dive deeper into Cereby AI's architecture? Check out our Hierarchical Context Compression post for page-level compression details, or explore our Building Cereby AI series.

Visual Summary

flowchart LR
    A[Large Single-Page Content] --> B[Semantic Chunking]
    B --> C[Query Keyword + Topic Scoring]
    C --> D[Greedy Chunk Selection]
    D --> E[Structural Reordering]
    E --> F[Gap Indicators]
    F --> G[Compressed Context to Model]
    G --> H[Accurate Response]