Building a Smarter Citation System: How We Improved Essay Quality by 60%

By the Cereby Engineering Team January 28, 2026 • 12 min read

TL;DR

We rebuilt our essay quote suggestion system from the ground up, replacing naive keyword matching with intelligent quality scoring, diversity algorithms, and adaptive thresholds. The result? 60% better quote quality, 92% location accuracy, and 5x faster performance. Here's how we did it.

The Problem: AI Essays with Poor Citations

When we launched Cereby's AI essay generation feature, we were excited to help students create well-researched papers. But user feedback quickly revealed a critical flaw: the citations were terrible.

Our system was suggesting quotes, but they were:

🔴 Incomplete sentence fragments
🔴 References to tables and figures
🔴 Too short or impossibly long
🔴 Often inaccurate page numbers
🔴 All three suggestions nearly identical

Worse, our location estimates were off by 3+ pages in documents with images and tables. Users couldn't verify citations, undermining trust in the entire system.

We knew we had to do better.

Diagnosing the Root Cause

We did a comprehensive system review and found 10 critical issues:

1. Keyword Matching Was Too Simplistic

Our bag-of-words approach couldn't distinguish context or recognize that related concepts use different words. This led to both false positives and missed relevant quotes.

2. Fixed Threshold Created a Catch-22

We used a hardcoded threshold for all queries, which meant:

Short queries like "war" → flood of low-quality matches
Complex queries like "phenomenological existential dread" → zero suggestions despite relevant content

3. No Quality Assessment

The system treated all sentences equally, regardless of whether they were complete thoughts, fragments, or table references.

4. Location Estimation Failed on Real Documents

We assumed uniform sentence distribution across pages, but reality was different:

Dense text pages had 100+ sentences
Image/table pages had 10 sentences
Result: Off by 3+ pages 40% of the time

5. All Suggestions Were Similar

Typical output would be three nearly identical quotes about the same aspect of a topic, missing diversity.

The Solution: Intelligent Quote Selection

We rebuilt the system with 9 major improvements:

1. Multi-Metric Quality Scoring

Instead of treating all sentences equally, we assess quote quality across 5 dimensions:

Quality Dimensions:

Completeness (30%) — Complete sentence with proper punctuation
Informativeness (20%) — Not too generic or vague
Citability (20%) — No problematic content like "See Table 3"
Length (25%) — 15-40 words is ideal range
Authoritative (10%) — Makes strong, clear claims

Each quote receives a weighted score from 0-1, immediately filtering out 80% of bad suggestions.

2. Adaptive Thresholds with Statistics

Instead of a fixed threshold, we calculate one based on the actual score distribution:

The Approach:

Calculate mean and standard deviation of all scores
Set threshold based on statistical distribution
Bounded between 0.15 (lenient) and 0.6 (strict)
System automatically adjusts to content availability

Result: The system automatically adapts - lots of good matches raises the bar, sparse content lowers it.

3. Diversity Algorithm

We implemented a greedy diversity selection that balances relevance and uniqueness:

The Strategy:

Always include best match first
For remaining slots, maximize diversity from already-selected quotes
Combined score: 60% relevance + 40% diversity
Ensures quotes cover different aspects of the topic

Now we get quotes covering different aspects of the topic!

4. Exact Location Finding

Instead of estimation, we search the actual document structure by:

Searching actual pages for documents
Searching transcript timestamps for videos/audio
Finding exact location with high confidence
Fallback to estimation only when needed

Impact: Location accuracy jumped from 60% to 92%.

5. Smart Caching Layer

Quote extraction is expensive. We added caching with:

LRU cache implementation
1-hour TTL (Time To Live)
Cache key based on query and file
5x speedup for repeated queries

For essays with multiple sections using the same resources, we get 5x speedup.

6. Better Query Construction

Old approach: Naive concatenation with lots of redundancy New approach:

Extract keywords with weights
Combine and deduplicate
Rank by importance
Take top 15 keywords only

Result: Much more focused queries that find better matches.

7. Two-Pass Content Cleaning

We improved detection of AI meta-commentary by:

First pass: Regex patterns for obvious errors
Second pass: Heuristic detection of meta-language
Calculate process word density
Filter out self-referential text

Now we catch subtle AI mistakes like "I'm going to start by discussing..." and "Let me explain the key concepts..."

8. Edge Case Handling

We added robust filtering for problematic content:

Code blocks and math equations
Numbered list items
Figure and table references
URLs and links
Formulas and equations

9. Multi-Sentence Quote Support

Sometimes a complete idea spans two sentences. Our system:

Splits content into sentences
Also tries combining adjacent sentences
Validates combined length (20-80 words)
Includes multi-sentence quotes when they form complete thoughts

The Results: A System Transformed

Performance Improvements

Metric	Before	After	Improvement
Quote Quality Score	0.45	0.72	+60%
Location Accuracy	60%	92%	+53%
Quote Diversity	65% overlap	25% overlap	+61%
First Query Time	~200ms	~250ms	-25% (acceptable)
Cached Query Time	N/A	<5ms	5x faster

Real-World Impact

Before:

Quotes often wrong topic or redundant
Location estimates inaccurate
Users couldn't verify sources

After:

Diverse, relevant quotes
Exact page/timestamp locations
High user trust and satisfaction

User Feedback

"The citations actually make sense now. I can verify every quote in my sources." - Sarah M., University Student

"Quote suggestions are diverse and relevant. Makes writing research papers so much easier." - James K., Graduate Student

Technical Challenges We Faced

Challenge 1: Balancing Performance and Quality

Adding quality scoring, diversity algorithms, and exact location search added overhead. Solution: Aggressive caching gave us 5x speedup with simple LRU cache.

Challenge 2: Avoiding False Positives

Our early quality scoring was too strict and rejected valid academic hedging. Solution: Tuned detection to focus on process language density rather than just keyword presence.

Challenge 3: Maintaining Backward Compatibility

We couldn't break existing essay generation workflows. Solution: All improvements are drop-in replacements with unchanged API signatures.

Lessons Learned

1. Measure Everything

We couldn't improve what we didn't measure. Creating clear metrics (quality score, location accuracy, diversity) was crucial for tracking progress.

2. Simple Heuristics Beat Complex ML (Sometimes)

We considered transformer models for semantic similarity, but rule-based quality scoring achieved 60% improvement without ML complexity or cost.

3. Edge Cases Matter

40% of our bugs came from edge cases like documents with lots of images, list items as quotes, code snippets, and table references. Building robust filters made the system production-ready.

4. Caching Is Magic

Adding a simple LRU cache with 1-hour TTL gave us 5x speedup for almost no effort. Always measure before optimizing, but caching is often the low-hanging fruit.

What's Next?

While we're proud of these improvements, we're not done:

Semantic Search (Planned)

Replace keyword matching with embedding-based similarity to catch synonyms and paraphrases.

User Feedback Loop

Let users rate quote suggestions to continuously improve through machine learning.

Contradiction Detection

Flag when suggested quotes contradict each other to maintain academic integrity.

Try It Yourself

The improved quote suggestion system is live in Cereby. Create an essay with research sources and see the difference!

Conclusion

Building intelligent systems isn't about complex ML models—it's about understanding your users' problems and applying the right tools.

By combining quality scoring, diversity algorithms, adaptive thresholds, exact location finding, and smart caching, we transformed a frustrating feature into a reliable research assistant.

The numbers speak for themselves: 60% better quote quality, 92% location accuracy, and 5x faster performance. But more importantly, students can now trust their citations.

That's engineering that matters.

Have questions or suggestions? Reach out to our engineering team on Twitter or GitHub. Want to work on problems like this? We're hiring!

Tags: #AI #MachineLearning #NLP #SoftwareEngineering #EdTech #Citations #Essays #Algorithms Related Posts:

Visual Summary

flowchart LR
    A[User Question] --> B[Evidence Retrieval]
    B --> C[Draft Answer Generation]
    C --> D[Citation Alignment Check]
    D --> E{Grounded in Source?}
    E -->|Yes| F[Return Answer + Citation]
    E -->|No| G[Regenerate with Stronger Constraints]
    G --> D