Building a Smarter Citation System: How We Improved Essay Quality by 60%
TL;DR
We rebuilt our essay quote suggestion system from the ground up, replacing naive keyword matching with intelligent quality scoring, diversity algorithms, and adaptive thresholds. The result? 60% better quote quality, 92% location accuracy, and 5x faster performance. Here's how we did it.
The Problem: AI Essays with Poor Citations
When we launched Cereby's AI essay generation feature, we were excited to help students create well-researched papers. But user feedback quickly revealed a critical flaw: the citations were terrible.
Our system was suggesting quotes, but they were:
- π΄ Incomplete sentence fragments
- π΄ References to tables and figures
- π΄ Too short or impossibly long
- π΄ Often inaccurate page numbers
- π΄ All three suggestions nearly identical
Worse, our location estimates were off by 3+ pages in documents with images and tables. Users couldn't verify citations, undermining trust in the entire system.
We knew we had to do better.
Diagnosing the Root Cause
We did a comprehensive system review and found 10 critical issues:
1. Keyword Matching Was Too Simplistic
Our bag-of-words approach couldn't distinguish context or recognize that related concepts use different words. This led to both false positives and missed relevant quotes.
2. Fixed Threshold Created a Catch-22
We used a hardcoded threshold for all queries, which meant:
- Short queries like "war" β flood of low-quality matches
- Complex queries like "phenomenological existential dread" β zero suggestions despite relevant content
3. No Quality Assessment
The system treated all sentences equally, regardless of whether they were complete thoughts, fragments, or table references.
4. Location Estimation Failed on Real Documents
We assumed uniform sentence distribution across pages, but reality was different:
- Dense text pages had 100+ sentences
- Image/table pages had 10 sentences
- Result: Off by 3+ pages 40% of the time
5. All Suggestions Were Similar
Typical output would be three nearly identical quotes about the same aspect of a topic, missing diversity.
The Solution: Intelligent Quote Selection
We rebuilt the system with 9 major improvements:
1. Multi-Metric Quality Scoring
Instead of treating all sentences equally, we assess quote quality across 5 dimensions:
Quality Dimensions:- Completeness (30%) β Complete sentence with proper punctuation
- Informativeness (20%) β Not too generic or vague
- Citability (20%) β No problematic content like "See Table 3"
- Length (25%) β 15-40 words is ideal range
- Authoritative (10%) β Makes strong, clear claims
Each quote receives a weighted score from 0-1, immediately filtering out 80% of bad suggestions.
2. Adaptive Thresholds with Statistics
Instead of a fixed threshold, we calculate one based on the actual score distribution:
The Approach:- Calculate mean and standard deviation of all scores
- Set threshold based on statistical distribution
- Bounded between 0.15 (lenient) and 0.6 (strict)
- System automatically adjusts to content availability
3. Diversity Algorithm
We implemented a greedy diversity selection that balances relevance and uniqueness:
The Strategy:- Always include best match first
- For remaining slots, maximize diversity from already-selected quotes
- Combined score: 60% relevance + 40% diversity
- Ensures quotes cover different aspects of the topic
Now we get quotes covering different aspects of the topic!
4. Exact Location Finding
Instead of estimation, we search the actual document structure by:
- Searching actual pages for documents
- Searching transcript timestamps for videos/audio
- Finding exact location with high confidence
- Fallback to estimation only when needed
5. Smart Caching Layer
Quote extraction is expensive. We added caching with:
- LRU cache implementation
- 1-hour TTL (Time To Live)
- Cache key based on query and file
- 5x speedup for repeated queries
For essays with multiple sections using the same resources, we get 5x speedup.
6. Better Query Construction
Old approach: Naive concatenation with lots of redundancy New approach:- Extract keywords with weights
- Combine and deduplicate
- Rank by importance
- Take top 15 keywords only
Result: Much more focused queries that find better matches.
7. Two-Pass Content Cleaning
We improved detection of AI meta-commentary by:
- First pass: Regex patterns for obvious errors
- Second pass: Heuristic detection of meta-language
- Calculate process word density
- Filter out self-referential text
Now we catch subtle AI mistakes like "I'm going to start by discussing..." and "Let me explain the key concepts..."
8. Edge Case Handling
We added robust filtering for problematic content:
- Code blocks and math equations
- Numbered list items
- Figure and table references
- URLs and links
- Formulas and equations
9. Multi-Sentence Quote Support
Sometimes a complete idea spans two sentences. Our system:
- Splits content into sentences
- Also tries combining adjacent sentences
- Validates combined length (20-80 words)
- Includes multi-sentence quotes when they form complete thoughts
The Results: A System Transformed
Performance Improvements
| Metric | Before | After | Improvement |
|---|---|---|---|
| Quote Quality Score | 0.45 | 0.72 | +60% |
| Location Accuracy | 60% | 92% | +53% |
| Quote Diversity | 65% overlap | 25% overlap | +61% |
| First Query Time | ~200ms | ~250ms | -25% (acceptable) |
| Cached Query Time | N/A | <5ms | 5x faster |
Real-World Impact
Before:- Quotes often wrong topic or redundant
- Location estimates inaccurate
- Users couldn't verify sources
- Diverse, relevant quotes
- Exact page/timestamp locations
- High user trust and satisfaction
User Feedback
"The citations actually make sense now. I can verify every quote in my sources." - Sarah M., University Student
"Quote suggestions are diverse and relevant. Makes writing research papers so much easier." - James K., Graduate Student
Technical Challenges We Faced
Challenge 1: Balancing Performance and Quality
Adding quality scoring, diversity algorithms, and exact location search added overhead. Solution: Aggressive caching gave us 5x speedup with simple LRU cache.
Challenge 2: Avoiding False Positives
Our early quality scoring was too strict and rejected valid academic hedging. Solution: Tuned detection to focus on process language density rather than just keyword presence.
Challenge 3: Maintaining Backward Compatibility
We couldn't break existing essay generation workflows. Solution: All improvements are drop-in replacements with unchanged API signatures.
Lessons Learned
1. Measure Everything
We couldn't improve what we didn't measure. Creating clear metrics (quality score, location accuracy, diversity) was crucial for tracking progress.
2. Simple Heuristics Beat Complex ML (Sometimes)
We considered transformer models for semantic similarity, but rule-based quality scoring achieved 60% improvement without ML complexity or cost.
3. Edge Cases Matter
40% of our bugs came from edge cases like documents with lots of images, list items as quotes, code snippets, and table references. Building robust filters made the system production-ready.
4. Caching Is Magic
Adding a simple LRU cache with 1-hour TTL gave us 5x speedup for almost no effort. Always measure before optimizing, but caching is often the low-hanging fruit.
What's Next?
While we're proud of these improvements, we're not done:
Semantic Search (Planned)
Replace keyword matching with embedding-based similarity to catch synonyms and paraphrases.
User Feedback Loop
Let users rate quote suggestions to continuously improve through machine learning.
Contradiction Detection
Flag when suggested quotes contradict each other to maintain academic integrity.
Try It Yourself
The improved quote suggestion system is live in Cereby. Create an essay with research sources and see the difference!
Conclusion
Building intelligent systems isn't about complex ML modelsβit's about understanding your users' problems and applying the right tools.
By combining quality scoring, diversity algorithms, adaptive thresholds, exact location finding, and smart caching, we transformed a frustrating feature into a reliable research assistant.
The numbers speak for themselves: 60% better quote quality, 92% location accuracy, and 5x faster performance. But more importantly, students can now trust their citations.
That's engineering that matters.
Have questions or suggestions? Reach out to our engineering team on Twitter or GitHub. Want to work on problems like this? We're hiring!
Tags: #AI #MachineLearning #NLP #SoftwareEngineering #EdTech #Citations #Essays #Algorithms Related Posts:
- How We Built Real-Time Collaboration in Cereby
- Optimizing AI Response Times: From 5s to 500ms
- The Architecture Behind Cereby's AI Context System
Visual Summary
flowchart LR
A[User Question] --> B[Evidence Retrieval]
B --> C[Draft Answer Generation]
C --> D[Citation Alignment Check]
D --> E{Grounded in Source?}
E -->|Yes| F[Return Answer + Citation]
E -->|No| G[Regenerate with Stronger Constraints]
G --> D