What is the difference between extractive and abstractive summarization?

**Extractive summarization** selects and concatenates the most important sentences directly from the source document. The output is composed entirely of original sentences. Think of it as highlighting a textbook -- you are choosing which parts to keep, not rewriting anything. Algorithms like TextRank, LexRank, and BERT-based sentence scoring fall in this category. **Abstractive summarization** generates new sentences that paraphrase and compress the source content. The output may contain words and phrases not present in the original. This is like taking notes in your own words. Models like BART, T5, PEGASUS, and large language models (GPT-4, Claude) are abstractive. The key tradeoff: extractive methods are **safer** (no hallucination risk since all words come from the source) but **less fluent** (stitched sentences may not flow well). Abstractive methods are **more fluent** but carry **hallucination risk** (the model may generate information not in the source). In practice, many production systems use a hybrid approach: extractive pre-selection identifies the most important passages, then an abstractive model rewrites them into a coherent summary. This limits the generation space and reduces hallucination while maintaining fluency.

How do I handle documents that are too long for my summarization model?

Most transformer-based summarization models have a maximum input length -- 1024 tokens for BART, 512 tokens for T5-base, up to 16384 tokens for LED (Longformer Encoder-Decoder). For documents exceeding this limit, you have three main strategies: **1. Map-Reduce**: Split the document into overlapping chunks, summarize each chunk independently ('map' step), then combine the chunk summaries into a final summary ('reduce' step). This parallelizes well but may lose cross-chunk context. **2. Refine**: Start with the first chunk's summary, then iteratively refine it by incorporating each subsequent chunk. This preserves sequential context better than map-reduce but is inherently sequential and slower. **3. Long-context models**: Use LED (16K tokens), BigBird (4K tokens), or an LLM with 128K+ context (GPT-4o, Claude). This avoids chunking entirely but is more expensive. Even with 128K context, very long documents (100K+ tokens) benefit from hierarchical summarization for cost reasons. For most practical applications, **map-reduce with 900-token chunks and 100-token overlap** is the reliable default. Use LangChain's `MapReduceDocumentsChain` or implement your own -- the pattern is straightforward.

What are ROUGE scores and why are they not sufficient for evaluation?

**ROUGE** (Recall-Oriented Understudy for Gisting Evaluation) measures n-gram overlap between a generated summary and a human-written reference summary. The three most reported variants are: - **ROUGE-1**: Unigram overlap -- how many individual words match - **ROUGE-2**: Bigram overlap -- how many two-word phrases match - **ROUGE-L**: Longest common subsequence -- captures sentence-level structure Each variant reports precision (what fraction of the generated summary overlaps with the reference), recall (what fraction of the reference is covered), and F1 (harmonic mean). ROUGE is **necessary but not sufficient** for several reasons: 1. **It cannot detect hallucinations**: A summary that adds fabricated facts from the model's training data can still match the reference on common n-grams. 2. **It penalizes valid paraphrases**: If the reference says 'climate change' and the summary says 'global warming', ROUGE sees no match despite equivalent meaning. 3. **It does not measure coherence**: A summary with all the right words in random order would score well on ROUGE-1. Supplement ROUGE with **BERTScore** (semantic similarity via contextual embeddings), **SummaC** or **AlignScore** (factual consistency via NLI), and periodic **human evaluation** for a complete quality picture.

How much does it cost to run a summarization system in production?

Costs vary dramatically based on your approach, scale, and quality requirements. Here are concrete numbers for a system processing 10,000 documents per day (average 3,000 tokens each): **Self-hosted fine-tuned model (BART-large on A10G GPU)**: - GPU instance: ~INR 2,100/month ($25/month) on spot pricing - Throughput: ~50 docs/sec, handles 10K docs in ~3 minutes - One-time cost: 4-8 hours of fine-tuning (~INR 800/$10) - Total: ~INR 3,000/month ($35/month) **LLM API (GPT-4o-mini)**: - Cost: $0.15 per 1M input tokens + $0.60 per 1M output tokens - 10K docs x 3K tokens = 30M input tokens/day = $4.50/day - Monthly: ~INR 11,300/month ($135/month) - No infrastructure management **LLM API (Claude 3.5 Haiku)**: - Cost: $0.25 per 1M input tokens + $1.25 per 1M output tokens - Monthly: ~INR 15,000/month ($180/month) **LLM API (GPT-4o)**: - Cost: $2.50 per 1M input tokens + $10 per 1M output tokens - Monthly: ~INR 75,000/month ($900/month) - Highest quality but 6-7x more expensive than GPT-4o-mini For Indian startups, the sweet spot is often a fine-tuned T5-base or BART-base on a budget GPU instance, supplemented by LLM API calls for complex documents that the fine-tuned model handles poorly.

How do I detect and prevent hallucinations in abstractive summaries?

Hallucination -- where the summary contains information not present in the source -- is the biggest risk in abstractive summarization. Here is a multi-layered defense strategy: **Layer 1: Model-level mitigation** - Use beam search (`num_beams >= 4`) instead of greedy or sampling -- it produces more faithful outputs - Set `length_penalty > 1.0` to avoid overly short, generic summaries that tend to hallucinate less but are less useful - Consider extractive-then-abstractive pipelines: first select relevant sentences, then rewrite them. This bounds the generation space. **Layer 2: Post-generation detection** - **SummaC**: An NLI-based tool that checks whether each summary sentence is entailed by the source document. Scores below 0.5 indicate likely hallucination. - **AlignScore**: Aligns summary claims with source spans and scores consistency. - **LLM-as-judge**: Ask a separate LLM to verify each claim in the summary against the source. More expensive but more flexible. **Layer 3: Production monitoring** - Sample 5% of production summaries for automated faithfulness checking - Route low-confidence summaries (faithfulness score < threshold) to human review - Track hallucination rates over time and alert on increases **Layer 4: Human-in-the-loop** - For high-stakes domains (medical, legal, financial), always have human reviewers verify summaries before they reach end users - Use human annotations to build a domain-specific evaluation dataset for continuous monitoring No single technique eliminates hallucination entirely. The goal is to reduce it to an acceptable rate for your use case and catch the remaining cases before they reach users.

Which model should I choose: BART, T5, or PEGASUS?

The choice depends on your data availability, domain, and constraints: **BART** (`facebook/bart-large-cnn`): The best general-purpose choice. Pre-trained as a denoising autoencoder with a bidirectional encoder and autoregressive decoder. The CNN/DailyMail fine-tuned version works well out-of-the-box for news-style summarization. Choose BART when you have a moderate amount of fine-tuning data (5K+ examples) and need strong abstractive quality across domains. **PEGASUS** (`google/pegasus-large`): Purpose-built for summarization via the Gap Sentences Generation pre-training objective. Excels in **low-resource settings** -- it achieves state-of-the-art results with only 1000 training examples on many benchmarks. Choose PEGASUS when you have limited domain-specific training data. It consistently outperforms BART and T5 on summarization benchmarks. **T5** (`google-t5/t5-base` or `t5-large`): The most versatile option. Its text-to-text framework means you can use the same model for summarization, translation, classification, and more. Choose T5 when you want a single model for multiple NLP tasks, or when you are working with the Flan-T5 variants that have better instruction-following capabilities. For **long documents**, consider **LED** (Longformer Encoder-Decoder) which supports up to 16K tokens, or use any of the above with a map-reduce strategy. A practical comparison on CNN/DailyMail (ROUGE-L F1): PEGASUS ~44.2, BART-large ~42.1, T5-large ~42.5. The differences are relatively small -- your fine-tuning data and domain match will matter more than the base model choice.

How does summarization fit into a RAG pipeline?

In a RAG (Retrieval-Augmented Generation) pipeline, summarization serves as a **compression layer** between retrieval and generation. Here is why it matters: A typical RAG pipeline retrieves 10-20 relevant passages from a vector store, each 200-500 tokens long. That is 2,000-10,000 tokens of context. If your LLM has a 4K token prompt budget (after accounting for the system prompt, user query, and output space), you cannot fit all retrieved passages. **Without summarization**: You truncate to the top 3-5 passages, losing information from the rest. The LLM generates answers based on incomplete context. **With summarization**: You summarize all 20 passages into a condensed form that fits within the token budget. The LLM gets a compressed but comprehensive view of all relevant information. The summarizer can operate in two modes within RAG: 1. **Per-passage compression**: Each retrieved passage is individually summarized to ~50 tokens, and all compressed passages are included in the prompt. 2. **Collective summarization**: All passages are concatenated and summarized together into a single coherent context block. Mode 1 preserves source-level granularity (useful for citation). Mode 2 produces more coherent context but loses individual source attribution. The context assembler typically coordinates this process, invoking the summarizer as needed based on the total token count of retrieved passages versus the available prompt budget.

What is chain-of-density prompting and when should I use it?

**Chain of Density (CoD)** is a prompting technique introduced by researchers from Salesforce, MIT, and Columbia in 2023. Instead of asking an LLM to produce a summary in one shot, CoD asks for multiple iterations (typically 5), where each iteration must: 1. Identify 1-3 salient entities that are missing from the previous summary 2. Add those entities by compressing, fusing, or rephrasing existing content 3. Keep the total summary length constant The result is a sequence of summaries that progress from sparse (entity-light, generic) to dense (entity-rich, specific). Human evaluators preferred summaries from iteration 3 -- a sweet spot between too sparse and too dense. **Use CoD when:** - You are using an LLM API (GPT-4, Claude, Gemini) for summarization and want higher quality than vanilla prompting - You need entity-dense summaries that pack maximum information into a fixed length (e.g., news cards, executive briefs) - You want more abstractive summaries with less lead bias **Avoid CoD when:** - Latency is critical -- CoD generates 5x more text than single-shot summarization - Cost is a primary concern -- 5 iterations means 5x the output tokens - You are using fine-tuned models rather than LLM APIs -- CoD is a prompting technique, not a training strategy For production systems, you can often take the prompt from iteration 3 directly (rather than generating all 5 iterations) by instructing the model to produce an 'entity-dense, highly informative' summary -- capturing the spirit of CoD at lower cost.

NLP

Summarizer in Machine Learning

Q: Which model should I choose: BART, T5, or PEGASUS?

The choice depends on your data availability, domain, and constraints: **BART** (`facebook/bart-large-cnn`): The best general-purpose choice. Pre-trained as a denoising autoencoder with a bidirectional encoder and autoregressive decoder. The CNN/DailyMail fine-tuned version works well out-of-the-box for news-style summarization. Choose BART when you have a moderate amount of fine-tuning data (5K+ examples) and need strong abstractive quality across domains. **PEGASUS** (`google/pegasus-large`): Purpose-built for summarization via the Gap Sentences Generation pre-training objective. Excels in **low-resource settings** -- it achieves state-of-the-art results with only 1000 training examples on many benchmarks. Choose PEGASUS when you have limited domain-specific training data. It consistently outperforms BART and T5 on summarization benchmarks. **T5** (`google-t5/t5-base` or `t5-large`): The most versatile option. Its text-to-text framework means you can use the same model for summarization, translation, classification, and more. Choose T5 when you want a single model for multiple NLP tasks, or when you are working with the Flan-T5 variants that have better instruction-following capabilities. For **long documents**, consider **LED** (Longformer Encoder-Decoder) which supports up to 16K tokens, or use any of the above with a map-reduce strategy. A practical comparison on CNN/DailyMail (ROUGE-L F1): PEGASUS ~44.2, BART-large ~42.1, T5-large ~42.5. The differences are relatively small -- your fine-tuning data and domain match will matter more than the base model choice.

Q: How does summarization fit into a RAG pipeline?

In a RAG (Retrieval-Augmented Generation) pipeline, summarization serves as a **compression layer** between retrieval and generation. Here is why it matters: A typical RAG pipeline retrieves 10-20 relevant passages from a vector store, each 200-500 tokens long. That is 2,000-10,000 tokens of context. If your LLM has a 4K token prompt budget (after accounting for the system prompt, user query, and output space), you cannot fit all retrieved passages. **Without summarization**: You truncate to the top 3-5 passages, losing information from the rest. The LLM generates answers based on incomplete context. **With summarization**: You summarize all 20 passages into a condensed form that fits within the token budget. The LLM gets a compressed but comprehensive view of all relevant information. The summarizer can operate in two modes within RAG: 1. **Per-passage compression**: Each retrieved passage is individually summarized to ~50 tokens, and all compressed passages are included in the prompt. 2. **Collective summarization**: All passages are concatenated and summarized together into a single coherent context block. Mode 1 preserves source-level granularity (useful for citation). Mode 2 produces more coherent context but loses individual source attribution. The context assembler typically coordinates this process, invoking the summarizer as needed based on the total token count of retrieved passages versus the available prompt budget.

Q: What is chain-of-density prompting and when should I use it?

**Chain of Density (CoD)** is a prompting technique introduced by researchers from Salesforce, MIT, and Columbia in 2023. Instead of asking an LLM to produce a summary in one shot, CoD asks for multiple iterations (typically 5), where each iteration must: 1. Identify 1-3 salient entities that are missing from the previous summary 2. Add those entities by compressing, fusing, or rephrasing existing content 3. Keep the total summary length constant The result is a sequence of summaries that progress from sparse (entity-light, generic) to dense (entity-rich, specific). Human evaluators preferred summaries from iteration 3 -- a sweet spot between too sparse and too dense. **Use CoD when:** - You are using an LLM API (GPT-4, Claude, Gemini) for summarization and want higher quality than vanilla prompting - You need entity-dense summaries that pack maximum information into a fixed length (e.g., news cards, executive briefs) - You want more abstractive summaries with less lead bias **Avoid CoD when:** - Latency is critical -- CoD generates 5x more text than single-shot summarization - Cost is a primary concern -- 5 iterations means 5x the output tokens - You are using fine-tuned models rather than LLM APIs -- CoD is a prompting technique, not a training strategy For production systems, you can often take the prompt from iteration 3 directly (rather than generating all 5 iterations) by instructing the model to produce an 'entity-dense, highly informative' summary -- capturing the spirit of CoD at lower cost.

A summarizer is an NLP component that condenses long text into a shorter version while preserving the essential meaning, key facts, and salient arguments. It sits at the heart of any system that needs to distill information -- whether that is compressing a 50-page legal contract into a one-page brief, turning a two-hour earnings call transcript into bullet points, or generating the "TL;DR" for a Slack thread.

Summarization comes in two fundamental flavors: extractive (selecting and stitching together existing sentences from the source) and abstractive (generating entirely new sentences that paraphrase the original content). Modern systems increasingly blur this line, using extractive methods to select candidate passages and then abstractive models to rewrite them into fluent, coherent summaries.

The rise of large language models has transformed summarization from a research benchmark into a production necessity. From Inshorts condensing Indian news articles into 60-word cards, to enterprise platforms summarizing customer support tickets for agents, to RAG pipelines compressing retrieved context before feeding it to a generator -- summarizers are everywhere. Understanding the tradeoffs between extraction and abstraction, the right model for your latency and cost budget, and how to evaluate output quality is essential for any ML engineer building information-dense applications.

Concept Snapshot

What It Is: An NLP component that condenses source text into a shorter representation while preserving key information, using either extractive (sentence selection) or abstractive (novel generation) techniques.
Category: Natural Language Processing
Complexity: Intermediate
Inputs / Outputs: Input: one or more source documents (raw text, tokenized sequences, or structured passages). Output: a shorter text summary (single sentence to multi-paragraph) with optional metadata like confidence scores and source attributions.
System Placement: Sits downstream of tokenizers and text preprocessors; upstream of context assemblers, prompt templates, or end-user interfaces in ML pipelines.
Also Known As: text summarizer, document summarizer, auto-summarization engine, TL;DR generator, condensation module, digest generator
Typical Users: ML Engineers, NLP Engineers, Data Scientists, Product Engineers, Content Platform Developers
Prerequisites: Tokenization and text preprocessing, Transformer architecture fundamentals, Sequence-to-sequence models, Attention mechanisms, Basic information retrieval concepts
Key Terms: extractive summarizationabstractive summarizationROUGE scoreBERTScoreseq2seqencoder-decoderTextRankchain of densitymap-reduce summarizationfaithfulnesshallucination

Why This Concept Exists

The Information Overload Problem

Humans are drowning in text. A single enterprise might generate 10,000 customer support tickets per day, each averaging 500 words. That is 5 million words daily -- no human team can read all of it. Legal teams review contracts that run 100+ pages. Analysts digest earnings call transcripts exceeding 15,000 words. Researchers face thousands of papers published monthly on arXiv alone.

Manual summarization does not scale. A skilled human summarizer processes roughly 5,000 words per hour with high quality. At that rate, summarizing a day's worth of support tickets would require 1,000 person-hours -- an impossibility for any operations team.

From Rule-Based Heuristics to Neural Generation

Early summarization systems were purely extractive, using statistical heuristics: sentences at the beginning of documents, sentences containing high-frequency terms, or sentences connected to many others in a graph. Luhn's 1958 method -- arguably the first automatic summarizer -- simply scored sentences by the frequency of "significant" words.

The TextRank algorithm (2004) brought graph-based ranking to summarization, treating sentences as nodes and using PageRank-style voting to identify the most "central" sentences. This was a leap forward, but extractive methods inherently suffer from coherence problems -- you are stitching together sentences written in different contexts.

Abstractive summarization became practical with sequence-to-sequence models. The attention mechanism (Bahdanau et al., 2015) and the copy mechanism (See et al., 2017) allowed models to both generate novel words and copy important terms from the source. Then came the transformer era: BART (2019), T5 (2019), and PEGASUS (2019) established new state-of-the-art results on every major summarization benchmark.

The LLM Inflection Point

Today, large language models like GPT-4, Claude, and Gemini can produce remarkably fluent summaries with zero-shot or few-shot prompting -- no fine-tuning required. This has democratized summarization: any developer can call an API and get a reasonable summary. But it has also raised the bar. Production systems now need to worry about faithfulness (does the summary contain only information from the source?), controllability (can you specify length, style, and focus?), and cost (LLM API calls at scale get expensive fast).

Key Takeaway: Summarizers exist because information grows faster than human attention. The evolution from frequency heuristics to TextRank to transformers to LLMs reflects the field's relentless pursuit of summaries that are not just shorter, but genuinely faithful and useful.

Core Intuition & Mental Model

Two Ways to Summarize -- and Why It Matters

Imagine you are taking notes in a lecture. Extractive summarization is like highlighting sentences in the textbook -- you are selecting existing content. Abstractive summarization is like writing the notes in your own words -- you are generating new text that captures the meaning.

Extractive methods are safer: they cannot hallucinate information that is not in the source, because every word in the output came from the input. But they are often clunky -- sentences ripped from different paragraphs may not flow well together, and they cannot combine ideas from multiple passages into a single concise statement.

Abstractive methods are more powerful: they can rephrase, compress, and synthesize. A 200-word paragraph can become a single elegant sentence. But they carry the risk of hallucination -- generating plausible-sounding claims that are not supported by the source document. This is the central tension in summarization: fluency vs. faithfulness.

The Mental Model: Compression With a Fidelity Guarantee

Think of summarization as lossy compression for natural language. Just as JPEG reduces image file size by discarding imperceptible visual details, a summarizer reduces text length by discarding less important information. The compression ratio (input length / output length) is typically 5:1 to 20:1 for single-document summarization.

But unlike JPEG, where the quality metric is well-defined (PSNR, SSIM), summarization quality is fuzzy. A summary can be fluent but unfaithful, or faithful but unreadable. The best summarizers optimize along multiple axes simultaneously: informativeness (did it capture the key points?), faithfulness (did it avoid making things up?), coherence (does it read well?), and conciseness (is it appropriately short?).

Expert Insight: When someone says "our summarizer works great," always ask: great by what metric? ROUGE scores can be high even when summaries hallucinate. Faithfulness evaluation is the harder and more important problem in production.

Technical Foundations

Mathematical Formulation

Let $D = (s_1, s_2, \ldots, s_n)$ be a source document consisting of $n$ sentences. A summarizer produces a summary $S$ subject to a length constraint $|S| \leq L$ .

Extractive Summarization selects a subset of indices $I \subseteq \{1, 2, \ldots, n\}$ and constructs $S = (s_{i_1}, s_{i_2}, \ldots, s_{i_k})$ where $i_1 < i_2 < \ldots < i_k$ preserves document order. The optimization objective is:

$I^* = \arg\max_{I \subseteq [n], |I| \leq k} \sum_{i \in I} \text{score}(s_i, D) - \lambda \sum_{i, j \in I} \text{redundancy}(s_i, s_j)$

where $\text{score}(s_i, D)$ measures sentence importance and $\lambda$ controls the redundancy penalty (Maximal Marginal Relevance).

Abstractive Summarization generates $S = (w_1, w_2, \ldots, w_m)$ token by token using a conditional language model:

$P(S | D) = \prod_{t=1}^{m} P(w_t | w_{<t}, D; \theta)$

where $\theta$ are the model parameters, typically a transformer encoder-decoder. The model is trained to maximize log-likelihood over reference summaries:

$\mathcal{L}(\theta) = \sum_{(D_i, S_i^*) \in \mathcal{T}} \log P(S_i^* | D_i; \theta)$

Evaluation: ROUGE Metrics

The standard evaluation framework is ROUGE (Recall-Oriented Understudy for Gisting Evaluation). For a candidate summary $C$ and reference summary $R$ :

$\text{ROUGE-N} = \frac{\sum_{\text{gram}_n \in R} \text{Count}_{\text{match}}(\text{gram}_n)}{\sum_{\text{gram}_n \in R} \text{Count}(\text{gram}_n)}$

where $\text{gram}_n$ denotes n-grams of length $n$ . The most commonly reported variants are:

ROUGE-1: Unigram overlap -- captures content coverage
ROUGE-2: Bigram overlap -- captures fluency and phrase-level matching
ROUGE-L: Longest common subsequence -- captures sentence-level structure

State-of-the-art models on CNN/DailyMail achieve approximately ROUGE-1: 44-47, ROUGE-2: 21-23, ROUGE-L: 40-43.

Beyond ROUGE: Semantic Evaluation

BERTScore computes token-level cosine similarity between contextual embeddings of the candidate and reference:

$\text{BERTScore}_{\text{F1}} = 2 \cdot \frac{P_{\text{BERT}} \cdot R_{\text{BERT}}}{P_{\text{BERT}} + R_{\text{BERT}}}$

where $P_{\text{BERT}}$ and $R_{\text{BERT}}$ are precision and recall computed by greedily matching each token in the candidate to the most similar token in the reference (and vice versa) using BERT embeddings. BERTScore correlates more strongly with human judgments than ROUGE, particularly for abstractive summaries where paraphrasing is common.

Internal Architecture

A production summarization system is more than a single model -- it is a pipeline that handles document preprocessing, chunking for long inputs, model inference, post-processing, and quality validation. Here is the typical architecture:

Summarizer in ML Systems Architecture — A directed flow from Source Document through Preprocessor to a decision point based on document l...

The architecture branches based on document length. Short documents (under the model's context window) go through a single forward pass. Long documents are split into chunks, each chunk is summarized independently (the "map" step), and the chunk summaries are combined into a final summary (the "reduce" step). This map-reduce pattern is the standard approach for handling documents that exceed model context limits -- whether you are using a fine-tuned BART model with a 1024-token window or an LLM with a 128K-token window that you want to use cost-efficiently.

Key Components

Preprocessor / Tokenizer

Cleans input text (removes boilerplate, normalizes whitespace, handles encoding), segments into sentences, and tokenizes. For extractive methods, sentence boundaries are critical. For abstractive methods, subword tokenization (BPE, SentencePiece) prepares input for the encoder.

Document Chunker

Splits long documents into overlapping segments that fit within the model's maximum input length. Chunk boundaries should respect sentence or paragraph boundaries to avoid splitting semantic units. Typical chunk sizes are 512-1024 tokens with 10-20% overlap.

Extractive Selector

Scores and ranks sentences by importance using graph-based methods (TextRank), embedding similarity (BERT-based), or learned classifiers. Selects top-k sentences while minimizing redundancy. Used standalone or as a pre-filter for abstractive models.

Abstractive Generator

A sequence-to-sequence transformer (BART, T5, PEGASUS) or an LLM (GPT-4, Claude, Gemini) that generates novel summary text conditioned on the source. Controls output length via max_length, min_length, and length_penalty parameters.

Map-Reduce Orchestrator

Manages the two-phase summarization of long documents: dispatches chunks to the summarizer in parallel (map phase), collects intermediate summaries, and feeds them back for final consolidation (reduce phase). May use iterative refinement instead of a single reduce step.

Post-Processor

Cleans generated output: removes repeated phrases, fixes grammatical artifacts from beam search, ensures proper sentence boundaries, and formats the summary according to application requirements (bullet points, paragraphs, specific length).

Quality Validator

Checks the summary for factual consistency against the source document using NLI-based detectors (SummaC, AlignScore) or LLM-as-judge. Flags or rejects summaries that contain hallucinated information. This is the most underinvested component in most production systems.

Data Flow

Single-Document Flow: Source text enters the preprocessor, gets tokenized and (if needed) chunked. Each chunk passes through the summarizer model. If multiple chunks exist, their summaries are combined in the reduce step. The post-processor cleans the output, and the quality validator checks factual consistency before the summary is returned.

Multi-Document Flow: Multiple source documents are first deduplicated and clustered by subtopic. Each cluster is summarized independently, then cluster summaries are merged into a coherent multi-document summary. This requires additional logic for handling contradictions between sources and ensuring balanced coverage.

RAG Integration Flow: In a RAG pipeline, the summarizer sits between the retriever and the prompt template. Retrieved passages are summarized to fit within the LLM's context window, maximizing the amount of relevant information that can be passed to the generator while staying within token budgets.

A directed flow from Source Document through Preprocessor to a decision point based on document length. Short documents go directly to the Single-Pass Summarizer. Long documents are chunked, each chunk is summarized (map phase), and results are combined (reduce phase) before going to the summarizer. Output passes through Post-Processor and Quality Validator to produce the Final Summary.

How to Implement

Choosing Your Approach

Implementation strategy depends on three factors: latency requirements, cost budget, and quality bar.

Option 1: Fine-tuned transformer models (BART, T5, PEGASUS). Best when you have domain-specific training data, need low latency (<500ms), and want predictable per-query costs. A facebook/bart-large-cnn model on a single GPU processes ~50 documents/second. Cost: ~INR 25,000/month ($300/month) for a GPU instance.

Option 2: LLM API-based summarization (GPT-4o, Claude, Gemini). Best for rapid prototyping, zero-shot quality on diverse domains, and when you do not have training data. Cost: ~ $0.50-2.00 per 100K input tokens (~INR 42-168). At 1000 documents/day of 2000 tokens each, monthly cost is roughly$ 30-120 (~INR 2,500-10,000).

Option 3: Hybrid extractive-then-abstractive. Use an extractive pre-filter (TextRank, BERT-based scoring) to select the most important passages, then feed those to an abstractive model. This reduces input length by 3-5x, cutting LLM API costs proportionally while maintaining quality.

For Indian startups operating under tight compute budgets, Option 1 with a fine-tuned T5-small or T5-base offers the best cost-quality tradeoff. For enterprise applications where quality is paramount, Option 2 with GPT-4o-mini or Claude 3.5 Haiku provides excellent results at reasonable cost.

Cost Note: Summarizing 10,000 documents/day (avg 3000 tokens each) with GPT-4o-mini costs approximately $4.50/day (~INR 375/day) or ~$ 135/month (~INR 11,300/month). With a fine-tuned BART-large on a single A10G GPU, the same workload costs ~$25/month (~INR 2,100/month) in compute but requires upfront fine-tuning effort.

Extractive Summarization with TextRank (No ML Model Required)40 lines

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import networkx as nx
import re

def textrank_summarize(text: str, num_sentences: int = 3) -> str:
    """Extractive summarization using TextRank with sentence embeddings."""
    # Split into sentences
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    if len(sentences) <= num_sentences:
        return text

    # Encode sentences using a lightweight model
    model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = model.encode(sentences)

    # Build similarity matrix
    sim_matrix = cosine_similarity(embeddings)
    np.fill_diagonal(sim_matrix, 0)  # No self-loops

    # Apply PageRank
    graph = nx.from_numpy_array(sim_matrix)
    scores = nx.pagerank(graph, alpha=0.85, max_iter=100)

    # Select top sentences (preserve original order)
    ranked_indices = sorted(scores, key=scores.get, reverse=True)
    selected = sorted(ranked_indices[:num_sentences])

    return ' '.join(sentences[i] for i in selected)

# Usage
text = """India's digital payments ecosystem has seen remarkable growth.
UPI processed 12.02 billion transactions in October 2024 alone.
PhonePe and Google Pay dominate with a combined 85% market share.
The Reserve Bank of India continues to expand the infrastructure.
Rural adoption remains a challenge despite government incentives.
New features like UPI Lite and credit-on-UPI are driving further adoption."""

print(textrank_summarize(text, num_sentences=2))

This example demonstrates extractive summarization using TextRank with modern sentence embeddings instead of traditional TF-IDF. The all-MiniLM-L6-v2 model from sentence-transformers produces 384-dimensional embeddings that capture semantic similarity far better than bag-of-words approaches. The PageRank algorithm then identifies the most "central" sentences -- those most similar to all other sentences in the document. This approach requires no training data and works across domains.

Abstractive Summarization with BART (Hugging Face Transformers)55 lines

from transformers import pipeline, BartForConditionalGeneration, BartTokenizer
import torch

# Option 1: Quick setup with pipeline API
summarizer = pipeline(
    "summarization",
    model="facebook/bart-large-cnn",
    device=0 if torch.cuda.is_available() else -1
)

text = """The Indian Space Research Organisation (ISRO) successfully launched 
the Chandrayaan-3 mission, which achieved a historic soft landing on the 
lunar south pole on August 23, 2023. India became the fourth country to 
land on the Moon and the first to land near the south pole. The Pragyan 
rover deployed from the Vikram lander conducted multiple experiments 
including thermal measurements and seismic activity detection. The mission 
cost approximately Rs 615 crore (about $75 million), making it one of the 
most cost-effective lunar missions in history. The success was celebrated 
across India and positioned ISRO as a leader in cost-effective space 
exploration. The data collected continues to provide insights about the 
lunar surface composition and potential water ice deposits."""

result = summarizer(
    text,
    max_length=80,
    min_length=30,
    do_sample=False,
    num_beams=4,
    length_penalty=2.0,
    no_repeat_ngram_size=3
)
print(result[0]['summary_text'])

# Option 2: Fine-grained control with model + tokenizer
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")

inputs = tokenizer(
    text,
    return_tensors="pt",
    max_length=1024,
    truncation=True
)

summary_ids = model.generate(
    inputs["input_ids"],
    max_length=80,
    min_length=30,
    num_beams=4,
    length_penalty=2.0,
    no_repeat_ngram_size=3,
    early_stopping=True
)

print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

Two approaches to using BART for abstractive summarization. The pipeline API is the quickest way to get started -- two lines of code for a working summarizer. The model + tokenizer approach gives you full control over generation parameters. Key parameters: num_beams=4 uses beam search for better quality, length_penalty=2.0 encourages longer outputs (reduce for shorter summaries), and no_repeat_ngram_size=3 prevents the model from repeating trigrams. The facebook/bart-large-cnn checkpoint is fine-tuned on CNN/DailyMail and works well for news-style summarization out of the box.

LLM-Based Summarization with Chain-of-Density Prompting62 lines

from openai import OpenAI
import json

client = OpenAI()  # Uses OPENAI_API_KEY env variable

def chain_of_density_summarize(
    article: str,
    iterations: int = 5,
    target_words: int = 80,
    model: str = "gpt-4o-mini"
) -> dict:
    """Chain of Density summarization: iteratively increases entity density."""

    prompt = f"""Article: {article}

You will generate {iterations} increasingly concise, entity-dense summaries
of the above article. Each summary should be approximately {target_words} words.

Guidelines:
- Summary 1: Write a sparse summary covering only 1-2 main entities.
- Each subsequent summary: Identify 1-3 informative entities from the article
  that are MISSING from the previous summary. Add them by:
  (a) Replacing less specific phrases with more specific ones
  (b) Fusing multiple sentences to make room
  (c) Compressing existing information
- NEVER increase the summary length. Every word must earn its place.
- NEVER drop entities mentioned in earlier summaries.
- An entity is a real-world object: named events, people, organizations,
  specific numbers, dates, technical terms.

Return a JSON array of {iterations} objects, each with keys:
  "summary": the summary text,
  "missing_entities": list of entities added in this iteration,
  "density_score": number of entities per 100 words
"""

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3,
        response_format={"type": "json_object"}
    )

    result = json.loads(response.choices[0].message.content)
    return result

# Usage
article = """India's Unified Payments Interface (UPI) has revolutionized
digital payments in the country. In 2024, UPI processed over 130 billion
transactions worth Rs 200 lakh crore ($2.4 trillion). PhonePe leads with
48% market share, followed by Google Pay at 37%. The National Payments
Corporation of India (NPCI) has expanded UPI internationally, with
singapore, UAE, France, and Sri Lanka now accepting UPI payments.
The RBI introduced UPI Lite for small transactions under Rs 500 without
PIN verification, processing 300 million transactions monthly.
Credit-on-UPI, launched in partnership with RuPay, has seen 15 million
users adopt the feature within six months."""

result = chain_of_density_summarize(article)
for i, step in enumerate(result.get("summaries", [result])):
    print(f"\n--- Iteration {i+1} ---")
    print(step.get("summary", step))

Chain of Density (CoD) prompting, introduced by Adams et al. (2023), produces summaries that are more entity-dense and abstractive than vanilla prompting. The key insight is that iterative refinement forces the model to prioritize information more carefully with each pass. Human evaluators on CNN/DailyMail preferred CoD summaries at step 3 (out of 5) -- a sweet spot between sparsity and information overload. This technique works with any capable LLM and requires no fine-tuning. Cost per summarization with GPT-4o-mini: approximately $0.001 (~INR 0.08) per article.

Map-Reduce Summarization for Long Documents100 lines

from transformers import pipeline
import torch
from typing import List
import re

class MapReduceSummarizer:
    """Summarize documents longer than the model's context window."""

    def __init__(
        self,
        model_name: str = "facebook/bart-large-cnn",
        chunk_size: int = 900,
        chunk_overlap: int = 100,
        map_max_length: int = 150,
        reduce_max_length: int = 200
    ):
        self.summarizer = pipeline(
            "summarization",
            model=model_name,
            device=0 if torch.cuda.is_available() else -1
        )
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.map_max_length = map_max_length
        self.reduce_max_length = reduce_max_length

    def _split_into_chunks(self, text: str) -> List[str]:
        """Split text into overlapping chunks at sentence boundaries."""
        sentences = re.split(r'(?<=[.!?])\s+', text.strip())
        chunks = []
        current_chunk = []
        current_length = 0

        for sentence in sentences:
            words = len(sentence.split())
            if current_length + words > self.chunk_size and current_chunk:
                chunks.append(' '.join(current_chunk))
                # Keep last few sentences for overlap
                overlap_sentences = []
                overlap_len = 0
                for s in reversed(current_chunk):
                    if overlap_len + len(s.split()) > self.chunk_overlap:
                        break
                    overlap_sentences.insert(0, s)
                    overlap_len += len(s.split())
                current_chunk = overlap_sentences
                current_length = overlap_len
            current_chunk.append(sentence)
            current_length += words

        if current_chunk:
            chunks.append(' '.join(current_chunk))

        return chunks

    def summarize(self, text: str) -> str:
        """Map-reduce summarization pipeline."""
        chunks = self._split_into_chunks(text)

        if len(chunks) == 1:
            result = self.summarizer(
                chunks[0],
                max_length=self.reduce_max_length,
                min_length=50,
                do_sample=False
            )
            return result[0]['summary_text']

        # Map phase: summarize each chunk
        print(f"Map phase: summarizing {len(chunks)} chunks...")
        chunk_summaries = []
        for i, chunk in enumerate(chunks):
            result = self.summarizer(
                chunk,
                max_length=self.map_max_length,
                min_length=30,
                do_sample=False
            )
            chunk_summaries.append(result[0]['summary_text'])

        # Reduce phase: combine chunk summaries
        combined = ' '.join(chunk_summaries)
        print(f"Reduce phase: combining {len(chunk_summaries)} summaries...")

        # If combined summaries are still too long, recurse
        if len(combined.split()) > self.chunk_size:
            return self.summarize(combined)

        result = self.summarizer(
            combined,
            max_length=self.reduce_max_length,
            min_length=50,
            do_sample=False
        )
        return result[0]['summary_text']

# Usage
summarizer = MapReduceSummarizer()
long_document = "..."  # Your long document here
print(summarizer.summarize(long_document))

This map-reduce implementation handles documents of arbitrary length by recursively chunking and summarizing. The overlap between chunks (100 tokens by default) ensures that context at chunk boundaries is not lost. The recursion in the reduce phase handles cases where even the combined chunk summaries exceed the model's context window -- rare but possible for very long documents (50,000+ words). This is the same pattern used by LangChain's MapReduceDocumentsChain but implemented from scratch for transparency and customization.

Evaluation: Computing ROUGE and BERTScore51 lines

from rouge_score import rouge_scorer
from bert_score import score as bert_score
import numpy as np

def evaluate_summary(
    candidate: str,
    reference: str,
    verbose: bool = True
) -> dict:
    """Evaluate a summary using ROUGE and BERTScore metrics."""

    # ROUGE scores
    scorer = rouge_scorer.RougeScorer(
        ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'],
        use_stemmer=True
    )
    rouge_results = scorer.score(reference, candidate)

    # BERTScore
    P, R, F1 = bert_score(
        [candidate],
        [reference],
        model_type="microsoft/deberta-xlarge-mnli",
        lang="en",
        verbose=False
    )

    results = {
        "rouge1_f1": rouge_results['rouge1'].fmeasure,
        "rouge2_f1": rouge_results['rouge2'].fmeasure,
        "rougeL_f1": rouge_results['rougeL'].fmeasure,
        "bertscore_precision": P.item(),
        "bertscore_recall": R.item(),
        "bertscore_f1": F1.item()
    }

    if verbose:
        print("=== ROUGE Scores ===")
        for key in ['rouge1', 'rouge2', 'rougeL']:
            r = rouge_results[key]
            print(f"  {key}: P={r.precision:.4f} R={r.recall:.4f} F1={r.fmeasure:.4f}")
        print(f"\n=== BERTScore ===")
        print(f"  P={P.item():.4f} R={R.item():.4f} F1={F1.item():.4f}")

    return results

# Example
reference = "India's ISRO achieved a historic lunar south pole landing with Chandrayaan-3 at a cost of $75 million."
candidate = "ISRO's Chandrayaan-3 mission successfully landed near the Moon's south pole, making India the first country to achieve this feat, at a cost of approximately Rs 615 crore."

results = evaluate_summary(candidate, reference)

This evaluation function computes both ROUGE (lexical overlap) and BERTScore (semantic similarity). ROUGE is the standard metric reported in papers and leaderboards, but BERTScore with DeBERTa-xlarge correlates better with human judgments -- it gives credit to paraphrases that ROUGE misses. For production monitoring, track ROUGE-L F1 as the primary metric and BERTScore F1 as a secondary signal. A ROUGE-L F1 above 0.35 and BERTScore F1 above 0.85 typically indicates acceptable quality for news-style summarization.

Configuration Example39 lines

# Summarization pipeline config (YAML)
pipeline:
  name: document-summarizer
  strategy: map-reduce  # options: single-pass, map-reduce, refine

preprocessor:
  max_input_tokens: 100000
  sentence_splitter: spacy  # options: spacy, nltk, regex
  clean_html: true
  remove_boilerplate: true

chunker:
  chunk_size_tokens: 900
  overlap_tokens: 100
  respect_sentence_boundaries: true

model:
  name: facebook/bart-large-cnn
  device: cuda:0
  quantization: int8  # options: none, int8, int4
  generation:
    max_length: 150
    min_length: 40
    num_beams: 4
    length_penalty: 2.0
    no_repeat_ngram_size: 3
    early_stopping: true

quality:
  faithfulness_checker: summac  # options: summac, alignscore, llm-judge
  min_rouge_l: 0.30
  max_hallucination_score: 0.15
  reject_on_failure: false  # log warning instead of rejecting

monitoring:
  log_rouge_scores: true
  log_latency_p99: true
  alert_on_quality_drop: true
  evaluation_sample_rate: 0.05  # evaluate 5% of outputs

Common Implementation Mistakes

●
Ignoring input length limits: Feeding a 10,000-token document into a BART model with a 1024-token limit silently truncates the input. The model summarizes only the beginning, missing critical information from the middle and end. Always check model.config.max_position_embeddings and implement chunking for longer inputs.
●
Treating ROUGE as a quality guarantee: A ROUGE-1 F1 of 0.45 tells you the summary has good unigram overlap with the reference, but it says nothing about factual accuracy. Models can achieve high ROUGE scores while hallucinating specific numbers, dates, or causal relationships. Always pair ROUGE with faithfulness evaluation.
●
Using greedy decoding for abstractive models: Greedy decoding (selecting the highest-probability token at each step) produces repetitive, degenerate summaries. Use beam search with num_beams=4 and no_repeat_ngram_size=3 as a baseline. For more diverse summaries, try nucleus sampling with top_p=0.9.
●
Not controlling output length explicitly: Without min_length and max_length constraints, models may produce summaries that are too short (a single generic sentence) or too long (barely shorter than the input). Always set explicit length bounds based on your application requirements.
●
Applying the same summarizer across all domains without evaluation: A model fine-tuned on CNN/DailyMail news articles will underperform on legal documents, medical records, or code documentation. Domain transfer is not free -- evaluate on your target domain and fine-tune if quality is insufficient.
●
Neglecting post-processing: Abstractive models sometimes generate incomplete sentences (especially with aggressive max_length truncation), repeated phrases, or formatting artifacts. A simple post-processing step that removes incomplete trailing sentences and deduplicates content significantly improves perceived quality.

When Should You Use This?

Use When

You need to condense long documents (reports, articles, transcripts) into digestible formats for end users or downstream systems
Your RAG pipeline retrieves more context than fits in the LLM's prompt -- a summarizer compresses retrieved passages to maximize information density within token budgets
Customer support teams need quick overviews of long ticket histories or conversation threads before responding
You are building a news aggregation or content curation platform (like Inshorts or Google News) that needs automated digest generation
Legal, medical, or financial documents need to be summarized for non-expert stakeholders while preserving key facts
Multi-document settings require synthesizing information from multiple sources into a single coherent overview
You want to generate meeting notes, podcast summaries, or video transcriptions in a shorter form

Avoid When

The source text is already short (under 100 words) -- summarization adds latency and cost with minimal benefit. Just display the original.
Exact wording matters (legal clauses, regulatory language, medical dosage instructions) -- any paraphrasing risks changing the meaning, and extractive methods may miss critical context
You need structured information extraction (entity extraction, relation extraction) rather than a prose summary -- use an NER or IE pipeline instead
The input is highly technical with domain-specific terminology that your summarizer has not been trained on -- output quality will be poor and potentially dangerous (e.g., medical summaries with wrong drug names)
Real-time latency requirements are under 50ms -- even fast summarization models add 100-500ms per document. Consider pre-computing summaries offline instead.
The task is better served by keyword extraction or topic labeling rather than full prose summarization

Key Tradeoffs

Extractive vs. Abstractive

Dimension	Extractive	Abstractive
Faithfulness	High (words come from source)	Lower (may hallucinate)
Fluency	Variable (stitched sentences)	High (generated prose)
Compression ratio	3:1 to 5:1 typical	5:1 to 20:1 achievable
Latency	10-50ms (no generation)	100ms-5s (model dependent)
Cost	Low (CPU inference)	Higher (GPU or API)
Training data needed	None (unsupervised)	Significant (supervised)

Fine-Tuned Models vs. LLM APIs

Fine-tuned models (BART, T5, PEGASUS) offer lower latency (50-200ms), predictable costs ($25-100/month for a GPU), and domain specialization. But they require training data, model management, and cannot generalize to new domains without retraining.

LLM APIs (GPT-4o, Claude, Gemini) offer zero-shot versatility, better instruction following (control length, style, format), and no infrastructure management. But they are more expensive at scale ($100-500/month at moderate volume), have higher latency (1-10s), and raise data privacy concerns for sensitive content.

The Sweet Spot for Most Teams

Start with an LLM API for prototyping and validation. Once you have confirmed that summarization adds value to your product and you have collected evaluation data, fine-tune a smaller model (T5-base, BART-base) for cost efficiency. Keep the LLM as a fallback for edge cases that the fine-tuned model handles poorly.

Alternatives & Comparisons

Text Classifier

A text classifier assigns labels to documents (topic, sentiment, priority) but does not generate shorter representations of the content. If you need a categorical understanding of a document rather than a condensed version, classification is the right tool. However, summarization and classification are often used together -- classify first, then summarize within each category.

Context Assembler

A context assembler selects and arranges retrieved passages for an LLM prompt. While it performs a form of content selection (similar to extractive summarization), it focuses on fitting within token budgets and maintaining relevance ordering rather than producing human-readable summaries. Use a summarizer when the output needs to be coherent prose; use a context assembler when the output feeds directly into an LLM.

Prompt Template

Prompt templates can include summarization instructions (e.g., 'Summarize the following in 3 bullet points'), effectively embedding summarization as part of a larger LLM call. This works for simple cases but lacks the modularity, caching, and quality monitoring of a dedicated summarizer component. For production systems, separate the summarization step from the generation step.

Tokenizer

The tokenizer is a prerequisite for the summarizer, not an alternative. It converts raw text into tokens that the summarization model can process. If your goal is simply to truncate text to fit a token budget without preserving meaning, a tokenizer with truncation is simpler than a summarizer -- but the result will be much lower quality.

Pros, Cons & Tradeoffs

Advantages

Massive time savings: Reduces hours of reading to seconds of scanning. A 10,000-word document becomes a 200-word summary in under a second with a fine-tuned model.
Enables downstream processing: Summarized text fits within LLM context windows, reduces embedding costs, and improves retrieval precision in RAG pipelines.
Scalable information access: Democratizes access to long, complex documents -- non-expert stakeholders can understand legal contracts, research papers, or financial reports through summaries.
Flexible granularity: Can produce anything from a single-sentence headline to a multi-paragraph executive summary, controlled by generation parameters.
Domain adaptability: Fine-tuned models can specialize in legal, medical, scientific, or financial summarization with domain-specific training data, achieving quality that matches human experts.
Cost-effective at scale: A single GPU running BART can summarize 50+ documents per second. Even LLM APIs at $0.001/summary are far cheaper than human summarizers (INR 500-2000 per human-written summary).

Disadvantages

Hallucination risk: Abstractive models may generate plausible-sounding facts not present in the source. This is especially dangerous in medical, legal, and financial domains where accuracy is non-negotiable.
Evaluation is hard: ROUGE captures lexical overlap but not semantic accuracy or faithfulness. Human evaluation is expensive (INR 200-500 per evaluation). BERTScore helps but is not sufficient alone.
Long document handling is complex: Most transformer models have context limits (1024 tokens for BART, 512 for base T5). Map-reduce strategies add complexity, latency, and potential information loss at chunk boundaries.
Domain transfer is not free: A model trained on news articles will struggle with medical discharge summaries, legal briefs, or code documentation. Fine-tuning requires labeled data that may not exist for your domain.
Loss of nuance: Summarization inherently discards information. Critical qualifying statements ('under certain conditions', 'except in cases of') may be dropped, changing the meaning of the remaining content.
Position bias: Many models disproportionately weight the beginning of the document (lead bias), under-representing important information in the middle or end. This is a known weakness of models trained on news data.

Evaluate on your target domain before deploying. If quality is insufficient, fine-tune on domain-specific data (even 1000 examples can make a significant difference with PEGASUS). For LLM-based systems, include domain-specific few-shot examples in the prompt.

Placement in an ML System

Where Does a Summarizer Sit in the ML Pipeline?

In a RAG pipeline, the summarizer typically sits between the retriever and the context assembler. After relevant passages are retrieved from a vector store, the summarizer compresses them to fit within the LLM's context window. This is especially valuable when retrieval returns 20+ passages but the prompt template only has room for 5 -- summarizing the top 20 into a condensed form preserves more information than simply truncating to the top 5.

In a content platform (news, research, e-commerce), the summarizer sits in the serving layer, generating summaries on-demand or during batch processing. Flipkart might summarize product reviews for display on product pages; Inshorts condenses full news articles into 60-word cards for mobile consumption.

In a customer support system, the summarizer processes conversation history before it is shown to an agent or fed into a response-generation model. Freshworks or Zendesk-style platforms use summarization to give agents quick context on long ticket threads.

Key Insight: The summarizer is a compression layer. It appears wherever there is a mismatch between available information and the capacity of the next consumer -- whether that consumer is a human, an LLM, or a downstream model.

Pipeline Stage

Post-Processing / Serving

Upstream

tokenizer
text-classifier
document-loader

Downstream

context-assembler
prompt-template
embedding-model

Scaling Bottlenecks

Where It Gets Tight

The primary bottleneck is GPU compute for abstractive summarization. A single BART-large model on an A10G GPU processes ~50 documents/second for short inputs (512 tokens). For map-reduce over 10-chunk documents, throughput drops to ~5 documents/second per GPU.

LLM API rate limits become the bottleneck for API-based systems. GPT-4o at 30K tokens/minute limits you to ~15 long-document summaries per minute. At 10,000 documents/day, you need careful request scheduling and batching.

Memory is the bottleneck for extractive methods at scale. Building a TextRank similarity matrix for a 10,000-sentence document requires $O(n^2)$ memory -- ~400MB for embeddings alone. For very long documents, hierarchical approaches are necessary.

Some concrete numbers for a system processing 50,000 documents/day:

Fine-tuned BART-large on 4x A10G GPUs: ~INR 80,000/month ($960/month)
GPT-4o-mini API: ~INR 19,000/month ($225/month) but higher latency
Hybrid extractive pre-filter + BART: 2x A10G GPUs, ~INR 40,000/month ($480/month)

Production Case Studies

InshortsNews & Media (India)

Inshorts, one of India's most popular news apps with 10M+ downloads, developed Rapid60 -- an AI-backed algorithm trained on 500,000+ manually-produced summaries that automatically condenses full-length news articles into 60-word 'shorts'. The system generates over 100K summaries per month, blending AI output with editorial oversight for the top 20% most-read content.

Outcome:

The company achieved operational profitability with revenue growing from INR 3 crore ( $0.4M) in 2016 to INR 100+ crore ($ 14M) in 2019-20. The AI summarization pipeline handles 80% of content, freeing editorial staff to focus on high-impact stories.

Google (Gemini)Technology

Google Cloud published a detailed reference architecture for long-document summarization using Gemini models with iterative refinement and map-reduce patterns. The system handles documents exceeding 100,000 tokens by splitting into chunks, summarizing with Gemini Flash, and consolidating with a refinement pass. Deployed on Vertex AI Workflows for orchestration.

Outcome:

The architecture demonstrates production-grade long-document summarization with configurable chunk strategies, achieving consistent quality across documents ranging from 10K to 500K tokens while managing API costs through model selection (Gemini Flash for chunks, Gemini Pro for final consolidation).

Salesforce (Chain of Density Research)Enterprise Software / Research

Researchers from Salesforce, MIT, and Columbia developed the Chain of Density prompting technique for LLM-based summarization. The method iteratively refines summaries to increase entity density without increasing length. Human evaluators on CNN/DailyMail preferred CoD summaries over vanilla GPT-4 summaries, selecting the 3rd iteration as the optimal density level.

Outcome:

CoD summaries were rated as more abstractive, exhibiting more fusion and less lead bias than standard GPT-4 summaries. The technique has been widely adopted in production summarization systems using LLM APIs, becoming a de facto prompting standard for high-quality summarization.

BBC (XSum Dataset)News & Media

The BBC's article corpus (2010-2017) was used to create the XSum (Extreme Summarization) dataset -- 226,711 articles each paired with a single-sentence summary. This dataset pushed the field toward highly abstractive summarization, as the one-sentence summaries cannot be produced by simple sentence extraction. It remains a key benchmark for evaluating summarization models.

Outcome:

XSum became one of the most widely used summarization benchmarks alongside CNN/DailyMail. State-of-the-art ROUGE-1 scores improved from ~29 (pre-transformer) to ~47 (PEGASUS) on this dataset, demonstrating the dramatic impact of pre-trained models on abstractive summarization quality.

Tooling & Ecosystem

Hugging Face Transformers

PythonOpen Source

The go-to library for summarization model inference and fine-tuning. Provides pre-trained BART, T5, PEGASUS, LED, and dozens of other summarization models with a unified pipeline('summarization') API. Supports GPU acceleration, quantization, and easy model switching.

sumy

PythonOpen Source

Lightweight Python library for extractive summarization. Implements TextRank, LexRank, LSA, Luhn, and KL-Sum algorithms. Excellent for quick extractive baselines with no GPU required. Ideal for prototyping or low-resource environments.

LangChain Summarization Chains

PythonOpen Source

Provides ready-made chains for stuff, map-reduce, and refine summarization strategies using any LLM backend (OpenAI, Anthropic, local models). Handles document splitting, prompt templating, and chain orchestration. Best for LLM-based summarization workflows.

rouge-score

PythonOpen Source

Google's official ROUGE implementation in Python. Computes ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum with optional stemming. The de facto standard for reproducible summarization evaluation.

BERTScore

PythonOpen Source

Computes semantic similarity between candidate and reference summaries using contextual embeddings. Correlates more strongly with human judgments than ROUGE, especially for abstractive summaries. Supports multiple backbone models including DeBERTa.

PEGASUS (Google Research)

Python / TensorFlowOpen Source

Google's PEGASUS model implementation, pre-trained with Gap Sentences Generation (GSG) -- a self-supervised objective specifically designed for summarization. Achieves strong results on 12 diverse benchmarks, particularly excels in low-resource settings (state-of-the-art with only 1000 examples).

SummaC

PythonOpen Source

Factual consistency evaluation tool for summarization. Uses NLI models to detect hallucinations by checking whether each sentence in the summary is entailed by the source document. Essential for production systems where faithfulness matters.

Research & References

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Lewis, Liu, Goyal, Ghazvininejad, Mohamed, Levy, Stoyanov, Zettlemoyer (2020)ACL 2020

Introduced BART, a denoising autoencoder that combines a bidirectional encoder (like BERT) with an autoregressive decoder (like GPT). Achieved state-of-the-art on CNN/DailyMail and XSum summarization benchmarks with up to 3.5 ROUGE improvement over prior work.

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Zhang, Zhao, Saleh, Liu (2020)ICML 2020

Proposed Gap Sentences Generation (GSG), a pre-training objective tailored for summarization where important sentences are masked and the model learns to generate them. Achieved state-of-the-art on all 12 evaluated summarization benchmarks and showed remarkable few-shot performance with only 1000 training examples.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li, Liu (2020)JMLR 2020

Introduced T5, which frames all NLP tasks as text-to-text problems. The systematic study of pre-training objectives, model architectures, and data strategies established T5 as a versatile foundation for summarization and many other tasks.

From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting

Adams, Peskoff, Weston, Suhr, Iyyer, Cohan (2023)EMNLP 2023

Introduced Chain of Density prompting where GPT-4 iteratively produces summaries with increasing entity density. Human evaluators preferred CoD summaries at step 3 (of 5), finding them more abstractive and less lead-biased than vanilla GPT-4 summaries.

Longformer: The Long-Document Transformer

Beltagy, Peters, Cohan (2020)arXiv preprint

Introduced an attention mechanism that scales linearly with sequence length, enabling processing of documents with thousands of tokens. The Longformer Encoder-Decoder (LED) variant achieved state-of-the-art on long-document summarization benchmarks (arXiv, PubMed).

Get To The Point: Summarization with Pointer-Generator Networks

See, Liu, Manning (2017)ACL 2017

Introduced the copy mechanism for neural summarization, allowing models to both generate new words and copy words from the source. The pointer-generator architecture and coverage mechanism became foundational building blocks for subsequent summarization models.

TextRank: Bringing Order into Texts

Mihalcea, Tarau (2004)EMNLP 2004

Proposed an unsupervised graph-based ranking algorithm for text extraction, applying PageRank to sentence graphs for extractive summarization. Remains widely used as a strong unsupervised baseline and a component in hybrid summarization systems.

Benchmarking Large Language Models for News Summarization

Zhang, Fabbri, Choubey, Gu, Radev (2024)TACL 2024

Comprehensive evaluation of LLMs (GPT-3.5, GPT-4, Claude, LLaMA) on news summarization. Found that LLMs produce more coherent and faithful summaries than fine-tuned models, but ROUGE scores can be misleading -- LLM summaries are preferred by humans despite sometimes having lower ROUGE scores.

Interview & Evaluation Perspective

Common Interview Questions

●
What is the difference between extractive and abstractive summarization? When would you choose each?
●
How do you handle summarization of documents that exceed the model's context window?
●
Explain the ROUGE metric and its limitations. What alternatives exist?
●
How would you detect and mitigate hallucinations in an abstractive summarizer?
●
Design a summarization system for a news app like Inshorts that needs to condense articles into 60-word summaries at scale.
●
Compare BART, T5, and PEGASUS for summarization. What are the architectural differences and when would you choose each?
●
How would you evaluate whether your summarizer is production-ready? What metrics and processes would you put in place?
●
Describe the map-reduce approach to long-document summarization. What are the tradeoffs?

Key Points to Mention

●
The extractive-abstractive spectrum is not binary -- modern systems often use extractive pre-selection followed by abstractive rewriting. This combines the faithfulness of extraction with the fluency of abstraction.
●
ROUGE measures lexical overlap, not semantic accuracy. A summary can score high on ROUGE while hallucinating facts. Always supplement with faithfulness metrics (SummaC, BERTScore, human evaluation).
●
PEGASUS's Gap Sentences Generation pre-training objective makes it the strongest model for summarization when fine-tuning data is limited (<1000 examples). BART is the best general-purpose choice. T5 offers flexibility across tasks.
●
Map-reduce summarization handles long documents but introduces information loss at chunk boundaries and in the reduction step. The refine strategy (iteratively updating a running summary) can produce more coherent results at the cost of sequential processing.
●
Chain of Density prompting produces higher-quality LLM summaries than vanilla prompting and requires no fine-tuning -- essential knowledge for production LLM-based summarization systems.
●
Position bias (lead bias) is a real problem: models trained on news over-weight early sentences. For non-news domains, this must be explicitly addressed through training data or prompting strategies.

Pitfalls to Avoid

●
Treating summarization as a solved problem because LLMs can generate fluent summaries -- faithfulness, controllability, and cost-efficiency are unsolved engineering challenges.
●
Reporting only ROUGE scores without discussing their limitations. Senior interviewers will push back on ROUGE-only evaluation.
●
Ignoring the latency and cost implications of different approaches. Always frame the solution in terms of the production constraints: 'For 10K docs/day at <200ms p99, a fine-tuned BART-base on GPU costs INR 25K/month.'
●
Failing to discuss post-processing and quality validation as part of the system. The model is one component; the system includes preprocessing, chunking, inference, post-processing, and monitoring.
●
Assuming one model fits all domains. Domain-specific fine-tuning or few-shot prompting is almost always needed for production quality.

Senior-Level Expectation

A senior candidate should discuss the full system design: document preprocessing and chunking strategy, model selection with quantitative justification (not just 'I would use BART'), output quality validation (faithfulness checking, ROUGE monitoring, human evaluation loops), latency and cost analysis at the expected scale, and graceful degradation strategies (fallback from abstractive to extractive if the model produces low-confidence output). They should also address operational concerns: how to handle model updates without downtime, how to detect quality regressions in production, and how to build evaluation datasets for a new domain. The ability to reason about the tradeoff between LLM API costs and self-hosted model costs -- especially in the context of Indian startup budgets -- distinguishes senior engineers from mid-level ones. A staff-level candidate would additionally discuss multi-document summarization, cross-lingual summarization for India's multilingual context, and how summarization quality metrics feed into broader product KPIs.

Summary

A summarizer is the information compression layer in ML systems, transforming long documents into concise representations while preserving key facts, arguments, and nuance. The field spans a spectrum from extractive methods (TextRank, BERT-based sentence scoring) that select existing sentences to abstractive methods (BART, T5, PEGASUS, LLMs) that generate novel text. Modern production systems increasingly use hybrid approaches -- extractive pre-selection followed by abstractive rewriting -- to balance faithfulness with fluency.

The practical engineering of summarization involves navigating several interconnected tradeoffs: extractive vs. abstractive (safety vs. fluency), fine-tuned models vs. LLM APIs (cost and latency vs. versatility), and quality vs. speed (beam search with faithfulness checking vs. single-pass generation). For long documents, map-reduce and refine strategies handle inputs that exceed model context windows, while techniques like Chain of Density prompting push LLM-based summarization quality beyond vanilla prompting. Evaluation remains a core challenge: ROUGE measures lexical overlap but misses semantic accuracy, BERTScore captures meaning better but ignores faithfulness, and factual consistency checkers (SummaC, AlignScore) address hallucination but add latency.

In production, the summarizer is wherever there is a mismatch between available information and the capacity of the next consumer. Whether compressing retrieved passages for a RAG pipeline's context window, condensing news articles into 60-word cards for an Inshorts-like app, or generating ticket summaries for customer support agents -- the summarizer bridges the gap between information abundance and attention scarcity. Choose your approach based on your domain, scale, and quality requirements; start with an LLM API for validation, then optimize toward fine-tuned models as your understanding of the problem matures.

Concept Snapshot

Why This Concept Exists

The Information Overload Problem

From Rule-Based Heuristics to Neural Generation

The LLM Inflection Point

Core Intuition & Mental Model

Two Ways to Summarize -- and Why It Matters

The Mental Model: Compression With a Fidelity Guarantee

Technical Foundations

Mathematical Formulation

Evaluation: ROUGE Metrics

Beyond ROUGE: Semantic Evaluation

Internal Architecture

Key Components

Data Flow

How to Implement

Choosing Your Approach

Common Implementation Mistakes

When Should You Use This?

Use When

Avoid When

Key Tradeoffs

Extractive vs. Abstractive

Fine-Tuned Models vs. LLM APIs

The Sweet Spot for Most Teams

Alternatives & Comparisons

Pros, Cons & Tradeoffs

Advantages

Disadvantages

Failure Modes & Debugging

Factual Hallucination

Input Truncation Without Warning

Repetitive or Degenerate Output

Lead Bias (Position Bias)

Entity Confusion in Multi-Document Settings

Quality Degradation Under Domain Shift

Placement in an ML System

Where Does a Summarizer Sit in the ML Pipeline?

Pipeline Stage

Upstream

Downstream

Scaling Bottlenecks

Production Case Studies

Tooling & Ecosystem

Research & References

Interview & Evaluation Perspective

Common Interview Questions

Key Points to Mention

Pitfalls to Avoid

Senior-Level Expectation

Summary

Related Blocks & Further Reading

Related ML Blocks

Further Reading