Skip to main content
Agentic Topic Modeling: Embedding Pipelines, LLMs, and Human-in-the-Loop Engineering Trade-offs
  1. Posts/

Agentic Topic Modeling: Embedding Pipelines, LLMs, and Human-in-the-Loop Engineering Trade-offs

·2106 words·10 mins
NeatGuyCoding
Author
NeatGuyCoding

Agentic Topic Modeling: Embedding Pipelines, LLMs, and Human-in-the-Loop Engineering Trade-offs
#

“Topics” over large corpora are no longer the exclusive domain of LDA bag-of-words era. Vector search, RAG, and agent toolchains have pushed document-level embeddings to center stage, yet operations still need browsable, filterable, iterable topic-layer metadata—supporting analysts reading reports and serving as a coarse index at million-document scale. Recent discussions often place agentic topic modeling alongside TopicGPT and BERTopic; stripping away podcast narrative, the core tension is a three-way trade-off among static embedding clustering, LLMs maintaining topic tables and assigning documents, and human-in-the-loop steering of granularity in natural language. Below is organized by engineering themes; evidence follows official docs and papers. Figures and product throughput not found in primary sources are labeled speaker/host view.


Problem space: what role topics play in the RAG stack
#

Why: Pure document-embedding retrieval is already strong semantically, but users still ask “what are these complaints mainly about?” and “what new issues appeared last quarter?”—that needs mid-level structure between full text and single sentences. A well-built topic layer can act as metadata partitions on a vector store or as the first hop in two-stage retrieval; a poor one introduces wrong priors and is worse than none.

Mechanisms/constraints: “Topic” has no unique form—it may be BERTopic c-TF-IDF keyword lists, or natural-language labels and descriptions in the TopicGPT style. Speaker view: presentation is highly subjective; there is no universally “correct topic.”

How to proceed: Architecturally decouple topic production from downstream consumers—first define consumers (analyst UI, retrieval router, evaluation pipeline), then choose representation (word weights vs LLM summaries vs hierarchy).

Common pitfall: Equating “topic” with “one LLM-generated headline”; without comparable scores, the eye over-trusts fluent wording (see Evaluation below).


Modular embedding pipeline: swappable building blocks
#

Why: Embedding models, clusterers, and dimensionality reducers iterate quickly; if topic modeling is bound to one stack, every swap of SentenceTransformer or density clusterer rewrites the pipeline.

Mechanisms/constraints: The BERTopic algorithm page describes five steps: Embeddings → Dimensionality Reduction → Clustering → Tokenizer → c-TF-IDF (class-based TF-IDF) → optional Representation. Each step is replaceable; arXiv:2203.05794 likewise states: cluster document embeddings, then generate topic representations with class-based TF-IDF.

Figure: slide “Build Your Topic Model”—bottom-up: Convert document into embeddings (Transformers / SBERT / SpaCy), Reduce embedding dimensionality (TruncatedSVD / PCA / UMAP), Cluster reduced embeddings (HDBSCAN / BIRCH / k-Means), Tokenize documents.

Mermaid diagram 1

How to proceed (minimal example, aligned with official defaults):

from bertopic import BERTopic
from umap import UMAP
import hdbscan

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric="cosine")
hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=15)
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
topics, probs = topic_model.fit_transform(docs)

Common pitfall: Treating presentation 2D scatter plots as the true clustering space. Official examples use n_components=5; 5–10 dimensions is the guest’s experience range, not UMAP’s documented default (UMAP parameters). 2D DataMapPlot suits narrative; when explaining cluster composition, return to high-dimensional space, word weights, and sample documents.


LLM-native topic discovery: TopicGPT and TnT-LLM
#

Why: Clustering + c-TF-IDF topic words skew toward “statistically salient,” not necessarily business language; LLMs can emit readable labels and hierarchical taxonomies directly.

Mechanisms/constraints:

  • TopicGPT (arXiv:2311.01449): prompt framework using LLMs to discover latent topics (natural-language labels + descriptions); README includes generate_topic_lvl1, refine_topics, assign_topics, etc.—generation and assignment separated.
  • TnT-LLM (arXiv:2403.12173): two-phase—phase one generates/refines a label taxonomy, phase two trains a lightweight classifier with LLM labels for scale assignment.

Speaker view: Repeated per-document LLM calls over the full corpus are wasteful; safer is embedding pre-clustering + LLM for guidance or top-level summaries. If inference cost nears zero, a full-corpus LLM inspector is logically plausible—then embeddings may retreat to visualization and hierarchical analysis (conditional forward look, no benchmark).

How to proceed: Export the LLM topic table as label vector y, plug into BERTopic supervised/semi-supervised paths (next section), rather than starting over.

Common pitfall: Assuming TopicGPT and BERTopic are structurally incompatible; the guest argues TopicGPT topics can be fed back into BERTopic for inference and visualization (interview workflow; capability supported by Supervised / Semi-supervised).

Figure: split-screen interview; left background shows Weaviate podcast branding and waveform motif (~10-minute segment, typical shot while discussing TopicGPT-style routes).


Decoupling assignment and generation
#

Why: Rewriting the representation layer over the full corpus with LLMs is expensive; “split 100k documents into 200 clusters” and “write one human-readable title per cluster” have different compute shapes.

Mechanisms/constraints: BERTopic does assignment (clustering) first, then generation (c-TF-IDF and optional Representation models). TopicGPT / TnT-LLM also separate topic generation from assignment. Speaker view: iterating representation touches only topic-count volume—cheap vs full-corpus LLM—“~1% of data” is order-of-magnitude intuition, no primary formula; do not cite as paper data.

Mermaid diagram 2

How to proceed: After fixing assignment, only replace representation_model or rerun c-TF-IDF; avoid recomputing UMAP/HDBSCAN.

Common pitfall: One end-to-end prompt for both assignment and naming, preventing separate evaluation of “assignment correct?” vs “name sounds good.”


“Agentic”: tool use, human-in-the-loop, and granularity
#

Why: In enterprise settings the biggest friction is often not the algorithm but granularity—too coarse misses sub-issues, too fine fails to align with retrieval agents.

Mechanisms/constraints: In this episode, agentic topic modeling ≈ tool calling + human-in-the-loop + natural-language steer, not a formal definition from a generic agent framework (interview view). Speaker view: the flow remains user-driven; stronger models still will not guess the granularity you want. Host adds: too much choice causes choice overload—humans excel at limited options (e.g. “merge Topic 3 and 7,” “only 12 top-level topics”).

How to proceed: Design agent interfaces as propose–confirm loops: system outputs candidate topics and sample documents → user issues structured merge/split/rename instructions → trigger semi-supervised re-clustering or representation-only rerun.

Common pitfall: Equating “LLM that can call APIs” with “automatically converges to business-satisfying topics”; ignoring hierarchy vs single-layer retrieval agents—speaker view: retrieval can ignore hierarchy and use fine-grained topic metadata; host argues hierarchy adds interface complexity with query agents (no unified conclusion from either).

Figure: interview split-screen; guest side while discussing granularity and human feedback (Weaviate podcast branding visible).


Clustering choices: HDBSCAN, k-means, and parameter trial
#

Why: Density clustering finds arbitrary shapes and labels noise, but hyperparameters are opaque; k-means fixes K, easing comparison of neighboring clusters with LLMs.

Mechanisms/constraints: HDBSCAN min_cluster_size is the smallest grouping still treated as a cluster; increasing it tends toward larger, fewer clusters, but total cluster count still needs experiments—not a direct “I want K topics” knob. Speaker view: min_cluster_size is hard to set a priori—main enterprise friction; some switch to k-means + post-hoc density filtering (preference statement; HDBSCAN docs do not discuss this).

How to proceed:

import hdbscan
clusterer = hdbscan.HDBSCAN(min_cluster_size=30, min_samples=10)
labels = clusterer.fit_predict(reduced_embeddings)

For k-means, LLMs can read keyword differences between adjacent clusters to help pick K (interview idea, not verified against k-LLM-means or similar).

Common pitfall: Recursive HDBSCAN on sub-clusters yields stable hierarchy—guest finds sub-cluster parameters sensitive and hard to maintain in practice (speaker view).


c-TF-IDF: cluster contrast and readable scores
#

Why: A single LLM headline lacks cross-topic comparable scores; analysts struggle to judge “how much does this word belong to this topic?”

Mechanisms/constraints: ClassTfidfTransformer treats each cluster as one “class document,” l1-normalized term frequency × IDF, highlighting words unique to this cluster vs others; the paper calls it a class-based variation of TF-IDF (arXiv:2203.05794). Speaker view: analogous to simple “contrastive learning”; opposes word clouds—font size misleads human weight perception, so BERTopic visualization omits word clouds (product behavior matches oral account; official page does not literally say “opposes word clouds”).

How to proceed: Expose term score ranked lists in UI, not area-encoded word clouds alone.

Common pitfall: Stopping at a fluent LLM topic name without reading representative documents.


Evaluation: automated coherence vs “topics users want”
#

Why: Offline metrics screen models, but classic topic coherence on neural/embedding topic models may diverge from human judgment.

Mechanisms/constraints:

The podcast’s spoken title “are neural topic modeling evaluation metrics broken?” does not match the exact arXiv titles above—cite arXiv titles in writing. On-screen overlay ProxAnn: Use-Oriented Evaluations… aligns with use-oriented evaluation; distinct from O’Reilly Hands-On Large Language Models (Jay Alammar & Maarten Grootendorst).

Figure: O’REILLY Hands-On Large Language Models cover (Jay Alammar & Maarten Grootendorst), overlaid with paper title Use-Oriented Evaluation of Topic Models and Document Clustering (Alexander Hoyle et al.).

Speaker view: coherent + diverse may still miss business-relevant dimensions; add user constraints (minimum topic count, must cover a class, banned words).

Common pitfall: Model selection from a single NPMI/coherence leaderboard; no human document reading or task-oriented (use-oriented) evaluation.

Figure: mid-episode split-screen (~24 minutes), typical shot while discussing evaluation metrics and neural topic models.


Retrieval acceleration: two-stage and IVF analogy
#

Why: Million documents × full-corpus vector ANN is costly; if documents carry topic IDs, shrink the candidate set first.

Mechanisms/constraints: Speaker view: retrieve Top-K topic embeddings first, then document-level search within relevant topics; semantic overlap with pure embedding search may exist, but speed is the clear benefit. Host verbally estimates ~5% of documents enter stage two—no benchmark, unverified. Analogous to FAISS IndexIVF coarse quantization; Weaviate vector index docs list HNSW / flat, not IVF as primary framing—state the analogy explicitly.

Mermaid diagram 3

How to proceed: Build a separate embedding index for topic labels/descriptions; write topic_id metadata on document objects; measure two-stage recall on a held-out query set.

Common pitfall: Shipping without measuring two-stage recall loss; treating 5% as a universal speedup ratio.

Figure: late-episode split-screen (~49 minutes), shot while discussing retrieval and vector stores.


Hybrid pipelines: Supervised and Semi-supervised BERTopic
#

Why: When TopicGPT labels or a business taxonomy already exist, HDBSCAN need not “rediscover” from scratch.

Mechanisms/constraints:

  • Supervised BERTopic: skip dimensionality reduction, use a classifier instead of clustering, then run c-TF-IDF on labels. Speaker view: called a “hack,” but simple and effective.
  • Semi-supervised: partial labels drive semi-supervised UMAP, then HDBSCAN; fit(docs, y=...), unlabeled may use -1.
  • Serialization: safetensors does not save clustering/dimensionality weights; inference can rely on topic embeddings; Dynamic Topic Modeling can slide c-TF-IDF per time slice without re-running clustering.

How to proceed:

# External labels already available (e.g. TopicGPT output)
topic_model = BERTopic(umap_model=None, hdbscan_model=your_classifier)
topic_model.fit(docs, y=labels)
topic_model.save("model_dir", serialization="safetensors")

Semi-supervised: pass y for known-topic documents so UMAP supervised guides the manifold, then cluster.

Common pitfall: Expecting supervised mode to “fix wrong labels”—it assumes assignment is given and only rebuilds representation and inference interface.

Figure: late-episode split-screen (~40 minutes), shot while discussing supervised and dynamic topics.

Figure: corner bug Weaviate podcast (OCR anchor: Weaviate podcast so).


Unresolved tensions (no need to force unity)
#

AxisCommon approachOther pole in guest/discussionEvidence status
Main routeEmbed → UMAP → HDBSCAN → c-TF-IDFFull-corpus LLM TopicGPT-style assignmentEmbedding route has official docs; full-corpus LLM cost argument is interview
Hierarchy vs single layerHierarchical BERTopic / TopicGPT lvl2Single layer + fine-grained metadata for retrievalBoth sides’ views, no experimental comparison
EvaluationNPMI / coherenceLLM-as-judge + humanSee 2107.02173, 2305.12152
Agent definitionFixed DAG workflowTool use + human-in-the-loopInterview view

Figure: corner bug m@& Weaviate (OCR anchor).


If you are shipping this
#

  1. Define consumers first: analysts reading reports, retrieval router, or offline evaluation—then choose “word-weight topics” vs “LLM label topics”; avoid end-to-end black boxes.
  2. Default to modular BERTopic: start with n_components=5 (official example); 2D plots for narrative only; QA with c-TF-IDF scores + sample documents, not LLM titles alone.
  3. Put LLMs on generation or guidance: after TopicGPT/TnT-LLM produce y, plug in via Supervised/Semi-supervised; estimate cost before repeated full-corpus assign.
  4. Layer task constraints on evaluation: beyond coherence critique and LLM judging, add business rules like “must cover class,” “minimum topic count”; cite papers by exact arXiv titles.
  5. Measure recall before two-stage retrieval: topic Top-K cascaded with document ANN needs held-out query validation; do not use unverified 5% document share for capacity planning.

References and further reading
#

Related

Compound AI: When a Single LLM Call Is Not Enough

·2083 words·10 mins
Compound AI: When a single LLM call is not enough—multiple model calls, retrievers, tools, and business logic as a graph; structured output, specialist pipelines, inference stacks, and deployment granularity from a Weaviate podcast with Baseten’s Philip Kiely.