Knowledge Base PRO
The Knowledge Base turns your agents from generic conversationalists into domain experts. Upload your documentation, product guides, FAQs, or any reference material, and your agents will use it to answer questions accurately — grounded in your actual content rather than guessing.
How RAG Works
The Knowledge Base is powered by Retrieval-Augmented Generation (RAG), a technique that combines search with AI generation. Here is exactly how it works, step by step:
- Upload — You upload documents (PDF, TXT, MD, CSV) or provide URLs to crawl. The raw content is extracted and stored on your VPS.
- Chunk — Each document is split into smaller pieces called chunks. ClawHQ uses a chunk size of 2,000 characters with a 200-character overlap between consecutive chunks. The overlap ensures that no information is lost at chunk boundaries — if a key sentence spans two chunks, the overlap captures it in both.
- Embed — Each chunk is converted into a numerical vector (an embedding) using the all-MiniLM-L6-v2 model, which produces 384-dimensional vectors. These vectors capture the semantic meaning of each chunk, so chunks about similar topics end up close together in vector space.
- Search — When a user asks a question, the question is also embedded into a vector. The system searches for chunks whose vectors are most similar to the question vector. This is called semantic search — it finds relevant content even when the exact words differ.
- Inject Context — The most relevant chunks are inserted into the prompt as context, placed before the user's question. The model sees both the retrieved content and the question together.
- Model Answers — The AI model generates its response using the injected context. Because the relevant information is right there in the prompt, the model can give accurate, grounded answers instead of relying solely on its training data.
Supported File Types
The Knowledge Base accepts four file formats:
- PDF — Product manuals, research papers, contracts, any document saved as PDF. Text is extracted automatically; scanned images within PDFs are not supported.
- TXT — Plain text files of any kind.
- MD — Markdown files, including those with frontmatter. Markdown formatting is preserved during chunking.
- CSV — Tabular data. Each row is treated as a potential chunk boundary, making CSV ideal for FAQ lists or structured reference data.
Upload Files or Crawl URLs
You can populate your Knowledge Base in two ways:
- File upload — Drag and drop files directly into the dashboard, or click to browse. Multiple files can be uploaded simultaneously. Each file is processed (extracted, chunked, embedded) automatically.
- URL crawl — Provide a URL and ClawHQ will fetch the page content, extract the text, and process it through the same chunking and embedding pipeline. This is useful for ingesting existing web documentation, help center articles, or blog posts.
Hybrid Search
ClawHQ uses hybrid search, which combines two search strategies simultaneously:
- Vector search — Finds semantically similar content using embedding cosine similarity. Catches relevant results even when the wording differs from the query.
- Keyword search — Traditional text matching that finds exact terms, product names, error codes, and other specific strings that semantic search might rank lower.
Both searches run simultaneously, and the results are merged using a weighted combination. This ensures that you get the best of both worlds: semantic understanding plus precise keyword matching.
Relevance Threshold
The default relevance threshold is a 0.3 cosine similarity score. Chunks scoring below this threshold are excluded from the context injected into the model prompt. This prevents low-quality matches from polluting the context window.
You can adjust the threshold in the Knowledge Base settings. A higher threshold (e.g., 0.5) means only very closely matching chunks are used — more precise but potentially missing relevant content. A lower threshold (e.g., 0.2) casts a wider net but may include marginally relevant chunks.
Test Your Knowledge Base
The "Test Your KB" interface lets you run search queries against your Knowledge Base and see exactly which chunks are returned, along with their relevance scores. This is invaluable for:
- Verifying that your documents are chunked correctly
- Tuning the relevance threshold
- Identifying gaps in your content (queries that return no results)
- Confirming that the right content surfaces for expected questions
Type a question in the search bar, and the results panel shows each matching chunk with its source document, chunk index, similarity score, and the full chunk text.
Retrieval Tracking
Retrieval tracking records which documents and chunks are used most frequently in agent responses. Over time, this data reveals which parts of your Knowledge Base are most valuable and which documents are rarely accessed.
Use retrieval tracking to prioritize content updates. Documents that are retrieved frequently but receive negative feedback may need rewriting. Documents that are never retrieved may need better titles or additional context to improve their discoverability.
Chunk Viewer
The Chunk Viewer lets you inspect individual chunks within any document. See how your content was split, verify that chunk boundaries make sense, and identify cases where important information was split across chunks in an unhelpful way.
Each chunk displays its index number, character count, the first and last few words of overlap with adjacent chunks, and its embedding vector magnitude. You can navigate through chunks sequentially or jump to a specific chunk by index.
Connectors
Connect external data sources to keep your Knowledge Base synchronized with content that lives outside of ClawHQ:
- Google Drive — Connect your Google Drive account and select specific files or folders to sync. When the source document changes in Drive, the Knowledge Base automatically re-processes it.
- Notion — Connect your Notion workspace and select databases or pages to sync. Notion content is extracted, chunked, and embedded just like uploaded files.
Connectors check for updates periodically and re-index changed content automatically. You can also trigger a manual sync at any time from the Connectors panel.
Feedback
When a user receives an answer that was sourced from the Knowledge Base, they can provide thumbs-up or thumbs-down feedback. This feedback is linked back to the specific chunks that were used, giving you a direct signal about content quality.
Over time, feedback data helps you identify documents that are helpful versus those that lead to poor answers. Combined with retrieval tracking, feedback creates a continuous improvement loop for your Knowledge Base content.
Reindexing Documents
If you update the embedding model, change chunk settings, or modify a document's content, you can trigger a reindex. Reindexing re-processes the document through the full pipeline: extraction, chunking, and embedding. The old chunks are replaced with the new ones atomically, so there is no downtime during reindexing.
Data Privacy
All Knowledge Base data — uploaded files, extracted text, chunks, and embedding vectors — is stored on your VPS. Your data never leaves your instance. The embedding model runs locally on the VPS, so document content is never sent to external services for processing.
Related Documentation
- Pro Features Overview — Full list of Pro capabilities
- Analytics — Track how KB-sourced answers impact satisfaction
- Agent Builder — Create agents that leverage your Knowledge Base
- API Access — Query the Knowledge Base programmatically via the API