Back to Agent Skills

RAG Document Ingestion

Ingest documents, chunk them, generate embeddings, and build a searchable knowledge base.

When to Use

  • -You have company docs you want to make searchable
  • -You want to build a Q&A system over your knowledge base
  • -You need cited answers from internal documentation
  • -You're setting up a RAG pipeline

Inputs

Documents (PDF, DOCX, TXT, MD). Chunk and model configuration.

Outputs

Vector index in pgvector, chunk summary, ingestion stats.

Tools Required

pgvectorOpenAI EmbeddingsPostgreSQLDocument parser

Skill Safety

Every 4M Labs skill is designed to be readable, auditable, and easy to modify before use. Treat skills like code: review them before running, check tool permissions, and keep secrets out of prompts.

SKILL.md

---
name: rag-document-ingestion
description: Gives an agent the ability to ingest documents, chunk them, generate embeddings, and build a searchable knowledge base.
inputs:
  - documents: Files (PDF, DOCX, TXT, MD) or URLs to process
  - chunk_strategy: Chunk size and overlap configuration
  - embedding_model: Which embedding model to use
outputs:
  - vector_index: Searchable embeddings in pgvector
  - chunk_summary: Overview of processed chunks and their sources
  - stats: Document count, chunk count, storage used
tools:
  - pgvector: Vector storage and similarity search
  - openai_embeddings: Text embedding generation
  - pdf_parser: Document text extraction
  - postgresql: Metadata and chunk storage
safety:
  - Review documents for PII before ingestion
  - Set access controls on the knowledge base
  - Monitor embedding API costs
  - Do not ingest sensitive credentials or secrets
---

# RAG Document Ingestion Skill

Ingest documents, chunk them, generate embeddings, and build a searchable knowledge base for RAG applications.

## When to Use

- You have company docs you want to make searchable
- You want to build a Q&A system over your knowledge base
- You need cited answers from internal documentation
- You're setting up a RAG pipeline

## How It Works

1. **Upload**: Accept PDF, DOCX, TXT, and Markdown files
2. **Extract**: Parse text content from each document
3. **Chunk**: Split into overlapping chunks (1000 chars, 200 overlap)
4. **Embed**: Generate vector embeddings for each chunk
5. **Index**: Store in pgvector with HNSW index for fast search
6. **Verify**: Test retrieval quality with sample queries

## Chunk Strategy

- Size: 1000 characters per chunk
- Overlap: 200 characters between chunks
- Split on: paragraph breaks, then sentences, then characters
- Preserve: Headers and section context

## Example Prompt

"Ingest these 10 PDF documents into a RAG knowledge base. Chunk them with 1000-char chunks and 200-char overlap. Generate embeddings and store in pgvector. Then test retrieval with 3 sample queries."

## Related

- Recipe: /recipes/rag-company-docs
Download SKILL.md

Related Recipes

Want rag document ingestion running in your business?

4M Labs can deploy rag document ingestion as a production workflow:

  • Connected to your tools and data sources
  • Secured for your team with proper access controls
  • Deployed with monitoring and error handling
  • Documented for handoff and future maintenance
Book an Implementation Sprint