Doc2Agent: How I Built a Fully Offline Document Agent in Less Than a Week

Doc2Agent: How I Built a Fully Offline Document Agent in Less Than a Week

A multi-agent PDF Q&A system that runs fully on-device with Ollama, Pydantic AI, Chainlit, and SQLite—privacy-preserving, offline-first, and cheap to run.

Originally published on Medium.

Doc2Agent intro animation

A story about turning static contracts into interactive knowledge, all while keeping your data on your own machine.

Living in Germany means I’m regularly handed 30‑page contracts in German.

Sure, I could shove those PDFs into ChatGPT and hope for the best, but I think it already knows too much about me, and cloud APIs can disappear, change, or cost more than they’re worth.

With a few quiet days over the holidays, a MacBook Air (M1, 16 GB) and a blank GitHub repository, I decided to see how far I could get in a week building a local, offline document Q&A assistant.

The result is Doc2Agent: a multi‑agent PDF system that runs entirely on your machine using Ollama for local inference.

In this post, I’ll share the journey, the design decisions, and the lessons learned.

Why build yet another PDF assistant?

This is not a game-changing project, and it is not just a wrapper around a remote API. I wanted something different:

  • Privacy‑preserving: personal data never leaves your laptop.
  • Offline‑first: works in airplane mode, no internet required.
  • Multi‑lingual: understands German, English, and anything in between.
  • Repeatable: deterministic behaviour that can be tested and debugged.
  • Cheap: local inference means no per‑token costs.

This wasn’t about beating GPT‑4 or building the next Silicon Valley unicorn.

It was about scratching an itch and seeing how far modern tooling can take a solo developer in a few days.

Local setup and testing: what runs on an M1?

The first step was making sure I could run a decent model locally.

I originally wanted to use vLLM because of its throughput, but it requires CUDA, and on an M1 Mac, that’s a non‑starter. Instead, I installed Ollama, which wraps multiple models behind an OpenAI‑compatible API and provides Apple Silicon binaries.

Ollama local setup

I pulled a few candidates:

  • gemma2:2b — small, fast, and multi‑lingual.
  • ministral-3:3b — supports tool calling and multi‑lingual.
  • deepseek-r1:8b — stronger reasoning but can’t call tools.

I tested a simple script by loading and running some prompts through the model.

model = os.environ.get("MODEL_NAME", "gemma2:2b")
temperature = float(os.environ.get("TEMPERATURE", "0.2"))
prompt = (
    "You are a helpful assistant.\n"
    "Explain in 5 bullet points what RAG is and when to use it."
)
logger.info("llm.generate.start model=%s temperature=%s", model, temperature)
t0 = time.time()
response = ollama.generate(
    model=model,
    prompt=prompt,
    options={"temperature": temperature},
)

I wired up a quick Pydantic AI agent with a /hello tool to test things out.

Gemma2:2b ran smoothly on the M1 but struggled with tool usage.

ministral-3:3b struck a sweet spot for tool calling and multi‑language, while deepseek-r1:8b served well for pure reasoning.

Meanwhile, I set up a new Chainlit project — it’s essentially a chat UI out of the box and saved me from writing HTML.

The combination of Chainlit + Pydantic AI + Ollama became the foundation of Doc2Agent.

From one big agent to many small ones

My initial design was naïve: one agent would call a translation API, another would parse PDFs, another would search a vector database, and so on.

It looked something like this:

Initial multi-service design

I quickly realised that this won’t work. Translation turned out to be unnecessary: local models like ministral-3:3b handle German well enough, especially when you pre‑process the document.

I also decided to avoid vector databases for now and rely on SQLite’s full‑text search (FTS5), to keep the stack smaller than adding a full vector DB in the first iteration.

The breakthrough came when I introduced multiple cooperating agents.

Rather than one giant agent juggling everything, why not have specialised agents that talk to each other?

Version 1: Task‑oriented agents

The first refactor split the logic into four agents: an Orchestrator, a Reader, an Extractor, and a Formatter. It worked, but it felt wrong, and the main agent would do everything most of the time.

Version 1: four task-oriented agents

Version 2: Main/Reviewer/Validator

After experimenting, I landed on a clean three‑agent design:

Version 2: Main, Reviewer, and Validator

Main Agent orchestrates the chat. It receives the user’s question, queries the document store, decides which tools to call, and drafts an answer. I mainly used ministral-3:3b.

Reviewer Agent uses deepseek-r1:8b for pure reasoning. It checks the draft for coherence, hallucinations, and formatting. It can reject the main agent’s answer.

Validator Agent again uses ministral-3:3b but with a different task. It extracts personal claims (names, dates, account numbers) from the draft and compares them against a user‑provided JSON file.

If something doesn’t match (e.g. the contract says my name is wrong), the validator points it out before the answer is presented.

This architecture was simple, scalable, and easy to reason about. Each agent’s responsibilities were well defined, and because they run locally, you can fan them out without worrying about API costs.

Ingesting and storing documents

Making documents queryable required two parts:

  1. Parsing PDF pages into structured text.
  2. Enriching pages with semantic metadata (people, dates, headings).

For parsing, I chose PyMuPDF (fitz). It’s fast and extracts text, tables and images without the headaches of pypdf or pdfplumber.

For enrichment, I wrote a small ingestion agent that runs an LLM over each page and returns a PageSchema (a Pydantic model) with extracted names, dates, and headings.

Everything is stored in SQLite tables with FTS5 enabled, and I keep a JSON export of small files so you can inspect the intermediate data.

After ingestion, the following models would be populated (there are more definitions in the code):

class StructuredDocument(BaseModel):
    pages: list[DocumentPage]
    sections: list[str]
    citable_spans: list[CitableSpan] = []
    doc_type: str | None = None


# New schemas for refactored ingestion pipeline
class DocumentMetadata(BaseModel):
    doc_id: str
    file_path: str
    file_name: str
    file_size_bytes: int
    page_count: int
    title: str | None = None
    author: str | None = None
    subject: str | None = None
    file_mod_time: float | None = None  # Unix timestamp for cache invalidation
    file_hash: str | None = None  # MD5 hash of file content


class Heading(BaseModel):
    text: str
    level: int  # 1 = H1, 2 = H2, 3 = H3
    start_pos: int | None = None


class PageSchema(BaseModel):
    page_num: int
    char_count: int
    word_count: int
    has_tables: bool
    has_images: bool
    contains_names: bool = False
    contains_dates: bool = False
    contains_locations: bool = False
    contains_signatures: bool = False
    contains_personal_info: bool = False
    headings: list[Heading] = Field(default_factory=list)
    languages: list[str] = Field(default_factory=list)
    keywords: list[str] = Field(default_factory=list)
    text: str


class DocumentSchema(BaseModel):
    metadata: DocumentMetadata
    pages: list[PageSchema]

The sequence diagram of the ingestion process appears in the original article.

Nothing here depends on an internet connection. Once the file is ingested, subsequent queries hit the local database instead of re‑calling the LLM.

Why SQLite instead of a vector database?

  • It’s built‑in, zero configuration, no external service.
  • It supports stemming and prefix searches out of the box.
  • It’s plenty fast for documents under a few hundred pages.
  • You can inspect the tables manually with any SQL client.

In future iterations, I may add a lightweight vector index.

Querying: orchestrating the agents

When you ask a question, the ChatAssistant (a thin Python class) executes the following steps:

  1. Load or ingest: if the PDF hasn’t been seen before or its modification time has changed, run the ingestion pipeline. Otherwise, load from the cache.
  2. Prepare turn: compile the conversation history into a structured prompt.
  3. Query storage: call query_pages and/or search_fts on the SQLite store.
  4. Draft an answer: ask Main Agent to call the appropriate tools and produce a draft answer.
  5. Review: have Reviewer Agent check the draft.
  6. Validate: extract any personal claims and compare against the user‑provided JSON via the Validator Agent.

Return: assemble the final answer and stream it back to the Chainlit UI.

All of these calls happen locally and concurrently, so the system feels snappy even on a laptop. A short excerpt from my logs shows the orchestration in action:

2025‑12‑25 22:59:49,236 INFO chat_assistant Initializing ChatAssistant backend=local
2025‑12‑25 22:59:49,236 DEBUG agents Using Ollama model: qwen3:8b
2025‑12‑25 23:00:08,440 INFO chat_assistant Loading PDF: …/fe7a6251–01fe-4660–9a4b-bd21a137db0e.pdf
2025‑12‑25 23:00:08,647 INFO chat_assistant Loaded pages=3 spans=3 chars=3908
2025‑12‑25 23:00:09,012 INFO main_agent Running tool: query_pages
2025‑12‑25 23:00:09,314 INFO reviewer_agent Reviewed draft (OK)
2025‑12‑25 23:00:09,600 INFO validator_agent No personal claims found

You can enable or disable logging to a file via environment variables (LOG_TO_FILE and LOG_LEVEL), making debugging and performance tuning much easier.

The full end-to-end system diagram is in the original article.

User experience: Chainlit as the UI

Chainlit gave me a clean chat UI, document upload widget, document selection menu, and slash commands (/docs, /reset) out of the box. It saved hours of work on HTML/CSS/JS and let me focus on the backend.

Running Doc2Agent is as simple as:

ollama pull ministral-3:3b
ollama pull deepseek-r1:8b
uv sync
uv run chainlit run app/chainlit_app.py

Open your browser to the printed URL, upload a PDF, and start asking questions. You can switch between cached documents, flush the query cache, and inspect the enriched pages via /docs.

Chainlit document UI

You can ask general questions about the document, request validation of personal information, request a summary or translation, and easily chat with the agent about the document. All of this is possible while maintaining complete control over the models locally and viewing all logs.

Chainlit chat example

What’s next?

Doc2Agent is far from done. Here are some things I’d like to explore:

  • Quality and user experience: I mainly focused on building a minimal system that works, but I didn’t dive deep into evaluating quality or planning improvements.
  • Vision models: use a small vision LLM to extract information from scanned images or tables that PyMuPDF can’t parse.
  • Fine‑tuning: adapt a compact model (e.g. Gemma) to my personal style and domain vocabulary.
  • Better prompts and tools: refine the system prompts and add more domain‑specific tools or even general tools.
  • Vector search: optionally add a small vector index alongside FTS5 for semantic retrieval.

Project code

The full project is open source: github.com/dallal9/Doc2Agent

Feel free to clone it, run it locally, and adapt it to your own use case. Pull requests, issues, and feedback are more than welcome. If you try it on your machine, I’d genuinely love to hear what worked and what didn’t.

Thanks for reading. I hope the project encourages you to experiment with your own local-first, offline AI systems.