Overview
Relevant source files
Purpose and Scope
This document introduces the DeepWiki-to-mdBook Converter, a containerized system that extracts wiki documentation from DeepWiki.com and transforms it into searchable HTML documentation using mdBook. This page covers the system's purpose, core capabilities, and high-level architecture.
For detailed usage instructions, see Quick Start. For architecture details, see System Architecture. For configuration options, see Configuration Reference.
Sources: README.md:1-3
Problem Statement
DeepWiki.com provides AI-generated documentation for GitHub repositories as a web-based wiki. The system addresses the following limitations:
| Problem | Solution |
|---|---|
| Content locked in web platform | HTTP scraping with requests and BeautifulSoup4 |
| Mermaid diagrams rendered client-side only | JavaScript payload extraction with fuzzy matching |
| No offline access | Self-contained HTML site generation |
| No searchability | mdBook's built-in search |
| Platform-specific formatting | Conversion to standard Markdown |
Sources: README.md:3-15
Core Capabilities
The system provides the following capabilities through environment variable configuration:
- Generic Repository Support : Works with any GitHub repository indexed by DeepWiki via
REPOenvironment variable - Auto-Detection : Extracts repository metadata from Git remotes when available
- Hierarchy Preservation : Maintains wiki page numbering and section structure
- Diagram Intelligence : Extracts ~461 total diagrams, matches ~48 with sufficient context using fuzzy matching
- Dual Output Modes : Full mdBook build or markdown-only extraction via
MARKDOWN_ONLYflag - No Authentication : Public HTTP scraping without API keys or credentials
- Containerized Deployment : Single Docker image with all dependencies
Sources: README.md:5-15 README.md:42-51
System Components
The system consists of three primary executable components coordinated by a shell orchestrator:
Main Components
graph TB
User["docker run"]
subgraph Container["deepwiki-scraper Container"]
BuildDocs["build-docs.sh\n(Shell Orchestrator)"]
Scraper["deepwiki-scraper.py\n(Python)"]
MdBook["mdbook\n(Rust Binary)"]
MermaidPlugin["mdbook-mermaid\n(Rust Binary)"]
end
subgraph External["External Systems"]
DeepWiki["deepwiki.com\n(HTTP Scraping)"]
GitHub["github.com\n(Edit Links)"]
end
subgraph Output["Output Directory"]
MarkdownDir["markdown/\n(.md files)"]
BookDir["book/\n(HTML site)"]
ConfigFile["book.toml"]
end
User -->|Environment Variables| BuildDocs
BuildDocs -->|Step 1: Execute| Scraper
BuildDocs -->|Step 4: Execute| MdBook
Scraper -->|HTTP GET| DeepWiki
Scraper -->|Writes| MarkdownDir
MdBook -->|Preprocessor| MermaidPlugin
MdBook -->|Generates| BookDir
BookDir -.->|Edit links| GitHub
BuildDocs -->|Copies| ConfigFile
style BuildDocs fill:#fff4e1
style Scraper fill:#e8f5e9
style MdBook fill:#f3e5f5
| Component | Language | Purpose | Key Functions |
|---|---|---|---|
build-docs.sh | Shell | Orchestration | Parse env vars, generate configs, call executables |
deepwiki-scraper.py | Python 3.12 | Content extraction | HTTP scraping, HTML parsing, diagram matching |
mdbook | Rust | Site generation | Markdown to HTML, navigation, search |
mdbook-mermaid | Rust | Diagram rendering | Inject JavaScript/CSS for Mermaid.js |
Sources: README.md:146-157 Diagram 1, Diagram 5
Processing Pipeline
The system executes a three-phase pipeline with conditional execution based on the MARKDOWN_ONLY environment variable:
Phase Details
stateDiagram-v2
[*] --> ParseEnvVars
ParseEnvVars --> ExecuteScraper : build-docs.sh phase 1
state ExecuteScraper {
[*] --> FetchHTML
FetchHTML --> ConvertMarkdown : html2text
ConvertMarkdown --> ExtractDiagrams : Regex on JS payload
ExtractDiagrams --> FuzzyMatch : Progressive chunks
FuzzyMatch --> WriteMarkdown : output/markdown/
WriteMarkdown --> [*]
}
ExecuteScraper --> CheckMode
state CheckMode <<choice>>
CheckMode --> GenerateBookToml : MARKDOWN_ONLY=false
CheckMode --> CopyOutput : MARKDOWN_ONLY=true
GenerateBookToml --> GenerateSummary : build-docs.sh phase 2
GenerateSummary --> ExecuteMdbook : build-docs.sh phase 3
state ExecuteMdbook {
[*] --> InitBook
InitBook --> CopyMarkdown : mdbook init
CopyMarkdown --> InstallMermaid : mdbook-mermaid install
InstallMermaid --> BuildHTML : mdbook build
BuildHTML --> [*] : output/book/
}
ExecuteMdbook --> CopyOutput
CopyOutput --> [*]
| Phase | Script | Key Operations | Output |
|---|---|---|---|
| 1 | deepwiki-scraper.py | HTTP fetch, BeautifulSoup4 parse, html2text conversion, fuzzy diagram matching | markdown/*.md |
| 2 | build-docs.sh | Generate book.toml, generate SUMMARY.md | Configuration files |
| 3 | mdbook + mdbook-mermaid | Markdown processing, Mermaid.js asset injection, HTML generation | book/ directory |
Sources: README.md:121-145 Diagram 2
Input and Output
Input Requirements
| Input | Format | Source | Example |
|---|---|---|---|
REPO | owner/repo | Environment variable | facebook/react |
BOOK_TITLE | String | Environment variable (optional) | React Documentation |
BOOK_AUTHORS | String | Environment variable (optional) | Meta Open Source |
MARKDOWN_ONLY | true/false | Environment variable (optional) | false |
Sources: README.md:42-51
Output Artifacts
Full Build Mode (MARKDOWN_ONLY=false or unset):
output/
├── markdown/
│ ├── 1-overview.md
│ ├── 2-quick-start.md
│ ├── section-3/
│ │ ├── 3-1-workspace.md
│ │ └── 3-2-parser.md
│ └── ...
├── book/
│ ├── index.html
│ ├── searchindex.json
│ ├── mermaid.min.js
│ └── ...
└── book.toml
Markdown-Only Mode (MARKDOWN_ONLY=true):
output/
└── markdown/
├── 1-overview.md
├── 2-quick-start.md
└── ...
Sources: README.md:89-119
Technical Stack
The system combines multiple technology stacks in a single container using Docker multi-stage builds:
Runtime Dependencies
| Component | Version | Purpose | Installation Method |
|---|---|---|---|
| Python | 3.12-slim | Scraping runtime | Base image |
requests | Latest | HTTP client | uv pip install |
beautifulsoup4 | Latest | HTML parser | uv pip install |
html2text | Latest | HTML to Markdown | uv pip install |
mdbook | Latest | Documentation builder | Compiled from source (Rust) |
mdbook-mermaid | Latest | Diagram preprocessor | Compiled from source (Rust) |
Build Architecture
The Dockerfile uses a two-stage build:
- Stage 1 (
rust:latest): Compilesmdbookandmdbook-mermaidbinaries (~1.5 GB, discarded) - Stage 2 (
python:3.12-slim): Copies binaries into Python runtime (~300-400 MB final)
Sources: README.md:146-157 Diagram 3
File System Interaction
The system interacts with three key filesystem locations:
Temporary Directory Workflow :
deepwiki-scraper.pywrites initial markdown to/tmp/wiki_temp/- After diagram enhancement, files move atomically to
/output/markdown/ build-docs.shcopies final HTML to/output/book/
This ensures no partial states exist in the output directory.
Sources: README.md:220-227 README.md136
Configuration Philosophy
The system operates on three configuration principles:
- Environment-Driven : All customization via environment variables, no file editing required
- Auto-Detection : Intelligent defaults from Git remotes (repository URL, author name)
- Zero-Configuration : Minimal required inputs (
REPOor auto-detect from current directory)
Minimal Example :
This single command triggers the complete extraction, transformation, and build pipeline.
For complete configuration options, see Configuration Reference. For deployment patterns, see Quick Start.
Sources: README.md:22-51 README.md:220-227