This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Overview
Relevant source files
Purpose and Scope
This document introduces the DeepWiki-to-mdBook Converter, a containerized system that extracts wiki documentation from DeepWiki.com and transforms it into searchable HTML documentation using mdBook. This page covers the system’s purpose, core capabilities, and high-level architecture.
For detailed usage instructions, see Quick Start. For architecture details, see System Architecture. For configuration options, see Configuration Reference.
Sources: README.md:1-3
Problem Statement
DeepWiki.com provides AI-generated documentation for GitHub repositories as a web-based wiki. The system addresses the following limitations:
| Problem | Solution |
|---|---|
| Content locked in web platform | HTTP scraping with requests and BeautifulSoup4 |
| Mermaid diagrams rendered client-side only | JavaScript payload extraction with fuzzy matching |
| No offline access | Self-contained HTML site generation |
| No searchability | mdBook’s built-in search |
| Platform-specific formatting | Conversion to standard Markdown |
Sources: README.md:3-15
Core Capabilities
The system provides the following capabilities through environment variable configuration:
- Generic Repository Support : Works with any GitHub repository indexed by DeepWiki via
REPOenvironment variable - Auto-Detection : Extracts repository metadata from Git remotes when available
- Hierarchy Preservation : Maintains wiki page numbering and section structure
- Diagram Intelligence : Extracts ~461 total diagrams, matches ~48 with sufficient context using fuzzy matching
- Dual Output Modes : Full mdBook build or markdown-only extraction via
MARKDOWN_ONLYflag - No Authentication : Public HTTP scraping without API keys or credentials
- Containerized Deployment : Single Docker image with all dependencies
Sources: README.md:5-15 README.md:42-51
System Components
The system consists of three primary executable components coordinated by a shell orchestrator:
Main Components
| Component | Language | Purpose | Key Functions |
|---|---|---|---|
build-docs.sh | Shell | Orchestration | Parse env vars, generate configs, call executables |
deepwiki-scraper.py | Python 3.12 | Content extraction | HTTP scraping, HTML parsing, diagram matching |
mdbook | Rust | Site generation | Markdown to HTML, navigation, search |
mdbook-mermaid | Rust | Diagram rendering | Inject JavaScript/CSS for Mermaid.js |
Sources: README.md:146-157 Diagram 1, Diagram 5
Processing Pipeline
The system executes a three-phase pipeline with conditional execution based on the MARKDOWN_ONLY environment variable:
Phase Details
| Phase | Script | Key Operations | Output |
|---|---|---|---|
| 1 | deepwiki-scraper.py | HTTP fetch, BeautifulSoup4 parse, html2text conversion, fuzzy diagram matching | markdown/*.md |
| 2 | build-docs.sh | Generate book.toml, generate SUMMARY.md | Configuration files |
| 3 | mdbook + mdbook-mermaid | Markdown processing, Mermaid.js asset injection, HTML generation | book/ directory |
Sources: README.md:121-145 Diagram 2
Input and Output
Input Requirements
| Input | Format | Source | Example |
|---|---|---|---|
REPO | owner/repo | Environment variable | facebook/react |
BOOK_TITLE | String | Environment variable (optional) | React Documentation |
BOOK_AUTHORS | String | Environment variable (optional) | Meta Open Source |
MARKDOWN_ONLY | true/false | Environment variable (optional) | false |
Sources: README.md:42-51
Output Artifacts
Full Build Mode (MARKDOWN_ONLY=false or unset):
output/
├── markdown/
│ ├── 1-overview.md
│ ├── 2-quick-start.md
│ ├── section-3/
│ │ ├── 3-1-workspace.md
│ │ └── 3-2-parser.md
│ └── ...
├── book/
│ ├── index.html
│ ├── searchindex.json
│ ├── mermaid.min.js
│ └── ...
└── book.toml
Markdown-Only Mode (MARKDOWN_ONLY=true):
output/
└── markdown/
├── 1-overview.md
├── 2-quick-start.md
└── ...
Sources: README.md:89-119
Technical Stack
The system combines multiple technology stacks in a single container using Docker multi-stage builds:
Runtime Dependencies
| Component | Version | Purpose | Installation Method |
|---|---|---|---|
| Python | 3.12-slim | Scraping runtime | Base image |
requests | Latest | HTTP client | uv pip install |
beautifulsoup4 | Latest | HTML parser | uv pip install |
html2text | Latest | HTML to Markdown | uv pip install |
mdbook | Latest | Documentation builder | Compiled from source (Rust) |
mdbook-mermaid | Latest | Diagram preprocessor | Compiled from source (Rust) |
Build Architecture
The Dockerfile uses a two-stage build:
- Stage 1 (
rust:latest): Compilesmdbookandmdbook-mermaidbinaries (~1.5 GB, discarded) - Stage 2 (
python:3.12-slim): Copies binaries into Python runtime (~300-400 MB final)
Sources: README.md:146-157 Diagram 3
File System Interaction
The system interacts with three key filesystem locations:
Temporary Directory Workflow :
deepwiki-scraper.pywrites initial markdown to/tmp/wiki_temp/- After diagram enhancement, files move atomically to
/output/markdown/ build-docs.shcopies final HTML to/output/book/
This ensures no partial states exist in the output directory.
Sources: README.md:220-227 README.md136
Configuration Philosophy
The system operates on three configuration principles:
- Environment-Driven : All customization via environment variables, no file editing required
- Auto-Detection : Intelligent defaults from Git remotes (repository URL, author name)
- Zero-Configuration : Minimal required inputs (
REPOor auto-detect from current directory)
Minimal Example :
This single command triggers the complete extraction, transformation, and build pipeline.
For complete configuration options, see Configuration Reference. For deployment patterns, see Quick Start.
Sources: README.md:22-51 README.md:220-227
Dismiss
Refresh this wiki
Enter email to refresh