This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Overview
Loading…
Overview
Relevant source files
Purpose and Scope
This document introduces the DeepWiki-to-mdBook Converter, a containerized system that extracts wiki documentation from DeepWiki.com and transforms it into searchable HTML documentation using mdBook. This page covers the system’s purpose, core capabilities, and high-level architecture.
For detailed usage instructions, see Quick Start. For architecture details, see System Architecture. For configuration options, see Configuration Reference.
Sources: README.md:1-3
Problem Statement
DeepWiki.com provides AI-generated documentation for GitHub repositories as a web-based wiki. The system addresses the following limitations:
| Problem | Solution |
|---|---|
| Content locked in web platform | HTTP scraping with requests and BeautifulSoup4 |
| Mermaid diagrams rendered client-side only | JavaScript payload extraction with fuzzy matching |
| No offline access | Self-contained HTML site generation |
| No searchability | mdBook’s built-in search |
| Platform-specific formatting | Conversion to standard Markdown |
Sources: README.md:3-15
Core Capabilities
The system provides the following capabilities through environment variable configuration:
- Generic Repository Support : Works with any GitHub repository indexed by DeepWiki via
REPOenvironment variable - Auto-Detection : Extracts repository metadata from Git remotes when available
- Hierarchy Preservation : Maintains wiki page numbering and section structure
- Diagram Intelligence : Extracts ~461 total diagrams, matches ~48 with sufficient context using fuzzy matching
- Dual Output Modes : Full mdBook build or markdown-only extraction via
MARKDOWN_ONLYflag - No Authentication : Public HTTP scraping without API keys or credentials
- Containerized Deployment : Single Docker image with all dependencies
Sources: README.md:5-15 README.md:42-51
System Components
The system consists of three primary executable components coordinated by a shell orchestrator. The following diagram maps user interaction to specific code entities:
Component Architecture with Code Entities
graph TB
DockerRun["docker run\nwith env vars"]
subgraph Container["/usr/local/bin/ executables"]
BuildDocs["build-docs.sh"]
Scraper["deepwiki-scraper.py\nmain()\nscrape_wiki()\nextract_mermaid_from_nextjs_data()"]
MdBook["mdbook binary\n(Rust)"]
MermaidPlugin["mdbook-mermaid binary\n(Rust)"]
end
subgraph External["External HTTP Endpoints"]
DeepWikiAPI["deepwiki.com/$REPO"]
GitHubEdit["github.com/$REPO/edit/"]
end
subgraph OutputVol["/output volume mount"]
MarkdownDir["markdown/\nnumbered .md files"]
RawMarkdownDir["raw_markdown/\npre-enhancement .md"]
BookDir["book/\nindex.html + search"]
ConfigFile["book.toml"]
SummaryFile["SUMMARY.md"]
end
DockerRun -->|REPO BOOK_TITLE BOOK_AUTHORS MARKDOWN_ONLY| BuildDocs
BuildDocs -->|python3 deepwiki-scraper.py| Scraper
BuildDocs -->|mdbook init| MdBook
BuildDocs -->|mdbook build| MdBook
BuildDocs -->|generates| ConfigFile
BuildDocs -->|generates| SummaryFile
Scraper -->|requests.get| DeepWikiAPI
Scraper -->|writes| RawMarkdownDir
Scraper -->|writes enhanced| MarkdownDir
MdBook -->|preprocessor chain| MermaidPlugin
MdBook -->|generates| BookDir
BookDir -.->|edit links point to| GitHubEdit
Executable Components
| Component | Type | Entry Point | Key Operations |
|---|---|---|---|
build-docs.sh | Shell script | CMD in Dockerfile | Parse $REPO, $BOOK_TITLE, generate book.toml, invoke Python and Rust tools |
deepwiki-scraper.py | Python 3.12 module | main() function | scrape_wiki(), extract_mermaid_from_nextjs_data(), inject_mermaid_diagrams_into_markdown() |
mdbook | Rust binary | CLI invocation | mdbook init, mdbook build with book.toml configuration |
mdbook-mermaid | Rust preprocessor | mdBook plugin chain | Asset injection for Mermaid.js runtime |
Sources: README.md:1-27 README.md:84-88 Diagram 1, Diagram 3
Processing Pipeline
The system executes a three-phase pipeline with conditional execution based on the MARKDOWN_ONLY environment variable. Each phase invokes specific executables and functions:
Three-Phase Execution Flow with Code Entities
stateDiagram-v2
[*] --> ParseEnv : build-docs.sh reads env
state ParseEnv {
[*] --> ReadREPO : $REPO
ReadREPO --> ReadBOOKTITLE : $BOOK_TITLE
ReadBOOKTITLE --> ReadMARKDOWNONLY : $MARKDOWN_ONLY
ReadMARKDOWNONLY --> [*]
}
ParseEnv --> Phase1 : python3 deepwiki-scraper.py
state Phase1 {
[*] --> scrape_wiki
scrape_wiki --> BeautifulSoup4 : parse HTML
BeautifulSoup4 --> html2text : convert to .md
html2text --> extract_mermaid : extract_mermaid_from_nextjs_data()
extract_mermaid --> normalize_mermaid : 7-step normalization
normalize_mermaid --> inject_diagrams : inject_mermaid_diagrams_into_markdown()
inject_diagrams --> write_files : /output/markdown/*.md
write_files --> [*]
}
Phase1 --> CheckMode
state CheckMode <<choice>>
CheckMode --> Phase2 : if MARKDOWN_ONLY != true
CheckMode --> Exit : if MARKDOWN_ONLY == true
Phase2 --> GenerateBookToml : build-docs.sh writes book.toml
GenerateBookToml --> GenerateSummary : build-docs.sh writes SUMMARY.md
GenerateSummary --> Phase3
state Phase3 {
[*] --> mdbook_init : mdbook init
mdbook_init --> mdbook_mermaid_install : mdbook-mermaid install
mdbook_mermaid_install --> mdbook_build : mdbook build
mdbook_build --> [*] : /output/book/
}
Phase3 --> Exit
Exit --> [*]
Phase Execution Details
| Phase | Primary Executable | Key Functions/Commands | Artifacts |
|---|---|---|---|
| 1: Extract | deepwiki-scraper.py | scrape_wiki(), extract_mermaid_from_nextjs_data(), normalize_mermaid_code(), inject_mermaid_diagrams_into_markdown() | /output/markdown/*.md, /output/raw_markdown/*.md |
| 2: Configure | build-docs.sh | Template string generation for book.toml and SUMMARY.md | /output/book.toml, /output/SUMMARY.md |
| 3: Build | mdbook, mdbook-mermaid | mdbook init, mdbook-mermaid install, mdbook build | /output/book/index.html, /output/book/searchindex.json |
Sources: README.md:72-77 Diagram 2, Diagram 4
Input and Output
Input Requirements
| Input | Format | Source | Example |
|---|---|---|---|
REPO | owner/repo | Environment variable | facebook/react |
BOOK_TITLE | String | Environment variable (optional) | React Documentation |
BOOK_AUTHORS | String | Environment variable (optional) | Meta Open Source |
MARKDOWN_ONLY | true/false | Environment variable (optional) | false |
Sources: README.md:42-51
Output Artifacts
Full Build Mode (MARKDOWN_ONLY=false or unset):
output/
├── markdown/
│ ├── 1-overview.md
│ ├── 2-quick-start.md
│ ├── section-3/
│ │ ├── 3-1-workspace.md
│ │ └── 3-2-parser.md
│ └── ...
├── book/
│ ├── index.html
│ ├── searchindex.json
│ ├── mermaid.min.js
│ └── ...
└── book.toml
Markdown-Only Mode (MARKDOWN_ONLY=true):
output/
└── markdown/
├── 1-overview.md
├── 2-quick-start.md
└── ...
Sources: README.md:89-119
Technical Stack
The system combines multiple technology stacks in a single container using Docker multi-stage builds:
Runtime Dependencies
| Component | Version | Purpose | Installation Method |
|---|---|---|---|
| Python | 3.12-slim | Scraping runtime | Base image |
requests | Latest | HTTP client | uv pip install |
beautifulsoup4 | Latest | HTML parser | uv pip install |
html2text | Latest | HTML to Markdown | uv pip install |
mdbook | Latest | Documentation builder | Compiled from source (Rust) |
mdbook-mermaid | Latest | Diagram preprocessor | Compiled from source (Rust) |
Build Architecture
The Dockerfile uses a two-stage build:
- Stage 1 (
rust:latest): Compilesmdbookandmdbook-mermaidbinaries (~1.5 GB, discarded) - Stage 2 (
python:3.12-slim): Copies binaries into Python runtime (~300-400 MB final)
Sources: README.md:146-157 Diagram 3
graph TB
subgraph HostFS["Host Filesystem"]
HostOutput["./output/\n(bind mount)"]
end
subgraph ContainerFS["Container Filesystem"]
BuildScript["/usr/local/bin/build-docs.sh"]
ScraperScript["/usr/local/bin/deepwiki-scraper.py"]
TmpWiki["/tmp/wiki_temp/\n(write buffer)"]
OutputMount["/output/\n(volume mount)"]
WorkspaceTemplates["/workspace/templates/\nheader.html, footer.html"]
end
subgraph WriteOperations["File Write Operations"]
WriteRaw["write_markdown_file()\nraw .md to /tmp"]
WriteEnhanced["write enhanced .md\nafter diagram injection"]
AtomicMove["shutil.move()\nor mv command"]
CopyBook["cp -r mdbook_output/"]
end
subgraph OutputStructure["/output/ final structure"]
OutMarkdown["/output/markdown/"]
OutRaw["/output/raw_markdown/"]
OutBook["/output/book/"]
OutConfig["/output/book.toml"]
end
HostOutput -.->|-v bind mount| OutputMount
ScraperScript -->|Phase 1| WriteRaw
WriteRaw --> TmpWiki
ScraperScript --> WriteEnhanced
WriteEnhanced --> TmpWiki
TmpWiki -->|atomic| AtomicMove
AtomicMove --> OutMarkdown
AtomicMove --> OutRaw
BuildScript -->|Phase 2| OutConfig
BuildScript -->|Phase 3| CopyBook
CopyBook --> OutBook
WorkspaceTemplates -.->|process-template.py reads| BuildScript
OutMarkdown --> OutputMount
OutRaw --> OutputMount
OutBook --> OutputMount
OutConfig --> OutputMount
File System Interaction
The system uses a temporary directory pattern to ensure atomic writes to the output volume:
Filesystem Write Pattern
Write Sequence
deepwiki-scraper.pywrites raw markdown to/tmp/wiki_temp/usingwrite_markdown_file()function- After diagram injection via
inject_mermaid_diagrams_into_markdown(), enhanced markdown moves to/output/markdown/ build-docs.shgenerates/output/book.tomlfrom environment variablesmdbook buildwrites HTML to internal directory, whichbuild-docs.shcopies to/output/book/
This pattern ensures atomicity: partial writes never appear in /output/.
Sources: README.md:19-26 README.md:54-58 Diagram 3
Configuration Philosophy
The system operates on three configuration principles:
- Environment-Driven : All customization via environment variables, no file editing required
- Auto-Detection : Intelligent defaults from Git remotes (repository URL, author name)
- Zero-Configuration : Minimal required inputs (
REPOor auto-detect from current directory)
Minimal Example :
This single command triggers the complete extraction, transformation, and build pipeline.
For complete configuration options, see Configuration Reference. For deployment patterns, see Quick Start.
Sources: README.md:22-51 README.md:220-227
Dismiss
Refresh this wiki
Enter email to refresh