System Architecture
Relevant source files
This document provides a comprehensive overview of the DeepWiki-to-mdBook Converter's system architecture, explaining how the major components interact and how data flows through the system. It describes the containerized polyglot design, the orchestration model, and the technology integration strategy.
For detailed information about the three-phase processing model, see Three-Phase Pipeline. For Docker containerization specifics, see Docker Multi-Stage Build. For individual component implementation details, see Component Reference.
Architectural Overview
The system follows a layered orchestration architecture where a shell script coordinator invokes specialized tools in sequence. The entire system runs within a single Docker container that combines Python web scraping tools with Rust documentation building tools.
Design Principles
| Principle | Implementation |
|---|---|
| Single Responsibility | Each component (shell, Python, Rust tools) has one clear purpose |
| Language-Specific Tools | Python for web scraping, Rust for documentation building, Shell for orchestration |
| Stateless Processing | No persistent state between runs; all configuration via environment variables |
| Atomic Operations | Temporary directory workflow ensures no partial output states |
| Generic Design | No hardcoded repository details; works with any DeepWiki repository |
Sources: README.md:218-227 build-docs.sh:1-206
Container Architecture
The system uses a two-stage Docker build to create a hybrid Python-Rust runtime environment while minimizing image size.
graph TB
subgraph Stage1["Stage 1: Rust Builder (rust:latest)"]
RustBase["rust:latest base image"]
CargoInstall["cargo install mdbook\ncargo install mdbook-mermaid"]
BinariesOut["/usr/local/cargo/bin/\nmdbook\nmdbook-mermaid"]
RustBase --> CargoInstall
CargoInstall --> BinariesOut
end
subgraph Stage2["Stage 2: Final Image (python:3.12-slim)"]
PyBase["python:3.12-slim base"]
UVInstall["COPY --from=ghcr.io/astral-sh/uv"]
PipInstall["uv pip install\nrequirements.txt"]
CopyBins["COPY --from=builder\nRust binaries"]
CopyScripts["COPY scripts:\ndeepwiki-scraper.py\nbuild-docs.sh"]
PyBase --> UVInstall
UVInstall --> PipInstall
PipInstall --> CopyBins
CopyBins --> CopyScripts
end
BinariesOut -.->|Extract binaries only discard 1.5GB toolchain| CopyBins
CopyScripts --> Runtime["Final Image: ~300-400MB\nPython + Rust binaries\nNo build tools"]
subgraph Runtime["Runtime Contents"]
direction LR
Python["Python 3.12 runtime"]
Packages["requests, BeautifulSoup4,\nhtml2text"]
Tools["mdbook, mdbook-mermaid\nbinaries"]
end
Docker Multi-Stage Build Topology
Stage 1 (Dockerfile:1-5) compiles Rust tools using the full rust:latest image (~1.5 GB) but only the compiled binaries are extracted. Stage 2 (Dockerfile:7-32) builds the final image on a minimal Python base, copying only the Rust binaries and Python scripts, resulting in a compact image.
Sources: Dockerfile:1-33 README.md156
Component Topology and Code Mapping
This diagram maps the system's logical components to their actual code implementations:
graph TB
subgraph User["User Interface"]
CLI["Docker CLI"]
EnvVars["Environment Variables:\nREPO, BOOK_TITLE,\nBOOK_AUTHORS, etc."]
Volume["/output volume mount"]
end
subgraph Orchestrator["Orchestration Layer"]
BuildScript["build-docs.sh"]
MainLoop["Main execution flow:\nLines 55-206"]
ConfigGen["Configuration generation:\nLines 84-103, 108-159"]
AutoDetect["Auto-detection logic:\nLines 8-19, 40-45"]
end
subgraph ScraperLayer["Content Acquisition Layer"]
ScraperMain["deepwiki-scraper.py\nmain()
function"]
ExtractStruct["extract_wiki_structure()\nLine 78"]
ExtractContent["extract_page_content()\nLine 453"]
ExtractDiagrams["extract_and_enhance_diagrams()\nLine 596"]
FetchPage["fetch_page()\nLine 27"]
ConvertHTML["convert_html_to_markdown()\nLine 175"]
CleanFooter["clean_deepwiki_footer()\nLine 127"]
FixLinks["fix_wiki_link()\nLine 549"]
end
subgraph BuildLayer["Documentation Generation Layer"]
MdBookInit["mdbook init"]
MdBookBuild["mdbook build\n(Line 176)"]
MermaidInstall["mdbook-mermaid install\n(Line 171)"]
end
subgraph Output["Output Artifacts"]
TempDir["/workspace/wiki/\n(temp directory)"]
OutputMD["/output/markdown/\nEnhanced .md files"]
OutputBook["/output/book/\nHTML documentation"]
BookToml["/output/book.toml"]
end
CLI --> EnvVars
EnvVars --> BuildScript
BuildScript --> AutoDetect
BuildScript --> MainLoop
MainLoop --> ScraperMain
MainLoop --> ConfigGen
MainLoop --> MdBookInit
ScraperMain --> ExtractStruct
ScraperMain --> ExtractContent
ScraperMain --> ExtractDiagrams
ExtractStruct --> FetchPage
ExtractContent --> FetchPage
ExtractContent --> ConvertHTML
ConvertHTML --> CleanFooter
ExtractContent --> FixLinks
ExtractDiagrams --> TempDir
ExtractContent --> TempDir
ConfigGen --> MdBookInit
TempDir --> MdBookBuild
MdBookBuild --> MermaidInstall
TempDir --> OutputMD
MdBookBuild --> OutputBook
ConfigGen --> BookToml
OutputMD --> Volume
OutputBook --> Volume
BookToml --> Volume
This diagram shows the complete code-to-component mapping, making it easy to locate specific functionality in the codebase.
Sources: build-docs.sh:1-206 tools/deepwiki-scraper.py:1-920
stateDiagram-v2
[*] --> ValidateConfig
ValidateConfig : build-docs.sh - Lines 8-53 Parse REPO, auto-detect if needed Set BOOK_TITLE, BOOK_AUTHORS defaults
ValidateConfig --> Phase1
Phase1 : Phase 1 - Scrape Wiki build-docs.sh - Line 58 Calls - deepwiki-scraper.py
state Phase1 {
[*] --> ExtractStructure
ExtractStructure : extract_wiki_structure() Parse main page, discover subsections
ExtractStructure --> ExtractPages
ExtractPages : extract_page_content() Fetch HTML, convert to markdown
ExtractPages --> EnhanceDiagrams
EnhanceDiagrams : extract_and_enhance_diagrams() Fuzzy match and inject diagrams
EnhanceDiagrams --> [*]
}
Phase1 --> CheckMode
CheckMode : Check MARKDOWN_ONLY flag build-docs.sh - Line 61
state CheckMode <<choice>>
CheckMode --> CopyMarkdown : MARKDOWN_ONLY=true
CheckMode --> Phase2 : MARKDOWN_ONLY=false
CopyMarkdown : Copy to /output/markdown build-docs.sh - Lines 63-75
CopyMarkdown --> Done
Phase2 : Phase 2 - Initialize mdBook build-docs.sh - Lines 79-106
state Phase2 {
[*] --> CreateBookToml
CreateBookToml : Generate book.toml Lines 85-103
CreateBookToml --> GenerateSummary
GenerateSummary : Generate SUMMARY.md Lines 113-159
GenerateSummary --> [*]
}
Phase2 --> Phase3
Phase3 : Phase 3 - Build Documentation build-docs.sh - Lines 164-191
state Phase3 {
[*] --> InstallMermaid
InstallMermaid : mdbook-mermaid install Line 171
InstallMermaid --> BuildBook
BuildBook : mdbook build Line 176
BuildBook --> CopyOutputs
CopyOutputs : Copy to /output Lines 184-191
CopyOutputs --> [*]
}
Phase3 --> Done
Done --> [*]
Execution Flow
The system executes through a well-defined sequence orchestrated by build-docs.sh:
Primary Execution Path
The execution flow has a fast-path (markdown-only mode) and a complete-path (full documentation build). The decision point at line 61 of build-docs.sh determines which path to take based on the MARKDOWN_ONLY environment variable.
Sources: build-docs.sh:55-206 tools/deepwiki-scraper.py:790-916
Technology Stack and Integration Points
Core Technologies
| Layer | Technology | Purpose | Code Reference |
|---|---|---|---|
| Orchestration | Bash | Script coordination, environment handling | build-docs.sh:1-206 |
| Web Scraping | Python 3.12 | HTTP requests, HTML parsing | tools/deepwiki-scraper.py:1-920 |
| HTML Parsing | BeautifulSoup4 | DOM navigation, content extraction | tools/deepwiki-scraper.py:18-19 |
| HTML→MD Conversion | html2text | Clean markdown generation | tools/deepwiki-scraper.py:175-190 |
| Documentation Build | mdBook (Rust) | HTML site generation | build-docs.sh176 |
| Diagram Rendering | mdbook-mermaid | Mermaid diagram support | build-docs.sh171 |
| Package Management | uv | Fast Python dependency installation | Dockerfile:13-17 |
Python Dependencies Integration
The scraper uses three primary Python libraries, installed via uv:
Integration points:
requestssession with retry logic: tools/deepwiki-scraper.py:27-42BeautifulSoupfor content extraction: tools/deepwiki-scraper.py:463-487html2textwithbody_width=0for no wrapping: tools/deepwiki-scraper.py:175-181
Sources: Dockerfile:16-17 tools/deepwiki-scraper.py:17-19 tools/deepwiki-scraper.py:27-42
graph TB
subgraph Docker["Docker Container Filesystem"]
subgraph Workspace["/workspace"]
WikiTemp["/workspace/wiki\n(temporary)\nScraper output"]
BookBuild["/workspace/book\nmdBook build directory"]
BookSrc["/workspace/book/src\nMarkdown source files"]
end
subgraph Binaries["/usr/local/bin"]
MdBook["mdbook"]
MdBookMermaid["mdbook-mermaid"]
Scraper["deepwiki-scraper.py"]
BuildScript["build-docs.sh"]
end
subgraph Output["/output (volume mount)"]
OutputMD["/output/markdown\nFinal markdown files"]
OutputBook["/output/book\nHTML documentation"]
OutputConfig["/output/book.toml"]
end
end
Scraper -.->|Phase 1: Write| WikiTemp
WikiTemp -.->|Phase 2: Enhance in-place| WikiTemp
WikiTemp -.->|Copy| BookSrc
BookSrc -.->|mdbook build| OutputBook
WikiTemp -.->|Move| OutputMD
File System Structure
The system uses a temporary directory workflow to ensure atomic operations:
Directory Layout at Runtime
Workflow:
- Lines 808-877 : Scraper writes to temporary directory in
/tmp(created bytempfile.TemporaryDirectory()) - Lines 880 : Diagram enhancement modifies files in temporary directory
- Lines 887-908 : Completed files moved atomically to
/output - Lines 166 :
build-docs.shcopies to mdBook source directory - Lines 176 : mdBook builds HTML to
/workspace/book/book - Lines 184-191 : Outputs copied to
/outputvolume
This pattern ensures no partial or corrupted output is visible to users.
Sources: tools/deepwiki-scraper.py:804-916 build-docs.sh:164-191
Configuration Management
Configuration flows from environment variables through shell script processing to generated config files:
Configuration Flow
| Input | Processor | Output | Code Reference |
|---|---|---|---|
REPO | build-docs.sh:8-19 | Auto-detected from Git or required | build-docs.sh:8-36 |
BOOK_TITLE | build-docs.sh:23 | Defaults to "Documentation" | build-docs.sh23 |
BOOK_AUTHORS | build-docs.sh:24,44 | Defaults to repo owner | build-docs.sh:24-44 |
GIT_REPO_URL | build-docs.sh:25,45 | Constructed from REPO | build-docs.sh:25-45 |
MARKDOWN_ONLY | build-docs.sh:26,61 | Controls pipeline execution | build-docs.sh:26-61 |
| All config | build-docs.sh:85-103 | book.toml generation | build-docs.sh:85-103 |
| File structure | build-docs.sh:113-159 | SUMMARY.md generation | build-docs.sh:113-159 |
Auto-Detection Logic
The system can automatically detect repository information from Git remotes:
This enables zero-configuration usage in CI/CD environments where the code is already checked out.
Sources: build-docs.sh:8-45 README.md:47-53
Summary
The DeepWiki-to-mdBook Converter architecture demonstrates several key design patterns:
- Polyglot Orchestration : Shell coordinates Python and Rust tools, each optimized for their specific task
- Multi-Stage Container Build : Separates build-time tooling from runtime dependencies for minimal image size
- Temporary Directory Workflow : Ensures atomic operations and prevents partial output states
- Progressive Processing : Three distinct phases (extract, enhance, build) with optional fast-path
- Zero-Configuration Capability : Intelligent defaults and auto-detection minimize required configuration
The architecture prioritizes maintainability (clear separation of concerns), reliability (atomic operations), and usability (intelligent defaults) while remaining fully generic and portable.
Sources: README.md:1-233 Dockerfile:1-33 build-docs.sh:1-206 tools/deepwiki-scraper.py:1-920