Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

DeepWiki GitHub

System Architecture

Relevant source files

This document provides a comprehensive overview of the DeepWiki-to-mdBook Converter's system architecture, explaining how the major components interact and how data flows through the system. It describes the containerized polyglot design, the orchestration model, and the technology integration strategy.

For detailed information about the three-phase processing model, see Three-Phase Pipeline. For Docker containerization specifics, see Docker Multi-Stage Build. For individual component implementation details, see Component Reference.

Architectural Overview

The system follows a layered orchestration architecture where a shell script coordinator invokes specialized tools in sequence. The entire system runs within a single Docker container that combines Python web scraping tools with Rust documentation building tools.

Design Principles

PrincipleImplementation
Single ResponsibilityEach component (shell, Python, Rust tools) has one clear purpose
Language-Specific ToolsPython for web scraping, Rust for documentation building, Shell for orchestration
Stateless ProcessingNo persistent state between runs; all configuration via environment variables
Atomic OperationsTemporary directory workflow ensures no partial output states
Generic DesignNo hardcoded repository details; works with any DeepWiki repository

Sources: README.md:218-227 build-docs.sh:1-206

Container Architecture

The system uses a two-stage Docker build to create a hybrid Python-Rust runtime environment while minimizing image size.

graph TB
    subgraph Stage1["Stage 1: Rust Builder (rust:latest)"]
RustBase["rust:latest base image"]
CargoInstall["cargo install mdbook\ncargo install mdbook-mermaid"]
BinariesOut["/usr/local/cargo/bin/\nmdbook\nmdbook-mermaid"]
RustBase --> CargoInstall
 
       CargoInstall --> BinariesOut
    end
    
    subgraph Stage2["Stage 2: Final Image (python:3.12-slim)"]
PyBase["python:3.12-slim base"]
UVInstall["COPY --from=ghcr.io/astral-sh/uv"]
PipInstall["uv pip install\nrequirements.txt"]
CopyBins["COPY --from=builder\nRust binaries"]
CopyScripts["COPY scripts:\ndeepwiki-scraper.py\nbuild-docs.sh"]
PyBase --> UVInstall
 
       UVInstall --> PipInstall
 
       PipInstall --> CopyBins
 
       CopyBins --> CopyScripts
    end
    
 
   BinariesOut -.->|Extract binaries only discard 1.5GB toolchain| CopyBins
    
 
   CopyScripts --> Runtime["Final Image: ~300-400MB\nPython + Rust binaries\nNo build tools"]
subgraph Runtime["Runtime Contents"]
direction LR
        Python["Python 3.12 runtime"]
Packages["requests, BeautifulSoup4,\nhtml2text"]
Tools["mdbook, mdbook-mermaid\nbinaries"]
end

Docker Multi-Stage Build Topology

Stage 1 (Dockerfile:1-5) compiles Rust tools using the full rust:latest image (~1.5 GB) but only the compiled binaries are extracted. Stage 2 (Dockerfile:7-32) builds the final image on a minimal Python base, copying only the Rust binaries and Python scripts, resulting in a compact image.

Sources: Dockerfile:1-33 README.md156

Component Topology and Code Mapping

This diagram maps the system's logical components to their actual code implementations:

graph TB
    subgraph User["User Interface"]
CLI["Docker CLI"]
EnvVars["Environment Variables:\nREPO, BOOK_TITLE,\nBOOK_AUTHORS, etc."]
Volume["/output volume mount"]
end
    
    subgraph Orchestrator["Orchestration Layer"]
BuildScript["build-docs.sh"]
MainLoop["Main execution flow:\nLines 55-206"]
ConfigGen["Configuration generation:\nLines 84-103, 108-159"]
AutoDetect["Auto-detection logic:\nLines 8-19, 40-45"]
end
    
    subgraph ScraperLayer["Content Acquisition Layer"]
ScraperMain["deepwiki-scraper.py\nmain()
function"]
ExtractStruct["extract_wiki_structure()\nLine 78"]
ExtractContent["extract_page_content()\nLine 453"]
ExtractDiagrams["extract_and_enhance_diagrams()\nLine 596"]
FetchPage["fetch_page()\nLine 27"]
ConvertHTML["convert_html_to_markdown()\nLine 175"]
CleanFooter["clean_deepwiki_footer()\nLine 127"]
FixLinks["fix_wiki_link()\nLine 549"]
end
    
    subgraph BuildLayer["Documentation Generation Layer"]
MdBookInit["mdbook init"]
MdBookBuild["mdbook build\n(Line 176)"]
MermaidInstall["mdbook-mermaid install\n(Line 171)"]
end
    
    subgraph Output["Output Artifacts"]
TempDir["/workspace/wiki/\n(temp directory)"]
OutputMD["/output/markdown/\nEnhanced .md files"]
OutputBook["/output/book/\nHTML documentation"]
BookToml["/output/book.toml"]
end
    
 
   CLI --> EnvVars
 
   EnvVars --> BuildScript
    
 
   BuildScript --> AutoDetect
 
   BuildScript --> MainLoop
 
   MainLoop --> ScraperMain
 
   MainLoop --> ConfigGen
 
   MainLoop --> MdBookInit
    
 
   ScraperMain --> ExtractStruct
 
   ScraperMain --> ExtractContent
 
   ScraperMain --> ExtractDiagrams
    
 
   ExtractStruct --> FetchPage
 
   ExtractContent --> FetchPage
 
   ExtractContent --> ConvertHTML
 
   ConvertHTML --> CleanFooter
 
   ExtractContent --> FixLinks
    
 
   ExtractDiagrams --> TempDir
 
   ExtractContent --> TempDir
    
 
   ConfigGen --> MdBookInit
 
   TempDir --> MdBookBuild
 
   MdBookBuild --> MermaidInstall
    
 
   TempDir --> OutputMD
 
   MdBookBuild --> OutputBook
 
   ConfigGen --> BookToml
    
 
   OutputMD --> Volume
 
   OutputBook --> Volume
 
   BookToml --> Volume

This diagram shows the complete code-to-component mapping, making it easy to locate specific functionality in the codebase.

Sources: build-docs.sh:1-206 tools/deepwiki-scraper.py:1-920

stateDiagram-v2
    [*] --> ValidateConfig
    
    ValidateConfig : build-docs.sh - Lines 8-53 Parse REPO, auto-detect if needed Set BOOK_TITLE, BOOK_AUTHORS defaults
    
    ValidateConfig --> Phase1
    
    Phase1 : Phase 1 - Scrape Wiki build-docs.sh - Line 58 Calls - deepwiki-scraper.py
    
    state Phase1 {
        [*] --> ExtractStructure
        ExtractStructure : extract_wiki_structure() Parse main page, discover subsections
        ExtractStructure --> ExtractPages
        ExtractPages : extract_page_content() Fetch HTML, convert to markdown
        ExtractPages --> EnhanceDiagrams
        EnhanceDiagrams : extract_and_enhance_diagrams() Fuzzy match and inject diagrams
        EnhanceDiagrams --> [*]
    }
    
    Phase1 --> CheckMode
    
    CheckMode : Check MARKDOWN_ONLY flag build-docs.sh - Line 61
    
    state CheckMode <<choice>>
    CheckMode --> CopyMarkdown : MARKDOWN_ONLY=true
    CheckMode --> Phase2 : MARKDOWN_ONLY=false
    
    CopyMarkdown : Copy to /output/markdown build-docs.sh - Lines 63-75
    CopyMarkdown --> Done
    
    Phase2 : Phase 2 - Initialize mdBook build-docs.sh - Lines 79-106
    
    state Phase2 {
        [*] --> CreateBookToml
        CreateBookToml : Generate book.toml Lines 85-103
        CreateBookToml --> GenerateSummary
        GenerateSummary : Generate SUMMARY.md Lines 113-159
        GenerateSummary --> [*]
    }
    
    Phase2 --> Phase3
    
    Phase3 : Phase 3 - Build Documentation build-docs.sh - Lines 164-191
    
    state Phase3 {
        [*] --> InstallMermaid
        InstallMermaid : mdbook-mermaid install Line 171
        InstallMermaid --> BuildBook
        BuildBook : mdbook build Line 176
        BuildBook --> CopyOutputs
        CopyOutputs : Copy to /output Lines 184-191
        CopyOutputs --> [*]
    }
    
    Phase3 --> Done
    Done --> [*]

Execution Flow

The system executes through a well-defined sequence orchestrated by build-docs.sh:

Primary Execution Path

The execution flow has a fast-path (markdown-only mode) and a complete-path (full documentation build). The decision point at line 61 of build-docs.sh determines which path to take based on the MARKDOWN_ONLY environment variable.

Sources: build-docs.sh:55-206 tools/deepwiki-scraper.py:790-916

Technology Stack and Integration Points

Core Technologies

LayerTechnologyPurposeCode Reference
OrchestrationBashScript coordination, environment handlingbuild-docs.sh:1-206
Web ScrapingPython 3.12HTTP requests, HTML parsingtools/deepwiki-scraper.py:1-920
HTML ParsingBeautifulSoup4DOM navigation, content extractiontools/deepwiki-scraper.py:18-19
HTML→MD Conversionhtml2textClean markdown generationtools/deepwiki-scraper.py:175-190
Documentation BuildmdBook (Rust)HTML site generationbuild-docs.sh176
Diagram Renderingmdbook-mermaidMermaid diagram supportbuild-docs.sh171
Package ManagementuvFast Python dependency installationDockerfile:13-17

Python Dependencies Integration

The scraper uses three primary Python libraries, installed via uv:

Integration points:

Sources: Dockerfile:16-17 tools/deepwiki-scraper.py:17-19 tools/deepwiki-scraper.py:27-42

graph TB
    subgraph Docker["Docker Container Filesystem"]
subgraph Workspace["/workspace"]
WikiTemp["/workspace/wiki\n(temporary)\nScraper output"]
BookBuild["/workspace/book\nmdBook build directory"]
BookSrc["/workspace/book/src\nMarkdown source files"]
end
        
        subgraph Binaries["/usr/local/bin"]
MdBook["mdbook"]
MdBookMermaid["mdbook-mermaid"]
Scraper["deepwiki-scraper.py"]
BuildScript["build-docs.sh"]
end
        
        subgraph Output["/output (volume mount)"]
OutputMD["/output/markdown\nFinal markdown files"]
OutputBook["/output/book\nHTML documentation"]
OutputConfig["/output/book.toml"]
end
    end
    
 
   Scraper -.->|Phase 1: Write| WikiTemp
 
   WikiTemp -.->|Phase 2: Enhance in-place| WikiTemp
 
   WikiTemp -.->|Copy| BookSrc
 
   BookSrc -.->|mdbook build| OutputBook
 
   WikiTemp -.->|Move| OutputMD

File System Structure

The system uses a temporary directory workflow to ensure atomic operations:

Directory Layout at Runtime

Workflow:

  1. Lines 808-877 : Scraper writes to temporary directory in /tmp (created by tempfile.TemporaryDirectory())
  2. Lines 880 : Diagram enhancement modifies files in temporary directory
  3. Lines 887-908 : Completed files moved atomically to /output
  4. Lines 166 : build-docs.sh copies to mdBook source directory
  5. Lines 176 : mdBook builds HTML to /workspace/book/book
  6. Lines 184-191 : Outputs copied to /output volume

This pattern ensures no partial or corrupted output is visible to users.

Sources: tools/deepwiki-scraper.py:804-916 build-docs.sh:164-191

Configuration Management

Configuration flows from environment variables through shell script processing to generated config files:

Configuration Flow

InputProcessorOutputCode Reference
REPObuild-docs.sh:8-19Auto-detected from Git or requiredbuild-docs.sh:8-36
BOOK_TITLEbuild-docs.sh:23Defaults to "Documentation"build-docs.sh23
BOOK_AUTHORSbuild-docs.sh:24,44Defaults to repo ownerbuild-docs.sh:24-44
GIT_REPO_URLbuild-docs.sh:25,45Constructed from REPObuild-docs.sh:25-45
MARKDOWN_ONLYbuild-docs.sh:26,61Controls pipeline executionbuild-docs.sh:26-61
All configbuild-docs.sh:85-103book.toml generationbuild-docs.sh:85-103
File structurebuild-docs.sh:113-159SUMMARY.md generationbuild-docs.sh:113-159

Auto-Detection Logic

The system can automatically detect repository information from Git remotes:

This enables zero-configuration usage in CI/CD environments where the code is already checked out.

Sources: build-docs.sh:8-45 README.md:47-53

Summary

The DeepWiki-to-mdBook Converter architecture demonstrates several key design patterns:

  1. Polyglot Orchestration : Shell coordinates Python and Rust tools, each optimized for their specific task
  2. Multi-Stage Container Build : Separates build-time tooling from runtime dependencies for minimal image size
  3. Temporary Directory Workflow : Ensures atomic operations and prevents partial output states
  4. Progressive Processing : Three distinct phases (extract, enhance, build) with optional fast-path
  5. Zero-Configuration Capability : Intelligent defaults and auto-detection minimize required configuration

The architecture prioritizes maintainability (clear separation of concerns), reliability (atomic operations), and usability (intelligent defaults) while remaining fully generic and portable.

Sources: README.md:1-233 Dockerfile:1-33 build-docs.sh:1-206 tools/deepwiki-scraper.py:1-920