Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Overview

Loading…

Overview

Relevant source files

Purpose and Scope

This document introduces the DeepWiki-to-mdBook Converter, a containerized system that extracts wiki documentation from DeepWiki.com and transforms it into searchable HTML documentation using mdBook. This page covers the system’s purpose, core capabilities, and high-level architecture.

For detailed usage instructions, see Quick Start. For architecture details, see System Architecture. For configuration options, see Configuration Reference.

Sources: README.md:1-3

Problem Statement

DeepWiki.com provides AI-generated documentation for GitHub repositories as a web-based wiki. The system addresses the following limitations:

ProblemSolution
Content locked in web platformHTTP scraping with requests and BeautifulSoup4
Mermaid diagrams rendered client-side onlyJavaScript payload extraction with fuzzy matching
No offline accessSelf-contained HTML site generation
No searchabilitymdBook’s built-in search
Platform-specific formattingConversion to standard Markdown

Sources: README.md:3-15

Core Capabilities

The system provides the following capabilities through environment variable configuration:

  • Generic Repository Support : Works with any GitHub repository indexed by DeepWiki via REPO environment variable
  • Auto-Detection : Extracts repository metadata from Git remotes when available
  • Hierarchy Preservation : Maintains wiki page numbering and section structure
  • Diagram Intelligence : Extracts ~461 total diagrams, matches ~48 with sufficient context using fuzzy matching
  • Dual Output Modes : Full mdBook build or markdown-only extraction via MARKDOWN_ONLY flag
  • No Authentication : Public HTTP scraping without API keys or credentials
  • Containerized Deployment : Single Docker image with all dependencies

Sources: README.md:5-15 README.md:42-51

System Components

The system consists of three primary executable components coordinated by a shell orchestrator. The following diagram maps user interaction to specific code entities:

Component Architecture with Code Entities

graph TB
    DockerRun["docker run\nwith env vars"]
subgraph Container["/usr/local/bin/ executables"]
BuildDocs["build-docs.sh"]
Scraper["deepwiki-scraper.py\nmain()\nscrape_wiki()\nextract_mermaid_from_nextjs_data()"]
MdBook["mdbook binary\n(Rust)"]
MermaidPlugin["mdbook-mermaid binary\n(Rust)"]
end
    
    subgraph External["External HTTP Endpoints"]
DeepWikiAPI["deepwiki.com/$REPO"]
GitHubEdit["github.com/$REPO/edit/"]
end
    
    subgraph OutputVol["/output volume mount"]
MarkdownDir["markdown/\nnumbered .md files"]
RawMarkdownDir["raw_markdown/\npre-enhancement .md"]
BookDir["book/\nindex.html + search"]
ConfigFile["book.toml"]
SummaryFile["SUMMARY.md"]
end
    
 
   DockerRun -->|REPO BOOK_TITLE BOOK_AUTHORS MARKDOWN_ONLY| BuildDocs
    
 
   BuildDocs -->|python3 deepwiki-scraper.py| Scraper
 
   BuildDocs -->|mdbook init| MdBook
 
   BuildDocs -->|mdbook build| MdBook
 
   BuildDocs -->|generates| ConfigFile
 
   BuildDocs -->|generates| SummaryFile
    
 
   Scraper -->|requests.get| DeepWikiAPI
 
   Scraper -->|writes| RawMarkdownDir
 
   Scraper -->|writes enhanced| MarkdownDir
    
 
   MdBook -->|preprocessor chain| MermaidPlugin
 
   MdBook -->|generates| BookDir
    
 
   BookDir -.->|edit links point to| GitHubEdit

Executable Components

ComponentTypeEntry PointKey Operations
build-docs.shShell scriptCMD in DockerfileParse $REPO, $BOOK_TITLE, generate book.toml, invoke Python and Rust tools
deepwiki-scraper.pyPython 3.12 modulemain() functionscrape_wiki(), extract_mermaid_from_nextjs_data(), inject_mermaid_diagrams_into_markdown()
mdbookRust binaryCLI invocationmdbook init, mdbook build with book.toml configuration
mdbook-mermaidRust preprocessormdBook plugin chainAsset injection for Mermaid.js runtime

Sources: README.md:1-27 README.md:84-88 Diagram 1, Diagram 3

Processing Pipeline

The system executes a three-phase pipeline with conditional execution based on the MARKDOWN_ONLY environment variable. Each phase invokes specific executables and functions:

Three-Phase Execution Flow with Code Entities

stateDiagram-v2
    [*] --> ParseEnv : build-docs.sh reads env
    
    state ParseEnv {
        [*] --> ReadREPO : $REPO
        ReadREPO --> ReadBOOKTITLE : $BOOK_TITLE
        ReadBOOKTITLE --> ReadMARKDOWNONLY : $MARKDOWN_ONLY
        ReadMARKDOWNONLY --> [*]
    }
    
    ParseEnv --> Phase1 : python3 deepwiki-scraper.py
    
    state Phase1 {
        [*] --> scrape_wiki
        scrape_wiki --> BeautifulSoup4 : parse HTML
        BeautifulSoup4 --> html2text : convert to .md
        html2text --> extract_mermaid : extract_mermaid_from_nextjs_data()
        extract_mermaid --> normalize_mermaid : 7-step normalization
        normalize_mermaid --> inject_diagrams : inject_mermaid_diagrams_into_markdown()
        inject_diagrams --> write_files : /output/markdown/*.md
        write_files --> [*]
    }
    
    Phase1 --> CheckMode
    
    state CheckMode <<choice>>
    CheckMode --> Phase2 : if MARKDOWN_ONLY != true
    CheckMode --> Exit : if MARKDOWN_ONLY == true
    
    Phase2 --> GenerateBookToml : build-docs.sh writes book.toml
    GenerateBookToml --> GenerateSummary : build-docs.sh writes SUMMARY.md
    
    GenerateSummary --> Phase3
    
    state Phase3 {
        [*] --> mdbook_init : mdbook init
        mdbook_init --> mdbook_mermaid_install : mdbook-mermaid install
        mdbook_mermaid_install --> mdbook_build : mdbook build
        mdbook_build --> [*] : /output/book/
    }
    
    Phase3 --> Exit
    Exit --> [*]

Phase Execution Details

PhasePrimary ExecutableKey Functions/CommandsArtifacts
1: Extractdeepwiki-scraper.pyscrape_wiki(), extract_mermaid_from_nextjs_data(), normalize_mermaid_code(), inject_mermaid_diagrams_into_markdown()/output/markdown/*.md, /output/raw_markdown/*.md
2: Configurebuild-docs.shTemplate string generation for book.toml and SUMMARY.md/output/book.toml, /output/SUMMARY.md
3: Buildmdbook, mdbook-mermaidmdbook init, mdbook-mermaid install, mdbook build/output/book/index.html, /output/book/searchindex.json

Sources: README.md:72-77 Diagram 2, Diagram 4

Input and Output

Input Requirements

InputFormatSourceExample
REPOowner/repoEnvironment variablefacebook/react
BOOK_TITLEStringEnvironment variable (optional)React Documentation
BOOK_AUTHORSStringEnvironment variable (optional)Meta Open Source
MARKDOWN_ONLYtrue/falseEnvironment variable (optional)false

Sources: README.md:42-51

Output Artifacts

Full Build Mode (MARKDOWN_ONLY=false or unset):

output/
├── markdown/
│   ├── 1-overview.md
│   ├── 2-quick-start.md
│   ├── section-3/
│   │   ├── 3-1-workspace.md
│   │   └── 3-2-parser.md
│   └── ...
├── book/
│   ├── index.html
│   ├── searchindex.json
│   ├── mermaid.min.js
│   └── ...
└── book.toml

Markdown-Only Mode (MARKDOWN_ONLY=true):

output/
└── markdown/
    ├── 1-overview.md
    ├── 2-quick-start.md
    └── ...

Sources: README.md:89-119

Technical Stack

The system combines multiple technology stacks in a single container using Docker multi-stage builds:

Runtime Dependencies

ComponentVersionPurposeInstallation Method
Python3.12-slimScraping runtimeBase image
requestsLatestHTTP clientuv pip install
beautifulsoup4LatestHTML parseruv pip install
html2textLatestHTML to Markdownuv pip install
mdbookLatestDocumentation builderCompiled from source (Rust)
mdbook-mermaidLatestDiagram preprocessorCompiled from source (Rust)

Build Architecture

The Dockerfile uses a two-stage build:

  1. Stage 1 (rust:latest): Compiles mdbook and mdbook-mermaid binaries (~1.5 GB, discarded)
  2. Stage 2 (python:3.12-slim): Copies binaries into Python runtime (~300-400 MB final)

Sources: README.md:146-157 Diagram 3

graph TB
    subgraph HostFS["Host Filesystem"]
HostOutput["./output/\n(bind mount)"]
end
    
    subgraph ContainerFS["Container Filesystem"]
BuildScript["/usr/local/bin/build-docs.sh"]
ScraperScript["/usr/local/bin/deepwiki-scraper.py"]
TmpWiki["/tmp/wiki_temp/\n(write buffer)"]
OutputMount["/output/\n(volume mount)"]
WorkspaceTemplates["/workspace/templates/\nheader.html, footer.html"]
end
    
    subgraph WriteOperations["File Write Operations"]
WriteRaw["write_markdown_file()\nraw .md to /tmp"]
WriteEnhanced["write enhanced .md\nafter diagram injection"]
AtomicMove["shutil.move()\nor mv command"]
CopyBook["cp -r mdbook_output/"]
end
    
    subgraph OutputStructure["/output/ final structure"]
OutMarkdown["/output/markdown/"]
OutRaw["/output/raw_markdown/"]
OutBook["/output/book/"]
OutConfig["/output/book.toml"]
end
    
 
   HostOutput -.->|-v bind mount| OutputMount
    
 
   ScraperScript -->|Phase 1| WriteRaw
 
   WriteRaw --> TmpWiki
 
   ScraperScript --> WriteEnhanced
 
   WriteEnhanced --> TmpWiki
    
 
   TmpWiki -->|atomic| AtomicMove
 
   AtomicMove --> OutMarkdown
 
   AtomicMove --> OutRaw
    
 
   BuildScript -->|Phase 2| OutConfig
 
   BuildScript -->|Phase 3| CopyBook
 
   CopyBook --> OutBook
    
 
   WorkspaceTemplates -.->|process-template.py reads| BuildScript
    
 
   OutMarkdown --> OutputMount
 
   OutRaw --> OutputMount
 
   OutBook --> OutputMount
 
   OutConfig --> OutputMount

File System Interaction

The system uses a temporary directory pattern to ensure atomic writes to the output volume:

Filesystem Write Pattern

Write Sequence

  1. deepwiki-scraper.py writes raw markdown to /tmp/wiki_temp/ using write_markdown_file() function
  2. After diagram injection via inject_mermaid_diagrams_into_markdown(), enhanced markdown moves to /output/markdown/
  3. build-docs.sh generates /output/book.toml from environment variables
  4. mdbook build writes HTML to internal directory, which build-docs.sh copies to /output/book/

This pattern ensures atomicity: partial writes never appear in /output/.

Sources: README.md:19-26 README.md:54-58 Diagram 3

Configuration Philosophy

The system operates on three configuration principles:

  1. Environment-Driven : All customization via environment variables, no file editing required
  2. Auto-Detection : Intelligent defaults from Git remotes (repository URL, author name)
  3. Zero-Configuration : Minimal required inputs (REPO or auto-detect from current directory)

Minimal Example :

This single command triggers the complete extraction, transformation, and build pipeline.

For complete configuration options, see Configuration Reference. For deployment patterns, see Quick Start.

Sources: README.md:22-51 README.md:220-227

Dismiss

Refresh this wiki

Enter email to refresh