Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Overview

Relevant source files

Purpose and Scope

This document introduces the DeepWiki-to-mdBook Converter, a containerized system that extracts wiki documentation from DeepWiki.com and transforms it into searchable HTML documentation using mdBook. This page covers the system’s purpose, core capabilities, and high-level architecture.

For detailed usage instructions, see Quick Start. For architecture details, see System Architecture. For configuration options, see Configuration Reference.

Sources: README.md:1-3

Problem Statement

DeepWiki.com provides AI-generated documentation for GitHub repositories as a web-based wiki. The system addresses the following limitations:

ProblemSolution
Content locked in web platformHTTP scraping with requests and BeautifulSoup4
Mermaid diagrams rendered client-side onlyJavaScript payload extraction with fuzzy matching
No offline accessSelf-contained HTML site generation
No searchabilitymdBook’s built-in search
Platform-specific formattingConversion to standard Markdown

Sources: README.md:3-15

Core Capabilities

The system provides the following capabilities through environment variable configuration:

  • Generic Repository Support : Works with any GitHub repository indexed by DeepWiki via REPO environment variable
  • Auto-Detection : Extracts repository metadata from Git remotes when available
  • Hierarchy Preservation : Maintains wiki page numbering and section structure
  • Diagram Intelligence : Extracts ~461 total diagrams, matches ~48 with sufficient context using fuzzy matching
  • Dual Output Modes : Full mdBook build or markdown-only extraction via MARKDOWN_ONLY flag
  • No Authentication : Public HTTP scraping without API keys or credentials
  • Containerized Deployment : Single Docker image with all dependencies

Sources: README.md:5-15 README.md:42-51

System Components

The system consists of three primary executable components coordinated by a shell orchestrator:

Main Components

ComponentLanguagePurposeKey Functions
build-docs.shShellOrchestrationParse env vars, generate configs, call executables
deepwiki-scraper.pyPython 3.12Content extractionHTTP scraping, HTML parsing, diagram matching
mdbookRustSite generationMarkdown to HTML, navigation, search
mdbook-mermaidRustDiagram renderingInject JavaScript/CSS for Mermaid.js

Sources: README.md:146-157 Diagram 1, Diagram 5

Processing Pipeline

The system executes a three-phase pipeline with conditional execution based on the MARKDOWN_ONLY environment variable:

Phase Details

PhaseScriptKey OperationsOutput
1deepwiki-scraper.pyHTTP fetch, BeautifulSoup4 parse, html2text conversion, fuzzy diagram matchingmarkdown/*.md
2build-docs.shGenerate book.toml, generate SUMMARY.mdConfiguration files
3mdbook + mdbook-mermaidMarkdown processing, Mermaid.js asset injection, HTML generationbook/ directory

Sources: README.md:121-145 Diagram 2

Input and Output

Input Requirements

InputFormatSourceExample
REPOowner/repoEnvironment variablefacebook/react
BOOK_TITLEStringEnvironment variable (optional)React Documentation
BOOK_AUTHORSStringEnvironment variable (optional)Meta Open Source
MARKDOWN_ONLYtrue/falseEnvironment variable (optional)false

Sources: README.md:42-51

Output Artifacts

Full Build Mode (MARKDOWN_ONLY=false or unset):

output/
├── markdown/
│   ├── 1-overview.md
│   ├── 2-quick-start.md
│   ├── section-3/
│   │   ├── 3-1-workspace.md
│   │   └── 3-2-parser.md
│   └── ...
├── book/
│   ├── index.html
│   ├── searchindex.json
│   ├── mermaid.min.js
│   └── ...
└── book.toml

Markdown-Only Mode (MARKDOWN_ONLY=true):

output/
└── markdown/
    ├── 1-overview.md
    ├── 2-quick-start.md
    └── ...

Sources: README.md:89-119

Technical Stack

The system combines multiple technology stacks in a single container using Docker multi-stage builds:

Runtime Dependencies

ComponentVersionPurposeInstallation Method
Python3.12-slimScraping runtimeBase image
requestsLatestHTTP clientuv pip install
beautifulsoup4LatestHTML parseruv pip install
html2textLatestHTML to Markdownuv pip install
mdbookLatestDocumentation builderCompiled from source (Rust)
mdbook-mermaidLatestDiagram preprocessorCompiled from source (Rust)

Build Architecture

The Dockerfile uses a two-stage build:

  1. Stage 1 (rust:latest): Compiles mdbook and mdbook-mermaid binaries (~1.5 GB, discarded)
  2. Stage 2 (python:3.12-slim): Copies binaries into Python runtime (~300-400 MB final)

Sources: README.md:146-157 Diagram 3

File System Interaction

The system interacts with three key filesystem locations:

Temporary Directory Workflow :

  1. deepwiki-scraper.py writes initial markdown to /tmp/wiki_temp/
  2. After diagram enhancement, files move atomically to /output/markdown/
  3. build-docs.sh copies final HTML to /output/book/

This ensures no partial states exist in the output directory.

Sources: README.md:220-227 README.md136

Configuration Philosophy

The system operates on three configuration principles:

  1. Environment-Driven : All customization via environment variables, no file editing required
  2. Auto-Detection : Intelligent defaults from Git remotes (repository URL, author name)
  3. Zero-Configuration : Minimal required inputs (REPO or auto-detect from current directory)

Minimal Example :

This single command triggers the complete extraction, transformation, and build pipeline.

For complete configuration options, see Configuration Reference. For deployment patterns, see Quick Start.

Sources: README.md:22-51 README.md:220-227

Dismiss

Refresh this wiki

Enter email to refresh