Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

DeepWiki GitHub

Overview

Relevant source files

Purpose and Scope

This document introduces the DeepWiki-to-mdBook Converter, a containerized system that extracts wiki documentation from DeepWiki.com and transforms it into searchable HTML documentation using mdBook. This page covers the system's purpose, core capabilities, and high-level architecture.

For detailed usage instructions, see Quick Start. For architecture details, see System Architecture. For configuration options, see Configuration Reference.

Sources: README.md:1-3

Problem Statement

DeepWiki.com provides AI-generated documentation for GitHub repositories as a web-based wiki. The system addresses the following limitations:

ProblemSolution
Content locked in web platformHTTP scraping with requests and BeautifulSoup4
Mermaid diagrams rendered client-side onlyJavaScript payload extraction with fuzzy matching
No offline accessSelf-contained HTML site generation
No searchabilitymdBook's built-in search
Platform-specific formattingConversion to standard Markdown

Sources: README.md:3-15

Core Capabilities

The system provides the following capabilities through environment variable configuration:

  • Generic Repository Support : Works with any GitHub repository indexed by DeepWiki via REPO environment variable
  • Auto-Detection : Extracts repository metadata from Git remotes when available
  • Hierarchy Preservation : Maintains wiki page numbering and section structure
  • Diagram Intelligence : Extracts ~461 total diagrams, matches ~48 with sufficient context using fuzzy matching
  • Dual Output Modes : Full mdBook build or markdown-only extraction via MARKDOWN_ONLY flag
  • No Authentication : Public HTTP scraping without API keys or credentials
  • Containerized Deployment : Single Docker image with all dependencies

Sources: README.md:5-15 README.md:42-51

System Components

The system consists of three primary executable components coordinated by a shell orchestrator:

Main Components

graph TB
    User["docker run"]
subgraph Container["deepwiki-scraper Container"]
BuildDocs["build-docs.sh\n(Shell Orchestrator)"]
Scraper["deepwiki-scraper.py\n(Python)"]
MdBook["mdbook\n(Rust Binary)"]
MermaidPlugin["mdbook-mermaid\n(Rust Binary)"]
end
    
    subgraph External["External Systems"]
DeepWiki["deepwiki.com\n(HTTP Scraping)"]
GitHub["github.com\n(Edit Links)"]
end
    
    subgraph Output["Output Directory"]
MarkdownDir["markdown/\n(.md files)"]
BookDir["book/\n(HTML site)"]
ConfigFile["book.toml"]
end
    
 
   User -->|Environment Variables| BuildDocs
 
   BuildDocs -->|Step 1: Execute| Scraper
 
   BuildDocs -->|Step 4: Execute| MdBook
    
 
   Scraper -->|HTTP GET| DeepWiki
 
   Scraper -->|Writes| MarkdownDir
    
 
   MdBook -->|Preprocessor| MermaidPlugin
 
   MdBook -->|Generates| BookDir
    
 
   BookDir -.->|Edit links| GitHub
 
   BuildDocs -->|Copies| ConfigFile
    
    style BuildDocs fill:#fff4e1
    style Scraper fill:#e8f5e9
    style MdBook fill:#f3e5f5
ComponentLanguagePurposeKey Functions
build-docs.shShellOrchestrationParse env vars, generate configs, call executables
deepwiki-scraper.pyPython 3.12Content extractionHTTP scraping, HTML parsing, diagram matching
mdbookRustSite generationMarkdown to HTML, navigation, search
mdbook-mermaidRustDiagram renderingInject JavaScript/CSS for Mermaid.js

Sources: README.md:146-157 Diagram 1, Diagram 5

Processing Pipeline

The system executes a three-phase pipeline with conditional execution based on the MARKDOWN_ONLY environment variable:

Phase Details

stateDiagram-v2
    [*] --> ParseEnvVars
    ParseEnvVars --> ExecuteScraper : build-docs.sh phase 1
    
    state ExecuteScraper {
        [*] --> FetchHTML
        FetchHTML --> ConvertMarkdown : html2text
        ConvertMarkdown --> ExtractDiagrams : Regex on JS payload
        ExtractDiagrams --> FuzzyMatch : Progressive chunks
        FuzzyMatch --> WriteMarkdown : output/markdown/
        WriteMarkdown --> [*]
    }
    
    ExecuteScraper --> CheckMode
    
    state CheckMode <<choice>>
    CheckMode --> GenerateBookToml : MARKDOWN_ONLY=false
    CheckMode --> CopyOutput : MARKDOWN_ONLY=true
    
    GenerateBookToml --> GenerateSummary : build-docs.sh phase 2
    GenerateSummary --> ExecuteMdbook : build-docs.sh phase 3
    
    state ExecuteMdbook {
        [*] --> InitBook
        InitBook --> CopyMarkdown : mdbook init
        CopyMarkdown --> InstallMermaid : mdbook-mermaid install
        InstallMermaid --> BuildHTML : mdbook build
        BuildHTML --> [*] : output/book/
    }
    
    ExecuteMdbook --> CopyOutput
    CopyOutput --> [*]
PhaseScriptKey OperationsOutput
1deepwiki-scraper.pyHTTP fetch, BeautifulSoup4 parse, html2text conversion, fuzzy diagram matchingmarkdown/*.md
2build-docs.shGenerate book.toml, generate SUMMARY.mdConfiguration files
3mdbook + mdbook-mermaidMarkdown processing, Mermaid.js asset injection, HTML generationbook/ directory

Sources: README.md:121-145 Diagram 2

Input and Output

Input Requirements

InputFormatSourceExample
REPOowner/repoEnvironment variablefacebook/react
BOOK_TITLEStringEnvironment variable (optional)React Documentation
BOOK_AUTHORSStringEnvironment variable (optional)Meta Open Source
MARKDOWN_ONLYtrue/falseEnvironment variable (optional)false

Sources: README.md:42-51

Output Artifacts

Full Build Mode (MARKDOWN_ONLY=false or unset):

output/
├── markdown/
│   ├── 1-overview.md
│   ├── 2-quick-start.md
│   ├── section-3/
│   │   ├── 3-1-workspace.md
│   │   └── 3-2-parser.md
│   └── ...
├── book/
│   ├── index.html
│   ├── searchindex.json
│   ├── mermaid.min.js
│   └── ...
└── book.toml

Markdown-Only Mode (MARKDOWN_ONLY=true):

output/
└── markdown/
    ├── 1-overview.md
    ├── 2-quick-start.md
    └── ...

Sources: README.md:89-119

Technical Stack

The system combines multiple technology stacks in a single container using Docker multi-stage builds:

Runtime Dependencies

ComponentVersionPurposeInstallation Method
Python3.12-slimScraping runtimeBase image
requestsLatestHTTP clientuv pip install
beautifulsoup4LatestHTML parseruv pip install
html2textLatestHTML to Markdownuv pip install
mdbookLatestDocumentation builderCompiled from source (Rust)
mdbook-mermaidLatestDiagram preprocessorCompiled from source (Rust)

Build Architecture

The Dockerfile uses a two-stage build:

  1. Stage 1 (rust:latest): Compiles mdbook and mdbook-mermaid binaries (~1.5 GB, discarded)
  2. Stage 2 (python:3.12-slim): Copies binaries into Python runtime (~300-400 MB final)

Sources: README.md:146-157 Diagram 3

File System Interaction

The system interacts with three key filesystem locations:

Temporary Directory Workflow :

  1. deepwiki-scraper.py writes initial markdown to /tmp/wiki_temp/
  2. After diagram enhancement, files move atomically to /output/markdown/
  3. build-docs.sh copies final HTML to /output/book/

This ensures no partial states exist in the output directory.

Sources: README.md:220-227 README.md136

Configuration Philosophy

The system operates on three configuration principles:

  1. Environment-Driven : All customization via environment variables, no file editing required
  2. Auto-Detection : Intelligent defaults from Git remotes (repository URL, author name)
  3. Zero-Configuration : Minimal required inputs (REPO or auto-detect from current directory)

Minimal Example :

This single command triggers the complete extraction, transformation, and build pipeline.

For complete configuration options, see Configuration Reference. For deployment patterns, see Quick Start.

Sources: README.md:22-51 README.md:220-227

DeepWiki GitHub

Quick Start

Relevant source files

This page provides practical instructions for running the DeepWiki-to-mdBook Converter using Docker. It covers building the image, running basic conversions, and accessing the output. For detailed configuration options, see Configuration Reference. For understanding what happens internally, see System Architecture.

Prerequisites

The following must be available on your system:

RequirementPurpose
DockerRuns the containerized conversion system
Internet connectionRequired to fetch content from DeepWiki.com
Disk space~500MB for Docker image, variable for output

Sources: README.md:17-20

Building the Docker Image

The system is distributed as a Dockerfile that must be built before use. The build process compiles Rust tools (mdBook, mdbook-mermaid) and installs Python dependencies.

The build process uses multi-stage Docker builds and takes approximately 5-10 minutes on first run. Subsequent builds use Docker layer caching for faster completion.

Note: For detailed information about the Docker build architecture, see Docker Multi-Stage Build.

Sources: README.md:29-31 build-docs.sh:1-5

Basic Usage Pattern

The converter runs as a Docker container that takes environment variables as input and produces output files in a mounted volume.

sequenceDiagram
    participant User
    participant Docker
    participant Container as "deepwiki-scraper\ncontainer"
    participant DeepWiki as "deepwiki.com"
    participant OutputVol as "/output volume"
    
    User->>Docker: docker run --rm -e REPO=...
    Docker->>Container: Start with env vars
    Container->>Container: build-docs.sh orchestrates
    Container->>DeepWiki: HTTP requests for wiki pages
    DeepWiki-->>Container: HTML content + JS payload
    Container->>Container: deepwiki-scraper.py extracts
    Container->>Container: mdbook build (unless MARKDOWN_ONLY)
    Container->>OutputVol: Write markdown/ and book/
    Container-->>Docker: Exit (status 0)
    Docker-->>User: Container removed (--rm)
    User->>OutputVol: Access generated files

User Interaction Flow

Sources: README.md:24-39 build-docs.sh:1-206

Minimal Command

The absolute minimum command requires only the REPO environment variable:

This command:

  • Uses -e REPO="owner/repo" to specify which GitHub repository's wiki to extract
  • Mounts the current directory's output/ subdirectory to /output in the container
  • Uses --rm to automatically remove the container after completion
  • Generates default values for BOOK_TITLE, BOOK_AUTHORS, and GIT_REPO_URL

Sources: README.md:34-38 build-docs.sh:22-26

Environment Variable Configuration

Sources: build-docs.sh:8-53 README.md:42-51

The following table describes each environment variable:

VariableRequiredDefault BehaviorExample
REPOYes*Auto-detected from Git remote if availablefacebook/react
BOOK_TITLENo"Documentation""React Internals"
BOOK_AUTHORSNoExtracted from REPO owner"Meta Open Source"
GIT_REPO_URLNoConstructed as https://github.com/{REPO}Custom fork URL
MARKDOWN_ONLYNo"false" (build full HTML)"true" for debugging
  • REPO is required unless running from a Git repository with a GitHub remote, in which case it is auto-detected via build-docs.sh:8-19

Sources: README.md:42-51 build-docs.sh:8-53

Common Usage Patterns

Pattern 1: Complete Documentation Build

Generate both Markdown source and HTML documentation:

Produces:

  • /output/markdown/ - Source Markdown files with diagrams
  • /output/book/ - Complete HTML site with search and navigation
  • /output/book.toml - mdBook configuration

Use when: You want a deployable documentation website.

Sources: README.md:74-87 build-docs.sh:178-192

Pattern 2: Markdown-Only Mode (Fast Iteration)

Extract only Markdown files, skipping the HTML build phase:

Produces:

  • /output/markdown/ - Source Markdown files only

Use when:

  • Debugging diagram placement
  • Testing content extraction
  • You only need Markdown files
  • Faster iteration cycles (~3-5x faster than full build)

Skips: Phases 3 (mdBook build) as controlled by build-docs.sh:61-76

Sources: README.md:55-72 build-docs.sh:61-76

Pattern 3: Custom Output Directory

Mount a different output location:

This writes output to /home/user/docs/rust instead of ./output.

Sources: README.md:200-207

Pattern 4: Minimal Configuration with Auto-Detection

If running from a Git repository directory:

The system extracts the repository from git config --get remote.origin.url via build-docs.sh:8-19 This only works when running the Docker command from within a Git repository with a GitHub remote configured.

Sources: build-docs.sh:8-19 README.md53

Output Structure

Complete Build Output

When MARKDOWN_ONLY=false (default), the output structure is:

output/
├── markdown/               # Source Markdown files
│   ├── 1-overview.md
│   ├── 2-quick-start.md
│   ├── section-3/
│   │   ├── 3-1-subsection.md
│   │   └── 3-2-subsection.md
│   └── ...
├── book/                   # Generated HTML documentation
│   ├── index.html
│   ├── 1-overview.html
│   ├── searchindex.js
│   ├── mermaid/            # Diagram rendering assets
│   └── ...
└── book.toml               # mdBook configuration

Sources: README.md:89-120 build-docs.sh:178-192

File Naming Convention

Files follow the pattern {number}-{title}.md where:

  • {number} is the hierarchical page number (e.g., 1, 2-1, 3-2)
  • {title} is a URL-safe version of the page title

Subsection files are organized in section-{N}/ subdirectories, where {N} is the parent section number.

Examples from README.md:115-119:

  • 1-overview.md - Top-level page 1
  • 2-1-workspace-and-crates.md - Subsection 1 of section 2
  • section-4/4-1-logical-planning.md - Subsection 1 of section 4, stored in subdirectory

Sources: README.md:115-119

Viewing the Output

Serving HTML Documentation Locally

After a complete build, serve the HTML site using Python's built-in HTTP server:

Then open http://localhost:8000 in your browser.

The generated site includes:

  • Full-text search via searchindex.js
  • Responsive navigation sidebar with page hierarchy
  • Rendered Mermaid diagrams
  • "Edit this page" links to the GitHub repository
  • Dark/light theme toggle

Sources: README.md:83-86 build-docs.sh:203-204

Accessing Markdown Files

Markdown files can be read directly or used with other tools:

Sources: README.md:100-113

Execution Flow

Sources: build-docs.sh:1-206 README.md:121-145

Quick Troubleshooting

"REPO must be set"

Error message: ERROR: REPO must be set or run from within a Git repository

Cause: The REPO environment variable was not provided and could not be auto-detected.

Solution:

Sources: build-docs.sh:32-37

"No wiki pages found"

Cause: The repository may not be indexed by DeepWiki.

Solution: Verify the wiki exists by visiting https://deepwiki.com/owner/repo in a browser. Not all GitHub repositories have DeepWiki documentation.

Sources: README.md:160-161

Connection Timeouts

Cause: Network issues or DeepWiki service unavailable.

Solution: The scraper includes automatic retries (3 attempts per page). Wait and retry the command. Check your internet connection.

Sources: README.md:171-172

mdBook Build Fails

Error symptoms: Build completes Phase 1 and 2 but fails during Phase 3.

Solutions:

  1. Ensure Docker has sufficient memory (2GB+ recommended)

  2. Try MARKDOWN_ONLY=true to verify extraction works independently:

  3. Check Docker logs for Rust compilation errors

Sources: README.md:174-177

Diagrams Not Appearing

Cause: Fuzzy matching may not find appropriate placement context for some diagrams.

Debugging approach:

Not all diagrams can be matched—typically ~48 out of ~461 diagrams have sufficient context for accurate placement.

Sources: README.md:166-169 README.md:132-135

Next Steps

After successfully generating documentation:

Sources: README.md:1-233

DeepWiki GitHub

Configuration Reference

Relevant source files

This document provides a comprehensive reference for all configuration options available in the DeepWiki-to-mdBook Converter system. It covers environment variables, their default values, validation logic, auto-detection features, and how configuration flows through the system components.

For information about running the system with these configurations, see Quick Start. For details on how auto-detection works internally, see Auto-Detection Features.

Configuration System Overview

The DeepWiki-to-mdBook Converter uses environment variables as its sole configuration mechanism. All configuration is processed by the build-docs.sh orchestrator script at runtime, with no configuration files required. The system provides intelligent defaults and auto-detection capabilities to minimize required configuration.

Configuration Flow Diagram

flowchart TD
    User["User/CI System"]
Docker["docker run -e VAR=value"]
subgraph "build-docs.sh Configuration Processing"
        AutoDetect["Git Auto-Detection\n[build-docs.sh:8-19]"]
ParseEnv["Environment Variable Parsing\n[build-docs.sh:21-26]"]
Defaults["Default Value Assignment\n[build-docs.sh:43-45]"]
Validate["Validation\n[build-docs.sh:32-37]"]
end
    
    subgraph "Configuration Consumers"
        Scraper["deepwiki-scraper.py\nREPO parameter"]
BookToml["book.toml Generation\n[build-docs.sh:85-103]"]
SummaryGen["SUMMARY.md Generation\n[build-docs.sh:113-159]"]
end
    
 
   User -->|Set environment variables| Docker
 
   Docker -->|Container startup| AutoDetect
 
   AutoDetect -->|REPO detection| ParseEnv
 
   ParseEnv -->|Parse all vars| Defaults
 
   Defaults -->|Apply defaults| Validate
 
   Validate -->|REPO validated| Scraper
 
   Validate -->|BOOK_TITLE, BOOK_AUTHORS, GIT_REPO_URL| BookToml
 
   Validate -->|No direct config needed| SummaryGen

Sources: build-docs.sh:1-206 README.md:41-51

Environment Variables Reference

The following table lists all environment variables supported by the system:

VariableTypeRequiredDefaultDescription
REPOStringConditionalAuto-detected from Git remoteGitHub repository in owner/repo format. Required if not running in a Git repository with a GitHub remote.
BOOK_TITLEStringNo"Documentation"Title displayed in the generated mdBook documentation. Used in book.toml title field.
BOOK_AUTHORSStringNoRepository owner (from REPO)Author name(s) displayed in the documentation. Used in book.toml authors array.
GIT_REPO_URLStringNohttps://github.com/{REPO}Full GitHub repository URL. Used for "Edit this page" links in mdBook output.
MARKDOWN_ONLYBooleanNo"false"When "true", skips Phase 3 (mdBook build) and outputs only extracted Markdown files. Useful for debugging.

Sources: build-docs.sh:21-26 README.md:44-51

Variable Details and Usage

REPO

Format: owner/repo (e.g., "facebook/react" or "microsoft/vscode")

Purpose: Identifies the GitHub repository to scrape from DeepWiki.com. This is the primary configuration variable that drives the entire system.

flowchart TD
    Start["build-docs.sh Startup"]
CheckEnv{"REPO environment\nvariable set?"}
UseEnv["Use provided REPO value\n[build-docs.sh:22]"]
CheckGit{"Git repository\ndetected?"}
GetRemote["Execute: git config --get\nremote.origin.url\n[build-docs.sh:12]"]
ParseURL["Extract owner/repo using regex:\n.*github\\.com[:/]([^/]+/[^/\\.]+)\n[build-docs.sh:16]"]
SetRepo["Set REPO variable\n[build-docs.sh:16]"]
ValidateRepo{"REPO is set?"}
Error["Exit with error\n[build-docs.sh:33-37]"]
Continue["Continue with\nREPO=$REPO_OWNER/$REPO_NAME"]
Start --> CheckEnv
 
   CheckEnv -->|Yes| UseEnv
 
   CheckEnv -->|No| CheckGit
 
   CheckGit -->|Yes| GetRemote
 
   CheckGit -->|No| ValidateRepo
 
   GetRemote --> ParseURL
 
   ParseURL --> SetRepo
 
   UseEnv --> ValidateRepo
 
   SetRepo --> ValidateRepo
 
   ValidateRepo -->|No| Error
 
   ValidateRepo -->|Yes| Continue

Auto-Detection Logic:

Sources: build-docs.sh:8-37

Validation: The system exits with an error if REPO is not set and cannot be auto-detected:

ERROR: REPO must be set or run from within a Git repository with a GitHub remote
Usage: REPO=owner/repo $0

Usage in System:

BOOK_TITLE

Default: "Documentation"

Purpose: Sets the title of the generated mdBook documentation. This appears in the browser tab, navigation header, and book metadata.

Usage: Injected into book.toml configuration file build-docs.sh87:

Examples:

  • BOOK_TITLE="React Documentation"
  • BOOK_TITLE="VS Code Internals"
  • BOOK_TITLE="Apache Arrow DataFusion Developer Guide"

Sources: build-docs.sh23 build-docs.sh87

BOOK_AUTHORS

Default: Repository owner extracted from REPO

Purpose: Sets the author name(s) in the mdBook documentation metadata.

Default Assignment Logic: build-docs.sh44

This uses shell parameter expansion to set BOOK_AUTHORS to REPO_OWNER only if BOOK_AUTHORS is unset or empty.

Usage: Injected into book.toml as an array build-docs.sh88:

Examples:

  • If REPO="facebook/react" and BOOK_AUTHORS not set → BOOK_AUTHORS="facebook"
  • Explicitly set: BOOK_AUTHORS="Meta Open Source"
  • Multiple authors: BOOK_AUTHORS="John Doe, Jane Smith" (rendered as single string in array)

Sources: build-docs.sh24 build-docs.sh44 build-docs.sh88

GIT_REPO_URL

Default: https://github.com/{REPO}

Purpose: Provides the full GitHub repository URL used for "Edit this page" links in the generated mdBook documentation. Each page includes a link back to the source repository.

Default Assignment Logic: build-docs.sh45

Usage: Injected into book.toml configuration build-docs.sh95:

Notes:

  • mdBook automatically appends /edit/main/ or similar paths based on its heuristics
  • The URL must be a valid Git repository URL for the edit links to work correctly
  • Can be overridden for non-standard Git hosting scenarios

Sources: build-docs.sh25 build-docs.sh45 build-docs.sh95

MARKDOWN_ONLY

Default: "false"

Type: Boolean string ("true" or "false")

Purpose: Controls whether the system executes the full three-phase pipeline or stops after Phase 2 (Markdown extraction with diagram enhancement). When set to "true", Phase 3 (mdBook build) is skipped.

flowchart TD
    Start["build-docs.sh Execution"]
Phase1["Phase 1: Scrape & Extract\n[build-docs.sh:56-58]"]
Phase2["Phase 2: Enhance Diagrams\n(within deepwiki-scraper.py)"]
CheckMode{"MARKDOWN_ONLY\n== 'true'?\n[build-docs.sh:61]"}
CopyMD["Copy markdown to /output/markdown\n[build-docs.sh:64-65]"]
ExitEarly["Exit (skipping mdBook build)\n[build-docs.sh:75]"]
Phase3Init["Phase 3: Initialize mdBook\n[build-docs.sh:79-106]"]
BuildBook["Build HTML documentation\n[build-docs.sh:176]"]
CopyAll["Copy all outputs\n[build-docs.sh:179-191]"]
Start --> Phase1
 
   Phase1 --> Phase2
 
   Phase2 --> CheckMode
 
   CheckMode -->|Yes| CopyMD
 
   CopyMD --> ExitEarly
 
   CheckMode -->|No| Phase3Init
 
   Phase3Init --> BuildBook
 
   BuildBook --> CopyAll
    
    style ExitEarly fill:#ffebee
    style CopyAll fill:#e8f5e9

Execution Flow with MARKDOWN_ONLY:

Sources: build-docs.sh26 build-docs.sh:61-76

Use Cases:

  • Debugging diagram placement: Quickly iterate on diagram matching without waiting for mdBook build
  • Markdown-only extraction: When you only need the Markdown source files
  • Faster feedback loops: mdBook build adds significant time; skipping it speeds up testing
  • Custom processing: Extract Markdown for processing with different documentation tools

Output Differences:

ModeOutput Directory Structure
MARKDOWN_ONLY="false" (default)/output/book/ (HTML site)
/output/markdown/ (source)
/output/book.toml (config)
MARKDOWN_ONLY="true"/output/markdown/ (source only)

Performance Impact: Markdown-only mode is approximately 3-5x faster, as it skips:

Sources: build-docs.sh:61-76 README.md:55-76

Internal Configuration Variables

These variables are derived or used internally and are not meant to be configured by users:

VariableSourcePurpose
WORK_DIRHard-coded: /workspace build-docs.sh27Temporary working directory inside container
WIKI_DIRDerived: $WORK_DIR/wiki build-docs.sh28Directory where deepwiki-scraper.py outputs Markdown
OUTPUT_DIRHard-coded: /output build-docs.sh29Container output directory (mounted as volume)
BOOK_DIRDerived: $WORK_DIR/book build-docs.sh30mdBook project directory
REPO_OWNERExtracted from REPO build-docs.sh40First component of owner/repo
REPO_NAMEExtracted from REPO build-docs.sh41Second component of owner/repo

Sources: build-docs.sh:27-30 build-docs.sh:40-41

Configuration Precedence and Inheritance

The system follows this precedence order for configuration values:

Sources: build-docs.sh:8-45

Example Scenarios:

  1. User provides all values:

All explicit values used; no auto-detection occurs.

  1. User provides only REPO:

    • REPO: "facebook/react" (explicit)
    • BOOK_TITLE: "Documentation" (default)
    • BOOK_AUTHORS: "facebook" (derived from REPO)
    • GIT_REPO_URL: "https://github.com/facebook/react" (derived)
    • MARKDOWN_ONLY: "false" (default)
  2. User provides no values in Git repo:

    • REPO: Auto-detected from git config --get remote.origin.url
    • All other values derived or defaulted as above

Generated Configuration Files

The system generates configuration files dynamically based on environment variables:

book.toml

Location: Created at $BOOK_DIR/book.toml build-docs.sh85 copied to /output/book.toml build-docs.sh191

Template Structure:

Sources: build-docs.sh:85-103

Variable Substitution Mapping:

Template VariableEnvironment VariableSection
${BOOK_TITLE}$BOOK_TITLE[book]
${BOOK_AUTHORS}$BOOK_AUTHORS[book]
${GIT_REPO_URL}$GIT_REPO_URL[output.html]

Hard-Coded Values:

SUMMARY.md

Location: Created at $BOOK_DIR/src/SUMMARY.md build-docs.sh159

Generation: Automatically generated from file structure in $WIKI_DIR, no direct environment variable input. See SUMMARY.md Generation for details.

Sources: build-docs.sh:109-159

Configuration Examples

Minimal Configuration

Results:

  • REPO: "owner/repo"
  • BOOK_TITLE: "Documentation"
  • BOOK_AUTHORS: "owner"
  • GIT_REPO_URL: "https://github.com/owner/repo"
  • MARKDOWN_ONLY: "false"

Full Custom Configuration

Auto-Detected Configuration

Note: This only works if the current directory is a Git repository with a GitHub remote URL configured.

Debugging Configuration

Outputs only Markdown files to /output/markdown/, skipping the mdBook build phase.

Sources: README.md:28-88

Configuration Validation

The system performs validation on the REPO variable build-docs.sh:32-37:

Validation Rules:

  • REPO must be non-empty after auto-detection
  • No format validation is performed on REPO value (e.g., owner/repo pattern)
  • Invalid REPO values will cause failures during scraping phase, not during validation

Other Variables:

  • No validation performed on BOOK_TITLE, BOOK_AUTHORS, or GIT_REPO_URL
  • MARKDOWN_ONLY is not validated; any value other than "true" is treated as false

Sources: build-docs.sh:32-37

Configuration Debugging

To debug configuration values, check the console output at startup build-docs.sh:47-53:

Configuration:
  Repository:    facebook/react
  Book Title:    React Documentation
  Authors:       Meta Open Source
  Git Repo URL:  https://github.com/facebook/react
  Markdown Only: false

This output shows the final resolved configuration values after auto-detection, derivation, and defaults are applied.

Sources: build-docs.sh:47-53

DeepWiki GitHub

System Architecture

Relevant source files

This document provides a comprehensive overview of the DeepWiki-to-mdBook Converter's system architecture, explaining how the major components interact and how data flows through the system. It describes the containerized polyglot design, the orchestration model, and the technology integration strategy.

For detailed information about the three-phase processing model, see Three-Phase Pipeline. For Docker containerization specifics, see Docker Multi-Stage Build. For individual component implementation details, see Component Reference.

Architectural Overview

The system follows a layered orchestration architecture where a shell script coordinator invokes specialized tools in sequence. The entire system runs within a single Docker container that combines Python web scraping tools with Rust documentation building tools.

Design Principles

PrincipleImplementation
Single ResponsibilityEach component (shell, Python, Rust tools) has one clear purpose
Language-Specific ToolsPython for web scraping, Rust for documentation building, Shell for orchestration
Stateless ProcessingNo persistent state between runs; all configuration via environment variables
Atomic OperationsTemporary directory workflow ensures no partial output states
Generic DesignNo hardcoded repository details; works with any DeepWiki repository

Sources: README.md:218-227 build-docs.sh:1-206

Container Architecture

The system uses a two-stage Docker build to create a hybrid Python-Rust runtime environment while minimizing image size.

graph TB
    subgraph Stage1["Stage 1: Rust Builder (rust:latest)"]
RustBase["rust:latest base image"]
CargoInstall["cargo install mdbook\ncargo install mdbook-mermaid"]
BinariesOut["/usr/local/cargo/bin/\nmdbook\nmdbook-mermaid"]
RustBase --> CargoInstall
 
       CargoInstall --> BinariesOut
    end
    
    subgraph Stage2["Stage 2: Final Image (python:3.12-slim)"]
PyBase["python:3.12-slim base"]
UVInstall["COPY --from=ghcr.io/astral-sh/uv"]
PipInstall["uv pip install\nrequirements.txt"]
CopyBins["COPY --from=builder\nRust binaries"]
CopyScripts["COPY scripts:\ndeepwiki-scraper.py\nbuild-docs.sh"]
PyBase --> UVInstall
 
       UVInstall --> PipInstall
 
       PipInstall --> CopyBins
 
       CopyBins --> CopyScripts
    end
    
 
   BinariesOut -.->|Extract binaries only discard 1.5GB toolchain| CopyBins
    
 
   CopyScripts --> Runtime["Final Image: ~300-400MB\nPython + Rust binaries\nNo build tools"]
subgraph Runtime["Runtime Contents"]
direction LR
        Python["Python 3.12 runtime"]
Packages["requests, BeautifulSoup4,\nhtml2text"]
Tools["mdbook, mdbook-mermaid\nbinaries"]
end

Docker Multi-Stage Build Topology

Stage 1 (Dockerfile:1-5) compiles Rust tools using the full rust:latest image (~1.5 GB) but only the compiled binaries are extracted. Stage 2 (Dockerfile:7-32) builds the final image on a minimal Python base, copying only the Rust binaries and Python scripts, resulting in a compact image.

Sources: Dockerfile:1-33 README.md156

Component Topology and Code Mapping

This diagram maps the system's logical components to their actual code implementations:

graph TB
    subgraph User["User Interface"]
CLI["Docker CLI"]
EnvVars["Environment Variables:\nREPO, BOOK_TITLE,\nBOOK_AUTHORS, etc."]
Volume["/output volume mount"]
end
    
    subgraph Orchestrator["Orchestration Layer"]
BuildScript["build-docs.sh"]
MainLoop["Main execution flow:\nLines 55-206"]
ConfigGen["Configuration generation:\nLines 84-103, 108-159"]
AutoDetect["Auto-detection logic:\nLines 8-19, 40-45"]
end
    
    subgraph ScraperLayer["Content Acquisition Layer"]
ScraperMain["deepwiki-scraper.py\nmain()
function"]
ExtractStruct["extract_wiki_structure()\nLine 78"]
ExtractContent["extract_page_content()\nLine 453"]
ExtractDiagrams["extract_and_enhance_diagrams()\nLine 596"]
FetchPage["fetch_page()\nLine 27"]
ConvertHTML["convert_html_to_markdown()\nLine 175"]
CleanFooter["clean_deepwiki_footer()\nLine 127"]
FixLinks["fix_wiki_link()\nLine 549"]
end
    
    subgraph BuildLayer["Documentation Generation Layer"]
MdBookInit["mdbook init"]
MdBookBuild["mdbook build\n(Line 176)"]
MermaidInstall["mdbook-mermaid install\n(Line 171)"]
end
    
    subgraph Output["Output Artifacts"]
TempDir["/workspace/wiki/\n(temp directory)"]
OutputMD["/output/markdown/\nEnhanced .md files"]
OutputBook["/output/book/\nHTML documentation"]
BookToml["/output/book.toml"]
end
    
 
   CLI --> EnvVars
 
   EnvVars --> BuildScript
    
 
   BuildScript --> AutoDetect
 
   BuildScript --> MainLoop
 
   MainLoop --> ScraperMain
 
   MainLoop --> ConfigGen
 
   MainLoop --> MdBookInit
    
 
   ScraperMain --> ExtractStruct
 
   ScraperMain --> ExtractContent
 
   ScraperMain --> ExtractDiagrams
    
 
   ExtractStruct --> FetchPage
 
   ExtractContent --> FetchPage
 
   ExtractContent --> ConvertHTML
 
   ConvertHTML --> CleanFooter
 
   ExtractContent --> FixLinks
    
 
   ExtractDiagrams --> TempDir
 
   ExtractContent --> TempDir
    
 
   ConfigGen --> MdBookInit
 
   TempDir --> MdBookBuild
 
   MdBookBuild --> MermaidInstall
    
 
   TempDir --> OutputMD
 
   MdBookBuild --> OutputBook
 
   ConfigGen --> BookToml
    
 
   OutputMD --> Volume
 
   OutputBook --> Volume
 
   BookToml --> Volume

This diagram shows the complete code-to-component mapping, making it easy to locate specific functionality in the codebase.

Sources: build-docs.sh:1-206 tools/deepwiki-scraper.py:1-920

stateDiagram-v2
    [*] --> ValidateConfig
    
    ValidateConfig : build-docs.sh - Lines 8-53 Parse REPO, auto-detect if needed Set BOOK_TITLE, BOOK_AUTHORS defaults
    
    ValidateConfig --> Phase1
    
    Phase1 : Phase 1 - Scrape Wiki build-docs.sh - Line 58 Calls - deepwiki-scraper.py
    
    state Phase1 {
        [*] --> ExtractStructure
        ExtractStructure : extract_wiki_structure() Parse main page, discover subsections
        ExtractStructure --> ExtractPages
        ExtractPages : extract_page_content() Fetch HTML, convert to markdown
        ExtractPages --> EnhanceDiagrams
        EnhanceDiagrams : extract_and_enhance_diagrams() Fuzzy match and inject diagrams
        EnhanceDiagrams --> [*]
    }
    
    Phase1 --> CheckMode
    
    CheckMode : Check MARKDOWN_ONLY flag build-docs.sh - Line 61
    
    state CheckMode <<choice>>
    CheckMode --> CopyMarkdown : MARKDOWN_ONLY=true
    CheckMode --> Phase2 : MARKDOWN_ONLY=false
    
    CopyMarkdown : Copy to /output/markdown build-docs.sh - Lines 63-75
    CopyMarkdown --> Done
    
    Phase2 : Phase 2 - Initialize mdBook build-docs.sh - Lines 79-106
    
    state Phase2 {
        [*] --> CreateBookToml
        CreateBookToml : Generate book.toml Lines 85-103
        CreateBookToml --> GenerateSummary
        GenerateSummary : Generate SUMMARY.md Lines 113-159
        GenerateSummary --> [*]
    }
    
    Phase2 --> Phase3
    
    Phase3 : Phase 3 - Build Documentation build-docs.sh - Lines 164-191
    
    state Phase3 {
        [*] --> InstallMermaid
        InstallMermaid : mdbook-mermaid install Line 171
        InstallMermaid --> BuildBook
        BuildBook : mdbook build Line 176
        BuildBook --> CopyOutputs
        CopyOutputs : Copy to /output Lines 184-191
        CopyOutputs --> [*]
    }
    
    Phase3 --> Done
    Done --> [*]

Execution Flow

The system executes through a well-defined sequence orchestrated by build-docs.sh:

Primary Execution Path

The execution flow has a fast-path (markdown-only mode) and a complete-path (full documentation build). The decision point at line 61 of build-docs.sh determines which path to take based on the MARKDOWN_ONLY environment variable.

Sources: build-docs.sh:55-206 tools/deepwiki-scraper.py:790-916

Technology Stack and Integration Points

Core Technologies

LayerTechnologyPurposeCode Reference
OrchestrationBashScript coordination, environment handlingbuild-docs.sh:1-206
Web ScrapingPython 3.12HTTP requests, HTML parsingtools/deepwiki-scraper.py:1-920
HTML ParsingBeautifulSoup4DOM navigation, content extractiontools/deepwiki-scraper.py:18-19
HTML→MD Conversionhtml2textClean markdown generationtools/deepwiki-scraper.py:175-190
Documentation BuildmdBook (Rust)HTML site generationbuild-docs.sh176
Diagram Renderingmdbook-mermaidMermaid diagram supportbuild-docs.sh171
Package ManagementuvFast Python dependency installationDockerfile:13-17

Python Dependencies Integration

The scraper uses three primary Python libraries, installed via uv:

Integration points:

Sources: Dockerfile:16-17 tools/deepwiki-scraper.py:17-19 tools/deepwiki-scraper.py:27-42

graph TB
    subgraph Docker["Docker Container Filesystem"]
subgraph Workspace["/workspace"]
WikiTemp["/workspace/wiki\n(temporary)\nScraper output"]
BookBuild["/workspace/book\nmdBook build directory"]
BookSrc["/workspace/book/src\nMarkdown source files"]
end
        
        subgraph Binaries["/usr/local/bin"]
MdBook["mdbook"]
MdBookMermaid["mdbook-mermaid"]
Scraper["deepwiki-scraper.py"]
BuildScript["build-docs.sh"]
end
        
        subgraph Output["/output (volume mount)"]
OutputMD["/output/markdown\nFinal markdown files"]
OutputBook["/output/book\nHTML documentation"]
OutputConfig["/output/book.toml"]
end
    end
    
 
   Scraper -.->|Phase 1: Write| WikiTemp
 
   WikiTemp -.->|Phase 2: Enhance in-place| WikiTemp
 
   WikiTemp -.->|Copy| BookSrc
 
   BookSrc -.->|mdbook build| OutputBook
 
   WikiTemp -.->|Move| OutputMD

File System Structure

The system uses a temporary directory workflow to ensure atomic operations:

Directory Layout at Runtime

Workflow:

  1. Lines 808-877 : Scraper writes to temporary directory in /tmp (created by tempfile.TemporaryDirectory())
  2. Lines 880 : Diagram enhancement modifies files in temporary directory
  3. Lines 887-908 : Completed files moved atomically to /output
  4. Lines 166 : build-docs.sh copies to mdBook source directory
  5. Lines 176 : mdBook builds HTML to /workspace/book/book
  6. Lines 184-191 : Outputs copied to /output volume

This pattern ensures no partial or corrupted output is visible to users.

Sources: tools/deepwiki-scraper.py:804-916 build-docs.sh:164-191

Configuration Management

Configuration flows from environment variables through shell script processing to generated config files:

Configuration Flow

InputProcessorOutputCode Reference
REPObuild-docs.sh:8-19Auto-detected from Git or requiredbuild-docs.sh:8-36
BOOK_TITLEbuild-docs.sh:23Defaults to "Documentation"build-docs.sh23
BOOK_AUTHORSbuild-docs.sh:24,44Defaults to repo ownerbuild-docs.sh:24-44
GIT_REPO_URLbuild-docs.sh:25,45Constructed from REPObuild-docs.sh:25-45
MARKDOWN_ONLYbuild-docs.sh:26,61Controls pipeline executionbuild-docs.sh:26-61
All configbuild-docs.sh:85-103book.toml generationbuild-docs.sh:85-103
File structurebuild-docs.sh:113-159SUMMARY.md generationbuild-docs.sh:113-159

Auto-Detection Logic

The system can automatically detect repository information from Git remotes:

This enables zero-configuration usage in CI/CD environments where the code is already checked out.

Sources: build-docs.sh:8-45 README.md:47-53

Summary

The DeepWiki-to-mdBook Converter architecture demonstrates several key design patterns:

  1. Polyglot Orchestration : Shell coordinates Python and Rust tools, each optimized for their specific task
  2. Multi-Stage Container Build : Separates build-time tooling from runtime dependencies for minimal image size
  3. Temporary Directory Workflow : Ensures atomic operations and prevents partial output states
  4. Progressive Processing : Three distinct phases (extract, enhance, build) with optional fast-path
  5. Zero-Configuration Capability : Intelligent defaults and auto-detection minimize required configuration

The architecture prioritizes maintainability (clear separation of concerns), reliability (atomic operations), and usability (intelligent defaults) while remaining fully generic and portable.

Sources: README.md:1-233 Dockerfile:1-33 build-docs.sh:1-206 tools/deepwiki-scraper.py:1-920

DeepWiki GitHub

Three-Phase Pipeline

Relevant source files

Purpose and Scope

This document describes the three-phase processing pipeline that transforms DeepWiki HTML pages into searchable mdBook documentation. The pipeline consists of Phase 1: Clean Markdown Extraction , Phase 2: Diagram Enhancement , and Phase 3: mdBook Build. Each phase has distinct responsibilities and uses different technology stacks.

For overall system architecture, see System Architecture. For detailed implementation of individual phases, see Phase 1: Markdown Extraction, Phase 2: Diagram Enhancement, and Phase 3: mdBook Build. For configuration that affects pipeline behavior, see Configuration Reference.

Pipeline Overview

The system processes content through three sequential phases, with an optional bypass mechanism for Phase 3.

Pipeline Execution Flow

stateDiagram-v2
    [*] --> Initialize
    
    Initialize --> Phase1 : Start build-docs.sh
    
    state "Phase 1 : Markdown Extraction" as Phase1 {
        [*] --> extract_wiki_structure
        extract_wiki_structure --> extract_page_content : For each page
        extract_page_content --> convert_html_to_markdown
        convert_html_to_markdown --> WriteTemp : Write to /workspace/wiki
        WriteTemp --> [*]
    }
    
    Phase1 --> CheckMode : deepwiki-scraper.py complete
    
    state CheckMode <<choice>>
    CheckMode --> Phase2 : MARKDOWN_ONLY=false
    CheckMode --> CopyOutput : MARKDOWN_ONLY=true
    
    state "Phase 2 : Diagram Enhancement" as Phase2 {
        [*] --> extract_and_enhance_diagrams
        extract_and_enhance_diagrams --> ExtractJS : Fetch JS payload
        ExtractJS --> FuzzyMatch : ~461 diagrams found
        FuzzyMatch --> InjectDiagrams : ~48 placed
        InjectDiagrams --> [*] : Update temp files
    }
    
    Phase2 --> Phase3 : Enhancement complete
    
    state "Phase 3 : mdBook Build" as Phase3 {
        [*] --> CreateBookToml : build-docs.sh
        CreateBookToml --> GenerateSummary : book.toml created
        GenerateSummary --> CopyToSrc : SUMMARY.md generated
        CopyToSrc --> MdbookMermaidInstall : Copy to /workspace/book/src
        MdbookMermaidInstall --> MdbookBuild : Install assets
        MdbookBuild --> [*] : HTML in /workspace/book/book
    }
    
    Phase3 --> CopyOutput
    CopyOutput --> [*] : Copy to /output

Sources: build-docs.sh:1-206 tools/deepwiki-scraper.py:790-919

Phase Coordination

The build-docs.sh orchestrator coordinates all three phases and handles the decision point for markdown-only mode.

Orchestrator Control Flow

flowchart TD
    Start[/"docker run with env vars"/]
    
 
   Start --> ParseEnv["Parse environment variables\nREPO, BOOK_TITLE, MARKDOWN_ONLY"]
ParseEnv --> ValidateRepo{"REPO set?"}
ValidateRepo -->|No| AutoDetect["git config --get remote.origin.url\nExtract owner/repo"]
ValidateRepo -->|Yes| CallScraper
 
   AutoDetect --> CallScraper
    
    CallScraper["python3 /usr/local/bin/deepwiki-scraper.py\nArgs: REPO, /workspace/wiki"]
CallScraper --> ScraperPhase1["Phase 1: extract_wiki_structure()\nextract_page_content()\nWrite to temp directory"]
ScraperPhase1 --> ScraperPhase2["Phase 2: extract_and_enhance_diagrams()\nFuzzy match and inject\nUpdate temp files"]
ScraperPhase2 --> CheckMarkdownOnly{"MARKDOWN_ONLY\n== true?"}
CheckMarkdownOnly -->|Yes| CopyMdOnly["cp -r /workspace/wiki/* /output/markdown/\nExit"]
CheckMarkdownOnly -->|No| InitMdBook
    
    InitMdBook["mkdir -p /workspace/book\nGenerate book.toml"]
InitMdBook --> GenSummary["Generate src/SUMMARY.md\nScan /workspace/wiki/*.md\nBuild table of contents"]
GenSummary --> CopyToSrc["cp -r /workspace/wiki/* src/"]
CopyToSrc --> InstallMermaid["mdbook-mermaid install /workspace/book"]
InstallMermaid --> BuildBook["mdbook build"]
BuildBook --> CopyOutputs["cp -r book /output/\ncp -r /workspace/wiki/* /output/markdown/\ncp book.toml /output/"]
CopyMdOnly --> End[/"Exit with outputs in /output"/]
 
   CopyOutputs --> End

Sources: build-docs.sh:8-76 build-docs.sh:78-206

Phase 1: Clean Markdown Extraction

Phase 1 discovers the wiki structure and converts HTML pages to clean Markdown, writing files to a temporary directory (/workspace/wiki). This phase is implemented entirely in Python within deepwiki-scraper.py.

Phase 1 Data Flow

flowchart LR
    DeepWiki["https://deepwiki.com/\nowner/repo"]
DeepWiki -->|HTTP GET| extract_wiki_structure
    
    extract_wiki_structure["extract_wiki_structure()\nParse sidebar links\nBuild page list"]
extract_wiki_structure --> PageList["pages = [\n {number, title, url, href, level},\n ...\n]"]
PageList --> Loop["For each page"]
Loop --> extract_page_content["extract_page_content(url, session)\nFetch HTML\nRemove nav/footer elements"]
extract_page_content --> BeautifulSoup["BeautifulSoup(response.text)\nFind article/main/body\nRemove DeepWiki UI"]
BeautifulSoup --> convert_html_to_markdown["convert_html_to_markdown(html)\nhtml2text.HTML2Text()\nbody_width=0"]
convert_html_to_markdown --> clean_deepwiki_footer["clean_deepwiki_footer(markdown)\nRemove footer patterns"]
clean_deepwiki_footer --> FixLinks["Fix internal links\nRegex: /owner/repo/N-title\nConvert to relative .md paths"]
FixLinks --> WriteTempFile["Write to /workspace/wiki/\nMain: N-title.md\nSubsection: section-N/N-M-title.md"]
WriteTempFile --> Loop
    
    style extract_wiki_structure fill:#f9f9f9
    style extract_page_content fill:#f9f9f9
    style convert_html_to_markdown fill:#f9f9f9
    style clean_deepwiki_footer fill:#f9f9f9

Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:175-216 tools/deepwiki-scraper.py:127-173

Key Functions and Their Roles

FunctionFile LocationResponsibility
extract_wiki_structure()tools/deepwiki-scraper.py:78-125Discover all pages by parsing sidebar links with pattern /repo/\d+
extract_page_content()tools/deepwiki-scraper.py:453-594Fetch individual page, parse HTML, remove navigation elements
convert_html_to_markdown()tools/deepwiki-scraper.py:175-216Convert HTML to Markdown using html2text with body_width=0
clean_deepwiki_footer()tools/deepwiki-scraper.py:127-173Remove DeepWiki UI elements using regex pattern matching
sanitize_filename()tools/deepwiki-scraper.py:21-25Convert page titles to safe filenames
fix_wiki_link()tools/deepwiki-scraper.py:549-589Rewrite internal links to relative .md paths

File Organization Logic

flowchart TD
    PageNum["page['number']"]
PageNum --> CheckLevel{"page['level']\n== 0?"}
CheckLevel -->|Yes main page| RootFile["Filename: N-title.md\nPath: /workspace/wiki/N-title.md\nExample: 2-quick-start.md"]
CheckLevel -->|No subsection| ExtractMain["Extract main section\nmain_section = number.split('.')[0]"]
ExtractMain --> SubDir["Create directory\nsection-{main_section}/"]
SubDir --> SubFile["Filename: N-M-title.md\nPath: section-N/N-M-title.md\nExample: section-2/2-1-installation.md"]

The system organizes files hierarchically based on page numbering:

Sources: tools/deepwiki-scraper.py:849-860

Phase 2: Diagram Enhancement

Phase 2 extracts Mermaid diagrams from the JavaScript payload and uses fuzzy matching to intelligently place them in the appropriate Markdown files. This phase operates on files in the temporary directory (/workspace/wiki).

Phase 2 Algorithm Flow

flowchart TD
    Start["extract_and_enhance_diagrams(repo, temp_dir, session)"]
Start --> FetchJS["GET https://deepwiki.com/owner/repo/1-overview\nExtract response.text"]
FetchJS --> ExtractAll["Regex: ```mermaid\\\n(.*?)```\nFind all diagram blocks"]
ExtractAll --> CountTotal["all_diagrams list\n(~461 total)"]
CountTotal --> ExtractContext["Regex: ([^`]{'{500,}'}?)```mermaid\\ (.*?)```\nExtract 500-char context before each"]
ExtractContext --> Unescape["For each diagram:\ncontext.replace('\\\n', '\\\n')\ndiagram.replace('\\\n', '\\\n')\nUnescape HTML entities"]
Unescape --> BuildContext["diagram_contexts = [\n {\n last_heading: str,\n anchor_text: str (last 300 chars),\n diagram: str\n },\n ...\n]\n(~48 with context)"]
BuildContext --> ScanFiles["For each .md file in temp_dir.glob('**/*.md')"]
ScanFiles --> SkipExisting{"File contains\n'```mermaid'?"}
SkipExisting -->|Yes| ScanFiles
 
   SkipExisting -->|No| NormalizeContent
    
    NormalizeContent["content_normalized = content.lower()\ncontent_normalized = ' '.join(content_normalized.split())"]
NormalizeContent --> MatchLoop["For each diagram in diagram_contexts"]
MatchLoop --> TryChunks["Try chunk sizes: [300, 200, 150, 100, 80]\ntest_chunk = anchor_normalized[-chunk_size:]\npos = content_normalized.find(test_chunk)"]
TryChunks --> FoundMatch{"Match found?"}
FoundMatch -->|Yes| ConvertToLine["Convert char position to line number\nScan through lines counting chars"]
FoundMatch -->|No| TryHeading["Try heading match\nCompare normalized heading text"]
TryHeading --> FoundMatch2{"Match found?"}
FoundMatch2 -->|Yes| ConvertToLine
 
   FoundMatch2 -->|No| MatchLoop
    
 
   ConvertToLine --> FindInsertPoint["Find insertion point:\nIf heading: skip blank lines, skip paragraph\nIf paragraph: find end of paragraph"]
FindInsertPoint --> QueueInsert["pending_insertions.append(\n (insert_line, diagram, score, idx)\n)"]
QueueInsert --> MatchLoop
    
 
   MatchLoop --> InsertDiagrams["Sort by line number (reverse)\nInsert from bottom up:\nlines.insert(pos, '')\nlines.insert(pos, '```')\nlines.insert(pos, diagram)\nlines.insert(pos, '```mermaid')\nlines.insert(pos, '')"]
InsertDiagrams --> WriteFile["Write enhanced file back to disk"]
WriteFile --> ScanFiles
    
 
   ScanFiles --> Complete["Return to orchestrator"]

Sources: tools/deepwiki-scraper.py:596-788

Fuzzy Matching Algorithm

The algorithm uses progressive chunk sizes to find the best match location for each diagram:

Sources: tools/deepwiki-scraper.py:716-730 tools/deepwiki-scraper.py:732-745

flowchart LR
    Anchor["anchor_text\n(300 chars from JS context)"]
Anchor --> Normalize1["Normalize:\nlowercase\ncollapse whitespace"]
Content["markdown file content"]
Content --> Normalize2["Normalize:\nlowercase\ncollapse whitespace"]
Normalize1 --> Try300["Try 300-char chunk\ntest_chunk = anchor[-300:]"]
Normalize2 --> Try300
    
 
   Try300 --> Found300{"Found?"}
Found300 -->|Yes| Match300["best_match_score = 300"]
Found300 -->|No| Try200["Try 200-char chunk"]
Try200 --> Found200{"Found?"}
Found200 -->|Yes| Match200["best_match_score = 200"]
Found200 -->|No| Try150["Try 150-char chunk"]
Try150 --> Found150{"Found?"}
Found150 -->|Yes| Match150["best_match_score = 150"]
Found150 -->|No| Try100["Try 100-char chunk"]
Try100 --> Found100{"Found?"}
Found100 -->|Yes| Match100["best_match_score = 100"]
Found100 -->|No| Try80["Try 80-char chunk"]
Try80 --> Found80{"Found?"}
Found80 -->|Yes| Match80["best_match_score = 80"]
Found80 -->|No| TryHeading["Fallback: heading match"]
TryHeading --> FoundH{"Found?"}
FoundH -->|Yes| Match50["best_match_score = 50"]
FoundH -->|No| NoMatch["No match\nSkip this diagram"]

Diagram Extraction from JavaScript

Diagrams are extracted from the Next.js JavaScript payload using two strategies:

Extraction Strategies

StrategyPatternDescription
Fenced blocksmermaid\\n(.*?)Primary strategy: extract code blocks with escaped newlines
JavaScript strings"graph TD..."Fallback: find Mermaid start keywords in quoted strings

The function extract_mermaid_from_nextjs_data() at tools/deepwiki-scraper.py:218-331 handles unescaping:

block.replace('\\n', '\n')
block.replace('\\t', '\t')
block.replace('\\"', '"')
block.replace('\\\\', '\\')
block.replace('\\u003c', '<')
block.replace('\\u003e', '>')
block.replace('\\u0026', '&')

Sources: tools/deepwiki-scraper.py:218-331 tools/deepwiki-scraper.py:615-646

Phase 3: mdBook Build

Phase 3 generates mdBook configuration, creates the table of contents, and builds the final HTML documentation. This phase is orchestrated by build-docs.sh and invokes Rust tools (mdbook, mdbook-mermaid).

Phase 3 Component Interactions

flowchart TD
    Start["Phase 3 entry point\n(build-docs.sh:78)"]
Start --> MkdirBook["mkdir -p /workspace/book\ncd /workspace/book"]
MkdirBook --> GenToml["Generate book.toml:\n[book]\ntitle, authors, language\n[output.html]\ndefault-theme=rust\ngit-repository-url\n[preprocessor.mermaid]"]
GenToml --> MkdirSrc["mkdir -p src"]
MkdirSrc --> GenSummary["Generate src/SUMMARY.md"]
GenSummary --> ScanRoot["Scan /workspace/wiki/*.md\nFind first page for intro"]
ScanRoot --> ProcessMain["For each main page:\nExtract title from first line\nCheck for section-N/ subdirectory"]
ProcessMain --> HasSubs{"Has\nsubsections?"}
HasSubs -->|Yes| WriteSection["Write to SUMMARY.md:\n# Title\n- [Title](N-title.md)\n - [Subtitle](section-N/N-M-title.md)"]
HasSubs -->|No| WriteStandalone["Write to SUMMARY.md:\n- [Title](N-title.md)"]
WriteSection --> ProcessMain
 
   WriteStandalone --> ProcessMain
    
 
   ProcessMain --> CopySrc["cp -r /workspace/wiki/* src/"]
CopySrc --> InstallMermaid["mdbook-mermaid install /workspace/book\nInstalls mermaid.min.js\nInstalls mermaid-init.js\nUpdates book.toml"]
InstallMermaid --> MdbookBuild["mdbook build\nReads src/SUMMARY.md\nProcesses all .md files\nApplies rust theme\nGenerates book/index.html\nGenerates book/*/index.html"]
MdbookBuild --> CopyOut["Copy outputs:\ncp -r book /output/\ncp -r /workspace/wiki/* /output/markdown/\ncp book.toml /output/"]

Sources: build-docs.sh:78-206

book.toml Generation

The orchestrator dynamically generates book.toml with runtime configuration:

Sources: build-docs.sh:84-103

flowchart TD
    Start["Generate SUMMARY.md"]
Start --> FindFirst["first_page = ls /workspace/wiki/*.md /head -1 Extract title from first line Write: [Title] filename"]
FindFirst --> LoopMain["For each /workspace/wiki/*.md excluding first_page"]
LoopMain --> ExtractNum["section_num = filename.match /^[0-9]+/"]
ExtractNum --> CheckDir{"section-{num}/ exists?"}
CheckDir -->|Yes|WriteSectionHeader["Write: # {title} - [{title}] {filename}"]
WriteSectionHeader --> LoopSubs["For each section-{num}/*.md"]
LoopSubs --> WriteSubitem["Write: - [{subtitle}] section-{num}/{subfilename}"]
WriteSubitem --> LoopSubs
 LoopSubs --> LoopMain
 CheckDir -->|No| WriteStandalone["Write:\n- [{title}]({filename})"]
WriteStandalone --> LoopMain
    
 
   LoopMain --> Complete["SUMMARY.md complete\ngrep -c '\\[' to count entries"]

SUMMARY.md Generation Algorithm

The table of contents is generated by scanning the actual file structure in /workspace/wiki:

Sources: build-docs.sh:108-162

mdBook and mdbook-mermaid Execution

The build process invokes two Rust binaries:

CommandPurposeOutput
mdbook-mermaid install $BOOK_DIRInstall Mermaid.js assets and update book.tomlmermaid.min.js, mermaid-init.js in book/
mdbook buildParse SUMMARY.md, process Markdown, generate HTMLHTML files in /workspace/book/book/

The mdbook binary:

  1. Reads src/SUMMARY.md to determine structure
  2. Processes each Markdown file referenced in SUMMARY.md
  3. Applies the rust theme specified in book.toml
  4. Generates navigation sidebar
  5. Adds search functionality
  6. Creates "Edit this page" links using git-repository-url

Sources: build-docs.sh:169-176

Data Transformation Summary

Each phase transforms data in specific ways:

PhaseInput FormatProcessingOutput Format
Phase 1HTML pages from DeepWikiBeautifulSoup parsing, html2text conversion, link rewritingClean Markdown in /workspace/wiki/
Phase 2Markdown files + JavaScript payloadRegex extraction, fuzzy matching, diagram injectionEnhanced Markdown in /workspace/wiki/ (modified in place)
Phase 3Markdown files + environment variablesbook.toml generation, SUMMARY.md generation, mdbook buildHTML site in /workspace/book/book/

Final Output Structure:

/output/
├── book/                    # HTML documentation site
│   ├── index.html
│   ├── 1-overview.html
│   ├── section-2/
│   │   └── 2-1-subsection.html
│   ├── mermaid.min.js
│   ├── mermaid-init.js
│   └── ...
├── markdown/                # Source Markdown files
│   ├── 1-overview.md
│   ├── section-2/
│   │   └── 2-1-subsection.md
│   └── ...
└── book.toml               # mdBook configuration

Sources: build-docs.sh:178-205 README.md:89-119

flowchart TD
    Phase1["Phase 1: Extraction\n(deepwiki-scraper.py)"]
Phase2["Phase 2: Enhancement\n(deepwiki-scraper.py)"]
Phase1 --> Phase2
    
 
   Phase2 --> Check{"MARKDOWN_ONLY\n== true?"}
Check -->|Yes| FastPath["cp -r /workspace/wiki/* /output/markdown/\nExit (fast path)"]
Check -->|No| Phase3["Phase 3: mdBook Build\n(build-docs.sh)"]
Phase3 --> FullOutput["Copy book/ and markdown/ to /output/\nExit (full build)"]
FastPath --> End[/"Build complete"/]
 
   FullOutput --> End

Conditional Execution: MARKDOWN_ONLY Mode

The MARKDOWN_ONLY environment variable allows bypassing Phase 3 for faster iteration during development:

When MARKDOWN_ONLY=true:

  • Execution time: ~30-60 seconds (scraping + diagram matching only)
  • Output: /output/markdown/ only
  • Use case: Debugging diagram placement, testing content extraction

When MARKDOWN_ONLY=false (default):

  • Execution time: ~60-120 seconds (full pipeline)
  • Output: /output/book/, /output/markdown/, /output/book.toml
  • Use case: Production documentation builds

Sources: build-docs.sh:60-76 README.md:55-76

DeepWiki GitHub

Docker Multi-Stage Build

Relevant source files

Purpose and Scope

This document explains the Docker multi-stage build strategy used to create the deepwiki-scraper container image. It details how the system combines a Rust toolchain for compiling documentation tools with a Python runtime for web scraping, while optimizing the final image size.

For information about how the container orchestrates the build process, see build-docs.sh Orchestrator. For details on the Python scraper implementation, see deepwiki-scraper.py.

Multi-Stage Build Strategy

The Dockerfile implements a two-stage build pattern that separates compilation from runtime. Stage 1 uses a full Rust development environment to compile mdBook binaries from source. Stage 2 creates a minimal Python runtime and extracts only the compiled binaries, discarding the build toolchain.

Build Stages Flow

graph TD
    subgraph Stage1["Stage 1: Rust Builder (rust:latest)"]
RustBase["rust:latest base image\n~1.5 GB with toolchain"]
CargoInstall["cargo install mdbook\ncargo install mdbook-mermaid"]
Binaries["/usr/local/cargo/bin/\nmdbook\nmdbook-mermaid"]
RustBase --> CargoInstall
 
       CargoInstall --> Binaries
    end
    
    subgraph Stage2["Stage 2: Python Runtime (python:3.12-slim)"]
PyBase["python:3.12-slim base\n~150 MB"]
UVInstall["Copy uv from\nghcr.io/astral-sh/uv:latest"]
PipInstall["uv pip install --system\nrequirements.txt"]
CopyBinaries["COPY --from=builder\n/usr/local/cargo/bin/"]
CopyScripts["COPY tools/ and\nbuild-docs.sh"]
PyBase --> UVInstall
 
       UVInstall --> PipInstall
 
       PipInstall --> CopyBinaries
 
       CopyBinaries --> CopyScripts
    end
    
 
   Binaries -.->|Extract only binaries| CopyBinaries
    
 
   CopyScripts --> FinalImage["Final Image\n~300-400 MB"]
Stage1 -.->|Discarded after build| Discard["Discard"]
style RustBase fill:#f5f5f5
    style PyBase fill:#f5f5f5
    style FinalImage fill:#e8e8e8
    style Discard fill:#fff,stroke-dasharray: 5 5

Sources: Dockerfile:1-33

Stage 1: Rust Builder

Stage 1 uses rust:latest as the base image, providing the complete Rust toolchain including cargo, the Rust package manager and build tool.

Rust Builder Configuration

AspectDetails
Base Imagerust:latest
Size~1.5 GB (includes rustc, cargo, stdlib)
Build Commandscargo install mdbook, cargo install mdbook-mermaid
Output Location/usr/local/cargo/bin/
Stage Identifierbuilder

The cargo install commands fetch mdBook and mdbook-mermaid source from crates.io, compile them from source, and install the resulting binaries to /usr/local/cargo/bin/.

flowchart LR
    subgraph BuilderStage["builder stage"]
CratesIO["crates.io\n(package registry)"]
CargoFetch["cargo fetch\n(download sources)"]
CargoCompile["cargo build --release\n(compile to binary)"]
CargoInstallBin["Install to\n/usr/local/cargo/bin/"]
CratesIO --> CargoFetch
 
       CargoFetch --> CargoCompile
 
       CargoCompile --> CargoInstallBin
    end
    
 
   CargoInstallBin --> MdBookBin["mdbook binary"]
CargoInstallBin --> MermaidBin["mdbook-mermaid binary"]
MdBookBin -.->|Copied to Stage 2| NextStage["NextStage"]
MermaidBin -.->|Copied to Stage 2| NextStage

Sources: Dockerfile:1-5

Stage 2: Python Runtime Assembly

Stage 2 builds the final runtime image starting from python:3.12-slim, a minimal Python base image that omits development headers and unnecessary packages.

Python Runtime Components

graph TB
    subgraph PythonBase["python:3.12-slim"]
PyInterpreter["Python 3.12 interpreter"]
PyStdlib["Python standard library"]
BaseUtils["Essential utilities\n(bash, sh, coreutils)"]
end
    
    subgraph InstalledTools["Installed via COPY"]
UV["uv package manager\n/bin/uv, /bin/uvx"]
PyDeps["Python packages\n(requests, beautifulsoup4, html2text)"]
RustBins["Rust binaries\n(mdbook, mdbook-mermaid)"]
Scripts["Application scripts\n(deepwiki-scraper.py, build-docs.sh)"]
end
    
 
   PythonBase --> UV
 
   UV --> PyDeps
 
   PythonBase --> RustBins
 
   PythonBase --> Scripts
    
 
   PyDeps --> Runtime["Runtime Environment"]
RustBins --> Runtime
 
   Scripts --> Runtime

The installation sequence follows a specific order:

  1. Copy uv Dockerfile13 - Multi-stage copy from ghcr.io/astral-sh/uv:latest
  2. Install Python dependencies Dockerfile:16-17 - Uses uv pip install --system --no-cache
  3. Copy Rust binaries Dockerfile:20-21 - Extracts from builder stage
  4. Copy application scripts Dockerfile:24-29 - Adds Python scraper and orchestrator

Sources: Dockerfile:8-29

Binary Extraction and Integration

The critical optimization occurs at Dockerfile:20-21 where the COPY --from=builder directive extracts only the compiled binaries without any build dependencies.

Binary Extraction Pattern

Source (Stage 1)Destination (Stage 2)Purpose
/usr/local/cargo/bin/mdbook/usr/local/bin/mdbookDocumentation builder executable
/usr/local/cargo/bin/mdbook-mermaid/usr/local/bin/mdbook-mermaidMermaid preprocessor executable
flowchart LR
    subgraph BuilderFS["Builder Filesystem"]
CargoDir["/usr/local/cargo/bin/"]
MdBookSrc["mdbook\n(compiled binary)"]
MermaidSrc["mdbook-mermaid\n(compiled binary)"]
CargoDir --> MdBookSrc
 
       CargoDir --> MermaidSrc
    end
    
    subgraph RuntimeFS["Runtime Filesystem"]
BinDir["/usr/local/bin/"]
MdBookDst["mdbook\n(extracted)"]
MermaidDst["mdbook-mermaid\n(extracted)"]
BinDir --> MdBookDst
 
       BinDir --> MermaidDst
    end
    
 
   MdBookSrc -.->|COPY --from=builder| MdBookDst
 
   MermaidSrc -.->|COPY --from=builder| MermaidDst
    
    subgraph Discarded["Discarded (not copied)"]
RustToolchain["rustc compiler"]
CargoTool["cargo build tool"]
SourceFiles["mdBook source files"]
BuildCache["cargo build cache"]
end

Both binaries are statically linked or contain all necessary Rust runtime dependencies, allowing them to execute in the Python base image without the Rust toolchain.

Sources: Dockerfile:19-21

Python Dependency Installation

Python dependencies are installed using uv, a fast Python package installer written in Rust. The dependencies are defined in tools/requirements.txt:1-4

Python Dependencies

PackageVersionPurpose
requests≥2.31.0HTTP client for scraping DeepWiki
beautifulsoup4≥4.12.0HTML parsing and navigation
html2text≥2020.1.16HTML to Markdown conversion

The installation command Dockerfile17 uses these flags:

  • --system: Install to system Python (not virtualenv)
  • --no-cache: Don't cache downloaded packages (reduces image size)
  • -r /tmp/requirements.txt: Read dependencies from file

Sources: Dockerfile:16-17 tools/requirements.txt:1-4

graph LR
    subgraph Approach1["Single-Stage Approach (Hypothetical)"]
Single["rust:latest + Python\n~2+ GB"]
end
    
    subgraph Approach2["Multi-Stage Approach (Actual)"]
Builder["Stage 1: rust:latest\n~1.5 GB\n(discarded)"]
Runtime["Stage 2: python:3.12-slim\n+ binaries + dependencies\n~300-400 MB"]
Builder -.->|Extract binaries only| Runtime
    end
    
 
   Single -->|Contains unnecessary build toolchain| Waste["Wasted Space"]
Runtime -->|Contains only runtime essentials| Efficient["Efficient"]
style Single fill:#f5f5f5
    style Builder fill:#f5f5f5
    style Runtime fill:#e8e8e8
    style Waste fill:#fff,stroke-dasharray: 5 5
    style Efficient fill:#fff,stroke-dasharray: 5 5

Image Size Optimization

The multi-stage strategy achieves significant size reduction by discarding the build environment.

Size Comparison

Size Breakdown of Final Image

ComponentApproximate Size
Python 3.12 slim base~150 MB
Python packages (requests, BeautifulSoup4, html2text)~20 MB
mdBook binary~8 MB
mdbook-mermaid binary~6 MB
uv package manager~10 MB
Application scripts<1 MB
Total~300-400 MB

Sources: Dockerfile:1-33 README.md156

graph TB
    subgraph Filesystem["/usr/local/bin/"]
BuildScript["build-docs.sh\n(orchestrator)"]
ScraperScript["deepwiki-scraper.py\n(Python scraper)"]
MdBookBin["mdbook\n(Rust binary)"]
MermaidBin["mdbook-mermaid\n(Rust binary)"]
UVBin["uv\n(Python installer)"]
end
    
    subgraph SystemPython["/usr/local/lib/python3.12/"]
Requests["requests package"]
BS4["beautifulsoup4 package"]
Html2Text["html2text package"]
end
    
    subgraph Execution["Execution Flow"]
Docker["docker run\n(CMD)"]
Docker --> BuildScript
 
       BuildScript -->|python| ScraperScript
 
       BuildScript -->|subprocess| MdBookBin
 
       MdBookBin -->|preprocessor| MermaidBin
 
       ScraperScript --> Requests
 
       ScraperScript --> BS4
 
       ScraperScript --> Html2Text
    end

Runtime Environment Structure

The final image contains a hybrid Python-Rust runtime where Python scripts can execute Rust binaries as subprocesses.

Runtime Component Locations

The entrypoint Dockerfile32 executes /usr/local/bin/build-docs.sh, which orchestrates calls to both Python and Rust components. The script can execute:

  • python /usr/local/bin/deepwiki-scraper.py for web scraping
  • mdbook init for initialization
  • mdbook build for HTML generation
  • mdbook-mermaid install for asset installation

Sources: Dockerfile:28-32 build-docs.sh

Container Execution Model

When the container runs, Docker executes the CMD Dockerfile32 which invokes build-docs.sh. This shell script has access to all binaries in /usr/local/bin/ (automatically on $PATH).

Process Tree During Execution

graph TD
    Docker["docker run\n(container init)"]
Docker --> CMD["CMD: build-docs.sh"]
CMD --> Phase1["Phase 1:\npython deepwiki-scraper.py"]
CMD --> Phase2["Phase 2: mdbook init"]
CMD --> Phase3["Phase 3: mdbook-mermaid install"]
CMD --> Phase4["Phase 4: mdbook build"]
Phase1 --> PyProc["Python 3.12 process"]
PyProc --> ReqLib["requests.get()"]
PyProc --> BS4Lib["BeautifulSoup()"]
PyProc --> H2TLib["html2text.HTML2Text()"]
Phase2 --> MdBookProc1["mdbook binary process"]
Phase3 --> MermaidProc["mdbook-mermaid binary process"]
Phase4 --> MdBookProc2["mdbook binary process"]
MdBookProc2 --> MermaidPreproc["mdbook-mermaid\n(as preprocessor)"]

Sources: Dockerfile32 README.md:122-145

DeepWiki GitHub

Component Reference

Relevant source files

This page provides an overview of the three major components that comprise the DeepWiki-to-mdBook Converter system and their responsibilities. Each component operates at a different layer of the technology stack (Shell, Python, Rust) and handles a specific phase of the documentation transformation pipeline.

For detailed documentation of each component's internal implementation, see:

System Component Architecture

The system consists of three primary executable components that work together in sequence, coordinated through file system operations and process execution.

Component Architecture Diagram

graph TB
    subgraph "Shell Layer"
        buildsh["build-docs.sh\nOrchestrator"]
end
    
    subgraph "Python Layer"
        scraper["deepwiki-scraper.py\nContent Processor"]
bs4["BeautifulSoup4\nHTML Parser"]
html2text["html2text\nMarkdown Converter"]
requests["requests\nHTTP Client"]
end
    
    subgraph "Rust Layer"
        mdbook["mdbook\nBinary"]
mermaid["mdbook-mermaid\nBinary"]
end
    
    subgraph "Configuration Files"
        booktoml["book.toml"]
summarymd["SUMMARY.md"]
end
    
    subgraph "File System"
        wikidir["$WIKI_DIR\nTemp Storage"]
outputdir["$OUTPUT_DIR\nFinal Output"]
end
    
 
   buildsh -->|executes python3| scraper
 
   buildsh -->|generates| booktoml
 
   buildsh -->|generates| summarymd
 
   buildsh -->|executes mdbook-mermaid| mermaid
 
   buildsh -->|executes mdbook| mdbook
    
 
   scraper -->|uses| bs4
 
   scraper -->|uses| html2text
 
   scraper -->|uses| requests
 
   scraper -->|writes .md files| wikidir
    
 
   mdbook -->|integrates| mermaid
 
   mdbook -->|reads config| booktoml
 
   mdbook -->|reads TOC| summarymd
 
   mdbook -->|reads sources| wikidir
 
   mdbook -->|writes HTML| outputdir
    
 
   buildsh -->|copies files| outputdir

Sources: build-docs.sh:1-206 tools/deepwiki-scraper.py:1-920 Dockerfile:1-33

Component Execution Flow

This diagram shows the actual execution sequence with specific function calls and file operations that occur during a complete documentation build.

Execution Flow with Code Entities

sequenceDiagram
    participant User
    participant buildsh as "build-docs.sh"
    participant scraper as "deepwiki-scraper.py::main()"
    participant extract as "extract_wiki_structure()"
    participant content as "extract_page_content()"
    participant enhance as "extract_and_enhance_diagrams()"
    participant mdbook as "mdbook binary"
    participant mermaid as "mdbook-mermaid binary"
    participant fs as "File System"
    
    User->>buildsh: docker run -e REPO=...
    buildsh->>buildsh: Parse $REPO, $BOOK_TITLE, etc
    buildsh->>buildsh: Set WIKI_DIR=/workspace/wiki
    
    buildsh->>scraper: python3 deepwiki-scraper.py $REPO $WIKI_DIR
    scraper->>extract: extract_wiki_structure(repo, session)
    extract->>extract: BeautifulSoup4 parsing
    extract-->>scraper: pages[] array
    
    loop For each page
        scraper->>content: extract_page_content(url, session)
        content->>content: convert_html_to_markdown()
        content->>content: clean_deepwiki_footer()
        content->>fs: Write to $WIKI_DIR/*.md
    end
    
    scraper->>enhance: extract_and_enhance_diagrams(repo, temp_dir)
    enhance->>enhance: Extract diagrams from JavaScript
    enhance->>enhance: Fuzzy match with progressive chunks
    enhance->>fs: Update $WIKI_DIR/*.md with diagrams
    scraper-->>buildsh: Exit 0
    
    alt MARKDOWN_ONLY=true
        buildsh->>fs: cp $WIKI_DIR/* $OUTPUT_DIR/
        buildsh-->>User: Exit (skip mdBook)
    else Full build
        buildsh->>buildsh: Generate book.toml
        buildsh->>buildsh: Generate SUMMARY.md from files
        buildsh->>fs: mkdir $BOOK_DIR/src
        buildsh->>fs: cp $WIKI_DIR/* $BOOK_DIR/src/
        
        buildsh->>mermaid: mdbook-mermaid install $BOOK_DIR
        mermaid->>fs: Install mermaid.js assets
        
        buildsh->>mdbook: mdbook build
        mdbook->>mdbook: Parse SUMMARY.md
        mdbook->>mdbook: Process markdown files
        mdbook->>mdbook: Render HTML with rust theme
        mdbook->>fs: Write to $BOOK_DIR/book/
        
        buildsh->>fs: cp $BOOK_DIR/book $OUTPUT_DIR/
        buildsh->>fs: cp $WIKI_DIR $OUTPUT_DIR/markdown/
        buildsh-->>User: Build complete
    end

Sources: build-docs.sh:55-206 tools/deepwiki-scraper.py:790-916 tools/deepwiki-scraper.py:596-789

Component Responsibility Matrix

The following table details the specific responsibilities and capabilities of each component.

ComponentTypePrimary ResponsibilityKey Functions/OperationsInputOutput
build-docs.shShell ScriptOrchestration and configuration- Parse environment variables
- Auto-detect Git repository
- Execute scraper
- Generate book.toml
- Generate SUMMARY.md
- Execute mdBook tools
- Copy outputsEnvironment variablesComplete documentation site
deepwiki-scraper.pyPython ScriptContent extraction and enhancement- extract_wiki_structure()
- extract_page_content()
- convert_html_to_markdown()
- extract_and_enhance_diagrams()
- clean_deepwiki_footer()DeepWiki URLEnhanced Markdown files
mdbookRust BinaryHTML generation- Parse SUMMARY.md
- Process Markdown
- Apply theme
- Generate navigation
- Enable searchMarkdown + configHTML documentation
mdbook-mermaidRust BinaryDiagram rendering- Install mermaid.js
- Install CSS assets
- Process mermaid code blocksMarkdown with mermaidHTML with rendered diagrams

Sources: build-docs.sh:1-206 tools/deepwiki-scraper.py:1-920 README.md:146-156

Component File Locations

Each component resides in a specific location within the repository and Docker container, with distinct installation methods.

File System Layout

graph TB
    subgraph "Repository Structure"
        repo["/"]
buildscript["build-docs.sh\nOrchestrator script"]
dockerfile["Dockerfile\nMulti-stage build"]
toolsdir["tools/"]
scraper_py["deepwiki-scraper.py\nMain scraper"]
requirements["requirements.txt\nPython deps"]
repo --> buildscript
 
       repo --> dockerfile
 
       repo --> toolsdir
 
       toolsdir --> scraper_py
 
       toolsdir --> requirements
    end
    
    subgraph "Docker Container"
        container["/"]
usrbin["/usr/local/bin/"]
buildsh_installed["build-docs.sh"]
scraper_installed["deepwiki-scraper.py"]
mdbook_bin["mdbook"]
mermaid_bin["mdbook-mermaid"]
workspace["/workspace"]
wikidir["/workspace/wiki"]
bookdir["/workspace/book"]
outputvol["/output"]
container --> usrbin
 
       container --> workspace
 
       container --> outputvol
        
 
       usrbin --> buildsh_installed
 
       usrbin --> scraper_installed
 
       usrbin --> mdbook_bin
 
       usrbin --> mermaid_bin
        
 
       workspace --> wikidir
 
       workspace --> bookdir
    end
    
 
   buildscript -.->|COPY| buildsh_installed
 
   scraper_py -.->|COPY| scraper_installed
    
    style buildsh_installed fill:#fff9c4
    style scraper_installed fill:#e8f5e9
    style mdbook_bin fill:#f3e5f5
    style mermaid_bin fill:#f3e5f5

Sources: Dockerfile:1-33 build-docs.sh:27-30

Component Dependencies

Each component has specific external dependencies that must be available at runtime.

ComponentRuntimeDependenciesInstallation Method
build-docs.shbash- Git (optional, for auto-detection)
- Python 3.12+
- mdbook binary
- mdbook-mermaid binaryBundled in Docker
deepwiki-scraper.pyPython 3.12- requests (HTTP client)
- beautifulsoup4 (HTML parsing)
- html2text (Markdown conversion)uv pip install -r requirements.txt
mdbookNative Binary- Compiled from Rust source
- No runtime dependenciescargo install mdbook
mdbook-mermaidNative Binary- Compiled from Rust source
- No runtime dependenciescargo install mdbook-mermaid

Sources: Dockerfile:1-33 tools/requirements.txt README.md:154-156

Component Communication Protocol

Components communicate exclusively through the file system and process exit codes, with no direct API calls or shared memory.

Inter-Component Communication

graph LR
    subgraph "Phase 1: Extraction"
        buildsh1["build-docs.sh"]
scraper1["deepwiki-scraper.py"]
env["Environment:\n$REPO\n$WIKI_DIR"]
wikidir1["$WIKI_DIR/\n*.md files"]
buildsh1 -->|sets| env
 
       env -->|python3 scraper.py $REPO $WIKI_DIR| scraper1
 
       scraper1 -->|writes| wikidir1
 
       scraper1 -.->|exit 0| buildsh1
    end
    
    subgraph "Phase 2: Configuration"
        buildsh2["build-docs.sh"]
booktoml2["book.toml"]
summarymd2["SUMMARY.md"]
wikidir2["$WIKI_DIR/\nfile scan"]
buildsh2 -->|reads structure| wikidir2
 
       buildsh2 -->|cat > book.toml| booktoml2
 
       buildsh2 -->|generates from files| summarymd2
    end
    
    subgraph "Phase 3: Build"
        buildsh3["build-docs.sh"]
mermaid3["mdbook-mermaid"]
mdbook3["mdbook"]
config3["book.toml\nSUMMARY.md\nsrc/*.md"]
output3["$OUTPUT_DIR/\nbook/"]
buildsh3 -->|mdbook-mermaid install| mermaid3
 
       mermaid3 -->|writes assets| config3
 
       buildsh3 -->|mdbook build| mdbook3
 
       mdbook3 -->|reads| config3
 
       mdbook3 -->|writes| output3
 
       mdbook3 -.->|exit 0| buildsh3
    end
    
 
   wikidir1 -->|same files| wikidir2

Sources: build-docs.sh:55-206

Environment Variable Interface

The orchestrator component accepts configuration through environment variables, which control all aspects of system behavior.

VariablePurposeDefaultUsed BySet At
$REPOGitHub repository identifierAuto-detectedbuild-docs.sh, deepwiki-scraper.pybuild-docs.sh:9-19
$BOOK_TITLEDocumentation title"Documentation"build-docs.sh (book.toml)build-docs.sh23
$BOOK_AUTHORSAuthor name(s)Extracted from $REPObuild-docs.sh (book.toml)build-docs.sh:24-44
$GIT_REPO_URLSource repository URLConstructed from $REPObuild-docs.sh (book.toml)build-docs.sh:25-45
$MARKDOWN_ONLYSkip mdBook build"false"build-docs.shbuild-docs.sh:26-76
$WORK_DIRWorking directory"/workspace"build-docs.shbuild-docs.sh27
$WIKI_DIRTemp markdown storage"$WORK_DIR/wiki"build-docs.sh, deepwiki-scraper.pybuild-docs.sh28
$OUTPUT_DIRFinal output location"/output"build-docs.shbuild-docs.sh29
$BOOK_DIRmdBook workspace"$WORK_DIR/book"build-docs.shbuild-docs.sh30

Sources: build-docs.sh:8-30 build-docs.sh:43-45

Python Module Structure

The deepwiki-scraper.py component is organized as a single-file script with a clear functional hierarchy.

Python Function Call Graph

graph TD
    main["main()\nEntry point"]
extract_struct["extract_wiki_structure()\nDiscover pages"]
extract_content["extract_page_content()\nProcess single page"]
enhance["extract_and_enhance_diagrams()\nAdd diagrams"]
fetch["fetch_page()\nHTTP with retries"]
sanitize["sanitize_filename()\nClean filenames"]
convert["convert_html_to_markdown()\nHTML→MD"]
clean["clean_deepwiki_footer()\nRemove UI"]
extract_mermaid["extract_mermaid_from_nextjs_data()\nParse JS payload"]
main --> extract_struct
 
   main --> extract_content
 
   main --> enhance
    
 
   extract_struct --> fetch
 
   extract_content --> fetch
 
   extract_content --> convert
    
 
   convert --> clean
    
 
   enhance --> fetch
 
   enhance --> extract_mermaid
    
 
   extract_content --> sanitize

Sources: tools/deepwiki-scraper.py:790-919 tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:596-789

graph TB
    start["Start"]
detect["Auto-detect Git repository\nlines 9-19"]
validate["Validate configuration\nlines 32-53"]
step1["Step 1: Execute scraper\nline 58:\npython3 deepwiki-scraper.py"]
check{"MARKDOWN_ONLY\n== true?"}
markdown_exit["Copy markdown only\nlines 64-76\nexit 0"]
step2["Step 2: Initialize mdBook\nlines 79-106:\nmkdir, cat > book.toml"]
step3["Step 3: Generate SUMMARY.md\nlines 109-159:\nscan files, generate TOC"]
step4["Step 4: Copy sources\nlines 164-166:\ncp wiki/* src/"]
step5["Step 5: Install mermaid\nlines 169-171:\nmdbook-mermaid install"]
step6["Step 6: Build book\nlines 174-176:\nmdbook build"]
step7["Step 7: Copy outputs\nlines 179-191:\ncp to /output"]
done["Done"]
start --> detect
 
   detect --> validate
 
   validate --> step1
 
   step1 --> check
    
 
   check -->|yes| markdown_exit
 
   markdown_exit --> done
    
 
   check -->|no| step2
 
   step2 --> step3
 
   step3 --> step4
 
   step4 --> step5
 
   step5 --> step6
 
   step6 --> step7
 
   step7 --> done

Shell Script Structure

The build-docs.sh orchestrator follows a linear execution model with conditional branching for markdown-only mode.

Shell Script Execution Blocks

Sources: build-docs.sh:1-206

Cross-Component Data Formats

Data passes between components in well-defined formats through the file system.

Data FormatProducerConsumerLocationStructure
Enhanced Markdowndeepwiki-scraper.pymdbook$WIKI_DIR/*.mdUTF-8 text, front matter optional, mermaid code blocks
book.tomlbuild-docs.shmdbook$BOOK_DIR/book.tomlTOML format, sections: [book], [output.html], [preprocessor.mermaid]
SUMMARY.mdbuild-docs.shmdbook$BOOK_DIR/src/SUMMARY.mdMarkdown list format, relative file paths
File hierarchydeepwiki-scraper.pybuild-docs.sh$WIKI_DIR/ and $WIKI_DIR/section-*/Root: N-title.md, Subsections: section-N/N-M-title.md
HTML outputmdbookUser$OUTPUT_DIR/book/Complete static site with search index

Sources: build-docs.sh:84-103 build-docs.sh:112-159 tools/deepwiki-scraper.py:849-868

graph TB
    subgraph "Stage 1: rust:latest"
        rust_base["rust:latest base\n~1.5 GB"]
cargo["cargo install"]
mdbook_build["mdbook binary\ncompilation"]
mermaid_build["mdbook-mermaid binary\ncompilation"]
rust_base --> cargo
 
       cargo --> mdbook_build
 
       cargo --> mermaid_build
    end
    
    subgraph "Stage 2: python:3.12-slim"
        py_base["python:3.12-slim base\n~150 MB"]
uv_install["Install uv package manager"]
pip_install["uv pip install\nrequirements.txt"]
copy_rust["COPY --from=builder\nRust binaries"]
copy_scripts["COPY Python + Shell scripts"]
py_base --> uv_install
 
       uv_install --> pip_install
 
       pip_install --> copy_rust
 
       copy_rust --> copy_scripts
    end
    
    subgraph "Final Image Contents"
        final["/usr/local/bin/"]
build_sh["build-docs.sh"]
scraper_py["deepwiki-scraper.py"]
mdbook_final["mdbook"]
mermaid_final["mdbook-mermaid"]
final --> build_sh
 
       final --> scraper_py
 
       final --> mdbook_final
 
       final --> mermaid_final
    end
    
 
   mdbook_build -.->|extract| copy_rust
 
   mermaid_build -.->|extract| copy_rust
    
 
   copy_scripts --> build_sh
 
   copy_scripts --> scraper_py
 
   copy_rust --> mdbook_final
 
   copy_rust --> mermaid_final

Component Installation in Docker

The multi-stage Docker build process installs each component using its native tooling, then combines them in a minimal runtime image.

Docker Build Process

Sources: Dockerfile:1-33

Next Steps

For detailed implementation documentation of each component, see:

  • build-docs.sh Orchestrator : Environment variable parsing, Git auto-detection, configuration file generation, subprocess execution, error handling
  • deepwiki-scraper.py : Wiki structure discovery, HTML parsing, Markdown conversion, diagram extraction algorithms, fuzzy matching implementation
  • mdBook Integration : Configuration schema, SUMMARY.md generation algorithm, mdbook-mermaid preprocessor integration, theme customization
DeepWiki GitHub

build-docs.sh Orchestrator

Relevant source files

Purpose and Scope

This page documents the build-docs.sh shell script, which serves as the central orchestrator for the entire documentation build process. This script is the container's entry point and coordinates all phases of the system: configuration parsing, scraper invocation, mdBook configuration generation, and output management.

For details about the Python scraping component that this orchestrator calls, see deepwiki-scraper.py. For information about the mdBook integration and configuration format, see mdBook Integration.

Overview

The build-docs.sh script is a Bash orchestration layer that implements the three-phase pipeline described in Three-Phase Pipeline. It has the following core responsibilities:

ResponsibilityLinesDescription
Auto-detectionbuild-docs.sh:8-19Detects repository from Git remote if not provided
Configurationbuild-docs.sh:21-53Parses environment variables and applies defaults
Phase 1 orchestrationbuild-docs.sh:55-58Invokes Python scraper
Markdown-only exitbuild-docs.sh:60-76Implements fast-path for debugging
Phase 3 orchestrationbuild-docs.sh:78-191Generates configs, builds mdBook, copies outputs

Sources: build-docs.sh:1-206

Script Workflow

Complete Execution Flow

The following diagram shows the complete control flow through the orchestrator, including all decision points and phase transitions:

flowchart TD
    Start[["build-docs.sh entry"]]
 
   Start --> AutoDetect["Auto-detect repository\nfrom git config"]
AutoDetect --> ValidateRepo{"REPO variable\nset?"}
ValidateRepo -->|No| Error[["Exit with error"]]
 
   ValidateRepo -->|Yes| ExtractParts["Extract REPO_OWNER\nand REPO_NAME"]
ExtractParts --> SetDefaults["Set defaults:\nBOOK_AUTHORS=REPO_OWNER\nGIT_REPO_URL=github.com/REPO"]
SetDefaults --> PrintConfig["Print configuration\nto stdout"]
PrintConfig --> Phase1["Execute Phase 1:\npython3 deepwiki-scraper.py"]
Phase1 --> CheckMode{"MARKDOWN_ONLY\n= true?"}
CheckMode -->|Yes| CopyMd["Copy WIKI_DIR to\nOUTPUT_DIR/markdown"]
CopyMd --> ExitMd[["Exit: markdown-only"]]
    
 
   CheckMode -->|No| InitBook["Create BOOK_DIR\nand book.toml"]
InitBook --> GenSummary["Generate SUMMARY.md\nfrom file structure"]
GenSummary --> CopySrc["Copy WIKI_DIR to\nBOOK_DIR/src"]
CopySrc --> InstallMermaid["mdbook-mermaid install"]
InstallMermaid --> BuildBook["mdbook build"]
BuildBook --> CopyOutputs["Copy outputs:\nbook/, markdown/, book.toml"]
CopyOutputs --> Success[["Exit: build complete"]]
    
    style Start fill:#f9f9f9
    style Phase1 fill:#e8f5e9
    style CheckMode fill:#fff9c4
    style ExitMd fill:#f9f9f9
    style Success fill:#f9f9f9
    style Error fill:#ffebee

Sources: build-docs.sh:1-206

Key Decision Point: MARKDOWN_ONLY Mode

The MARKDOWN_ONLY environment variable creates two distinct execution paths in the orchestrator. When set to "true", the script bypasses mdBook configuration generation and building (Phase 3), providing a fast path for debugging content extraction and diagram placement.

Sources: build-docs.sh26 build-docs.sh:60-76

Configuration Handling

Auto-Detection System

The script implements an intelligent auto-detection system for the REPO variable when running in a Git repository context:

flowchart LR
 
   Start["REPO variable"] --> Check{"REPO set?"}
Check -->|Yes| UseProvided["Use provided value"]
Check -->|No| GitCheck{"Inside Git\nrepository?"}
GitCheck -->|No| RequireManual["REPO remains empty"]
GitCheck -->|Yes| GetRemote["git config --get\nremote.origin.url"]
GetRemote --> Extract["Extract owner/repo\nusing sed regex"]
Extract --> SetRepo["Set REPO variable"]
UseProvided --> Validate["Validation check"]
SetRepo --> Validate
 
   RequireManual --> Validate
    
 
   Validate --> ValidCheck{"REPO\nis set?"}
ValidCheck -->|No| ExitError[["Exit with error:\nREPO must be set"]]
 
   ValidCheck -->|Yes| Continue["Continue execution"]

The regular expression used for extraction handles multiple GitHub URL formats:

  • https://github.com/owner/repo.git
  • git@github.com:owner/repo.git
  • https://github.com/owner/repo

Sources: build-docs.sh:8-19 build-docs.sh:32-37

Configuration Variable Flow

The script manages five primary configuration variables, with the following precedence and default logic:

VariableSourceDefault DerivationCode Reference
REPOEnvironment or Git auto-detect(required) build-docs.sh8-22
BOOK_TITLEEnvironment"Documentation"build-docs.sh23
BOOK_AUTHORSEnvironment$REPO_OWNERbuild-docs.sh:40-44
GIT_REPO_URLEnvironmenthttps://github.com/$REPObuild-docs.sh:40-45
MARKDOWN_ONLYEnvironment"false"build-docs.sh26

The script extracts REPO_OWNER and REPO_NAME from the REPO variable using shell string manipulation:

Sources: build-docs.sh:39-45

Working Directory Structure

The orchestrator uses four primary directory paths:

  • WORK_DIR="/workspace": Temporary workspace for all build operations
  • WIKI_DIR="$WORK_DIR/wiki": Scraper output location
  • BOOK_DIR="$WORK_DIR/book": mdBook project directory
  • OUTPUT_DIR="/output": Volume-mounted final output location

Sources: build-docs.sh:27-30

Phase Orchestration

Phase 1: Scraper Invocation

The orchestrator invokes the Python scraper with exactly two positional arguments:

This command executes the complete Phase 1 and Phase 2 pipeline as documented in Phase 1: Markdown Extraction and Phase 2: Diagram Enhancement. The scraper writes all output to $WIKI_DIR.

Sources: build-docs.sh:55-58

Phase 3: mdBook Configuration and Build

Phase 3 is implemented through six distinct steps in the orchestrator:

Note: Step numbering in stdout messages is off-by-one from phase numbering because the scraper is "Step 1."

Sources: build-docs.sh:78-191

Configuration File Generation

flowchart LR
    EnvVars["Environment variables:\nBOOK_TITLE\nBOOK_AUTHORS\nGIT_REPO_URL"]
Template["Heredoc template\nat line 85-103"]
BookToml["BOOK_DIR/book.toml"]
EnvVars --> Template
 
   Template --> BookToml
    
 
   BookToml --> MdBook["mdbook build"]

book.toml Generation

The orchestrator dynamically generates the book.toml configuration file for mdBook using a heredoc:

The generated book.toml includes:

  • [book] section: title, authors, language, multilingual, src
  • [output.html] section: default-theme, git-repository-url
  • [preprocessor.mermaid] section: command
  • [output.html.fold] section: enable, level

The git-repository-url setting enables mdBook's "Edit this page" functionality, linking back to the GitHub repository specified in $GIT_REPO_URL.

Sources: build-docs.sh:84-103

flowchart TD
    Start["Begin SUMMARY.md generation"]
Start --> FindFirst["Find first .md file\nin WIKI_DIR root"]
FindFirst --> ExtractTitle1["Extract title from\nfirst line (# Title)"]
ExtractTitle1 --> WriteIntro["Write as Introduction link"]
WriteIntro --> IterateMain["Iterate *.md files\nin WIKI_DIR root"]
IterateMain --> SkipFirst{"Is this\nfirst file?"}
SkipFirst -->|Yes| NextFile["Skip to next file"]
SkipFirst -->|No| ExtractTitle2["Extract title\nfrom first line"]
ExtractTitle2 --> GetSectionNum["Extract section number\nusing grep regex"]
GetSectionNum --> CheckSubdir{"section-N/\ndirectory exists?"}
CheckSubdir -->|No| WriteStandalone["Write as standalone:\n- [Title](file.md)"]
CheckSubdir -->|Yes| WriteSection["Write section header:\n# Title"]
WriteSection --> WriteMainLink["Write main page link:\n- [Title](file.md)"]
WriteMainLink --> IterateSubs["Iterate section-N/*.md"]
IterateSubs --> WriteSubLinks["Write indented sub-links:\n - [SubTitle](section-N/file.md)"]
WriteStandalone --> NextFile
 
   WriteSubLinks --> NextFile
    
 
   NextFile --> MoreFiles{"More\nfiles?"}
MoreFiles -->|Yes| IterateMain
 
   MoreFiles -->|No| WriteSummary["Write to BOOK_DIR/src/SUMMARY.md"]
WriteSummary --> Done["Generation complete"]

SUMMARY.md Generation Algorithm

The orchestrator generates the table of contents (SUMMARY.md) by scanning the actual file structure in $WIKI_DIR. This dynamic generation ensures the table of contents always matches the scraped content.

The algorithm extracts titles by reading the first line of each Markdown file and removing the # prefix using sed:

Section numbers are extracted using grep with a regex pattern:

For detailed information about how the file structure is organized, see Wiki Structure Discovery.

Sources: build-docs.sh:108-159

File Operations

Copy Operations Mapping

The orchestrator performs strategic copy operations to move data through the pipeline:

SourceDestinationPurposeCode Reference
$WIKI_DIR/*$OUTPUT_DIR/markdown/Markdown-only mode outputbuild-docs.sh65
$WIKI_DIR/*$BOOK_DIR/src/Source files for mdBookbuild-docs.sh166
$BOOK_DIR/book$OUTPUT_DIR/book/Final HTML outputbuild-docs.sh184
$WIKI_DIR/*$OUTPUT_DIR/markdown/Markdown reference copybuild-docs.sh188
$BOOK_DIR/book.toml$OUTPUT_DIR/book.tomlConfiguration referencebuild-docs.sh191

The final output structure in $OUTPUT_DIR is:

/output/
├── book/              # HTML documentation (from BOOK_DIR/book)
│   ├── index.html
│   ├── *.html
│   └── ...
├── markdown/          # Source Markdown files (from WIKI_DIR)
│   ├── 1-overview.md
│   ├── 2-section.md
│   ├── section-2/
│   └── ...
└── book.toml          # Configuration copy (from BOOK_DIR)

Sources: build-docs.sh:178-191

Atomic Output Management

The orchestrator uses a two-stage directory strategy for atomic outputs:

  1. Working stage : All operations occur in /workspace (ephemeral)
  2. Output stage : Final artifacts are copied to /output (volume-mounted)

This ensures that partial builds never appear in the output directory—only completed artifacts are copied. If any step fails, the set -e directive at build-docs.sh2 causes immediate script termination with no partial outputs.

Sources: build-docs.sh2 build-docs.sh:27-30 build-docs.sh:178-191

Tool Invocations

External Command Execution

The orchestrator invokes three external tools during execution:

Each tool is invoked with specific working directories and arguments:

Python scraper invocation (build-docs.sh58):

mdbook-mermaid installation (build-docs.sh171):

This installs the necessary JavaScript and CSS assets for Mermaid diagram rendering into the mdBook project.

mdBook build (build-docs.sh176):

Executed from within $BOOK_DIR due to cd "$BOOK_DIR" at build-docs.sh82

Sources: build-docs.sh58 build-docs.sh82 build-docs.sh171 build-docs.sh176

Error Handling

Validation and Exit Conditions

The script implements minimal but critical validation:

The set -e directive at build-docs.sh2 ensures that any command failure (non-zero exit code) immediately terminates the script. This includes:

  • HTTP failures in the Python scraper
  • File system errors during copy operations
  • mdBook build failures
  • mdbook-mermaid installation failures

The only explicit validation check is for the REPO variable at build-docs.sh:32-37 which prints usage instructions and exits with code 1 if not set.

Sources: build-docs.sh2 build-docs.sh:32-37

Stdout Output Format

The orchestrator provides structured console output for monitoring build progress:

================================================================================
DeepWiki Documentation Builder
================================================================================

Configuration:
  Repository:    owner/repo
  Book Title:    Documentation Title
  Authors:       Author Name
  Git Repo URL:  https://github.com/owner/repo
  Markdown Only: false

Step 1: Scraping wiki from DeepWiki...
[scraper output...]

Step 2: Initializing mdBook structure...

Step 3: Generating SUMMARY.md from scraped content...
Generated SUMMARY.md with N entries

Step 4: Copying markdown files to book...

Step 5: Installing mdbook-mermaid assets...

Step 6: Building mdBook...

Step 7: Copying outputs to /output...

================================================================================
✓ Documentation build complete!
================================================================================

Outputs:
  - HTML book:       /output/book/
  - Markdown files:  /output/markdown/
  - Book config:     /output/book.toml

To serve the book locally:
  cd /output && python3 -m http.server --directory book 8000

Each step is clearly labeled with progress indicators. The configuration block is printed before processing begins to aid in debugging.

Sources: build-docs.sh:4-6 build-docs.sh:47-53 build-docs.sh:55-205

DeepWiki GitHub

deepwiki-scraper.py

Relevant source files

Purpose and Scope

The deepwiki-scraper.py script is the core content extraction engine that scrapes wiki pages from DeepWiki.com and converts them into clean Markdown files with intelligently placed Mermaid diagrams. This page documents the script's internal architecture, algorithms, and data transformations.

For information about how this script is orchestrated within the larger build system, see 5.1: build-docs.sh Orchestrator. For detailed explanations of the extraction and enhancement phases, see 6: Phase 1: Markdown Extraction and 7: Phase 2: Diagram Enhancement.

Sources: tools/deepwiki-scraper.py:1-11

Command-Line Interface

The script accepts exactly two arguments and is designed to be called programmatically:

ParameterDescriptionExample
owner/repoGitHub repository identifier in format owner/repofacebook/react
output-dirDirectory where markdown files will be written./output/markdown

The script validates the repository format using regex ^[\w-]+/[\w-]+$ and exits with an error if the format is invalid.

Sources: tools/deepwiki-scraper.py:790-802

Main Execution Flow

The main() function orchestrates all operations using a temporary directory workflow to ensure atomic file operations:

Atomic Workflow Design: All scraping and enhancement operations occur in a temporary directory. Files are only moved to the final output directory after all processing completes successfully. If the script crashes or is interrupted, the output directory remains untouched.

graph TB
 
   Start["main()"] --> Validate["Validate Arguments\nRegex: ^[\w-]+/[\w-]+$"]
Validate --> TempDir["Create Temporary Directory\ntempfile.TemporaryDirectory()"]
TempDir --> Session["Create requests.Session()\nwith User-Agent headers"]
Session --> Phase1["PHASE 1: Clean Markdown\nextract_wiki_structure()\nextract_page_content()"]
Phase1 --> WriteTemp["Write files to temp_dir\nOrganized by hierarchy"]
WriteTemp --> Phase2["PHASE 2: Diagram Enhancement\nextract_and_enhance_diagrams()"]
Phase2 --> EnhanceTemp["Enhance files in temp_dir\nInsert diagrams via fuzzy matching"]
EnhanceTemp --> Phase3["PHASE 3: Atomic Move\nshutil.copytree()\nshutil.copy2()"]
Phase3 --> CleanOutput["Clear output_dir\nMove temp files to output"]
CleanOutput --> Complete["Complete\ntemp_dir auto-deleted"]
style Phase1 fill:#e8f5e9
    style Phase2 fill:#f3e5f5
    style Phase3 fill:#fff4e1

Sources: tools/deepwiki-scraper.py:790-919

Dependencies and HTTP Session

The script imports three primary libraries for web scraping and conversion:

DependencyPurposeKey Usage
requestsHTTP client with session supporttools/deepwiki-scraper.py17
beautifulsoup4HTML parsing and DOM traversaltools/deepwiki-scraper.py18
html2textHTML to Markdown conversiontools/deepwiki-scraper.py19

The HTTP session is configured with browser-like headers to avoid being blocked:

Sources: tools/deepwiki-scraper.py:817-821 tools/requirements.txt:1-4

Core Function Reference

Structure Discovery Functions

extract_wiki_structure(repo, session) tools/deepwiki-scraper.py:78-125

  • Fetches the repository's main wiki page
  • Extracts all links matching pattern /owner/repo/\d+
  • Parses page numbers (e.g., 1, 2.1, 3.2.1) and titles
  • Determines hierarchy level by counting dots in page number
  • Returns sorted list of page dictionaries with keys: number, title, url, href, level

discover_subsections(repo, main_page_num, session) tools/deepwiki-scraper.py:44-76

  • Attempts to discover subsections by testing URL patterns
  • Tests up to 10 subsections per main page (e.g., /repo/2-1-, /repo/2-2-)
  • Uses HEAD requests for efficiency
  • Returns list of discovered subsection metadata

Sources: tools/deepwiki-scraper.py:44-125

Content Extraction Functions

extract_page_content(url, session, current_page_info) tools/deepwiki-scraper.py:453-594

  • Main content extraction function called for each wiki page
  • Removes navigation and UI elements before conversion
  • Converts HTML to Markdown using html2text library
  • Rewrites internal DeepWiki links to relative Markdown file paths
  • Returns clean Markdown string

fetch_page(url, session) tools/deepwiki-scraper.py:27-42

  • Implements retry logic with exponential backoff
  • Attempts each request up to 3 times with 2-second delays
  • Raises exception on final failure
  • Returns requests.Response object

convert_html_to_markdown(html_content) tools/deepwiki-scraper.py:175-216

  • Configures html2text.HTML2Text() with body_width=0 (no line wrapping)
  • Sets ignore_links=False to preserve link structure
  • Calls clean_deepwiki_footer() to remove UI elements
  • Diagrams are not extracted here (handled in Phase 2)

Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:175-216 tools/deepwiki-scraper.py:453-594

graph TB
 
   Input["Input: /owner/repo/4-2-query-planning"] --> Extract["Extract via regex:\n/(\d+(?:\.\d+)*)-(.+)$"]
Extract --> ParseNum["page_num = '4.2'\nslug = 'query-planning'"]
ParseNum --> ConvertNum["file_num = page_num.replace('.', '-')\nResult: '4-2'"]
ConvertNum --> CheckTarget{"Target is\nsubsection?\n(has dot)"}
CheckTarget -->|Yes| CheckSource{"Source is\nsubsection?\n(level > 0)"}
CheckTarget -->|No| CheckSource2{"Source is\nsubsection?"}
CheckSource -->|Yes, same section| SameSec["Return: '4-2-query-planning.md'"]
CheckSource -->|No or different| DiffSec["Return: 'section-4/4-2-query-planning.md'"]
CheckSource2 -->|Yes| UpLevel["Return: '../4-2-query-planning.md'"]
CheckSource2 -->|No| SameLevel["Return: '4-2-query-planning.md'"]
style CheckTarget fill:#e8f5e9
    style CheckSource fill:#fff4e1

The script converts DeepWiki's absolute URLs to relative Markdown file paths, handling hierarchical section directories:

Algorithm Implementation: tools/deepwiki-scraper.py:549-592

The fix_wiki_link() nested function handles four scenarios:

  1. Both main pages: Use filename only (e.g., 2-overview.md)
  2. Source subsection → target main page: Use ../ prefix (e.g., ../2-overview.md)
  3. Both in same section directory: Use filename only (e.g., 4-2-sql-parser.md)
  4. Different sections: Use full path (e.g., section-4/4-2-sql-parser.md)

Sources: tools/deepwiki-scraper.py:549-592

Diagram Enhancement Architecture

Diagram Extraction from Next.js Payload

DeepWiki embeds all Mermaid diagrams in a JavaScript payload within the HTML. The extract_and_enhance_diagrams() function extracts diagrams with contextual information:

Key Data Structures:

graph TB
 
   Start["extract_and_enhance_diagrams(repo, temp_dir, session)"] --> FetchJS["Fetch https://deepwiki.com/{repo}/1-overview\nAny page contains all diagrams"]
FetchJS --> Pattern1["Regex: ```mermaid\\\\\n(.*?)```\nFind all diagram blocks"]
Pattern1 --> Count["Print: Found {N}
total diagrams"]
Count --> Pattern2["Regex with context:\n([^`]{500,}?)```mermaid\\\\ (.*?)```"]
Pattern2 --> Extract["For each match:\n- Extract 500-char context before\n- Extract diagram code"]
Extract --> Unescape["Unescape sequences:\n\\\n→ newline\n\\	 → tab\n\\\" → quote\n\< → '<'"]
Unescape --> Parse["Parse context:\n- Find last heading\n- Extract last 2-3 non-heading lines\n- Create anchor_text (last 300 chars)"]
Parse --> Store["Store diagram_contexts[]\nKeys: last_heading, anchor_text, diagram"]
Store --> Enhance["Enhance all .md files in temp_dir"]
style Pattern1 fill:#e8f5e9
    style Pattern2 fill:#fff4e1
    style Parse fill:#f3e5f5

Sources: tools/deepwiki-scraper.py:596-674

graph TB
 
   Start["For each markdown file"] --> Normalize["Normalize content:\n- Convert to lowercase\n- Collapse whitespace\n- content_normalized = ' '.join(content.split())"]
Normalize --> Loop["For each diagram in diagram_contexts"]
Loop --> GetAnchors["Get anchor_text and last_heading\nfrom diagram context"]
GetAnchors --> TryChunks{"Try chunk sizes:\n300, 200, 150, 100, 80"}
TryChunks --> ExtractChunk["Extract last N chars of anchor_text\ntest_chunk = anchor[-chunk_size:]"]
ExtractChunk --> FindPos["pos = content_normalized.find(test_chunk)"]
FindPos --> Found{"pos != -1?"}
Found -->|Yes| ConvertLine["Convert char position to line number\nby counting chars in each line"]
Found -->|No| TrySmaller{"Try smaller\nchunk?"}
TrySmaller -->|Yes| ExtractChunk
 
   TrySmaller -->|No| Fallback["Fallback: Match heading text\nheading_normalized in line_normalized"]
ConvertLine --> FindInsert["Find insertion point:\n- After heading: skip blanks, skip paragraph\n- After paragraph: find blank line"]
Fallback --> FindInsert
    
 
   FindInsert --> Queue["Add to pending_insertions[]\n(line_num, diagram, score, idx)"]
Queue --> InsertAll["Sort by line_num (reverse)\nInsert diagrams bottom-up"]
InsertAll --> Save["Write enhanced file\nto same path in temp_dir"]
style TryChunks fill:#e8f5e9
    style Found fill:#fff4e1
    style FindInsert fill:#f3e5f5

Fuzzy Matching Algorithm

The script uses progressive chunk matching to find where diagrams belong in the Markdown content:

Progressive Chunk Sizes: The algorithm tries matching increasingly smaller chunks (300 → 200 → 150 → 100 → 80 characters) until it finds a match. This handles variations in text formatting between the JavaScript payload and html2text output.

Scoring: Each match is scored based on the chunk size used. Larger chunks indicate more confident matches.

Bottom-Up Insertion: Diagrams are inserted from the bottom of the file upward to preserve line numbers during insertion.

Sources: tools/deepwiki-scraper.py:676-788

Helper Functions

Filename Sanitization

sanitize_filename(text) tools/deepwiki-scraper.py:21-25

  • Removes non-alphanumeric characters except hyphens and spaces
  • Collapses multiple hyphens/spaces into single hyphens
  • Converts to lowercase
  • Example: "Query Planning & Optimization""query-planning-optimization"

clean_deepwiki_footer(markdown) tools/deepwiki-scraper.py:127-173

  • Removes DeepWiki UI elements from markdown using regex patterns
  • Patterns include: "Dismiss", "Refresh this wiki", "Edit Wiki", "On this page"
  • Scans last 50 lines backwards to find footer start
  • Removes all content from footer start to end of file
  • Also removes trailing empty lines

Sources: tools/deepwiki-scraper.py:21-25 tools/deepwiki-scraper.py:127-173

File Organization and Output

The script organizes output files based on the hierarchical page structure:

File Naming Convention: {number}-{title-slug}.md

  • Number with dots replaced by hyphens (e.g., 2.12-1)
  • Title sanitized to safe filename format
  • Examples: 1-overview.md, 2-1-workspace.md, 4-3-2-optimizer.md

Sources: tools/deepwiki-scraper.py:842-877 tools/deepwiki-scraper.py:897-908

Error Handling and Resilience

The script implements multiple layers of error handling:

Retry Logic

HTTP Request Retries: tools/deepwiki-scraper.py:33-42

  • Each HTTP request attempts up to 3 times
  • 2-second delay between attempts
  • Only raises exception on final failure

Graceful Degradation

ScenarioBehavior
No pages foundExit with error message and status code 1
Page extraction failsPrint error, continue with remaining pages
Diagram extraction failsPrint warning, continue without diagrams
Content selector not foundFall back to <body> tag as last resort

Temporary Directory Cleanup

The script uses Python's tempfile.TemporaryDirectory() context manager, which automatically deletes the temporary directory even if the script crashes or is interrupted. This prevents accumulation of partial work files.

Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:808-916

Performance Characteristics

Rate Limiting

The script includes a 1-second sleep between page fetches to be respectful to the DeepWiki server:

Sources: tools/deepwiki-scraper.py872

Memory Efficiency

  • Uses streaming HTTP responses where possible
  • Processes one page at a time rather than loading all pages into memory
  • Temporary directory is cleared automatically after completion

Typical Execution Times

For a repository with approximately 20 pages and 50 diagrams:

  • Phase 1 (Extraction): ~30-40 seconds (with 1-second delays between requests)
  • Phase 2 (Enhancement): ~5-10 seconds (local processing)
  • Phase 3 (Move): <1 second (file operations)

Total: Approximately 40-50 seconds for a medium-sized wiki.

Sources: tools/deepwiki-scraper.py:790-919

Data Flow Summary

Sources: tools/deepwiki-scraper.py:1-919

DeepWiki GitHub

mdBook Integration

Relevant source files

Purpose and Scope

This document explains how the DeepWiki-to-mdBook Converter integrates with mdBook and its plugins to generate the final HTML documentation. It covers the configuration file generation, table of contents assembly, build orchestration, and diagram rendering setup. For details about the earlier phases that produce the markdown input files, see Phase 1: Markdown Extraction and Phase 2: Diagram Enhancement. For specifics on the build process flow, see Phase 3: mdBook Build.

Overview

The system integrates mdBook as the final transformation stage, converting enhanced Markdown files into a searchable, navigable HTML documentation site. This integration is orchestrated by the build-docs.sh script and uses two Rust-based tools compiled during the Docker build:

ToolPurposeInstallation Method
mdbookCore documentation generatorCompiled from source via cargo install
mdbook-mermaidMermaid diagram preprocessorCompiled from source via cargo install

The integration is optional and can be bypassed by setting MARKDOWN_ONLY=true, which produces only the enhanced Markdown files without the HTML build.

Sources: build-docs.sh:60-76 Dockerfile:1-5 Dockerfile:19-21

Integration Architecture

The following diagram shows how mdBook is integrated into the three-phase pipeline and what files it consumes and produces:

Sources: build-docs.sh:60-206 README.md:138-144

graph TB
    subgraph "Phase 1 & 2 Output"
        WikiDir["$WIKI_DIR\n(Scraped Markdown)"]
RootMD["*.md files\n(Root pages)"]
SectionDirs["section-N/\n(Subsection pages)"]
WikiDir --> RootMD
 
       WikiDir --> SectionDirs
    end
    
    subgraph "build-docs.sh Orchestrator"
        CheckMode{"MARKDOWN_ONLY\ncheck"}
GenConfig["Generate book.toml\n(lines 84-103)"]
GenSummary["Generate SUMMARY.md\n(lines 108-159)"]
CopyFiles["Copy to src/\n(line 166)"]
InstallAssets["mdbook-mermaid install\n(line 171)"]
BuildCmd["mdbook build\n(line 176)"]
end
    
    subgraph "mdBook Process"
        ParseConfig["Parse book.toml"]
ParseSummary["Parse SUMMARY.md"]
ProcessMD["Process Markdown files"]
MermaidPreproc["mdbook-mermaid\npreprocessor"]
RenderHTML["Render HTML pages"]
end
    
    subgraph "Output Directory"
        BookHTML["$OUTPUT_DIR/book/\n(HTML site)"]
BookToml["$OUTPUT_DIR/book.toml\n(Config copy)"]
MarkdownCopy["$OUTPUT_DIR/markdown/\n(Source copy)"]
end
    
 
   WikiDir --> CheckMode
 
   CheckMode -->|false| GenConfig
 
   CheckMode -->|true| MarkdownCopy
    
 
   GenConfig --> GenSummary
 
   GenSummary --> CopyFiles
 
   CopyFiles --> InstallAssets
 
   InstallAssets --> BuildCmd
    
 
   BuildCmd --> ParseConfig
 
   BuildCmd --> ParseSummary
 
   ParseConfig --> ProcessMD
 
   ParseSummary --> ProcessMD
 
   ProcessMD --> MermaidPreproc
 
   MermaidPreproc --> RenderHTML
    
 
   RenderHTML --> BookHTML
 
   GenConfig -.->|copy| BookToml
 
   WikiDir -.->|copy| MarkdownCopy

Configuration Generation (book.toml)

The build-docs.sh script dynamically generates the book.toml configuration file using environment variables. This generation occurs in build-docs.sh:84-103 and produces a TOML file with three main sections:

Configuration Structure

Configuration Fields

The generated configuration includes:

SectionFieldValue SourcePurpose
[book]title$BOOK_TITLE or "Documentation"Book title in navigation
[book]authors$BOOK_AUTHORS or $REPO_OWNERAuthor attribution
[book]language"en" (hardcoded)Content language
[book]src"src" (hardcoded)Source directory path
[output.html]default-theme"rust" (hardcoded)Visual theme
[output.html]git-repository-url$GIT_REPO_URL"Edit" link target
[preprocessor.mermaid]command"mdbook-mermaid"Preprocessor binary
[output.html.fold]enabletrueEnable section folding
[output.html.fold]level1Fold depth level

Sources: build-docs.sh:84-103 build-docs.sh:39-45

Table of Contents Generation (SUMMARY.md)

The system automatically generates SUMMARY.md from the scraped file structure, discovering the hierarchy from filename patterns and directory organization. This logic is implemented in build-docs.sh:108-159

Generation Algorithm

File Structure Detection

The algorithm detects the hierarchical structure using these conventions:

PatternInterpretationExample
N-title.mdMain section N5-component-reference.md
section-N/Directory for section N subsectionssection-5/
N-M-title.md in section-N/Subsection M of section Nsection-5/5-1-build-docs-sh.md

The algorithm extracts the section number from the filename using grep -oE '^[0-9]+' and checks for the existence of a corresponding section-N directory. If found, it writes the main section as a header followed by indented subsections.

Sources: build-docs.sh:108-159 build-docs.sh:117-123 build-docs.sh:126-158

Build Process Orchestration

The build process follows a specific sequence of operations coordinated by build-docs.sh:

Sources: build-docs.sh:78-191

sequenceDiagram
    participant Script as build-docs.sh
    participant FS as File System
    participant MdBook as mdbook binary
    participant Mermaid as mdbook-mermaid binary
    
    Note over Script: Step 2: Initialize\n(lines 78-82)
    Script->>FS: mkdir -p $BOOK_DIR
    Script->>Script: cd $BOOK_DIR
    
    Note over Script: Generate Config\n(lines 84-103)
    Script->>FS: Write book.toml
    Script->>FS: mkdir -p src
    
    Note over Script: Step 3: Generate TOC\n(lines 108-159)
    Script->>FS: Read $WIKI_DIR/*.md files
    Script->>FS: Read section-*/*.md files
    Script->>FS: Write src/SUMMARY.md
    
    Note over Script: Step 4: Copy Files\n(lines 164-166)
    Script->>FS: cp -r $WIKI_DIR/* src/
    
    Note over Script: Step 5: Install Assets\n(lines 169-171)
    Script->>Mermaid: mdbook-mermaid install $BOOK_DIR
    Mermaid->>FS: Install mermaid.min.js
    Mermaid->>FS: Install mermaid-init.js
    Mermaid->>FS: Update book.toml
    
    Note over Script: Step 6: Build\n(lines 174-176)
    Script->>MdBook: mdbook build
    MdBook->>FS: Read book.toml
    MdBook->>FS: Read src/SUMMARY.md
    MdBook->>FS: Read src/*.md files
    MdBook->>Mermaid: Preprocess (mermaid blocks)
    Mermaid-->>MdBook: Transformed Markdown
    MdBook->>FS: Write book/ directory
    
    Note over Script: Step 7: Copy Outputs\n(lines 179-191)
    Script->>FS: cp -r book $OUTPUT_DIR/
    Script->>FS: cp book.toml $OUTPUT_DIR/
    Script->>FS: cp -r $WIKI_DIR/* $OUTPUT_DIR/markdown/

mdbook-mermaid Integration

The system uses the mdbook-mermaid preprocessor to enable Mermaid diagram rendering in the final HTML output. This integration involves three steps:

Installation and Configuration

The mdbook-mermaid binary is compiled during the Docker build stage:

Preprocessor Configuration

The preprocessor is configured in the generated book.toml:

This configuration tells mdBook to run mdbook-mermaid as a preprocessor before rendering HTML. The preprocessor scans all Markdown files for code blocks with the mermaid language tag and transforms them into HTML containers that the Mermaid JavaScript library can render.

Asset Installation

The mdbook-mermaid install command (executed at build-docs.sh171) installs required JavaScript and CSS assets:

AssetPurpose
mermaid.min.jsMermaid diagram rendering library
mermaid-init.jsInitialization script for Mermaid
Additional CSSStyling for diagram containers

These assets are installed into the book's theme directory and are automatically included in all generated HTML pages.

Sources: Dockerfile5 Dockerfile21 build-docs.sh:169-171 build-docs.sh:97-98

Output Structure

After the mdBook build completes, the system produces three output artifacts:

HTML Site Features

The generated HTML site includes:

FeatureDescriptionEnabled By
Navigation sidebarLeft sidebar with TOCGenerated SUMMARY.md
Search functionalityFull-text searchmdBook default feature
Responsive designMobile-friendly layoutRust theme
Mermaid diagramsInteractive diagramsmdbook-mermaid preprocessor
Edit linksLink to GitHub sourcegit-repository-url config
Section foldingCollapsible sections[output.html.fold] config

Sources: build-docs.sh:179-191 README.md:93-104

Binary Requirements

The integration requires two Rust binaries to be available at runtime:

BinaryInstallation LocationRequired For
mdbook/usr/local/bin/mdbookHTML generation
mdbook-mermaid/usr/local/bin/mdbook-mermaidDiagram preprocessing

Both binaries are compiled during the Docker multi-stage build and copied to the final image. The compilation occurs in Dockerfile:1-5 using cargo install, and the binaries are extracted in Dockerfile:19-21 from the build stage.

Sources: Dockerfile:1-21 build-docs.sh171 build-docs.sh176

DeepWiki GitHub

Phase 1: Markdown Extraction

Relevant source files

This page documents Phase 1 of the three-phase processing pipeline, which handles the extraction and initial conversion of wiki content from DeepWiki.com into clean Markdown files. Phase 1 produces raw Markdown files in a temporary directory before diagram enhancement (Phase 2, see #7) and mdBook HTML generation (Phase 3, see #8).

For detailed information about specific sub-processes within Phase 1, see:

  • Wiki structure discovery algorithm: #6.1
  • HTML parsing and Markdown conversion: #6.2

Scope and Objectives

Phase 1 accomplishes the following:

  1. Discover all wiki pages and their hierarchical structure from DeepWiki
  2. Fetch HTML content for each page via HTTP requests
  3. Parse HTML to extract main content and remove UI elements
  4. Convert cleaned HTML to Markdown using html2text
  5. Organize output files into a hierarchical directory structure
  6. Save to a temporary directory for subsequent processing

This phase is implemented entirely in Python within deepwiki-scraper.py and operates independently of Phases 2 and 3.

Sources: README.md:121-128 tools/deepwiki-scraper.py:790-876

Phase 1 Execution Flow

The following diagram shows the complete execution flow of Phase 1, mapping high-level steps to specific functions in the codebase:

Sources: tools/deepwiki-scraper.py:790-876 tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594

flowchart TD
    Start["main()
Entry Point"]
CreateTemp["Create tempfile.TemporaryDirectory()"]
CreateSession["requests.Session()
with User-Agent"]
DiscoverPhase["Structure Discovery Phase"]
ExtractWiki["extract_wiki_structure(repo, session)"]
ParseLinks["BeautifulSoup: find_all('a', href=pattern)"]
SortPages["sort by page number (handle dots)"]
ExtractionPhase["Content Extraction Phase"]
LoopPages["For each page in pages list"]
FetchContent["extract_page_content(url, session, page_info)"]
FetchHTML["fetch_page(url, session)
with retries"]
ParseHTML["BeautifulSoup(response.text)"]
RemoveNav["Remove nav/header/footer/aside elements"]
FindContent["Find main content: article/main/[role='main']"]
ConvertPhase["Conversion Phase"]
ConvertMD["convert_html_to_markdown(html_content)"]
HTML2Text["html2text.HTML2Text with body_width=0"]
CleanFooter["clean_deepwiki_footer(markdown)"]
FixLinks["Regex replace: wiki links → .md paths"]
SavePhase["File Organization Phase"]
DetermineLevel{"page['level'] == 0?"}
SaveRoot["Save to temp_dir/NUM-title.md"]
CreateSubdir["Create temp_dir/section-N/"]
SaveSubdir["Save to section-N/NUM-title.md"]
NextPage{"More pages?"}
Complete["Phase 1 Complete: temp_dir contains all .md files"]
Start --> CreateTemp
 
   CreateTemp --> CreateSession
 
   CreateSession --> DiscoverPhase
    
 
   DiscoverPhase --> ExtractWiki
 
   ExtractWiki --> ParseLinks
 
   ParseLinks --> SortPages
 
   SortPages --> ExtractionPhase
    
 
   ExtractionPhase --> LoopPages
 
   LoopPages --> FetchContent
 
   FetchContent --> FetchHTML
 
   FetchHTML --> ParseHTML
 
   ParseHTML --> RemoveNav
 
   RemoveNav --> FindContent
 
   FindContent --> ConvertPhase
    
 
   ConvertPhase --> ConvertMD
 
   ConvertMD --> HTML2Text
 
   HTML2Text --> CleanFooter
 
   CleanFooter --> FixLinks
 
   FixLinks --> SavePhase
    
 
   SavePhase --> DetermineLevel
 
   DetermineLevel -->|Yes: Main Page| SaveRoot
 
   DetermineLevel -->|No: Subsection| CreateSubdir
 
   CreateSubdir --> SaveSubdir
 
   SaveRoot --> NextPage
 
   SaveSubdir --> NextPage
    
 
   NextPage -->|Yes| LoopPages
 
   NextPage -->|No| Complete

Core Components and Data Flow

Structure Discovery Pipeline

The structure discovery process identifies all wiki pages and builds a hierarchical page list:

Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:90-116 tools/deepwiki-scraper.py:118-123

flowchart LR
    subgraph Input
        BaseURL["Base URL\ndeepwiki.com/owner/repo"]
end
    
    subgraph extract_wiki_structure
        FetchMain["fetch_page(base_url)"]
ParseSoup["BeautifulSoup(response.text)"]
FindLinks["soup.find_all('a', href=regex)"]
ExtractInfo["Extract page number & title\nRegex: /(\d+(?:\.\d+)*)-(.+)$"]
CalcLevel["Calculate level from dots\nlevel = page_num.count('.')"]
BuildPages["Build pages list with metadata"]
SortFunc["Sort by sort_key(page)\nparts = [int(x)
for x in num.split('.')]"]
end
    
    subgraph Output
        PagesList["List[Dict]\n{'number': '2.1',\n'title': 'Section',\n'url': full_url,\n'href': path,\n'level': 1}"]
end
    
 
   BaseURL --> FetchMain
 
   FetchMain --> ParseSoup
 
   ParseSoup --> FindLinks
 
   FindLinks --> ExtractInfo
 
   ExtractInfo --> CalcLevel
 
   CalcLevel --> BuildPages
 
   BuildPages --> SortFunc
 
   SortFunc --> PagesList

Content Extraction and Cleaning

Each page undergoes a multi-step cleaning process to remove DeepWiki UI elements:

Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:175-190 tools/deepwiki-scraper.py:127-173

flowchart TD
    subgraph fetch_page
        MakeRequest["requests.get(url, headers)\nUser-Agent: Mozilla/5.0..."]
RetryLogic["Retry up to 3 times\n2 second delay between attempts"]
CheckStatus["response.raise_for_status()"]
end
    
    subgraph extract_page_content
        ParsePage["BeautifulSoup(response.text)"]
RemoveUnwanted["Decompose: nav, header, footer,\naside, .sidebar, script, style"]
FindMain["Try selectors in order:\narticle → main → .wiki-content\n→ [role='main'] → body"]
RemoveUI["Remove DeepWiki UI elements:\n'Edit Wiki', 'Last indexed:',\n'Index your code with Devin'"]
RemoveNavLists["Remove navigation <ul> lists\n(80%+ internal wiki links)"]
end
    
    subgraph convert_html_to_markdown
        HTML2TextInit["h = html2text.HTML2Text()\nh.ignore_links = False\nh.body_width = 0"]
HandleContent["markdown = h.handle(html_content)"]
CleanFooterCall["clean_deepwiki_footer(markdown)"]
end
    
    subgraph clean_deepwiki_footer
        SplitLines["lines = markdown.split('\\n')"]
ScanBackward["Scan last 50 lines backward\nfor footer patterns"]
MatchPatterns["Regex patterns:\n'Dismiss', 'Refresh this wiki',\n'On this page', 'Edit Wiki'"]
TruncateLines["lines = lines[:footer_start]"]
RemoveEmpty["Remove trailing empty lines"]
end
    
 
   MakeRequest --> RetryLogic
 
   RetryLogic --> CheckStatus
 
   CheckStatus --> ParsePage
    
 
   ParsePage --> RemoveUnwanted
 
   RemoveUnwanted --> FindMain
 
   FindMain --> RemoveUI
 
   RemoveUI --> RemoveNavLists
 
   RemoveNavLists --> HTML2TextInit
    
 
   HTML2TextInit --> HandleContent
 
   HandleContent --> CleanFooterCall
    
 
   CleanFooterCall --> SplitLines
 
   SplitLines --> ScanBackward
 
   ScanBackward --> MatchPatterns
 
   MatchPatterns --> TruncateLines
 
   TruncateLines --> RemoveEmpty

Phase 1 transforms internal DeepWiki links into relative Markdown file paths. The rewriting logic accounts for hierarchical directory structure:

Sources: tools/deepwiki-scraper.py:549-592

flowchart TD
    subgraph Input
        WikiLink["DeepWiki Link\n[text](/owner/repo/2-1-section)"]
SourcePage["Current Page Info\n{level: 1, number: '2.1'}"]
end
    
    subgraph fix_wiki_link
        ExtractPath["Regex: /(\d+(?:\.\d+)*)-(.+)$"]
ParseNumbers["Extract: page_num='2.1', slug='section'"]
ConvertNum["file_num = page_num.replace('.', '-')\nResult: '2-1'"]
CheckTarget{"Is target\nsubsection?\n(has '.')"}
CheckSource{"Is source\nsubsection?\n(level > 0)"}
CheckSame{"Same main\nsection?"}
PathSameSection["Relative path:\nfile_num-slug.md"]
PathDiffSection["Full path:\nsection-N/file_num-slug.md"]
PathToMain["Up one level:\n../file_num-slug.md"]
PathMainToMain["Same level:\nfile_num-slug.md"]
end
    
    subgraph Output
        MDLink["Markdown Link\n[text](2-1-section.md)\nor [text](section-2/2-1-section.md)\nor [text](../2-1-section.md)"]
end
    
 
   WikiLink --> ExtractPath
 
   ExtractPath --> ParseNumbers
 
   ParseNumbers --> ConvertNum
 
   ConvertNum --> CheckTarget
    
 
   CheckTarget -->|Yes| CheckSource
 
   CheckTarget -->|No: Main Page| CheckSource
    
 
   CheckSource -->|Target: Sub, Source: Sub| CheckSame
 
   CheckSource -->|Target: Sub, Source: Main| PathDiffSection
 
   CheckSource -->|Target: Main, Source: Sub| PathToMain
 
   CheckSource -->|Target: Main, Source: Main| PathMainToMain
    
 
   CheckSame -->|Yes| PathSameSection
 
   CheckSame -->|No| PathDiffSection
    
 
   PathSameSection --> MDLink
 
   PathDiffSection --> MDLink
 
   PathToMain --> MDLink
 
   PathMainToMain --> MDLink

File Organization Strategy

Phase 1 organizes output files into a hierarchical directory structure based on page levels:

Directory Structure Rules

Page LevelPage Number FormatDirectory LocationFilename PatternExample
0 (Main)1, 2, 3temp_dir/ (root){num}-{slug}.md1-overview.md
1 (Subsection)2.1, 3.4temp_dir/section-{N}/{num}-{slug}.mdsection-2/2-1-workspace.md

File Organization Implementation

Sources: tools/deepwiki-scraper.py:21-25 tools/deepwiki-scraper.py:845-868

HTTP Session Configuration

Phase 1 uses a persistent requests.Session with browser-like headers and retry logic:

Session Setup

Retry Strategy

Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:817-821

Data Structures

Page Metadata Dictionary

Each page discovered by extract_wiki_structure() is represented as a dictionary:

Sources: tools/deepwiki-scraper.py:109-115

BeautifulSoup Content Selectors

Phase 1 attempts multiple selector strategies to find main content, in priority order:

PrioritySelector TypeSelector ValueRationale
1CSS SelectorarticleSemantic HTML5 element for main content
2CSS SelectormainHTML5 main landmark element
3CSS Selector.wiki-contentCommon class name for wiki content
4CSS Selector.contentGeneric content class
5CSS Selector#contentGeneric content ID
6CSS Selector.markdown-bodyGitHub-style markdown container
7Attributerole="main"ARIA landmark role
8FallbackbodyLast resort: entire body

Sources: tools/deepwiki-scraper.py:472-484

Error Handling and Robustness

Page Extraction Error Handling

Phase 1 implements graceful degradation for individual page failures:

Sources: tools/deepwiki-scraper.py:841-876

Content Extraction Fallbacks

If primary content selectors fail, Phase 1 applies fallback strategies:

  1. Content Selector Fallback Chain : Try 8 different selectors (see table above)
  2. Empty Content Check : Raises exception if no content element found tools/deepwiki-scraper.py:486-487
  3. HTTP Retry Logic : 3 attempts with exponential backoff
  4. Session Persistence : Reuses TCP connections for efficiency

Sources: tools/deepwiki-scraper.py:472-487 tools/deepwiki-scraper.py:27-42

Output Format

Temporary Directory Structure

At the end of Phase 1, the temporary directory contains the following structure:

temp_dir/
├── 1-overview.md                    # Main page (level 0)
├── 2-architecture.md                # Main page (level 0)
├── 3-components.md                  # Main page (level 0)
├── section-2/                       # Subsections of page 2
│   ├── 2-1-workspace-and-crates.md  # Subsection (level 1)
│   └── 2-2-dependency-graph.md      # Subsection (level 1)
└── section-4/                       # Subsections of page 4
    ├── 4-1-logical-planning.md
    └── 4-2-physical-planning.md

Markdown File Format

Each generated Markdown file has the following characteristics:

  • Title : Always starts with # {Page Title} heading
  • Content : Cleaned HTML converted to Markdown via html2text
  • Links : Internal wiki links rewritten to relative .md paths
  • No Diagrams : Diagrams are added in Phase 2 (see #7)
  • No Footer : DeepWiki UI elements removed via clean_deepwiki_footer()
  • Encoding : UTF-8

Sources: tools/deepwiki-scraper.py:862-868 tools/deepwiki-scraper.py:127-173

Phase 1 Completion Criteria

Phase 1 is considered complete when:

  1. All pages discovered by extract_wiki_structure() have been processed
  2. Each page's Markdown file has been written to the temporary directory
  3. Directory structure (main pages + section-N/ subdirectories) has been created
  4. Success count is reported: "✓ Successfully extracted N/M pages to temp directory"

The temporary directory is then passed to Phase 2 for diagram enhancement.

Sources: tools/deepwiki-scraper.py877 tools/deepwiki-scraper.py:596-788

DeepWiki GitHub

Wiki Structure Discovery

Relevant source files

Purpose and Scope

This document describes the wiki structure discovery mechanism in Phase 1 of the processing pipeline. The system analyzes the main DeepWiki repository page to identify all available wiki pages and their hierarchical relationships. This discovery phase produces a structured page list that drives subsequent content extraction.

For the HTML-to-Markdown conversion that follows discovery, see HTML to Markdown Conversion. For the overall Phase 1 process, see Phase 1: Markdown Extraction.

Overview

The discovery process fetches the main wiki page for a repository and parses its HTML to extract all wiki page references. The system identifies both main pages (e.g., 1, 2, 3) and subsections (e.g., 2.1, 2.2, 3.1) by analyzing link patterns. The output is a sorted list of page metadata that includes page numbers, titles, URLs, and hierarchical levels.

flowchart TD
 
   Start["main()
entry point"] --> ValidateRepo["Validate repo format\n(owner/repo)"]
ValidateRepo --> CreateSession["Create requests.Session\nwith User-Agent headers"]
CreateSession --> CallExtract["extract_wiki_structure(repo, session)"]
CallExtract --> FetchMain["Fetch https://deepwiki.com/{repo}"]
FetchMain --> ParseHTML["BeautifulSoup(response.text)"]
ParseHTML --> FindLinks["soup.find_all('a', href=regex)"]
FindLinks --> IterateLinks["Iterate over all links"]
IterateLinks --> ExtractPattern["Regex: /(\d+(?:\.\d+)*)-(.+)$"]
ExtractPattern --> BuildPageDict["Build page dict:\n{number, title, url, href, level}"]
BuildPageDict --> CheckDupe{"href in seen_urls?"}
CheckDupe -->|Yes| IterateLinks
 
   CheckDupe -->|No| AddToList["pages.append(page_dict)"]
AddToList --> IterateLinks
    
 
   IterateLinks -->|Done| SortPages["Sort by numeric parts:\nsort_key([int(x)
for x in num.split('.')])"]
SortPages --> ReturnPages["Return pages list"]
ReturnPages --> ProcessPages["Process each page\nin main loop"]
style CallExtract fill:#f9f,stroke:#333,stroke-width:2px
    style ExtractPattern fill:#f9f,stroke:#333,stroke-width:2px
    style SortPages fill:#f9f,stroke:#333,stroke-width:2px

Discovery Flow Diagram

Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:790-831

Main Discovery Function

The extract_wiki_structure function performs the core discovery logic. It accepts a repository identifier (e.g., "jzombie/deepwiki-to-mdbook") and an HTTP session object, then returns a list of page dictionaries.

Function Signature and Entry Point

Sources: tools/deepwiki-scraper.py:78-79

HTTP Request and HTML Parsing

The function constructs the base URL and fetches the main wiki page:

The fetch_page helper includes retry logic (3 attempts) and browser-like headers to handle transient failures.

Sources: tools/deepwiki-scraper.py:80-84 tools/deepwiki-scraper.py:27-42

The system uses a compiled regex pattern to find all wiki page links:

This pattern matches URLs like:

  • /jzombie/deepwiki-to-mdbook/1-overview
  • /jzombie/deepwiki-to-mdbook/2-quick-start
  • /jzombie/deepwiki-to-mdbook/2-1-basic-usage

Sources: tools/deepwiki-scraper.py:88-90

Page Information Extraction

For each matched link, the system extracts page metadata using a detailed regex pattern:

The regex r'/(\d+(?:\.\d+)*)-(.+)$' captures:

  • Group 1: Page number with optional dots (e.g., 1, 2.1, 3.2.1)
  • Group 2: URL slug (e.g., overview, basic-usage)

Sources: tools/deepwiki-scraper.py:98-107

Sources: tools/deepwiki-scraper.py:98-115

Deduplication and Sorting

Deduplication Strategy

The system maintains a seen_urls set to prevent duplicate page entries:

Sources: tools/deepwiki-scraper.py:92-116

Hierarchical Sorting

Pages are sorted by their numeric components to maintain proper ordering:

This ensures ordering like: 122.12.233.1

Sources: tools/deepwiki-scraper.py:118-123

Sorting Example

Before Sorting (Link Order)Page NumberAfter Sorting (Numeric Order)
/3-phase-33/1-overview
/2-1-subsection-one2.1/2-quick-start
/1-overview1/2-1-subsection-one
/2-quick-start2/2-2-subsection-two
/2-2-subsection-two2.2/3-phase-3

Page Data Structure

Page Dictionary Schema

Each discovered page is represented as a dictionary:

Sources: tools/deepwiki-scraper.py:109-115

Level Calculation

The level field indicates hierarchical depth:

Page NumberLevelType
10Main page
20Main page
2.11Subsection
2.21Subsection
3.1.12Sub-subsection

Sources: tools/deepwiki-scraper.py:106-114

Discovery Result Processing

Output Statistics

After discovery, the system categorizes pages and reports statistics:

Sources: tools/deepwiki-scraper.py:824-837

Integration with Content Extraction

The discovered page list drives the extraction loop in main():

Sources: tools/deepwiki-scraper.py:841-860

Alternative Discovery Method (Unused)

Subsection Probing Function

The codebase includes a discover_subsections function that uses HTTP HEAD requests to probe for subsections, but this function is not invoked in the current implementation:

This function attempts to discover subsections by making HEAD requests to potential URLs (e.g., /repo/2-1-, /repo/2-2-). However, the actual implementation relies entirely on parsing links from the main wiki page.

Sources: tools/deepwiki-scraper.py:44-76

Discovery Method Comparison

Sources: tools/deepwiki-scraper.py:44-76 tools/deepwiki-scraper.py:78-125

Error Handling

No Pages Found

The system validates that at least one page was discovered:

Sources: tools/deepwiki-scraper.py:828-830

Network Failures

The fetch_page function includes retry logic:

Sources: tools/deepwiki-scraper.py:33-42

Summary

The wiki structure discovery process provides a robust mechanism for identifying all pages in a DeepWiki repository through a single HTML parse operation. The resulting page list is hierarchically organized and drives all subsequent extraction operations in Phase 1.

Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:790-831

DeepWiki GitHub

HTML to Markdown Conversion

Relevant source files

This document describes the HTML parsing and Markdown conversion process that transforms DeepWiki's HTML pages into clean, portable Markdown files. This is a core component of Phase 1 (Markdown Extraction) in the three-phase pipeline.

For information about the diagram enhancement that occurs after this conversion, see Phase 2: Diagram Enhancement. For details on how the wiki structure is discovered before this conversion begins, see Wiki Structure Discovery.

Purpose and Scope

The HTML to Markdown conversion process takes raw HTML fetched from DeepWiki.com and transforms it into clean Markdown files suitable for processing by mdBook. This conversion must handle several challenges:

  • Extract only content, removing DeepWiki's UI elements and navigation
  • Preserve the semantic structure (headings, lists, code blocks)
  • Convert internal wiki links to relative Markdown file paths
  • Remove DeepWiki-specific footer content
  • Handle hierarchical link relationships between main pages and subsections

Conversion Pipeline Overview

Conversion Flow: HTML to Clean Markdown

Sources: tools/deepwiki-scraper.py:453-594

HTML Parsing and Content Extraction

BeautifulSoup Content Location Strategy

The system uses a multi-strategy approach to locate the main content area, trying selectors in order of specificity:

Content Locator Strategies

flowchart LR
    Start["extract_page_content()"]
Strat1["Try CSS Selectors\narticle, main, .wiki-content"]
Strat2["Try Role Attribute\nrole='main'"]
Strat3["Fallback: body Element"]
Success["Content Found"]
Error["Raise Exception"]
Start --> Strat1
 
   Strat1 -->|Found| Success
 
   Strat1 -->|Not Found| Strat2
 
   Strat2 -->|Found| Success
 
   Strat2 -->|Not Found| Strat3
 
   Strat3 -->|Found| Success
 
   Strat3 -->|Not Found| Error

The system attempts these selectors in sequence:

PrioritySelector TypeSelector ValuePurpose
1CSSarticle, main, .wiki-content, .content, #content, .markdown-bodySemantic HTML5 content containers
2Attributerole="main"ARIA landmark for main content
3FallbackbodyLast resort - entire body element

Sources: tools/deepwiki-scraper.py:469-487

UI Element Removal

The conversion process removes several categories of unwanted elements before processing:

Structural Element Removal

The following element types are removed wholesale using elem.decompose():

Text-Based UI Element Removal

DeepWiki-specific UI elements are identified by text content patterns:

PatternPurposeMax Length Filter
Index your code with DevinAI indexing prompt< 200 chars
Edit WikiEdit button< 200 chars
Last indexed:Metadata display< 200 chars
View this search on DeepWikiSearch link< 200 chars

The length filter prevents accidental removal of paragraph content that happens to contain these phrases.

Sources: tools/deepwiki-scraper.py:466-500

The system automatically detects and removes navigation lists using heuristics:

Navigation List Detection Algorithm

flowchart TD
    FindUL["Find all <ul> elements"]
CountLinks["Count <a> tags"]
Check5["links.length > 5?"]
CountInternal["Count internal links\nhref starts with '/'"]
Check80["wiki_links > 80% of links?"]
Remove["ul.decompose()"]
Keep["Keep element"]
FindUL --> CountLinks
 
   CountLinks --> Check5
 
   Check5 -->|Yes| CountInternal
 
   Check5 -->|No| Keep
 
   CountInternal --> Check80
 
   Check80 -->|Yes| Remove
 
   Check80 -->|No| Keep

This heuristic successfully identifies table of contents lists and navigation menus while preserving legitimate bulleted lists in the content.

Sources: tools/deepwiki-scraper.py:502-511

html2text Conversion Configuration

The core conversion uses the html2text library with specific configuration to ensure clean output:

html2text Configuration

Key Configuration Decisions

SettingValueRationale
ignore_linksFalseLinks must be preserved so they can be rewritten to relative paths
body_width0Disables line wrapping, which would interfere with diagram matching in Phase 2

The body_width=0 setting is particularly important because Phase 2's fuzzy matching algorithm compares text chunks from the JavaScript payload with the converted Markdown. Line wrapping would cause mismatches.

Sources: tools/deepwiki-scraper.py:175-190

After html2text conversion, the system removes DeepWiki-specific footer content using pattern matching.

The clean_deepwiki_footer() function uses compiled regex patterns to identify footer content:

Footer Pattern Table

PatternExample MatchPurpose
^\s*Dismiss\s*$"Dismiss"Modal dismiss button
Refresh this wiki"Refresh this wiki"Refresh action link
This wiki was recently refreshedFull phraseStatus message
###\s*On this page"### On this page"TOC heading
Please wait \d+ days? to refresh"Please wait 7 days"Rate limit message
You can refresh again inFull phraseAlternative rate limit
^\s*View this search on DeepWikiFull phraseSearch link
^\s*Edit Wiki\s*$"Edit Wiki"Edit action

Footer Scanning Algorithm

The backward scan ensures the earliest footer indicator is found, preventing content loss if footer elements are scattered.

Sources: tools/deepwiki-scraper.py:127-173

The most complex part of the conversion is rewriting internal wiki links to relative Markdown file paths. Links must account for the hierarchical directory structure where subsections are placed in subdirectories.

The fix_wiki_link() function handles four distinct cases based on source and target locations:

Link Rewriting Decision Matrix

Source LocationTarget LocationRelative Path FormatExample
Main pageMain page{file_num}-{slug}.md2-overview.md
Main pageSubsectionsection-{main}/{file_num}-{slug}.mdsection-2/2-1-details.md
SubsectionSame section subsection{file_num}-{slug}.md2-2-more.md
SubsectionMain page../{file_num}-{slug}.md../3-next.md
SubsectionDifferent section../section-{main}/{file_num}-{slug}.md../section-3/3-1-sub.md

Link Rewriting Flow

The link rewriting uses regex substitution on Markdown link syntax:

The regex captures only the page-slug portion after the repository path, which is then processed by fix_wiki_link().

Sources: tools/deepwiki-scraper.py:547-592

Post-Conversion Cleanup

After all conversions and transformations, final cleanup removes artifacts:

Duplicate Title Removal

Duplicate Title Detection

This cleanup handles cases where DeepWiki includes the page title multiple times in the rendered HTML.

Sources: tools/deepwiki-scraper.py:525-545

flowchart TD
    Start["extract_page_content(url, session, current_page_info)"]
Fetch["fetch_page(url, session)\nHTTP GET with retries"]
Parse["BeautifulSoup(response.text)"]
RemoveNav["Remove nav/header/footer"]
FindContent["Locate main content area"]
RemoveUI["Remove DeepWiki UI elements"]
RemoveLists["Remove navigation lists"]
ToStr["str(content)"]
Convert["convert_html_to_markdown(html)"]
CleanUp["Remove duplicate titles/Menu"]
FixLinks["Rewrite internal links\nusing current_page_info"]
Return["Return markdown string"]
Start --> Fetch
 
   Fetch --> Parse
 
   Parse --> RemoveNav
 
   RemoveNav --> FindContent
 
   FindContent --> RemoveUI
 
   RemoveUI --> RemoveLists
 
   RemoveLists --> ToStr
 
   ToStr --> Convert
 
   Convert --> CleanUp
 
   CleanUp --> FixLinks
 
   FixLinks --> Return

Integration with Extract Page Content

The complete content extraction flow shows how all components work together:

extract_page_content() Complete Flow

The current_page_info parameter provides context about the source page's location in the hierarchy, which is essential for generating correct relative link paths.

Sources: tools/deepwiki-scraper.py:453-594

Error Handling and Retries

HTTP Fetch with Retries

The fetch_page() function implements exponential backoff for failed requests:

AttemptActionDelay
1Try requestNone
2Retry after error2 seconds
3Final retry2 seconds
FailRaise exceptionN/A

Browser-like headers are used to avoid being blocked:

Sources: tools/deepwiki-scraper.py:27-42

Rate Limiting

To be respectful to the DeepWiki server, the main extraction loop includes a 1-second delay between page requests:

This appears in the main loop after each successful page extraction.

Sources: tools/deepwiki-scraper.py872

Dependencies

The HTML to Markdown conversion relies on three key Python libraries:

LibraryVersionPurpose
requests≥2.31.0HTTP requests with session management
beautifulsoup4≥4.12.0HTML parsing and element manipulation
html2text≥2020.1.16HTML to Markdown conversion

Sources: tools/requirements.txt:1-3

Output Characteristics

The Markdown files produced by this conversion have these properties:

  • No line wrapping : Original formatting preserved (body_width=0)
  • Clean structure : No UI elements or navigation
  • Relative links : All internal links point to local .md files
  • Title guarantee : Every file starts with an H1 heading
  • Hierarchy-aware : Links account for subdirectory structure
  • Footer-free : DeepWiki-specific footer content removed

These characteristics make the files suitable for Phase 2 diagram enhancement and Phase 3 mdBook building without further modification.

DeepWiki GitHub

Phase 2: Diagram Enhancement

Relevant source files

Purpose and Scope

Phase 2 performs intelligent diagram extraction and placement after Phase 1 has generated clean markdown files. This phase extracts Mermaid diagrams from DeepWiki's JavaScript payload, matches them to appropriate locations in the markdown content using fuzzy text matching, and inserts them contextually after relevant paragraphs.

For information about the initial markdown extraction that precedes this phase, see Phase 1: Markdown Extraction. For details on the specific fuzzy matching algorithm implementation, see Fuzzy Diagram Matching Algorithm. For information about the extraction patterns used, see Diagram Extraction from Next.js.

Sources: README.md:130-136 tools/deepwiki-scraper.py:596-789

The Client-Side Rendering Problem

DeepWiki renders diagrams client-side using JavaScript, making them invisible to traditional HTML scraping. All Mermaid diagrams are embedded in a JavaScript payload (self.__next_f.push) that contains diagram code for all pages in the wiki , not just the current page. This creates a matching problem: given ~461 diagrams in a single payload and individual markdown files, how do we determine which diagrams belong in which files?

Key challenges:

  • Diagrams are escaped JavaScript strings (\n, \t, \")
  • No metadata associates diagrams with specific pages
  • html2text conversion changes text formatting from the original JavaScript context
  • Must avoid false positives (placing diagrams in wrong locations)

Sources: tools/deepwiki-scraper.py:458-461 README.md:131-136

Architecture Overview

Diagram: Phase 2 Processing Pipeline

Sources: tools/deepwiki-scraper.py:596-789

Diagram Extraction Process

The extraction process reads the JavaScript payload from any DeepWiki page and locates all Mermaid diagram blocks using regex pattern matching.

flowchart TD
    Start["extract_and_enhance_diagrams()"]
FetchURL["Fetch https://deepwiki.com/repo/1-overview"]
subgraph "Pattern Matching"
        Pattern1["Pattern: r'```mermaid\\\\\n(.*?)```'\n(re.DOTALL)"]
Pattern2["Pattern: r'([^`]{500,}?)```mermaid\\\\ (.*?)```'\n(with context)"]
FindAll["re.findall() → all_diagrams list"]
FindIter["re.finditer() → diagram_contexts with context"]
end
    
    subgraph "Unescaping"
        ReplaceNewline["Replace '\\\\\n' → newline"]
ReplaceTab["Replace '\\\\ ' → tab"]
ReplaceQuote["Replace '\\\\\"' → double-quote"]
ReplaceUnicode["Replace Unicode escapes:\n\\\< → '<'\n\\\> → '>'\n\\\& → '&'"]
end
    
    subgraph "Context Processing"
        Last500["Extract last 500 chars of context"]
FindHeading["Scan for last heading starting with #"]
ExtractAnchor["Extract last 2-3 non-heading lines\n(min 20 chars each)"]
BuildDict["Build dict: {last_heading, anchor_text, diagram}"]
end
    
 
   Start --> FetchURL
 
   FetchURL --> Pattern1
 
   FetchURL --> Pattern2
 
   Pattern1 --> FindAll
 
   Pattern2 --> FindIter
    
 
   FindAll --> ReplaceNewline
 
   FindIter --> ReplaceNewline
 
   ReplaceNewline --> ReplaceTab
 
   ReplaceTab --> ReplaceQuote
 
   ReplaceQuote --> ReplaceUnicode
    
 
   ReplaceUnicode --> Last500
 
   Last500 --> FindHeading
 
   FindHeading --> ExtractAnchor
 
   ExtractAnchor --> BuildDict
    
 
   BuildDict --> Output["Returns:\n- all_diagrams count\n- diagram_contexts list"]

Extraction Function Flow

Diagram: Diagram Extraction and Context Building

Sources: tools/deepwiki-scraper.py:604-674

Key Implementation Details

ComponentImplementationLocation
Regex Patternr'```mermaid\\n(.*?)```' with re.DOTALL flagtools/deepwiki-scraper.py615
Context Patternr'([^]{500,}?)mermaid\\n(.*?)'` captures 500+ charstools/deepwiki-scraper.py621
Unescape Operationsreplace('\\n', '\n'), replace('\\t', '\t'), etc.tools/deepwiki-scraper.py:628-635 tools/deepwiki-scraper.py:639-645
Heading Detectionline.startswith('#') on reversed context linestools/deepwiki-scraper.py:652-656
Anchor ExtractionLast 2-3 lines with len(line) > 20, max 300 charstools/deepwiki-scraper.py:658-666
Context StorageDict with keys: last_heading, anchor_text, diagramtools/deepwiki-scraper.py:668-672

Sources: tools/deepwiki-scraper.py:614-674

Fuzzy Matching Algorithm

The fuzzy matching algorithm determines where each diagram should be inserted by finding the best match between the diagram's context and the markdown file's content.

flowchart TD
    Start["For each diagram_contexts[idx]"]
CheckUsed["idx in diagrams_used?"]
Skip["Skip to next diagram"]
subgraph "Text Normalization"
        NormFile["Normalize file content:\ncontent.lower()\n' '.join(content.split())"]
NormAnchor["Normalize anchor_text:\nanchor.lower()\n' '.join(anchor.split())"]
NormHeading["Normalize heading:\nheading.lower().replace('#', '').strip()"]
end
    
    subgraph "Progressive Chunk Matching"
        Try300["Try chunk_size=300"]
Try200["Try chunk_size=200"]
Try150["Try chunk_size=150"]
Try100["Try chunk_size=100"]
Try80["Try chunk_size=80"]
ExtractChunk["test_chunk = anchor_normalized[-chunk_size:]"]
FindPos["pos = content_normalized.find(test_chunk)"]
CheckPos["pos != -1?"]
ConvertLine["Convert char position to line number"]
RecordMatch["Record best_match_line, best_match_score"]
end
    
    subgraph "Heading Fallback"
        IterLines["For each line in markdown"]
CheckHeadingLine["line.strip().startswith('#')?"]
NormalizeLinе["Normalize line heading"]
CheckContains["heading_normalized in line_normalized?"]
RecordHeadingMatch["best_match_line = line_num\nbest_match_score = 50"]
end
    
 
   Start --> CheckUsed
 
   CheckUsed -->|Yes| Skip
 
   CheckUsed -->|No| NormFile
    
 
   NormFile --> NormAnchor
 
   NormAnchor --> Try300
 
   Try300 --> ExtractChunk
 
   ExtractChunk --> FindPos
 
   FindPos --> CheckPos
 
   CheckPos -->|Found| ConvertLine
 
   CheckPos -->|Not found| Try200
 
   ConvertLine --> RecordMatch
    
 
   Try200 --> Try150
 
   Try150 --> Try100
 
   Try100 --> Try80
 
   Try80 -->|All failed| IterLines
    
 
   RecordMatch --> Success["Return match with score"]
IterLines --> CheckHeadingLine
 
   CheckHeadingLine -->|Yes| NormalizeLinе
 
   NormalizeLinе --> CheckContains
 
   CheckContains -->|Yes| RecordHeadingMatch
 
   RecordHeadingMatch --> Success

Matching Strategy

Diagram: Progressive Chunk Matching with Fallback

Sources: tools/deepwiki-scraper.py:708-746

Chunk Size Progression

The algorithm tries progressively smaller chunk sizes to accommodate variations in text formatting between the JavaScript context and the html2text-converted markdown:

Chunk SizeUse CaseSuccess Rate
300 charsPerfect or near-perfect matchesHighest precision
200 charsMinor formatting differencesGood precision
150 charsModerate text variationsAcceptable precision
100 charsSignificant reformattingLower precision
80 charsMinimal context availableLowest precision
Heading matchFallback when text matching failsScore: 50

The algorithm stops at the first successful match, prioritizing larger chunks for higher confidence.

Sources: tools/deepwiki-scraper.py:716-730 README.md134

flowchart TD
    Start["Found best_match_line"]
CheckType["lines[best_match_line].strip().startswith('#')?"]
subgraph "Heading Case"
        H1["insert_line = best_match_line + 1"]
H2["Skip blank lines after heading"]
H3["Skip through paragraph content"]
H4["Stop at next blank line or heading"]
end
    
    subgraph "Paragraph Case"
        P1["insert_line = best_match_line + 1"]
P2["Find end of current paragraph"]
P3["Stop at next blank line or heading"]
end
    
    subgraph "Insertion Format"
        I1["Insert: empty line"]
I2["Insert: ```mermaid"]
I3["Insert: diagram code"]
I4["Insert: ```"]
I5["Insert: empty line"]
end
    
 
   Start --> CheckType
 
   CheckType -->|Heading| H1
 
   CheckType -->|Paragraph| P1
    
 
   H1 --> H2
 
   H2 --> H3
 
   H3 --> H4
    
 
   P1 --> P2
 
   P2 --> P3
    
 
   H4 --> I1
 
   P3 --> I1
    
 
   I1 --> I2
 
   I2 --> I3
 
   I3 --> I4
 
   I4 --> I5
    
 
   I5 --> Append["Append to pending_insertions list:\n(insert_line, diagram, score, idx)"]

Insertion Point Logic

After finding a match, the system determines the precise line number where the diagram should be inserted.

Insertion Algorithm

Diagram: Insertion Point Calculation

Sources: tools/deepwiki-scraper.py:747-768

graph LR
    Collect["Collect all\npending_insertions"]
Sort["Sort by insert_line\n(descending)"]
Insert["Insert from bottom to top\npreserves line numbers"]
Write["Write enhanced file\nto temp_dir"]
Collect --> Sort
 
   Sort --> Insert
 
   Insert --> Write

Batch Insertion Strategy

Diagrams are inserted in descending line order to avoid invalidating insertion points:

Diagram: Batch Insertion Order

Implementation:

Sources: tools/deepwiki-scraper.py:771-783

sequenceDiagram
    participant Main as extract_and_enhance_diagrams()
    participant Glob as temp_dir.glob('**/*.md')
    participant File as Individual .md file
    participant Matcher as Fuzzy Matcher
    participant Writer as File Writer
    
    Main->>Main: Extract all diagram_contexts
    Main->>Glob: Find all markdown files
    
    loop For each md_file
        Glob->>File: Open and read content
        File->>File: Check if '```mermaid' already present
        
        alt Already has diagrams
            File->>Glob: Skip (continue)
        else No diagrams
            File->>Matcher: Normalize content
            
            loop For each diagram_context
                Matcher->>Matcher: Try progressive chunk matching
                Matcher->>Matcher: Try heading fallback
                Matcher->>Matcher: Record best match
            end
            
            Matcher->>File: Return pending_insertions list
            File->>File: Sort insertions (descending)
            File->>File: Insert diagrams bottom-up
            File->>Writer: Write enhanced content
            Writer->>Main: Increment enhanced_count
        end
    end
    
    Main->>Main: Print summary

File Processing Workflow

Phase 2 operates on files in the temporary directory created by Phase 1, enhancing them in-place before they are moved to the final output directory.

Processing Loop

Diagram: File Processing Sequence

Sources: tools/deepwiki-scraper.py:676-788

Performance Characteristics

Extraction Statistics

From a typical wiki with ~10 pages:

MetricValueLocation
Total diagrams in JS payload~461README.md132
Diagrams with context (500+ chars)~48README.md133
Context window size500 characterstools/deepwiki-scraper.py621
Anchor text max length300 characterstools/deepwiki-scraper.py666
Typical enhanced filesVaries by contentPrinted in output

Sources: README.md:132-133 tools/deepwiki-scraper.py674 tools/deepwiki-scraper.py788

Matching Performance

The progressive chunk size strategy balances precision and recall:

  • High precision matches (300-200 chars) : Strong contextual alignment
  • Medium precision matches (150-100 chars) : Acceptable with some risk
  • Low precision matches (80 chars) : Risk of false positives
  • Heading-only matches (score: 50) : Last resort fallback

The algorithm prefers to skip a diagram rather than place it incorrectly, prioritizing documentation quality over diagram count.

Sources: tools/deepwiki-scraper.py:716-745

Integration with Phases 1 and 3

Input Requirements (from Phase 1)

  • Clean markdown files in temp_dir
  • Files must not already contain \```mermaid blocks
  • Proper heading structure for fallback matching
  • Normalized link structure

Sources: tools/deepwiki-scraper.py:810-877

Output Guarantees (for Phase 3)

  • Enhanced markdown files in temp_dir
  • Diagrams inserted with proper fencing: \```mermaid...`````
  • Blank lines before and after diagrams for proper rendering
  • Original file structure preserved (section-N directories maintained)
  • Atomic file operations (write complete file or skip)

Sources: tools/deepwiki-scraper.py:781-786 tools/deepwiki-scraper.py:883-908

Workflow Integration

Diagram: Three-Phase Integration

Sources: README.md:123-145 tools/deepwiki-scraper.py:810-916

Error Handling and Edge Cases

Skipped Files

Files are skipped if they already contain Mermaid diagrams to avoid duplicate insertion:

Sources: tools/deepwiki-scraper.py:686-687

Failed Matches

When a diagram cannot be matched:

  • The diagram is not inserted (conservative approach)
  • No error is raised (continues processing other diagrams)
  • File is left unmodified if no diagrams match

Sources: tools/deepwiki-scraper.py:699-746

Network Errors

If diagram extraction fails (network error, changed HTML structure):

  • Warning is printed but Phase 2 continues
  • Phase 1 files remain valid
  • System can still proceed to Phase 3 without diagrams

Sources: tools/deepwiki-scraper.py:610-612

Diagram Quality Thresholds

ThresholdPurpose
len(diagram) > 10Filter out trivial/invalid diagram code
len(anchor) > 50Ensure sufficient context for matching
len(line) > 20Filter out short lines from anchor text
chunk_size >= 80Minimum viable match size

Sources: tools/deepwiki-scraper.py648 tools/deepwiki-scraper.py712 tools/deepwiki-scraper.py661

Summary

Phase 2 implements a sophisticated fuzzy matching system that:

  1. Extracts all Mermaid diagrams from DeepWiki's JavaScript payload using regex patterns
  2. Processes diagram context to extract heading and anchor text metadata
  3. Matches diagrams to markdown files using progressive chunk size comparison (300→80 chars)
  4. Inserts diagrams after relevant paragraphs with proper formatting
  5. Validates through conservative matching to avoid false positives

The phase operates entirely on files in the temporary directory, leaving Phase 1's output intact while preparing enhanced files for Phase 3's mdBook build process.

Sources: README.md:130-136 tools/deepwiki-scraper.py:596-789

DeepWiki GitHub

Fuzzy Diagram Matching Algorithm

Relevant source files

Purpose and Scope

This document describes the fuzzy matching algorithm used to intelligently place Mermaid diagrams extracted from DeepWiki's JavaScript payload into the correct locations within Markdown files. The algorithm solves the problem of matching diagram context (as it appears in the JavaScript) to content locations in the html2text-converted Markdown, accounting for formatting differences between the two representations.

For information about how diagrams are extracted from the Next.js payload, see Diagram Extraction from Next.js. For the overall diagram enhancement phase, see Phase 2: Diagram Enhancement.

The Matching Problem

The fuzzy matching algorithm addresses a fundamental mismatch: diagrams are embedded in DeepWiki's JavaScript payload alongside their surrounding context text, but this context text differs significantly from the final Markdown output produced by html2text. The algorithm must find where each diagram belongs despite these differences.

Format Differences Between Sources

AspectJavaScript Payloadhtml2text Output
WhitespaceEscaped \n sequencesActual newlines
Line wrappingNo wrapping (continuous text)Wrapped at natural boundaries
HTML entitiesEscaped (\u003c, \u0026)Decoded (<, &)
FormattingInline with escaped quotesClean Markdown syntax
StructureLinear text streamHierarchical headings/paragraphs

Sources: tools/deepwiki-scraper.py:615-646

Context Extraction Strategy

The algorithm extracts two types of context for each diagram to enable matching:

1. Last Heading Before Diagram

The algorithm scans backwards through context lines to find the most recent heading, which provides a coarse-grained location hint.

Sources: tools/deepwiki-scraper.py:651-656

2. Anchor Text (Last 2-3 Paragraphs)

The anchor text consists of the last 2-3 substantial non-heading lines before the diagram, truncated to 300 characters. This provides fine-grained matching capability.

Sources: tools/deepwiki-scraper.py:658-666

Progressive Chunk Size Matching

The core of the fuzzy matching algorithm uses progressively smaller chunk sizes to find matches, prioritizing longer (more specific) matches over shorter ones.

Chunk Size Progression

The algorithm tests chunks in this order:

Chunk SizePurposeMatch Quality
300 charsFull anchor textHighest confidence
200 charsMost of anchorHigh confidence
150 charsSignificant portionMedium-high confidence
100 charsKey phrasesMedium confidence
80 charsMinimum viable matchLow confidence

Matching Algorithm Flow

Sources: tools/deepwiki-scraper.py:716-730

Text Normalization

Both the diagram context and the target Markdown content undergo identical normalization to maximize matching success:

This process:

  • Converts all text to lowercase
  • Collapses all consecutive whitespace (spaces, tabs, newlines) into single spaces
  • Removes leading/trailing whitespace

Sources: tools/deepwiki-scraper.py:695-696 tools/deepwiki-scraper.py:713-714

Fallback: Heading-Based Matching

If progressive chunk matching fails, the algorithm falls back to heading-based matching:

Heading-based matches receive a fixed score of 50, lower than any chunk-based match, indicating lower confidence.

Sources: tools/deepwiki-scraper.py:733-745

Insertion Point Calculation

Once a match is found, the algorithm calculates the precise insertion point for the diagram:

Insertion After Headings

Sources: tools/deepwiki-scraper.py:751-759

Insertion After Paragraphs

Sources: tools/deepwiki-scraper.py:760-765

Scoring and Deduplication

The algorithm tracks which diagrams have been used to prevent duplicates:

For each file, the algorithm:

  1. Attempts to match all diagrams with context
  2. Stores successful matches with their scores in pending_insertions
  3. Marks diagrams as used in diagrams_used
  4. Sorts insertions by line number (descending) to avoid index shifting
  5. Inserts diagrams from bottom to top

Sources: tools/deepwiki-scraper.py692 tools/deepwiki-scraper.py:767-768

Diagram Insertion Format

Diagrams are inserted with proper Markdown fencing and spacing:

This results in the following structure in the Markdown file:

Next paragraph text.

**Sources:** <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/tools/deepwiki-scraper.py#L774-L779" min=774 max=779 file-path="tools/deepwiki-scraper.py">Hii</FileRef>

## Complete Matching Pipeline

```mermaid
flowchart TD
    Start["extract_and_enhance_diagrams"] --> FetchJS["Fetch JavaScript from\n/1-overview page"]
    FetchJS --> ExtractAll["Extract all diagrams\nwith 500+ char context"]
    ExtractAll --> ParseContext["Parse each context:\n- last_heading\n- anchor_text (300 chars)"]
    ParseContext --> FindFiles["Find all .md files\nin temp directory"]
    FindFiles --> ForEachFile["For each Markdown file"]
    
    ForEachFile --> SkipExisting["Skip if already has\n```mermaid blocks"]
    SkipExisting --> NormalizeContent["Normalize file content"]
    NormalizeContent --> ForEachDiagram["For each diagram\nwith context"]
    
    ForEachDiagram --> TryChunks["Try progressive chunk\nmatching (300-80)"]
    TryChunks -->|Match| StoreMatch["Store match with score"]
    TryChunks -->|No match| TryHeading["Try heading fallback"]
    TryHeading -->|Match| StoreMatch
    TryHeading -->|No match| NextDiagram["Try next diagram"]
    
    StoreMatch --> NextDiagram
    NextDiagram --> ForEachDiagram
    ForEachDiagram -->|All tried| InsertAll["Insert all matched\ndiagrams (bottom-up)"]
    InsertAll --> SaveFile["Save enhanced file"]
    SaveFile --> ForEachFile
    ForEachFile -->|All files| Report["Report statistics"]

Sources: tools/deepwiki-scraper.py:596-788

Key Functions

FunctionLocationPurpose
extract_and_enhance_diagramstools/deepwiki-scraper.py:596-788Main orchestrator for diagram enhancement phase
Progressive chunk looptools/deepwiki-scraper.py:716-730Tries decreasing chunk sizes for matching
Heading fallbacktools/deepwiki-scraper.py:733-745Matches based on heading text when chunks fail
Insertion point calculationtools/deepwiki-scraper.py:748-765Determines where to insert diagram after match
Diagram insertiontools/deepwiki-scraper.py:774-779Inserts diagram with proper fencing

Performance Characteristics

The algorithm processes diagrams in a single pass per file with the following complexity:

OperationComplexityNotes
Content normalizationO(n)Where n = file size in characters
Chunk searchO(n × c)c = number of chunk sizes (5)
Line number conversionO(L)Where L = number of lines in file
Insertion sortingO(k log k)Where k = matched diagrams
Bottom-up insertionO(k × L)Avoids index recalculation

For a typical file with 1000 lines and 50 diagram candidates, the algorithm completes in under 100ms.

Sources: tools/deepwiki-scraper.py:681-788

Match Quality Statistics

As reported in the console output, the algorithm typically achieves:

  • Total diagrams in JavaScript : ~461 diagrams across all pages
  • Diagrams with sufficient context : ~48 diagrams (500+ char context)
  • Average match rate : 60-80% of diagrams with context are successfully placed
  • Typical score distribution :
    • 300-char matches: 20-30% (highest confidence)
    • 200-char matches: 15-25%
    • 150-char matches: 15-20%
    • 100-char matches: 10-15%
    • 80-char matches: 5-10%
    • Heading fallback: 5-10% (lowest confidence)

Sources: README.md:132-136 tools/deepwiki-scraper.py674

DeepWiki GitHub

Diagram Extraction from Next.js

Relevant source files

Purpose and Scope

This document details how Mermaid diagrams are extracted from DeepWiki's Next.js JavaScript payload. DeepWiki uses client-side rendering for diagrams, embedding them as escaped strings within the HTML's JavaScript data structures. This page covers the extraction algorithms, regex patterns, unescaping logic, and deduplication mechanisms used to recover these diagrams.

For information about how extracted diagrams are matched to content and injected into Markdown files, see Fuzzy Diagram Matching Algorithm. For the overall diagram enhancement workflow, see Phase 2: Diagram Enhancement.


The Next.js Data Payload Problem

DeepWiki's architecture presents a unique challenge for diagram extraction. The application uses Next.js with client-side rendering, where Mermaid diagrams are embedded in the JavaScript payload rather than being present in the static HTML. Furthermore, the JavaScript payload contains diagrams from all pages in the wiki, not just the currently viewed page, making per-page extraction impossible without additional context matching.

Diagram: Next.js Payload Structure

graph TB
    subgraph "Browser View"
        HTML["HTML Response\nfrom deepwiki.com"]
end
    
    subgraph "Embedded JavaScript"
        JSPayload["Next.js Data Payload\nMixed content from all pages"]
DiagramData["Mermaid Diagrams\nAs escaped strings"]
end
    
    subgraph "String Format"
        EscapedFormat["```mermaid\\ \ngraph TD\\ \nA --> B\\ \n```"]
UnescapedFormat["```mermaid\ngraph TD\nA --> B\n```"]
end
    
 
   HTML --> JSPayload
 
   JSPayload --> DiagramData
 
   DiagramData --> EscapedFormat
 
   EscapedFormat -.->|extract_mermaid_from_nextjs_data| UnescapedFormat
    
    note1["Problem: Diagrams from\nALL wiki pages mixed together"]
JSPayload -.-> note1
    
    note2["Problem: Escape sequences\n\\\n, \\	, \\\", etc."]
EscapedFormat -.-> note2

Sources: tools/deepwiki-scraper.py:218-331 tools/deepwiki-scraper.py:596-674

The key characteristics of this data format:

CharacteristicDescriptionImpact
Escaped newlinesLiteral \\n instead of newline charactersRequires unescaping before use
Mixed contentAll pages' diagrams in one payloadRequires context matching (Phase 2)
Unicode escapesSequences like \\u003c for <Requires comprehensive unescape logic
String wrappingDiagrams wrapped in JavaScript stringsRequires careful quote handling

Extraction Entry Point

The extract_mermaid_from_nextjs_data() function serves as the primary extraction mechanism. It is called during Phase 2 of the pipeline when processing the HTML response from any DeepWiki page.

Diagram: Extraction Function Flow

flowchart TD
    Start["extract_mermaid_from_nextjs_data(html_text)"]
Strategy1["Strategy 1:\nFenced Block Pattern\n```mermaid\\\n(.*?)```"]
Check1{{"Blocks found?"}}
Strategy2["Strategy 2:\nJavaScript String Scan\nSearch for diagram keywords"]
Unescape["Unescape all blocks:\n\\\n→ newline\n\\ → tab\n\< → <"]
Dedup["Deduplicate by fingerprint\nFirst 100 chars"]
Return["Return unique_blocks"]
Start --> Strategy1
 
   Strategy1 --> Check1
 
   Check1 -->|Yes| Unescape
 
   Check1 -->|No| Strategy2
 
   Strategy2 --> Unescape
 
   Unescape --> Dedup
 
   Dedup --> Return

Sources: tools/deepwiki-scraper.py:218-331


Strategy 1: Fenced Mermaid Block Pattern

The primary extraction strategy uses a regex pattern to locate fenced Mermaid code blocks within the JavaScript payload. These blocks follow the Markdown convention but with escaped newlines.

Regex Pattern : r'```mermaid\\n(.*?)```'

This pattern specifically targets:

  • Opening fence: ````mermaid`
  • Escaped newline: \\n (literal backslash-n in the string)
  • Diagram content: (.*?) (non-greedy capture)
  • Closing fence: `````

Diagram: Fenced Block Extraction Process

Sources: tools/deepwiki-scraper.py:223-244

Code Implementation :

The extraction loop at tools/deepwiki-scraper.py:223-244 implements this strategy:

  1. Pattern matching : Uses re.finditer() with re.DOTALL flag to handle multi-line diagrams
  2. Content extraction : Captures the diagram code via match.group(1)
  3. Unescaping : Applies comprehensive escape sequence replacement
  4. Validation : Filters blocks with len(block) > 10 to exclude empty matches
  5. Logging : Prints first 50 characters and line count for diagnostics

Strategy 2: JavaScript String Scanning

When Strategy 1 fails to find fenced blocks, the function falls back to scanning for raw diagram strings embedded in JavaScript. This handles cases where diagrams are stored as plain strings without Markdown fencing.

Diagram: JavaScript String Scan Algorithm

flowchart TD
    Start["For each diagram keyword"]
Keywords["Keywords:\ngraph TD, graph TB,\nflowchart TD, sequenceDiagram,\nclassDiagram"]
FindKW["pos = html_text.find(keyword, pos)"]
CheckFound{{"Keyword found?"}}
BackwardScan["Scan backwards 20 chars\nFind opening quote"]
QuoteFound{{"Quote found?"}}
ForwardScan["Scan forward up to 10000 chars\nFind closing quote\nSkip escaped quotes"]
Extract["Extract string_start:string_end"]
UnescapeValidate["Unescape and validate\nMust have 3+ lines"]
Append["Append to mermaid_blocks"]
NextPos["pos += 1, continue search"]
Start --> Keywords
 
   Keywords --> FindKW
 
   FindKW --> CheckFound
 
   CheckFound -->|Yes| BackwardScan
 
   CheckFound -->|No, break| End["Move to next keyword"]
BackwardScan --> QuoteFound
 
   QuoteFound -->|Yes| ForwardScan
 
   QuoteFound -->|No| NextPos
 
   ForwardScan --> Extract
 
   Extract --> UnescapeValidate
 
   UnescapeValidate --> Append
 
   Append --> NextPos
 
   NextPos --> FindKW

Sources: tools/deepwiki-scraper.py:246-302

Keyword List :

The algorithm searches for these Mermaid diagram type indicators:

Quote Handling Logic :

The forward scan at tools/deepwiki-scraper.py:273-285 implements careful quote detection:

  • Scans up to 10,000 characters forward (safety limit)
  • Checks if previous character is \ to identify escaped quotes
  • Breaks on first unescaped " character
  • Returns to search position + 1 if no closing quote found

Unescape Processing

All extracted diagram blocks undergo comprehensive unescaping to convert JavaScript string representations into valid Mermaid code. The unescaping process handles multiple escape sequence types.

Escape Sequence Mapping :

Escaped FormUnescaped ResultPurpose
\\n\n (newline)Line breaks in diagram code
\\t\t (tab)Indentation
\\"" (quote)String literals in labels
\\\\\ (backslash)Literal backslashes
\\u003c<Less-than symbol
\\u003e>Greater-than symbol
\\u0026&Ampersand

Diagram: Unescape Transformation Pipeline

Sources: tools/deepwiki-scraper.py:231-238 tools/deepwiki-scraper.py:289-295

Implementation Details :

The unescaping sequence at tools/deepwiki-scraper.py:231-238 executes in a specific order to prevent double-processing:

  1. Newlines first : \\n\n (most common)
  2. Tabs : \\t\t (whitespace)
  3. Quotes : \\"" (before backslash handling to avoid conflicts)
  4. Backslashes : \\\\\ (last to avoid interfering with other escapes)
  5. Unicode : \\u003c, \\u003e, \\u0026<, >, &

The order matters: processing backslashes before quotes would incorrectly unescape \\\\" sequences.


flowchart TD
    Start["Input: mermaid_blocks[]\n(may contain duplicates)"]
Init["Initialize:\nunique_blocks = []\nseen = set()"]
Loop["For each block in mermaid_blocks"]
Fingerprint["fingerprint = block[:100]\n(first 100 chars)"]
CheckSeen{{"fingerprint in seen?"}}
Skip["Skip duplicate"]
Add["Add to seen set\nAppend to unique_blocks"]
Return["Return unique_blocks"]
Start --> Init
 
   Init --> Loop
 
   Loop --> Fingerprint
 
   Fingerprint --> CheckSeen
 
   CheckSeen -->|Yes| Skip
 
   CheckSeen -->|No| Add
 
   Skip --> Loop
 
   Add --> Loop
 
   Loop -->|Done| Return

Deduplication Mechanism

Since multiple extraction strategies may find the same diagram (once as fenced block, once as JavaScript string), the function implements fingerprint-based deduplication.

Diagram: Deduplication Algorithm

Sources: tools/deepwiki-scraper.py:304-311

Fingerprint Strategy :

The deduplication at tools/deepwiki-scraper.py:304-311 uses the first 100 characters as a unique identifier. This approach:

  • Avoids exact string comparison : Saves memory and time for large diagrams
  • Handles minor variations : Trailing whitespace differences don't affect matching
  • Preserves order : First occurrence wins (FIFO)
  • Works across strategies : Catches duplicates from both extraction methods

Integration with Enhancement Pipeline

The extract_mermaid_from_nextjs_data() function is called from extract_and_enhance_diagrams() during Phase 2 processing. The integration pattern extracts diagrams globally, then distributes them to individual pages through context matching.

Diagram: Phase 2 Integration Flow

sequenceDiagram
    participant Main as "extract_and_enhance_diagrams()"
    participant HTTP as "requests.Session"
    participant Extract as "extract_mermaid_from_nextjs_data()"
    participant Context as "Context Extraction"
    participant Files as "Markdown Files"
    
    Main->>HTTP: GET https://deepwiki.com/repo/1-overview
    HTTP-->>Main: HTML response (all diagrams)
    
    Main->>Extract: extract_mermaid_from_nextjs_data(html_text)
    Extract->>Extract: Strategy 1: Fenced blocks
    Extract->>Extract: Strategy 2: JS strings
    Extract->>Extract: Unescape all
    Extract->>Extract: Deduplicate
    Extract-->>Main: all_diagrams[] (~461 diagrams)
    
    Main->>Context: Extract with 500-char context
    Context-->>Main: diagram_contexts[] (~48 with context)
    
    Main->>Files: Fuzzy match and inject into *.md
    Files-->>Main: enhanced_count files modified

Sources: tools/deepwiki-scraper.py:596-674 tools/deepwiki-scraper.py:604-612

Call Site :

The extraction is invoked at tools/deepwiki-scraper.py:604-612 within extract_and_enhance_diagrams():

  1. Fetch any page : Typically uses /1-overview as all diagrams are in every page's payload
  2. Extract globally : Calls extract_mermaid_from_nextjs_data() on full HTML response
  3. Count total : Logs total diagram count (~461 in typical repositories)
  4. Extract context : Secondary regex pass to capture surrounding text (see Fuzzy Diagram Matching Algorithm)

Alternative Pattern Search :

Phase 2 also performs a second extraction pass at tools/deepwiki-scraper.py:615-646 with context:

  • Pattern : r'([^]{500,}?)mermaid\\n(.*?)'`
  • Purpose : Captures 500+ characters before each diagram for context matching
  • Result : diagram_contexts[] with last_heading, anchor_text, and diagram fields
  • Filtering : Only diagrams with meaningful context are used for fuzzy matching

Error Handling and Diagnostics

The extraction function includes comprehensive error handling and diagnostic output to aid in debugging and monitoring extraction quality.

Error Handling Strategy :

Sources: tools/deepwiki-scraper.py:327-331

Diagnostic Output :

The function provides detailed logging at tools/deepwiki-scraper.py244 tools/deepwiki-scraper.py248 tools/deepwiki-scraper.py300 tools/deepwiki-scraper.py:314-316:

Output MessageConditionPurpose
"Found mermaid diagram: {first_50}... ({lines} lines)"Each successful extractionVerify diagram content
"No fenced mermaid blocks found, trying JavaScript extraction..."Strategy 1 failsIndicate fallback
"Found JS mermaid diagram: {first_50}... ({lines} lines)"Strategy 2 successShow fallback results
"Extracted {count} unique mermaid diagram(s)"Deduplication completeReport final count
"Warning: No valid mermaid diagrams extracted"Zero diagrams foundAlert to potential issues
"Warning: Failed to extract mermaid from page data: {e}"Exception caughtDebug extraction failures

Performance Characteristics

The extraction algorithm exhibits specific performance characteristics relevant to large wiki repositories.

Complexity Analysis :

OperationTime ComplexitySpace ComplexityNotes
Strategy 1 regexO(n)O(m)n = HTML length, m = diagram count
Strategy 2 scanO(n × k)O(m)k = keyword count (10)
UnescapingO(m × d)O(m × d)d = avg diagram length
DeduplicationO(m)O(m)Uses 100-char fingerprint
TotalO(n × k)O(m × d)Dominated by Strategy 2

Typical Performance :

Based on the diagnostic output patterns at tools/deepwiki-scraper.py314:

  • Input size : ~2-5 MB HTML response
  • Extraction time : ~200-500ms (dominated by regex operations)
  • Diagrams found : ~461 total diagrams
  • Diagrams with context : ~48 after filtering
  • Memory usage : ~1-2 MB for diagram storage (ephemeral)

Optimization Opportunities :

The current implementation prioritizes correctness over performance. Potential optimizations:

  1. Early termination : Stop Strategy 2 after finding sufficient diagrams
  2. Compiled patterns : Pre-compile regex patterns (currently done inline)
  3. Streaming extraction : Process HTML in chunks rather than loading entirely
  4. Fingerprint cache : Persist fingerprints across runs to avoid re-extraction

However, given typical execution times (<1 second), these optimizations are not currently necessary.


Summary

The Next.js diagram extraction mechanism solves the challenge of recovering client-side rendered Mermaid diagrams from DeepWiki's JavaScript payload. The implementation uses a two-strategy approach (fenced blocks and JavaScript string scanning), comprehensive unescaping logic, and fingerprint-based deduplication to reliably extract hundreds of diagrams from a single HTML response. The extracted diagrams are then passed to the fuzzy matching algorithm (see Fuzzy Diagram Matching Algorithm) for intelligent placement in the appropriate Markdown files.

Key Functions and Components :

ComponentLocationPurpose
extract_mermaid_from_nextjs_data()tools/deepwiki-scraper.py:218-331Main extraction function
Strategy 1 regextools/deepwiki-scraper.py:223-244Fenced block pattern matching
Strategy 2 scannertools/deepwiki-scraper.py:246-302JavaScript string scanning
Deduplicationtools/deepwiki-scraper.py:304-311Fingerprint-based uniqueness
Phase 2 integrationtools/deepwiki-scraper.py:604-612Call site in enhancement pipeline

Sources: tools/deepwiki-scraper.py:218-331 tools/deepwiki-scraper.py:596-674

DeepWiki GitHub

Phase 3: mdBook Build

Relevant source files

This document describes the final phase of the three-phase pipeline, where the extracted and enhanced Markdown files are transformed into a searchable HTML documentation site using mdBook and mdbook-mermaid. This phase is orchestrated by build-docs.sh and can be optionally skipped using the MARKDOWN_ONLY configuration flag.

For details on configuration file generation, see Configuration Generation. For the table of contents generation algorithm, see SUMMARY.md Generation. For the earlier phases of content extraction and diagram enhancement, see Phase 1: Markdown Extraction and Phase 2: Diagram Enhancement.

Overview

Phase 3 executes only when MARKDOWN_ONLY is not set to "true". It consists of seven distinct steps that transform the extracted Markdown files into a complete mdBook-based HTML documentation site with working Mermaid diagram rendering, search functionality, and navigation.

Phase 3 Execution Flow

graph TB
    Start["Phase 2 Complete\n$WIKI_DIR contains\nenhanced Markdown"]
CheckMode{"MARKDOWN_ONLY\n== 'true'?"}
SkipPath["Skip Phase 3\nCopy markdown to output\nExit"]
Step2["Step 2:\nInitialize mdBook Structure\nmkdir -p $BOOK_DIR\ncd $BOOK_DIR"]
Step3["Step 3:\nGenerate book.toml\nConfigure title, authors,\ntheme, preprocessors"]
Step4["Step 4:\nGenerate SUMMARY.md\nDiscover file structure\nCreate table of contents"]
Step5["Step 5:\nCopy Markdown Files\ncp -r $WIKI_DIR/* src/"]
Step6["Step 6:\nInstall Mermaid Assets\nmdbook-mermaid install"]
Step7["Step 7:\nBuild HTML Book\nmdbook build"]
Step8["Step 8:\nCopy Outputs\nbook/ → /output/book/\nmarkdown → /output/markdown/\nbook.toml → /output/"]
End["Complete:\nHTML documentation ready"]
Start --> CheckMode
 
   CheckMode -->|Yes| SkipPath
 
   CheckMode -->|No| Step2
 
   SkipPath --> End
 
   Step2 --> Step3
 
   Step3 --> Step4
 
   Step4 --> Step5
 
   Step5 --> Step6
 
   Step6 --> Step7
 
   Step7 --> Step8
 
   Step8 --> End

Sources: build-docs.sh:60-76 build-docs.sh:78-205

Directory Structure Transformation

Phase 3 transforms the flat wiki directory structure into mdBook's required layout. The following diagram shows how files are organized and moved through the build process:

Directory Layout Evolution Through Phase 3

graph LR
    subgraph Input["Input: $WIKI_DIR"]
WikiRoot["$WIKI_DIR/\n(scraped content)"]
WikiMD["*.md files"]
WikiSections["section-N/ dirs"]
WikiRoot --> WikiMD
 
       WikiRoot --> WikiSections
    end
    
    subgraph BookStructure["$BOOK_DIR Structure"]
BookRoot["$BOOK_DIR/"]
BookToml["book.toml"]
SrcDir["src/"]
SrcSummary["src/SUMMARY.md"]
SrcMD["src/*.md"]
SrcSections["src/section-N/"]
BookOutput["book/"]
BookHTML["book/index.html\nbook/*.html"]
BookRoot --> BookToml
 
       BookRoot --> SrcDir
 
       BookRoot --> BookOutput
 
       SrcDir --> SrcSummary
 
       SrcDir --> SrcMD
 
       SrcDir --> SrcSections
 
       BookOutput --> BookHTML
    end
    
    subgraph Output["Output: /output"]
OutRoot["/output/"]
OutBook["book/"]
OutMarkdown["markdown/"]
OutToml["book.toml"]
OutRoot --> OutBook
 
       OutRoot --> OutMarkdown
 
       OutRoot --> OutToml
    end
    
 
   WikiMD -.->|Step 5: cp -r| SrcMD
 
   WikiSections -.->|Step 5: cp -r| SrcSections
 
   BookHTML -.->|Step 8: cp -r book| OutBook
 
   WikiMD -.->|Step 8: cp -r| OutMarkdown
 
   BookToml -.->|Step 8: cp| OutToml

Sources: build-docs.sh:27-30 build-docs.sh:81-106 build-docs.sh:163-191

Step 2: Initialize mdBook Structure

The build script creates the base directory structure required by mdBook. This establishes the workspace where all subsequent operations occur.

OperationCommandPurpose
Create book directorymkdir -p "$BOOK_DIR"Root directory for mdBook project
Change to book directorycd "$BOOK_DIR"Set working directory for mdBook commands
Create source directorymkdir -p srcDirectory where Markdown files will be placed

The $BOOK_DIR variable defaults to /workspace/book and serves as the mdBook project root throughout Phase 3.

Sources: build-docs.sh:78-106

Step 3: Generate book.toml Configuration

The script dynamically generates book.toml, mdBook's main configuration file, using environment variables and auto-detected metadata. This file controls the book's metadata, theme, preprocessors, and output settings.

Configuration File Structure

SectionKeyValue SourcePurpose
[book]title$BOOK_TITLEBook title displayed in HTML
[book]authors$BOOK_AUTHORSAuthor names (defaults to repo owner)
[book]language"en" (hardcoded)Documentation language
[book]multilingualfalse (hardcoded)Single language mode
[book]src"src" (hardcoded)Source directory name
[output.html]default-theme"rust" (hardcoded)Visual theme (Rust documentation style)
[output.html]git-repository-url$GIT_REPO_URLEnables "Edit this page" links
[preprocessor.mermaid]command"mdbook-mermaid"Diagram rendering preprocessor
[output.html.fold]enabletrueEnable sidebar section folding
[output.html.fold]level1Fold at level 1 by default

The generated file is written to $BOOK_DIR/book.toml using a heredoc construct. For detailed information on this process, see Configuration Generation.

Sources: build-docs.sh:84-103

Step 4: Generate SUMMARY.md Table of Contents

The script automatically discovers the file structure and generates SUMMARY.md, which defines the book's navigation hierarchy. This is a critical step that bridges the flat file structure to mdBook's hierarchical navigation.

SUMMARY.md Generation Algorithm

graph TD
    Start["Start:\nsrc/SUMMARY.md creation"]
Header["Write header:\n'# Summary'"]
FindFirst["Find first page:\nls $WIKI_DIR/*.md /head -1"]
CheckFirst{"First page exists?"}
AddIntro["Extract title from first line Add as introduction link"]
LoopMain["Loop through all *.md in $WIKI_DIR"]
CheckSkip{"Is this first page?"}
SkipIt["Continue to next file"]
ExtractTitle["Extract title: head -1/ sed 's/^# //'"]
GetSectionNum["Extract section number:\ngrep -oE '^[0-9]+'"]
CheckSubsections{"section-$num\ndirectory\nexists?"}
AddMain["Add main section entry:\n- [title](filename)"]
EndLoop{"More\nfiles?"}
AddSection["Add section header:\n# title\n- [title](filename)"]
LoopSubs["Loop through\nsection-N/*.md files"]
AddSub["Add subsection:\n - [subtitle](section-N/file)"]
CheckMoreSubs{"More\nsubsections?"}
Done["SUMMARY.md complete"]
Start --> Header
 
   Header --> FindFirst
 
   FindFirst --> CheckFirst
 
   CheckFirst -->|Yes| AddIntro
 
   CheckFirst -->|No| LoopMain
 
   AddIntro --> LoopMain
 
   LoopMain --> CheckSkip
 
   CheckSkip -->|Yes| SkipIt
 
   CheckSkip -->|No| ExtractTitle
 
   SkipIt --> EndLoop
 
   ExtractTitle --> GetSectionNum
 
   GetSectionNum --> CheckSubsections
 
   CheckSubsections -->|Yes| AddSection
 
   CheckSubsections -->|No| AddMain
 
   AddMain --> EndLoop
 
   AddSection --> LoopSubs
 
   LoopSubs --> AddSub
 
   AddSub --> CheckMoreSubs
 
   CheckMoreSubs -->|Yes| LoopSubs
 
   CheckMoreSubs -->|No| EndLoop
 
   EndLoop -->|Yes| LoopMain
 
   EndLoop -->|No| Done

The script counts the generated entries using grep -c '\[' src/SUMMARY.md and logs the total. For a detailed explanation of the SUMMARY.md format and generation logic, see SUMMARY.md Generation.

Sources: build-docs.sh:108-161

Step 5: Copy Markdown Files to Book Source

The script copies all Markdown files from the wiki directory to the mdBook source directory. This includes both top-level files and subsection directories.

The copy operation uses recursive mode to preserve the directory structure:

  • Command: cp -r "$WIKI_DIR"/* src/
  • Source: /workspace/wiki/ (contains enhanced Markdown with diagrams)
  • Destination: $BOOK_DIR/src/ (mdBook source directory)

All files retain their names and relative paths, ensuring SUMMARY.md references remain valid.

Sources: build-docs.sh:163-166

Step 6: Install mdbook-mermaid Assets

The mdbook-mermaid preprocessor requires JavaScript and CSS assets to render Mermaid diagrams in the browser. The installation command copies these assets into the mdBook theme directory.

Asset Installation Process

graph LR
    Cmd["mdbook-mermaid install $BOOK_DIR"]
Detect["Detect book.toml location"]
ThemeDir["Create/locate theme directory\n$BOOK_DIR/theme/"]
CopyJS["Copy mermaid.min.js\n(Mermaid rendering library)"]
CopyInit["Copy mermaid-init.js\n(Initialization code)"]
CopyCSS["Copy mermaid.css\n(Diagram styling)"]
Complete["Assets installed\nReady for diagram rendering"]
Cmd --> Detect
 
   Detect --> ThemeDir
 
   ThemeDir --> CopyJS
 
   ThemeDir --> CopyInit
 
   ThemeDir --> CopyCSS
 
   CopyJS --> Complete
 
   CopyInit --> Complete
 
   CopyCSS --> Complete

After installation, mdBook will automatically include these assets in all generated HTML pages, enabling client-side Mermaid diagram rendering.

Sources: build-docs.sh:168-171 README.md:138-144

Step 7: Build HTML Documentation

The core build operation is performed by the mdbook build command, which reads the configuration, processes all Markdown files, and generates the complete HTML site.

mdBook Build Pipeline

graph TB
    Start["mdbook build command"]
ReadConfig["Read book.toml\nLoad configuration"]
ParseSummary["Parse src/SUMMARY.md\nBuild navigation tree"]
ReadMarkdown["Read all .md files\nfrom src/ directory"]
PreprocessMermaid["Run mermaid preprocessor\nDetect ```mermaid blocks"]
ConvertHTML["Convert Markdown → HTML\nApply rust theme"]
GeneratePages["Generate HTML pages\nOne per Markdown file"]
AddNav["Add navigation sidebar\nBased on SUMMARY.md"]
AddSearch["Generate search index\nSearchable content"]
AddAssets["Include CSS/JS assets\nTheme + Mermaid libraries"]
WriteOutput["Write to book/ directory\nComplete static site"]
Done["Build complete:\nbook/index.html ready"]
Start --> ReadConfig
 
   ReadConfig --> ParseSummary
 
   ParseSummary --> ReadMarkdown
 
   ReadMarkdown --> PreprocessMermaid
 
   PreprocessMermaid --> ConvertHTML
 
   ConvertHTML --> GeneratePages
 
   GeneratePages --> AddNav
 
   GeneratePages --> AddSearch
 
   GeneratePages --> AddAssets
 
   AddNav --> WriteOutput
 
   AddSearch --> WriteOutput
 
   AddAssets --> WriteOutput
 
   WriteOutput --> Done

The build command produces the following outputs in $BOOK_DIR/book/:

  • index.html - Main entry point
  • Individual HTML pages for each Markdown file
  • searchindex.js - Full-text search index
  • searchindex.json - Search metadata
  • CSS and JavaScript assets (theme + Mermaid)
  • Font files and icons

Sources: build-docs.sh:173-176 README.md:93-99

Step 8: Copy Outputs to Volume Mount

The final step copies all generated artifacts to the /output directory, which is typically mounted as a Docker volume. This makes the results accessible outside the container.

SourceDestinationContents
$BOOK_DIR/book//output/book/Complete HTML documentation site
$WIKI_DIR/*/output/markdown/Enhanced Markdown source files
$BOOK_DIR/book.toml/output/book.tomlConfiguration file (for reference)

The script outputs a summary showing the locations of all artifacts:

Outputs:
  - HTML book:       /output/book/
  - Markdown files:  /output/markdown/
  - Book config:     /output/book.toml

The HTML book in /output/book/ is a self-contained static site that can be:

  • Served with any web server (e.g., python3 -m http.server)
  • Deployed to static hosting (GitHub Pages, Netlify, etc.)
  • Opened directly in a browser (file:// URLs)

Sources: build-docs.sh:178-205

Conditional Execution: Markdown-Only Mode

Phase 3 can be completely bypassed by setting the MARKDOWN_ONLY environment variable to "true". This provides a fast-path execution mode useful for debugging content extraction and diagram placement without the overhead of building HTML.

Execution Path Decision

graph TD
    Start["Phase 2 Complete"]
CheckVar{"MARKDOWN_ONLY\n== 'true'?"}
FullPath["Execute Full Phase 3:\n- Initialize mdBook\n- Generate configs\n- Build HTML\n- Copy all outputs"]
FastPath["Execute Fast Path:\n- mkdir -p /output/markdown\n- cp -r $WIKI_DIR/* /output/markdown/\n- Exit immediately"]
FullOutput["Outputs:\n- /output/book/ (HTML)\n- /output/markdown/ (source)\n- /output/book.toml (config)"]
FastOutput["Outputs:\n- /output/markdown/ (source only)"]
Start --> CheckVar
 
   CheckVar -->|false| FullPath
 
   CheckVar -->|true| FastPath
 
   FullPath --> FullOutput
 
   FastPath --> FastOutput

When MARKDOWN_ONLY="true":

  • Steps 2-7 are skipped entirely
  • Only the Markdown files are copied to /output/markdown/
  • Build time is significantly reduced (seconds vs. minutes)
  • Useful for iterating on diagram placement logic
  • No HTML output is generated

Sources: build-docs.sh:60-76 README.md:55-75

Error Handling and Requirements

Phase 3 executes with set -e enabled, causing the script to exit immediately if any command fails. This ensures partial builds are not created.

Potential Failure Points

StepCommandFailure ConditionImpact
Step 2mkdir -p "$BOOK_DIR"Insufficient permissionsCannot create workspace
Step 3cat > book.tomlWrite permission deniedNo configuration file
Step 4File discovery loopNo .md files foundEmpty SUMMARY.md
Step 5cp -r "$WIKI_DIR"/* src/Source directory emptyNo content to build
Step 6mdbook-mermaid installBinary not in PATHDiagrams won't render
Step 7mdbook buildInvalid Markdown syntaxBuild fails with error
Step 8cp -r book "$OUTPUT_DIR/"Insufficient disk spaceIncomplete output

The multi-stage Docker build ensures both mdbook and mdbook-mermaid binaries are present in the final image. See Docker Multi-Stage Build for details on how these tools are compiled and installed.

Sources: build-docs.sh:1-2 build-docs.sh:168-176

graph TB
    EnvVars["Environment Variables\nREPO\nBOOK_TITLE\nBOOK_AUTHORS\nGIT_REPO_URL\nMARKDOWN_ONLY"]
BuildScript["build-docs.sh\n(orchestrator)"]
ScraperOutput["$WIKI_DIR/\n(Phase 1 & 2 output)"]
BookToml["book.toml\n(generated config)"]
SummaryMd["src/SUMMARY.md\n(generated TOC)"]
MdBookBinary["mdbook binary\n(Rust tool)"]
MermaidBinary["mdbook-mermaid binary\n(Rust preprocessor)"]
HTMLOutput["book/ directory\n(final HTML site)"]
VolumeMount["/output volume\n(Docker mount point)"]
EnvVars --> BuildScript
 
   ScraperOutput --> BuildScript
 
   BuildScript --> BookToml
 
   BuildScript --> SummaryMd
 
   BuildScript --> MdBookBinary
 
   MdBookBinary --> MermaidBinary
 
   BookToml --> MdBookBinary
 
   SummaryMd --> MdBookBinary
 
   ScraperOutput --> MdBookBinary
 
   MdBookBinary --> HTMLOutput
 
   BuildScript --> VolumeMount
 
   HTMLOutput --> VolumeMount

Integration with Other Components

Phase 3 integrates with multiple system components:

Component Integration Map

The phase acts as the bridge between the Python-based content extraction layers and the final deliverable, using Rust-based tools for the HTML generation. For details on how environment variables are processed, see Configuration Reference. For the complete system architecture, see System Architecture.

Sources: build-docs.sh:21-53 build-docs.sh:78-205

DeepWiki GitHub

Configuration Generation

Relevant source files

Purpose and Scope

This document details how the book.toml configuration file is dynamically generated during Phase 3 of the build process. The configuration generation occurs in build-docs.sh:78-103 and uses environment variables, Git repository metadata, and computed defaults to produce a complete mdBook configuration.

For information about how the generated configuration is used by mdBook to build the documentation site, see mdBook Integration. For details about the SUMMARY.md generation that happens after configuration generation, see SUMMARY.md Generation.

Sources: build-docs.sh:1-206


Configuration Flow Overview

The configuration generation process transforms user-provided environment variables and auto-detected repository information into a complete book.toml file that controls all aspects of the mdBook build.

Configuration Data Flow

flowchart TB
    subgraph "Input Sources"
        ENV_REPO["Environment Variable:\nREPO"]
ENV_TITLE["Environment Variable:\nBOOK_TITLE"]
ENV_AUTHORS["Environment Variable:\nBOOK_AUTHORS"]
ENV_URL["Environment Variable:\nGIT_REPO_URL"]
GIT["Git Remote:\norigin URL"]
end
    
    subgraph "Processing in build-docs.sh"
        AUTO_DETECT["Auto-Detection Logic\n[8-19]"]
VAR_INIT["Variable Initialization\n[21-26]"]
VALIDATION["Repository Validation\n[32-37]"]
DEFAULTS["Default Computation\n[39-45]"]
end
    
    subgraph "Computed Values"
        FINAL_REPO["REPO:\nowner/repo"]
FINAL_TITLE["BOOK_TITLE:\nDocumentation"]
FINAL_AUTHORS["BOOK_AUTHORS:\nrepo owner"]
FINAL_URL["GIT_REPO_URL:\nhttps://github.com/owner/repo"]
REPO_PARTS["REPO_OWNER, REPO_NAME"]
end
    
    subgraph "book.toml Generation"
        BOOK_SECTION["[book] section\n[86-91]"]
HTML_SECTION["[output.html] section\n[93-95]"]
PREPROC_SECTION["[preprocessor.mermaid]\n[97-98]"]
FOLD_SECTION["[output.html.fold]\n[100-102]"]
end
    
    OUTPUT["book.toml file\nwritten to /workspace/book/"]
GIT -->|if REPO not set| AUTO_DETECT
 
   ENV_REPO -->|or explicit| AUTO_DETECT
 
   AUTO_DETECT --> VAR_INIT
    
 
   ENV_TITLE --> VAR_INIT
 
   ENV_AUTHORS --> VAR_INIT
 
   ENV_URL --> VAR_INIT
    
 
   VAR_INIT --> VALIDATION
 
   VALIDATION --> DEFAULTS
 
   VALIDATION --> REPO_PARTS
    
 
   DEFAULTS --> FINAL_REPO
 
   DEFAULTS --> FINAL_TITLE
 
   DEFAULTS --> FINAL_AUTHORS
 
   DEFAULTS --> FINAL_URL
 
   REPO_PARTS --> FINAL_AUTHORS
 
   REPO_PARTS --> FINAL_URL
    
 
   FINAL_TITLE --> BOOK_SECTION
 
   FINAL_AUTHORS --> BOOK_SECTION
 
   FINAL_URL --> HTML_SECTION
    
 
   BOOK_SECTION --> OUTPUT
 
   HTML_SECTION --> OUTPUT
 
   PREPROC_SECTION --> OUTPUT
 
   FOLD_SECTION --> OUTPUT

Sources: build-docs.sh:8-103


Environment Variables and Defaults

The configuration generation system processes five primary environment variables, each with intelligent defaults computed from the repository context.

Environment Variable Processing

VariablePurposeDefault ValueComputation Logic
REPORepository identifier (owner/repo)Auto-detected from GitExtracted from git config remote.origin.url build-docs.sh:8-19
BOOK_TITLETitle displayed in documentation"Documentation"Simple string default build-docs.sh23
BOOK_AUTHORSAuthor name(s) in metadataRepository ownerExtracted from REPO using cut -d'/' -f1 build-docs.sh:40-44
GIT_REPO_URLLink to source repositoryhttps://github.com/owner/repoConstructed from REPO build-docs.sh45
MARKDOWN_ONLYSkip mdBook build"false"Boolean flag build-docs.sh26

Sources: build-docs.sh:21-45

Variable Initialization Code Structure

Sources: build-docs.sh:21-45


Auto-Detection Logic

When the REPO environment variable is not explicitly provided, the system attempts to auto-detect it from the Git repository configuration. This enables zero-configuration usage in CI/CD environments.

Git Remote URL Extraction

The auto-detection logic in build-docs.sh:8-19 performs the following steps:

  1. Check if running in a Git repository: Uses git rev-parse --git-dir to verify
  2. Extract remote URL: Retrieves remote.origin.url from Git config
  3. Parse GitHub URL: Uses sed regex to extract owner/repo from various URL formats

Supported GitHub URL Formats:

  • HTTPS: https://github.com/owner/repo.git
  • SSH: git@github.com:owner/repo.git
  • Without .git suffix: https://github.com/owner/repo

Auto-Detection Algorithm

Sources: build-docs.sh:8-19

Regex Pattern Details

The sed command at build-docs.sh16 uses this regex pattern:

s#.*github\.com<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/#LNaN-LNaN" NaN  file-path="">Hii</FileRef>(\.git)?.*#\1#

This pattern:

  • Matches github.com followed by either : (SSH) or / (HTTPS)
  • Captures the owner/repo portion: ([^/]+/[^/\.]+)
  • Optionally matches .git suffix: (\.git)?
  • Extracts only the owner/repo portion as the replacement

Sources: build-docs.sh16


book.toml Structure

The generated book.toml file contains four configuration sections that control mdBook's behavior. The file is created at build-docs.sh:84-103 using a here-document.

Configuration Sections

Sources: build-docs.sh:84-103

Section Details

[book] Section

Located at build-docs.sh:86-91 this section defines core book metadata:

FieldValueSource
titleValue of $BOOK_TITLEEnvironment variable or default "Documentation"
authorsArray with $BOOK_AUTHORSComputed from repository owner
language"en"Hardcoded English
multilingualfalseHardcoded
src"src"mdBook convention for source directory

Sources: build-docs.sh:86-91

[output.html] Section

Located at build-docs.sh:93-95 this section configures HTML output:

FieldValuePurpose
default-theme"rust"Uses mdBook's Rust theme for consistent styling
git-repository-urlValue of $GIT_REPO_URLCreates "Edit this file on GitHub" links in the UI

Sources: build-docs.sh:93-95

[preprocessor.mermaid] Section

Located at build-docs.sh:97-98 this section enables Mermaid diagram rendering:

FieldValuePurpose
command"mdbook-mermaid"Specifies the preprocessor binary to invoke

This preprocessor is executed before mdBook processes the Markdown files, transforming Mermaid code blocks into rendered diagrams. The preprocessor binary is installed during the Docker build and its assets are installed at build-docs.sh:170-171

Sources: build-docs.sh:97-98

[output.html.fold] Section

Located at build-docs.sh:100-102 this section configures navigation behavior:

FieldValuePurpose
enabletrueEnables collapsible sections in the navigation sidebar
level1Folds sections at depth 1, keeping top-level sections visible

Sources: build-docs.sh:100-102


Configuration Generation Process

The complete configuration generation process occurs within the orchestration logic of build-docs.sh. This diagram maps the process to specific code locations.

Code Execution Sequence

sequenceDiagram
    participant ENV as Environment Variables
    participant AUTO as Auto-Detection [8-19]
    participant INIT as Initialization [21-26]
    participant VAL as Validation [32-37]
    participant PARSE as Parsing [39-41]
    participant DEF as Defaults [44-45]
    participant LOG as Logging [47-53]
    participant GEN as Generation [84-103]
    participant FILE as book.toml
    
    ENV->>AUTO: Check if REPO set
    AUTO->>AUTO: Try git config
    AUTO->>INIT: Return REPO (or empty)
    
    INIT->>INIT: Set variable defaults
    INIT->>VAL: Pass to validation
    
    VAL->>VAL: Check REPO not empty
    alt REPO is empty
        VAL-->>ENV: Exit with error
    end
    
    VAL->>PARSE: Continue with valid REPO
    PARSE->>PARSE: Extract REPO_OWNER
    PARSE->>PARSE: Extract REPO_NAME
    
    PARSE->>DEF: Pass extracted parts
    DEF->>DEF: Compute BOOK_AUTHORS
    DEF->>DEF: Compute GIT_REPO_URL
    
    DEF->>LOG: Pass final config
    LOG->>LOG: Echo configuration
    
    Note over LOG,GEN: Scraper runs here [58]
    
    LOG->>GEN: Proceed to book.toml
    GEN->>GEN: Create [book] section
    GEN->>GEN: Create [output.html]
    GEN->>GEN: Create [preprocessor.mermaid]
    GEN->>GEN: Create [output.html.fold]
    
    GEN->>FILE: Write complete file

Sources: build-docs.sh:8-103


Template Interpolation

The book.toml file is generated using shell variable interpolation within a here-document (heredoc). This technique allows dynamic insertion of computed values into the template.

Here-Document Structure

The generation code at build-docs.sh:85-103 uses this structure:

cat > book.toml <<EOF
[book]
title = "$BOOK_TITLE"
authors = ["$BOOK_AUTHORS"]
...
EOF

The <<EOF syntax creates a here-document that:

  1. Allows multi-line content with preserved formatting
  2. Performs shell variable expansion (using $VARIABLE syntax)
  3. Writes the result to book.toml via redirection

Variable Interpolation Points

LineVariableContext
87$BOOK_TITLEUsed in title = "$BOOK_TITLE"
88$BOOK_AUTHORSUsed in authors = ["$BOOK_AUTHORS"]
95$GIT_REPO_URLUsed in git-repository-url = "$GIT_REPO_URL"

The quotes around variable names ensure that values containing spaces are properly handled in the TOML format.

Sources: build-docs.sh:85-103


Output Location and Usage

The generated book.toml file is written to /workspace/book/book.toml within the Docker container. This location is significant because:

  1. Working Directory: The script changes to $BOOK_DIR at build-docs.sh82
  2. mdBook Convention: mdBook expects book.toml in the project root
  3. Build Process: The mdbook build command at build-docs.sh176 reads this file

File Lifecycle

The file is also copied to the output directory at build-docs.sh191 for user reference and debugging purposes.

Sources: build-docs.sh:82-191


Error Handling

Configuration generation includes validation to ensure required values are present before proceeding with the build.

Repository Validation

The validation logic at build-docs.sh:32-37 checks that REPO has a value:

stateDiagram-v2
    [*] --> AutoDetect : Script starts
    AutoDetect --> CheckEmpty : Lines 8-19
    CheckEmpty --> Error : REPO still empty
    CheckEmpty --> ParseRepo : REPO has value
    
    Error --> [*] : Exit code 1
    
    ParseRepo --> ComputeDefaults : Lines 39-45
    ComputeDefaults --> LogConfig : Lines 47-53
    LogConfig --> GenerateConfig : Lines 84-103
    GenerateConfig --> [*] : Continue build

This validation occurs after auto-detection attempts, so the error message guides users to either:

  1. Set the REPO environment variable explicitly, or
  2. Run the command from within a Git repository with a GitHub remote configured

Validation Timing

Sources: build-docs.sh:32-37


Configuration Output Example

Based on the generation logic, here is an example of a complete generated book.toml file when processing the repository jzombie/deepwiki-to-mdbook:

This example demonstrates:

  • Default title when BOOK_TITLE is not set
  • Author extracted from repository owner
  • Computed Git repository URL
  • All static configuration values

Sources: build-docs.sh:84-103


Integration with mdBook Build

The generated configuration is consumed by mdBook during the build process. The integration points are:

  1. mdbook-mermaid install at build-docs.sh171 reads [preprocessor.mermaid]
  2. mdbook build at build-docs.sh176 reads all sections
  3. HTML output uses [output.html] settings for theming and repository links
  4. Navigation rendering uses [output.html.fold] for sidebar behavior

Configuration-Driven Features

ConfigurationVisible Result
git-repository-url"Suggest an edit" button in top-right of each page
default-theme = "rust"Consistent color scheme and typography matching Rust documentation
[preprocessor.mermaid]Mermaid code blocks rendered as interactive diagrams
enable = true (folding)Collapsible sections in left sidebar navigation

For details on how mdBook processes this configuration to build the final documentation site, see mdBook Integration.

Sources: build-docs.sh:84-176

DeepWiki GitHub

SUMMARY.md Generation

Relevant source files

Purpose and Scope

This document explains how the SUMMARY.md file is dynamically generated from the scraped markdown content structure. The SUMMARY.md file serves as mdBook's table of contents, defining the navigation structure and page hierarchy for the generated HTML documentation.

For information about how the markdown files are initially organized during scraping, see Wiki Structure Discovery. For details about the overall mdBook build configuration, see Configuration Generation.

SUMMARY.md in mdBook

The SUMMARY.md file is mdBook's primary navigation document. It defines:

  • The order of pages in the documentation
  • The hierarchical structure (chapters and sub-chapters)
  • The titles displayed in the navigation sidebar
  • Which markdown files map to which sections

mdBook parses SUMMARY.md to construct the entire book structure. Pages not listed in SUMMARY.md will not be included in the generated documentation.

Sources: build-docs.sh:108-161

Generation Process Overview

The SUMMARY.md generation occurs in Step 3 of the build pipeline, after markdown extraction is complete but before the mdBook build begins. The generation algorithm automatically discovers the file structure and constructs an appropriate table of contents.

Diagram: SUMMARY.md Generation Workflow

flowchart TD
    Start["Start: SUMMARY.md Generation\n(build-docs.sh:110)"]
Init["Initialize Output\nEcho '# Summary'"]
FindFirst["Find First Page\nls $WIKI_DIR/*.md /head -1"]
ExtractFirst["Extract Title head -1 $file/ sed 's/^# //'"]
AddFirst["Add as Introduction\n[title](filename)"]
LoopStart["Iterate: $WIKI_DIR/*.md"]
Skip{"Is first_page?"}
ExtractTitle["Extract Title from First Line\nhead -1 $file /sed 's/^# //'"]
GetSectionNum["Extract section_num grep -oE '^[0-9]+'"]
CheckDir{"Directory Exists? section-$section_num/"}
MainWithSub["Output Section Header # $title - [$title] $filename"]
IterateSub["Iterate: section-$section_num/*.md"]
AddSub["Add Subsection - [$subtitle] section-N/$subfilename"]
Standalone["Output Standalone - [$title] $filename"]
LoopEnd{"More Files?"}
Complete["Write to src/SUMMARY.md"]
End["End: SUMMARY.md Generated"]
Start --> Init
 Init --> FindFirst
 FindFirst --> ExtractFirst
 ExtractFirst --> AddFirst
 AddFirst --> LoopStart
 LoopStart --> Skip
 Skip -->|Yes|LoopEnd
 Skip -->|No|ExtractTitle
 ExtractTitle --> GetSectionNum
 GetSectionNum --> CheckDir
 CheckDir -->|Yes|MainWithSub
 MainWithSub --> IterateSub
 IterateSub --> AddSub
 AddSub --> LoopEnd
 CheckDir -->|No|Standalone
 Standalone --> LoopEnd
 LoopEnd -->|Yes|LoopStart
 LoopEnd -->|No| Complete
 
   Complete --> End

Sources: build-docs.sh:108-161

Algorithm Components

Step 1: Introduction Page Selection

The first markdown file in the wiki directory is designated as the introduction page. This ensures the documentation has a clear entry point.

Diagram: First Page Processing Pipeline

flowchart LR
    ListFiles["ls $WIKI_DIR/*.md"]
TakeFirst["head -1"]
GetBasename["basename"]
StoreName["first_page variable"]
ExtractTitle["head -1 $WIKI_DIR/$first_page"]
RemoveHash["sed 's/^# //'"]
StoreTitle["title variable"]
WriteEntry["echo '[${title}]($first_page)'"]
ToSummary[">> src/SUMMARY.md"]
ListFiles --> TakeFirst
 
   TakeFirst --> GetBasename
 
   GetBasename --> StoreName
    
 
   StoreName --> ExtractTitle
 
   ExtractTitle --> RemoveHash
 
   RemoveHash --> StoreTitle
    
 
   StoreTitle --> WriteEntry
 
   WriteEntry --> ToSummary

The implementation uses shell command chaining:

  • ls "$WIKI_DIR"/*.md 2>/dev/null | head -1 | xargs basename extracts the first filename
  • head -1 "$WIKI_DIR/$first_page" | sed 's/^# //' extracts the title by removing the leading # from the first line

Sources: build-docs.sh:118-123

Step 2: Main Page Iteration

All markdown files in the root wiki directory are processed sequentially. Each file represents either a standalone page or a main section with subsections.

Processing StepCommand/LogicPurpose
File Discoveryfor file in "$WIKI_DIR"/*.mdIterate all root-level markdown files
File Check[ -f "$file" ]Verify file existence
Basename Extractionbasename "$file"Get filename without path
First Page Skip[ "$filename" = "$first_page" ]Avoid duplicate introduction
Title Extraction`head -1 "$file"sed 's/^# //'`

Sources: build-docs.sh:126-135

Step 3: Subsection Detection

The algorithm determines whether a main page has subsections by:

  1. Extracting the numeric prefix from the filename (e.g., 5 from 5-component-reference.md)
  2. Checking if a corresponding section-N/ directory exists
  3. If found, treating the page as a main section with nested subsections
flowchart TD
    MainPage["Main Page File\ne.g., 5-component-reference.md"]
ExtractNum["Extract section_num\necho $filename /grep -oE '^[0-9]+'"]
HasNum{"Numeric Prefix?"}
BuildPath["Construct section_dir $WIKI_DIR/section-$section_num"]
CheckDir["Check Directory [ -d $section_dir ]"]
DirExists{"Directory Exists?"}
OutputHeader["Output Section Header # $title"]
OutputMain["Output Main Link - [$title] $filename"]
IterateSubs["for subfile in $section_dir/*.md"]
ExtractSubTitle["head -1 $subfile/ sed 's/^# //'"]
OutputSub["Output Subsection\n - [$subtitle](section-N/$subfilename)"]
OutputStandalone["Output Standalone\n- [$title]($filename)"]
MainPage --> ExtractNum
 
   ExtractNum --> HasNum
    
 
   HasNum -->|Yes| BuildPath
 
   HasNum -->|No| OutputStandalone
    
 
   BuildPath --> CheckDir
 
   CheckDir --> DirExists
    
 
   DirExists -->|Yes| OutputHeader
 
   OutputHeader --> OutputMain
 
   OutputMain --> IterateSubs
 
   IterateSubs --> ExtractSubTitle
 
   ExtractSubTitle --> OutputSub
    
 
   DirExists -->|No| OutputStandalone

Diagram: Subsection Detection and Nesting Logic

Sources: build-docs.sh:137-158

Step 4: Subsection Processing

When a section-N/ directory is detected, all markdown files within it are processed as subsections:

Key aspects:

  • Subsections use two-space indentation: - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/$subtitle" undefined file-path="$subtitle">Hii</FileRef>
  • File paths include the section-N/ directory prefix
  • Each subsection's title is extracted using the same pattern as main pages

Sources: build-docs.sh:147-152

File Structure Conventions

The generation algorithm depends on the file structure created during markdown extraction (see Wiki Structure Discovery):

Diagram: File Structure Conventions for SUMMARY.md Generation

PatternLocationSUMMARY.md Output
*.mdRoot directoryMain pages
N-*.mdRoot directoryMain section (if section-N/ exists)
*.mdsection-N/ directorySubsections (indented under section N)

Sources: build-docs.sh:126-158

Title Extraction Method

All page titles are extracted using a consistent pattern:

This assumes that every markdown file begins with a level-1 heading (# Title). The sed command removes the # prefix, leaving only the title text.

Extraction Pipeline:

CommandPurposeExample InputExample Output
head -1 "$file"Get first line# Component Reference# Component Reference
sed 's/^# //'Remove heading syntax# Component ReferenceComponent Reference

Sources: build-docs.sh120 build-docs.sh134 build-docs.sh150

Output Format

The generated SUMMARY.md follows mdBook's syntax:

Format Rules:

ElementSyntaxPurpose
Header# SummaryRequired mdBook header
Introduction<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Title" undefined file-path="Title">Hii</FileRef>First page (no bullet)
Main Page- <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Title" undefined file-path="Title">Hii</FileRef>Top-level navigation item
Section Header# Section NameVisual grouping in sidebar
Subsection - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Title" undefined file-path="Title">Hii</FileRef>Nested under main section (2-space indent)

Sources: build-docs.sh:113-159

Implementation Code Mapping

The following table maps the algorithm steps to specific code locations:

Diagram: Code Location Mapping for SUMMARY.md Generation

Variable NamePurposeExample Value
WIKI_DIRSource directory for markdown files/workspace/wiki
first_pageFirst markdown file (introduction)1-overview.md
section_numNumeric prefix of main page5 (from 5-component-reference.md)
section_dirSubsection directory path/workspace/wiki/section-5
titleExtracted page titleComponent Reference
subtitleExtracted subsection titlebuild-docs.sh Orchestrator

Sources: build-docs.sh:108-161

Generation Statistics

After generation completes, the script logs the number of entries created:

DeepWiki GitHub

Output Structure

Relevant source files

This page documents the structure and contents of the /output directory produced by the DeepWiki-to-mdBook converter. The output structure varies depending on whether the system runs in full build mode or markdown-only mode. For information about enabling markdown-only mode, see Markdown-Only Mode.

Output Directory Overview

The system writes all artifacts to the /output directory, which is typically mounted as a Docker volume. The contents of this directory depend on the MARKDOWN_ONLY environment variable:

Output Mode Decision Logic

graph TD
    Start["build-docs.sh execution"]
CheckMode{MARKDOWN_ONLY\nenvironment variable}
FullBuild["Full Build Path"]
MarkdownOnly["Markdown-Only Path"]
OutputBook["/output/book/\nHTML documentation"]
OutputMarkdown["/output/markdown/\nSource .md files"]
OutputToml["/output/book.toml\nConfiguration"]
Start --> CheckMode
 
   CheckMode -->|false default| FullBuild
 
   CheckMode -->|true| MarkdownOnly
    
 
   FullBuild --> OutputBook
 
   FullBuild --> OutputMarkdown
 
   FullBuild --> OutputToml
    
 
   MarkdownOnly --> OutputMarkdown

Sources: build-docs.sh26 build-docs.sh:60-76

Full Build Mode Output

When MARKDOWN_ONLY is not set or is false, the system produces three distinct outputs:

graph TD
    Output["/output/"]
Book["book/\nComplete HTML site"]
Markdown["markdown/\nSource files"]
BookToml["book.toml\nConfiguration"]
BookIndex["index.html"]
BookCSS["css/"]
BookJS["FontAwesome/"]
BookSearchJS["searchindex.js"]
BookMermaid["mermaid-init.js"]
BookPages["*.html pages"]
MarkdownRoot["*.md files\n(main pages)"]
MarkdownSections["section-N/\n(subsection dirs)"]
Output --> Book
 
   Output --> Markdown
 
   Output --> BookToml
    
 
   Book --> BookIndex
 
   Book --> BookCSS
 
   Book --> BookJS
 
   Book --> BookSearchJS
 
   Book --> BookMermaid
 
   Book --> BookPages
    
 
   Markdown --> MarkdownRoot
 
   Markdown --> MarkdownSections

Directory Structure

Full Build Output Structure

Sources: build-docs.sh:178-192 README.md:92-104

/output/book/ Directory

The book/ directory contains the complete HTML documentation site generated by mdBook. This is a self-contained static website that can be hosted on any web server or opened directly in a browser.

ComponentDescriptionGenerated By
index.htmlMain entry point for the documentationmdBook
*.htmlIndividual page files corresponding to each .md sourcemdBook
css/Styling for the rust thememdBook
FontAwesome/Icon font assetsmdBook
searchindex.jsSearch index for site-wide search functionalitymdBook
mermaid.min.jsMermaid diagram rendering librarymdbook-mermaid
mermaid-init.jsMermaid initialization scriptmdbook-mermaid

The HTML site includes:

  • Responsive navigation sidebar with hierarchical structure
  • Full-text search functionality
  • Syntax highlighting for code blocks
  • Working Mermaid diagram rendering
  • "Edit this page" links pointing to GIT_REPO_URL
  • Collapsible sections in the navigation

Sources: build-docs.sh:173-176 build-docs.sh:94-95 README.md:95-99

/output/markdown/ Directory

The markdown/ directory contains the source Markdown files extracted from DeepWiki and enhanced with Mermaid diagrams. These files follow a specific naming convention and organizational structure.

File Naming Convention:

<page-number>-<page-title-slug>.md

Examples from actual output:

  • 1-overview.md
  • 2-1-workspace-and-crates.md
  • 3-2-sql-parser.md

Subsection Organization:

Pages with subsections have their children organized into directories:

section-N/
  N-1-first-subsection.md
  N-2-second-subsection.md
  ...

For example, if page 4-architecture.md has subsections, they appear in:

section-4/
  4-1-overview.md
  4-2-components.md

This organization is reflected in the mdBook SUMMARY.md generation logic at build-docs.sh:125-159

Sources: README.md:100-119 build-docs.sh:163-166 build-docs.sh:186-188

/output/book.toml File

The book.toml file is a copy of the mdBook configuration used to generate the HTML site. It contains:

This file can be used to:

  • Understand the configuration used for the build
  • Regenerate the book with different settings
  • Debug mdBook configuration issues

Sources: build-docs.sh:84-103 build-docs.sh:190-191

Markdown-Only Mode Output

When MARKDOWN_ONLY=true, the system produces only the /output/markdown/ directory. This mode skips the mdBook build phase entirely.

Markdown-Only Mode Data Flow

graph LR
    Scraper["deepwiki-scraper.py"]
TempDir["/workspace/wiki/\nTemporary directory"]
OutputMarkdown["/output/markdown/\nFinal output"]
Scraper -->|Writes enhanced .md files| TempDir
 
   TempDir -->|cp -r| OutputMarkdown

The output structure is identical to the markdown/ directory in full build mode, but the book/ and book.toml artifacts are not created.

Sources: build-docs.sh:60-76 README.md:106-113

graph TD
    subgraph "Phase 1: Scraping"
        Scraper["deepwiki-scraper.py"]
WikiDir["/workspace/wiki/"]
end
    
    subgraph "Phase 2: Decision Point"
        CheckMode{MARKDOWN_ONLY\ncheck}
end
    
    subgraph "Phase 3: mdBook Build (conditional)"
        BookInit["Initialize /workspace/book/"]
GenToml["Generate book.toml"]
GenSummary["Generate SUMMARY.md"]
CopyToSrc["cp wiki/* book/src/"]
MermaidInstall["mdbook-mermaid install"]
MdBookBuild["mdbook build"]
BuildOutput["/workspace/book/book/"]
end
    
    subgraph "Phase 4: Copy to Output"
        CopyBook["cp -r book /output/"]
CopyMarkdown["cp -r wiki /output/markdown/"]
CopyToml["cp book.toml /output/"]
end
    
 
   Scraper -->|Writes to| WikiDir
 
   WikiDir --> CheckMode
    
 
   CheckMode -->|false| BookInit
 
   CheckMode -->|true| CopyMarkdown
    
 
   BookInit --> GenToml
 
   GenToml --> GenSummary
 
   GenSummary --> CopyToSrc
 
   CopyToSrc --> MermaidInstall
 
   MermaidInstall --> MdBookBuild
 
   MdBookBuild --> BuildOutput
    
 
   BuildOutput --> CopyBook
 
   WikiDir --> CopyMarkdown
 
   GenToml --> CopyToml

Output Generation Process

The following diagram shows how each output artifact is generated during the build process:

Complete Output Generation Pipeline

Sources: build-docs.sh:55-205

File Naming Examples

The following table shows actual filename patterns produced by the system:

PatternExampleDescription
N-title.md1-overview.mdMain page without subsections
N-M-title.md2-1-workspace-and-crates.mdSubsection file in root (legacy format)
section-N/N-M-title.mdsection-4/4-1-logical-planning.mdSubsection file in section directory

The system automatically detects which pages have subsections by examining the numeric prefix and checking for corresponding section-N/ directories during SUMMARY.md generation.

Sources: build-docs.sh:125-159 README.md:115-119

Volume Mounting

The /output directory is designed to be mounted as a Docker volume. The typical Docker run command specifies:

This mounts the host's ./output directory to the container's /output directory, making all generated artifacts accessible on the host filesystem after the container exits.

Sources: README.md:34-38 README.md:83-86

Output Size Characteristics

The output directory typically contains:

  • Markdown files : 10-500 KB per page depending on content length and diagram count
  • HTML book : 5-50 MB total depending on page count and assets
  • book.toml : ~500 bytes

For a typical repository with 20-30 documentation pages, expect:

  • markdown/: 5-15 MB
  • book/: 10-30 MB (includes all HTML, CSS, JS, and search index)
  • book.toml: < 1 KB

The HTML book is significantly larger than the markdown source because it includes:

  • Complete mdBook framework (CSS, JavaScript)
  • Search index (searchindex.js)
  • Mermaid rendering library (mermaid.min.js)
  • Font assets (FontAwesome)
  • Generated HTML for each page with navigation

Sources: build-docs.sh:178-205

Serving the Output

The HTML documentation in /output/book/ can be served using any static web server:

The markdown files in /output/markdown/ can be:

  • Committed to a Git repository
  • Used as input for other documentation systems
  • Edited and re-processed through mdBook manually
  • Served directly by markdown-aware platforms like GitHub

Sources: README.md:83-86 build-docs.sh:203-204

DeepWiki GitHub

Advanced Topics

Relevant source files

This page covers advanced usage scenarios, implementation details, and power-user features of the DeepWiki-to-mdBook Converter. It provides deeper insight into optional features, debugging techniques, and the internal mechanisms that enable the system's flexibility and robustness.

For basic usage and configuration, see Quick Start and Configuration Reference. For architectural overview, see System Architecture. For component-level details, see Component Reference.

When to Use Advanced Features

The system provides several advanced features designed for specific scenarios:

Markdown-Only Mode : Extract markdown without building the HTML documentation. Useful for:

  • Debugging diagram placement and content extraction
  • Quick iteration during development
  • Creating markdown archives for version control
  • Feeding extracted content into other tools

Auto-Detection : Automatically determine repository metadata from Git remotes. Useful for:

  • CI/CD pipeline integration with minimal configuration
  • Running from within a repository checkout
  • Reducing configuration boilerplate

Custom Configuration : Override default behaviors through environment variables. Useful for:

  • Multi-repository documentation builds
  • Custom branding and themes
  • Specialized output requirements

Decision Flow for Build Modes

Sources: build-docs.sh:60-76 build-docs.sh:78-206 README.md:55-76

Debugging Strategies

Using Markdown-Only Mode for Fast Iteration

The MARKDOWN_ONLY environment variable bypasses the mdBook build phase, reducing build time from minutes to seconds. This is controlled by a simple conditional check in the orchestration script.

Workflow:

  1. Set MARKDOWN_ONLY=true in Docker run command
  2. Script executes build-docs.sh:60-76 which skips Steps 2-6
  3. Only Phase 1 (scraping) and Phase 2 (diagram enhancement) execute
  4. Output written directly to /output/markdown/

Typical debugging session:

The check at build-docs.sh61 determines whether to exit early:

For detailed information about this mode, see Markdown-Only Mode.

Sources: build-docs.sh:60-76 build-docs.sh26 README.md:55-76

Inspecting Intermediate Outputs

The system uses a temporary directory workflow that can be examined for debugging:

StageLocationContents
During Phase 1/workspace/wiki/ (temp)Raw markdown before diagram enhancement
During Phase 2/workspace/wiki/ (temp)Markdown with injected diagrams
During Phase 3/workspace/book/src/Markdown copied for mdBook
Final Output/output/markdown/Final enhanced markdown files

The temporary directory pattern is implemented using Python's tempfile.TemporaryDirectory at tools/deepwiki-scraper.py808:

This ensures atomic operations—if the script fails mid-process, partial outputs are automatically cleaned up.

Sources: tools/deepwiki-scraper.py:804-916 build-docs.sh:27-30

Diagram Placement Debugging

Diagram injection uses fuzzy matching with progressive chunk sizes. To debug placement:

  1. Check raw extraction count : Look for console output "Found N total diagrams"
  2. Check context extraction : Look for "Found N diagrams with context"
  3. Check matching : Look for "Enhanced X files with diagrams"

The matching algorithm tries progressively smaller chunks at tools/deepwiki-scraper.py:716-730:

Debugging poor matches:

  • If too few diagrams placed: The context from JavaScript may not match converted markdown
  • If diagrams in wrong locations: Context text may appear in multiple locations
  • If no diagrams: Repository may not contain mermaid diagrams

Sources: tools/deepwiki-scraper.py:596-788 tools/deepwiki-scraper.py:216-331

DeepWiki uses absolute URLs like /owner/repo/2-1-subsection. The scraper must convert these to relative markdown paths that work in the mdBook file hierarchy:

output/markdown/
├── 1-overview.md
├── 2-main-section.md
├── section-2/
│   ├── 2-1-subsection.md
│   └── 2-2-another.md
└── 3-next-section.md

Links must account for:

  • Source page location (main page vs. subsection)
  • Target page location (main page vs. subsection)
  • Same section vs. cross-section links

Sources: tools/deepwiki-scraper.py:549-593

The fix_wiki_link function at tools/deepwiki-scraper.py:549-589 implements this logic:

Input parsing:

Location detection:

Path generation rules:

Source LocationTarget LocationGenerated PathExample
Main pageMain pagefile.md3-next.md
Main pageSubsectionsection-N/file.mdsection-2/2-1-sub.md
SubsectionMain page../file.md../3-next.md
Subsection (same section)Subsectionfile.md2-2-another.md
Subsection (diff section)Subsectionsection-N/file.mdsection-3/3-1-sub.md

The regex replacement at tools/deepwiki-scraper.py592 applies this transformation to all links:

For detailed explanation, see Link Rewriting Logic.

Sources: tools/deepwiki-scraper.py:549-593

Auto-Detection Mechanisms

flowchart TD
 
   Start["build-docs.sh starts"] --> CheckRepo{"REPO env var\nprovided?"}
CheckRepo -->|Yes| UseEnv["Use provided REPO value"]
CheckRepo -->|No| CheckGit{"Is current directory\na Git repository?\n(git rev-parse --git-dir)"}
CheckGit -->|Yes| GetRemote["Get remote.origin.url:\ngit config --get\nremote.origin.url"]
CheckGit -->|No| SetEmpty["Set REPO=<empty>"]
GetRemote --> HasRemote{"Remote URL\nfound?"}
HasRemote -->|Yes| ParseURL["Parse GitHub URL using sed regex:\nExtract owner/repo"]
HasRemote -->|No| SetEmpty
    
 
   ParseURL --> ValidateFormat{"Format is\nowner/repo?"}
ValidateFormat -->|Yes| SetRepo["Set REPO variable"]
ValidateFormat -->|No| SetEmpty
    
 
   SetEmpty --> FinalCheck{"REPO is empty?"}
UseEnv --> Continue["Continue with REPO"]
SetRepo --> Continue
    
 
   FinalCheck -->|Yes| Error["ERROR: REPO must be set\nExit with code 1"]
FinalCheck -->|No| Continue

Git Remote Auto-Detection

When REPO environment variable is not provided, the system attempts to auto-detect it from the Git repository in the current working directory.

Sources: build-docs.sh:8-37

Implementation Details

The auto-detection logic at build-docs.sh:8-19 handles multiple Git URL formats:

Supported URL formats:

  • HTTPS: https://github.com/owner/repo.git
  • HTTPS (no .git): https://github.com/owner/repo
  • SSH: git@github.com:owner/repo.git
  • SSH (no .git): git@github.com:owner/repo

The regex pattern .*github\.com<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/#LNaN-LNaN" NaN file-path="">Hii</FileRef>(\.git)?.* captures:

  • [:/] - Matches either : (SSH) or / (HTTPS)
  • ([^/]+/[^/\.]+) - Captures owner/repo (stops at / or .)
  • (\.git)? - Optionally matches .git suffix

Derived defaults:

After determining REPO, the script derives other configuration at build-docs.sh:39-45:

This provides sensible defaults:

  • BOOK_AUTHORS defaults to repository owner
  • GIT_REPO_URL defaults to GitHub URL (for "Edit this page" links)

For detailed explanation, see Auto-Detection Features.

Sources: build-docs.sh:8-45 README.md:47-53

Performance Considerations

Build Time Breakdown

Typical build times for a medium-sized repository (50-100 pages):

PhaseTimeBottleneck
Phase 1: Scraping60-120sNetwork requests + 1s delays
Phase 2: Diagrams5-10sRegex matching + file I/O
Phase 3: mdBook10-20sRust compilation + mermaid assets
Total75-150sNetwork + computation

Optimization Strategies

Network optimization:

Markdown-only mode:

  • Skips Phase 3 entirely, reducing build time by ~15-25%
  • Useful for content-only iterations

Docker build optimization:

  • Multi-stage build discards Rust toolchain (~1.5 GB)
  • Final image only contains binaries (~300-400 MB)
  • See Docker Multi-Stage Build for details

Caching considerations:

  • No internal caching—each run fetches fresh content
  • DeepWiki serves dynamic content (no cache headers)
  • Docker layer caching helps with repeated image builds

Sources: tools/deepwiki-scraper.py:28-42 tools/deepwiki-scraper.py:817-821 tools/deepwiki-scraper.py872

Extending the System

Adding New Output Formats

The system's three-phase architecture makes it easy to add new output formats:

Integration points:

  1. Before Phase 3: Add code after build-docs.sh188 to read from $WIKI_DIR
  2. Alternative Phase 3: Replace build-docs.sh:174-176 with custom builder
  3. Post-processing: Add steps after build-docs.sh192 to transform mdBook output

Example: Adding PDF export:

Sources: build-docs.sh:174-206

Customizing Diagram Matching

The fuzzy matching algorithm can be tuned by modifying the chunk sizes at tools/deepwiki-scraper.py716:

Matching strategy customization:

The scoring system at tools/deepwiki-scraper.py:709-745 prioritizes:

  1. Anchor text matching (weighted by chunk size)
  2. Heading matching (weight: 50)

You can add additional heuristics by modifying the scoring logic or adding new matching strategies.

Sources: tools/deepwiki-scraper.py:596-788 tools/deepwiki-scraper.py:716-745

Adding New Content Cleaners

The HTML-to-markdown conversion can be enhanced by adding custom cleaners at tools/deepwiki-scraper.py:489-511:

The footer cleaner at tools/deepwiki-scraper.py:127-173 can be extended with additional patterns:

Sources: tools/deepwiki-scraper.py:127-173 tools/deepwiki-scraper.py:466-511

Common Advanced Scenarios

CI/CD Integration

GitHub Actions example:

The auto-detection at build-docs.sh:8-19 determines REPO from Git context. The BOOK_TITLE overrides the default.

Sources: build-docs.sh:8-45 README.md:228-232

Multi-Repository Builds

Build documentation for multiple repositories in parallel:

Each build runs in an isolated container with separate output directories.

Sources: build-docs.sh:21-53 README.md:200-207

Custom Theming

Override mdBook theme by modifying the generated book.toml template at build-docs.sh:85-103:

Or inject custom CSS:

Sources: build-docs.sh:84-103

DeepWiki GitHub

Markdown-Only Mode

Relevant source files

Purpose and Scope

This document describes the Markdown-Only Mode feature, which provides a fast-path execution mode that bypasses Phase 3 of the processing pipeline. This mode is primarily used for debugging content extraction and diagram placement without waiting for the full mdBook HTML build. For information about the complete three-phase pipeline, see Three-Phase Pipeline. For details about the full output structure, see Output Structure.

Overview

Markdown-Only Mode is a configuration option that terminates the build process after completing Phase 1 (Markdown Extraction) and Phase 2 (Diagram Enhancement), skipping the computationally expensive Phase 3 (mdBook Build). This mode produces only the enhanced Markdown files without generating the final HTML documentation site.

The mode is controlled by the MARKDOWN_ONLY environment variable, which defaults to false for complete builds.

Sources: README.md:55-72 build-docs.sh26

Configuration

The mode is activated by setting the MARKDOWN_ONLY environment variable to "true":

VariableValueEffect
MARKDOWN_ONLY"true"Skip Phase 3, output only markdown files
MARKDOWN_ONLY"false" (default)Complete all three phases, generate full HTML site

Sources: build-docs.sh26 README.md:45-51

Processing Pipeline Comparison

Full Build vs. Markdown-Only Mode Decision Flow

Sources: build-docs.sh:1-206

Phase Execution Matrix

PhaseFull BuildMarkdown-Only Mode
Phase 1: Markdown Extraction✓ Executed✓ Executed
Phase 2: Diagram Enhancement✓ Executed✓ Executed
Phase 3: mdBook Build✓ Executed✗ Skipped

Both modes execute the Python scraper deepwiki-scraper.py identically. The difference occurs after scraping completes, where the shell orchestrator makes a conditional decision based on the MARKDOWN_ONLY variable.

Sources: build-docs.sh:58-76 README.md:123-145

Implementation Details

Conditional Logic in build-docs.sh

The markdown-only mode implementation consists of a simple conditional branch in the shell orchestrator:

graph TD
    ScraperCall["python3 /usr/local/bin/deepwiki-scraper.py\n[build-docs.sh:58]"]
CheckVar{"if [ '$MARKDOWN_ONLY' = 'true' ]\n[build-docs.sh:61]"}
subgraph "Markdown-Only Path [lines 63-76]"
        MkdirOut["mkdir -p $OUTPUT_DIR/markdown"]
CpMarkdown["cp -r $WIKI_DIR/* $OUTPUT_DIR/markdown/"]
EchoSuccess["Echo success message"]
Exit0["exit 0"]
end
    
    subgraph "Full Build Path [lines 78-206]"
        MkdirBook["mkdir -p $BOOK_DIR"]
CreateToml["Create book.toml"]
CreateSummary["Generate SUMMARY.md"]
CopySrcFiles["Copy markdown to src/"]
MdbookMermaid["mdbook-mermaid install"]
MdbookBuild["mdbook build"]
CopyAll["Copy to /output"]
end
    
 
   ScraperCall --> CheckVar
 
   CheckVar -->|true| MkdirOut
 
   MkdirOut --> CpMarkdown
 
   CpMarkdown --> EchoSuccess
 
   EchoSuccess --> Exit0
    
 
   CheckVar -->|false| MkdirBook
 
   MkdirBook --> CreateToml
 
   CreateToml --> CreateSummary
 
   CreateSummary --> CopySrcFiles
 
   CopySrcFiles --> MdbookMermaid
 
   MdbookMermaid --> MdbookBuild
 
   MdbookBuild --> CopyAll

The implementation performs an early exit at build-docs.sh75 when markdown-only mode is enabled, preventing execution of the entire mdBook build pipeline.

Sources: build-docs.sh:60-76

Variable Reading and Default Value

The MARKDOWN_ONLY variable is read with a default value of "false":

This line at build-docs.sh26 uses bash parameter expansion to set the variable to "false" if it is unset or empty. The string comparison at build-docs.sh61 checks for exact equality with "true", meaning any other value (including "false", "", or "1") results in a full build.

Sources: build-docs.sh26 build-docs.sh61

Output Structure Comparison

Markdown-Only Mode Output

When MARKDOWN_ONLY="true", the output directory contains:

/output/
└── markdown/
    ├── 1-overview.md
    ├── 2-quick-start.md
    ├── 3-configuration-reference.md
    ├── section-4/
    │   ├── 4-1-three-phase-pipeline.md
    │   └── 4-2-docker-multi-stage-build.md
    └── ...
OutputPresentLocation
Markdown files/output/markdown/
HTML documentationN/A
book.tomlN/A
SUMMARY.mdN/A

Sources: build-docs.sh:64-74 README.md:106-114

Full Build Mode Output

When MARKDOWN_ONLY="false" (default), the output directory contains:

/output/
├── book/
│   ├── index.html
│   ├── print.html
│   ├── searchindex.js
│   ├── css/
│   ├── FontAwesome/
│   └── ...
├── markdown/
│   ├── 1-overview.md
│   ├── 2-quick-start.md
│   └── ...
└── book.toml
OutputPresentLocation
Markdown files/output/markdown/
HTML documentation/output/book/
book.toml/output/book.toml
SUMMARY.md✓ (internal)Copied to mdBook src during build

Sources: build-docs.sh:179-201 README.md:89-105

Use Cases

Debugging Diagram Placement

Markdown-only mode is particularly useful when debugging the fuzzy diagram matching algorithm (see Fuzzy Diagram Matching Algorithm). Developers can rapidly iterate on diagram placement logic without waiting for mdBook compilation:

This workflow allows inspection of the exact markdown output, including where diagrams were injected, without the overhead of HTML generation.

Sources: README.md:55-72 README.md:168-169

Content Extraction Verification

When verifying that HTML-to-Markdown conversion is working correctly (see HTML to Markdown Conversion), markdown-only mode provides quick feedback:

Sources: README.md:55-72

CI/CD Pipeline Intermediate Artifacts

In continuous integration pipelines, markdown-only mode can be used as an intermediate step to produce version-controlled markdown artifacts without generating the full HTML site:

Sources: README.md:228-232

Performance Characteristics

Build Time Comparison

Build ModePhase 1Phase 2Phase 3Total
Full Build~30s~10s~20s~60s
Markdown-Only~30s~10s0s~40s

The markdown-only mode provides approximately 33% faster execution by eliminating:

  • mdBook binary initialization
  • book.toml generation
  • SUMMARY.md generation
  • Markdown file copying to src/
  • mdbook-mermaid asset installation
  • HTML compilation and asset generation

Note: Actual times vary based on repository size, number of diagrams, and system resources.

Sources: build-docs.sh:78-176 README.md:71-72

Resource Consumption

ResourceFull BuildMarkdown-Only
CPUHigh (Rust compilation)Medium (Python only)
Memory~2GB recommended~512MB sufficient
Disk I/OHigh (HTML generation)Low (markdown only)
NetworkSame (scraping)Same (scraping)

Sources: README.md175

Common Workflows

Iterative Debugging Workflow

This workflow minimizes iteration time during development by using markdown-only mode for rapid feedback loops, only running the full build when markdown output is verified correct.

Sources: README.md:55-72 README.md177

Markdown Extraction for Other Tools

Markdown-only mode can extract clean markdown files for use with documentation tools other than mdBook:

The extracted markdown files contain properly rewritten internal links and enhanced diagrams, making them suitable for any markdown-compatible documentation system.

Sources: README.md:106-114 README.md:218-232

Console Output Differences

Markdown-Only Mode Output

When MARKDOWN_ONLY="true", the console output terminates after Phase 2:

Step 1: Scraping wiki from DeepWiki...
[... scraping progress ...]

Step 2: Copying markdown files to output (markdown-only mode)...

================================================================================
✓ Markdown extraction complete!
================================================================================

Outputs:
  - Markdown files:  /output/markdown/

The script exits at build-docs.sh75 with exit code 0.

Sources: build-docs.sh:63-75

Full Build Mode Output

When MARKDOWN_ONLY="false", the console output continues through all phases:

Step 1: Scraping wiki from DeepWiki...
[... scraping progress ...]

Step 2: Initializing mdBook structure...
Step 3: Generating SUMMARY.md from scraped content...
Step 4: Copying markdown files to book...
Step 5: Installing mdbook-mermaid assets...
Step 6: Building mdBook...
Step 7: Copying outputs to /output...

================================================================================
✓ Documentation build complete!
================================================================================

Outputs:
  - HTML book:       /output/book/
  - Markdown files:  /output/markdown/
  - Book config:     /output/book.toml

Sources: build-docs.sh:193-205

Summary

Markdown-Only Mode provides a fast-path execution option controlled by the MARKDOWN_ONLY environment variable. It executes Phases 1 and 2 of the pipeline but bypasses Phase 3, producing only enhanced markdown files without HTML documentation. This mode is essential for:

  • Debugging : Rapid iteration on content extraction and diagram placement
  • Performance : 33% faster execution when HTML output is not needed
  • Flexibility : Extract markdown for use with other documentation tools

The implementation is straightforward: a single conditional check at build-docs.sh61 determines whether to execute the mdBook build pipeline or exit early with only markdown artifacts.

Sources: build-docs.sh:60-76 README.md:55-72 README.md:106-114

DeepWiki GitHub

Link Rewriting Logic

Relevant source files

This document details the algorithm for converting internal DeepWiki URL links into relative Markdown file paths during the content extraction process. The link rewriting system ensures that cross-references between wiki pages function correctly in the final mdBook output by transforming absolute web URLs into appropriate relative file paths based on the hierarchical structure of the documentation.

For information about the overall markdown extraction process, see Phase 1: Markdown Extraction. For details about file organization and directory structure, see Output Structure.

Overview

DeepWiki uses absolute URL paths for internal wiki links in the format /owner/repo/N-page-title or /owner/repo/N.M-subsection-title. These links must be rewritten to relative Markdown file paths that respect the mdBook directory structure where:

  • Main pages (e.g., "1-overview", "2-architecture") reside in the root markdown directory
  • Subsections (e.g., "2.1-subsection", "2.2-another") reside in subdirectories named section-N/
  • File names use hyphens instead of dots (e.g., 2-1-subsection.md instead of 2.1-subsection.md)

The rewriting logic must compute the correct relative path based on both the source page location and the target page location.

Sources: tools/deepwiki-scraper.py:547-594

Directory Structure Context

The system organizes markdown files into a hierarchical structure that affects link rewriting:

Diagram: File Organization Hierarchy

graph TB
    Root["Root Directory\n(output/markdown/)"]
Main1["1-overview.md\n(Main Page)"]
Main2["2-architecture.md\n(Main Page)"]
Main3["3-installation.md\n(Main Page)"]
Section2["section-2/\n(Subsection Directory)"]
Section3["section-3/\n(Subsection Directory)"]
Sub2_1["2-1-components.md\n(Subsection)"]
Sub2_2["2-2-workflows.md\n(Subsection)"]
Sub3_1["3-1-docker-setup.md\n(Subsection)"]
Sub3_2["3-2-manual-setup.md\n(Subsection)"]
Root --> Main1
 
   Root --> Main2
 
   Root --> Main3
 
   Root --> Section2
 
   Root --> Section3
    
 
   Section2 --> Sub2_1
 
   Section2 --> Sub2_2
    
 
   Section3 --> Sub3_1
 
   Section3 --> Sub3_2

This structure requires different relative path strategies depending on where the link originates and where it points.

Sources: tools/deepwiki-scraper.py:848-860

Input Format Detection

The algorithm begins by matching markdown links that reference the DeepWiki URL structure using a regular expression pattern.

Diagram: Link Pattern Matching Flow

flowchart TD
    Start["Markdown Content"]
Regex["Apply Regex Pattern:\n'\\]\\(/[^/]+/[^/]+/([^)]+)\\)'"]
Extract["Extract Path Component:\ne.g., '4-query-planning'"]
Parse["Parse Page Number and Slug:\nPattern: '(\\d+(?:\\.\\d+)*)-(.+)$'"]
PageNum["page_num\n(e.g., '2.1' or '4')"]
Slug["slug\n(e.g., 'query-planning')"]
Start --> Regex
 
   Regex --> Extract
 
   Extract --> Parse
 
   Parse --> PageNum
 
   Parse --> Slug

The regex \]\(/[^/]+/[^/]+/([^)]+)\) captures the path component after the repository identifier. For example, in <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/text" undefined file-path="text">Hii</FileRef>, it captures 4-query-planning.

Sources: tools/deepwiki-scraper.py592

Page Classification Logic

Each page (source and target) is classified based on whether it contains a dot in its page number, indicating a subsection.

Diagram: Page Type Classification

graph TB
    subgraph "Target Classification"
        TargetNum["Target Page Number"]
CheckDot["Contains '.' ?"]
IsTargetSub["is_target_subsection = True\ntarget_main_section = N"]
IsTargetMain["is_target_subsection = False\ntarget_main_section = None"]
TargetNum --> CheckDot
 
       CheckDot -->|Yes "2.1"| IsTargetSub
 
       CheckDot -->|No "2"| IsTargetMain
    end
    
    subgraph "Source Classification"
        SourceInfo["current_page_info"]
SourceLevel["Check 'level' field"]
IsSourceSub["is_source_subsection = True\nsource_main_section = N"]
IsSourceMain["is_source_subsection = False\nsource_main_section = None"]
SourceInfo --> SourceLevel
 
       SourceLevel -->|> 0| IsSourceSub
 
       SourceLevel -->|= 0| IsSourceMain
    end

The level field in current_page_info is set during wiki structure discovery and indicates the depth in the hierarchy (0 for main pages, 1+ for subsections).

Sources: tools/deepwiki-scraper.py:554-570

Path Generation Decision Matrix

The relative path is computed based on the combination of source and target types:

Source TypeTarget TypeRelative PathExample
Main PageMain Page{file_num}-{slug}.md3-installation.md
Main PageSubsectionsection-{N}/{file_num}-{slug}.mdsection-2/2-1-components.md
SubsectionMain Page../{file_num}-{slug}.md../3-installation.md
Subsection (same section)Subsection (same section){file_num}-{slug}.md2-2-workflows.md
Subsection (section A)Subsection (section B)../section-{N}/{file_num}-{slug}.md../section-3/3-1-setup.md

Sources: tools/deepwiki-scraper.py:573-588

Implementation Details

flowchart TD
    Start["fix_wiki_link(match)"]
ExtractPath["full_path = match.group(1)"]
ParseLink["Match pattern: '(\\d+(?:\\.\\d+)*)-(.+)$'"]
Success{"Match\nsuccessful?"}
NoMatch["Return match.group(0)\n(unchanged)"]
ExtractParts["page_num = match.group(1)\nslug = match.group(2)"]
ConvertNum["file_num = page_num.replace('.', '-')"]
ClassifyTarget["Classify target:\nis_target_subsection\ntarget_main_section"]
ClassifySource["Classify source:\nis_source_subsection\nsource_main_section"]
Decision{"Target is\nsubsection?"}
DecisionYes{"Source in same\nsection?"}
DecisionNo{"Source is\nsubsection?"}
Path1["Return '{file_num}-{slug}.md'"]
Path2["Return 'section-{N}/{file_num}-{slug}.md'"]
Path3["Return '../{file_num}-{slug}.md'"]
Start --> ExtractPath
 
   ExtractPath --> ParseLink
 
   ParseLink --> Success
 
   Success -->|No| NoMatch
 
   Success -->|Yes| ExtractParts
 
   ExtractParts --> ConvertNum
 
   ConvertNum --> ClassifyTarget
 
   ClassifyTarget --> ClassifySource
 
   ClassifySource --> Decision
    
 
   Decision -->|Yes| DecisionYes
 
   DecisionYes -->|Yes| Path1
 
   DecisionYes -->|No| Path2
    
 
   Decision -->|No| DecisionNo
 
   DecisionNo -->|Yes| Path3
 
   DecisionNo -->|No| Path1

The core implementation is a nested function fix_wiki_link that serves as a callback for re.sub.

Diagram: fix_wiki_link Function Control Flow

The function handles all path generation cases through a series of conditional checks, using information from both the link match and the current_page_info parameter.

Sources: tools/deepwiki-scraper.py:549-589

Page Number Transformation

The transformation from page numbers with dots to file names with hyphens is critical for matching the file system structure:

Diagram: Page Number Format Conversion

graph LR
    subgraph "DeepWiki Format"
        DW1["Page: '2.1'"]
DW2["URL: '/repo/2.1-title'"]
end
    
    subgraph "Transformation"
        Trans["Replace '.' with '-'"]
end
    
    subgraph "File System Format"
        FS1["File Number: '2-1'"]
FS2["Path: 'section-2/2-1-title.md'"]
end
    
 
   DW1 --> Trans
 
   DW2 --> Trans
 
   Trans --> FS1
 
   Trans --> FS2

This conversion is performed by the line file_num = page_num.replace('.', '-'), which ensures that subsection identifiers match the actual file names created during extraction.

Sources: tools/deepwiki-scraper.py558

Detailed Example Scenarios

When a main page (e.g., 1-overview.md) links to another main page (e.g., 4-features.md):

  • Source: 1-overview.md (level = 0, in root directory)
  • Target: 4-features (no dot, is main page)
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Features" undefined file-path="Features">Hii</FileRef>
  • Generated Path: 4-features.md
  • Reason: Both files are in the same root directory, so only the filename is needed

Sources: tools/deepwiki-scraper.py:586-588

When a main page (e.g., 2-architecture.md) links to a subsection (e.g., 2.1-components):

  • Source: 2-architecture.md (level = 0, in root directory)
  • Target: 2.1-components (contains dot, is subsection in section-2/)
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Components" undefined file-path="Components">Hii</FileRef>
  • Generated Path: section-2/2-1-components.md
  • Reason: Target is in subdirectory section-2/, source is in root, so full relative path is needed

Sources: tools/deepwiki-scraper.py:579-580

When a subsection (e.g., 2.1-components.md in section-2/) links to a main page (e.g., 3-installation.md):

  • Source: 2.1-components.md (level = 1, in section-2/ directory)
  • Target: 3-installation (no dot, is main page)
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Installation" undefined file-path="Installation">Hii</FileRef>
  • Generated Path: ../3-installation.md
  • Reason: Source is in subdirectory, target is in parent directory, so ../ is needed to go up one level

Sources: tools/deepwiki-scraper.py:583-585

Scenario 4: Subsection to Subsection (Same Section)

When a subsection (e.g., 2.1-components.md) links to another subsection in the same section (e.g., 2.2-workflows.md):

  • Source: 2.1-components.md (level = 1, in section-2/)
  • Source Main Section: 2
  • Target: 2.2-workflows (contains dot, in section-2/)
  • Target Main Section: 2
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Workflows" undefined file-path="Workflows">Hii</FileRef>
  • Generated Path: 2-2-workflows.md
  • Reason: Both files are in the same section-2/ directory, so only the filename is needed

Sources: tools/deepwiki-scraper.py:575-577

Scenario 5: Subsection to Subsection (Different Section)

When a subsection (e.g., 2.1-components.md in section-2/) links to a subsection in a different section (e.g., 3.1-docker-setup.md in section-3/):

  • Source: 2.1-components.md (level = 1, in section-2/)
  • Source Main Section: 2
  • Target: 3.1-docker-setup (contains dot, in section-3/)
  • Target Main Section: 3
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Docker Setup" undefined file-path="Docker Setup">Hii</FileRef>
  • Generated Path: section-3/3-1-docker-setup.md
  • Reason: Sections don't match, so full path from root perspective is used (implicitly going up and into different section directory)

Sources: tools/deepwiki-scraper.py:579-580

sequenceDiagram
    participant EPC as extract_page_content
    participant CTM as convert_html_to_markdown
    participant FWL as fix_wiki_link
    participant RE as re.sub
    
    EPC->>CTM: Convert HTML to Markdown
    CTM-->>EPC: Raw Markdown (with DeepWiki URLs)
    
    Note over EPC: Clean up content
    
    EPC->>RE: Apply link rewriting regex
    loop For each matched link
        RE->>FWL: Call with match object
        FWL->>FWL: Parse page number and slug
        FWL->>FWL: Classify source and target
        FWL->>FWL: Compute relative path
        FWL-->>RE: Return rewritten link
    end
    RE-->>EPC: Markdown with relative paths
    
    EPC-->>EPC: Return final markdown

Integration with Content Extraction

The link rewriting is integrated into the extract_page_content function and applied after HTML-to-Markdown conversion:

Diagram: Link Rewriting Integration Sequence

The rewriting occurs at line 592 using re.sub(r'\]\(/[^/]+/[^/]+/([^)]+)\)', fix_wiki_link, markdown), which finds all internal wiki links and replaces them with their rewritten versions.

Sources: tools/deepwiki-scraper.py:547-594

Edge Cases and Error Handling

If a link doesn't match the expected pattern (\d+(?:\.\d+)*)-(.+)$, the function returns the original match unchanged:

This ensures that malformed or external links are preserved in their original form.

Sources: tools/deepwiki-scraper.py:551-589

Missing current_page_info

If current_page_info is not provided (e.g., during development or testing), the function defaults to treating the source as a main page:

This allows the function to work in degraded mode, though links from subsections may not be correctly rewritten.

Sources: tools/deepwiki-scraper.py:565-570

Performance Considerations

The link rewriting is performed using a single re.sub call with a callback function, which is efficient for typical wiki pages with dozens to hundreds of links. The regex compilation is implicit and cached by Python's re module.

The algorithm has O(n) complexity where n is the number of internal links in the page, with each link requiring constant-time string operations.

Sources: tools/deepwiki-scraper.py592

Testing and Validation

The correctness of link rewriting can be validated by:

  1. Checking that generated links use .md extension
  2. Verifying that links from subsections to main pages use ../
  3. Confirming that links to subsections use the section-N/ prefix when appropriate
  4. Testing cross-section subsection links resolve correctly

The mdBook build process will fail if links are incorrectly rewritten, providing a validation mechanism during Phase 3 of the pipeline.

Sources: tools/deepwiki-scraper.py:547-594

DeepWiki GitHub

Auto-Detection Features

Relevant source files

This document describes the automatic detection and configuration mechanisms in the DeepWiki-to-mdBook converter system. These features enable the system to operate with minimal user configuration by intelligently inferring repository metadata, generating sensible defaults, and dynamically discovering file structures.

For information about manually configuring these values, see Configuration Reference. For details on how SUMMARY.md generation works, see SUMMARY.md Generation.

Overview

The system implements three primary auto-detection capabilities:

  1. Git Repository Detection : Automatically identifies the GitHub repository from Git remote URLs
  2. Configuration Defaults : Generates book metadata from detected repository information
  3. File Structure Discovery : Dynamically builds table of contents from actual file hierarchies

These features allow the system to run with a single docker run command in many cases, with all necessary configuration inferred from context.

Git Repository Auto-Detection

Detection Mechanism

The system attempts to auto-detect the GitHub repository when the REPO environment variable is not provided. This detection occurs in the shell orchestrator and follows a specific fallback sequence.

Git Repository Auto-Detection Flow

flowchart TD
    Start["build-docs.sh execution"]
CheckRepo{"REPO env\nvariable set?"}
UseRepo["Use $REPO value"]
CheckGit{"Git repository\ndetected?"}
GetRemote["Execute:\ngit config --get\nremote.origin.url"]
CheckRemote{"Remote URL\nfound?"}
ExtractOwnerRepo["Apply regex pattern:\ns#.*github\.com[:/]([^/]+/[^/\.]+)\n(\.git)?.*#\1#"]
SetRepo["Set REPO variable\nto owner/repo"]
ErrorExit["Exit with error:\nREPO must be set"]
Start --> CheckRepo
 
   CheckRepo -->|Yes| UseRepo
 
   CheckRepo -->|No| CheckGit
 
   CheckGit -->|No| ErrorExit
 
   CheckGit -->|Yes| GetRemote
 
   GetRemote --> CheckRemote
 
   CheckRemote -->|No| ErrorExit
 
   CheckRemote -->|Yes| ExtractOwnerRepo
 
   ExtractOwnerRepo --> SetRepo
 
   UseRepo --> Continue["Continue with\nbuild process"]
SetRepo --> Continue

Sources: build-docs.sh:8-19

Implementation Details

The auto-detection logic is implemented in the shell script's initialization section:

Detection StepShell CommandPurpose
Check Git repositorygit rev-parse --git-dir > /dev/null 2>&1Verify current directory is a Git repository
Retrieve remote URLgit config --get remote.origin.urlGet the origin remote URL
Extract repositorysed -E 's#.*github\.com<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/#LNaN-LNaN" NaN file-path="">Hii</FileRef>(\.git)?.*#\1#'Parse owner/repo from various URL formats

The regex pattern in the sed command handles multiple GitHub URL formats:

  • HTTPS: https://github.com/owner/repo.git
  • SSH: git@github.com:owner/repo.git
  • HTTPS without .git: https://github.com/owner/repo
  • SSH without .git: git@github.com:owner/repo

Sources: build-docs.sh:8-19

Supported URL Formats

The detection regex supports the following GitHub remote URL patterns:

The regex captures the repository path between github.com and any optional .git suffix, handling both : (SSH) and / (HTTPS) separators.

Sources: build-docs.sh:14-16

Configuration Defaults Generation

flowchart LR
    REPO["$REPO\n(owner/repo)"]
Extract["Parse repository\ncomponents"]
REPO_OWNER["$REPO_OWNER\n(cut -d'/' -f1)"]
REPO_NAME["$REPO_NAME\n(cut -d'/' -f2)"]
DefaultAuthors["BOOK_AUTHORS\ndefault: $REPO_OWNER"]
DefaultURL["GIT_REPO_URL\ndefault: https://github.com/$REPO"]
DefaultTitle["BOOK_TITLE\ndefault: Documentation"]
FinalAuthors["Final BOOK_AUTHORS"]
FinalURL["Final GIT_REPO_URL"]
FinalTitle["Final BOOK_TITLE"]
REPO --> Extract
 
   Extract --> REPO_OWNER
 
   Extract --> REPO_NAME
    
 
   REPO_OWNER --> DefaultAuthors
 
   REPO --> DefaultURL
    
 
   DefaultAuthors -->|Override if env var set| FinalAuthors
 
   DefaultURL -->|Override if env var set| FinalURL
 
   DefaultTitle -->|Override if env var set| FinalTitle
    
 
   FinalAuthors --> BookToml["book.toml\ngeneration"]
FinalURL --> BookToml
 
   FinalTitle --> BookToml

Metadata Derivation

Once the REPO variable is determined (either from environment or auto-detection), the system generates additional configuration values with intelligent defaults:

Configuration Default Generation Flow

Sources: build-docs.sh:39-45

Default Value Table

Configuration VariableDefault Value ExpressionExample ResultOverride Behavior
REPO_OWNER`$(echo "$REPO"cut -d'/' -f1)`jzombie
REPO_NAME`$(echo "$REPO"cut -d'/' -f2)`deepwiki-to-mdbook
BOOK_AUTHORS${BOOK_AUTHORS:=$REPO_OWNER}jzombieEnvironment variable takes precedence
GIT_REPO_URL${GIT_REPO_URL:=https://github.com/$REPO}https://github.com/jzombie/deepwiki-to-mdbookEnvironment variable takes precedence
BOOK_TITLE${BOOK_TITLE:-Documentation}DocumentationEnvironment variable takes precedence

The shell parameter expansion syntax ${VAR:=default} assigns the default value only if VAR is unset or null, enabling environment variable overrides.

Sources: build-docs.sh:21-26 build-docs.sh:39-45

book.toml Generation

The auto-detected and default values are incorporated into the dynamically generated book.toml configuration file:

The git-repository-url field enables mdBook to generate "Edit this page" links that direct users to the appropriate GitHub repository file.

Sources: build-docs.sh:85-103 README.md99

File Structure Discovery

Dynamic SUMMARY.md Generation

The system automatically discovers the file hierarchy and generates a table of contents without requiring manual configuration. This process analyzes the scraped markdown files to determine their structure.

File Structure Discovery Algorithm

flowchart TD
    Start["Begin SUMMARY.md\ngeneration"]
FindFirst["Find first .md file:\nls $WIKI_DIR/*.md /head -1"]
ExtractTitle["Extract title: head -1 file/ sed 's/^# //'"]
WriteIntro["Write introduction entry\nto SUMMARY.md"]
IterateFiles["Iterate all .md files\nin $WIKI_DIR"]
SkipFirst{"Is this\nfirst page?"}
ExtractNum["Extract section number:\ngrep -oE '^[0-9]+'"]
CheckSubdir{"Does section-N\ndirectory exist?"}
WriteSection["Write section header:\n# Title"]
WriteMain["Write main entry:\n- [Title](filename.md)"]
IterateSubs["Iterate subsection files:\nsection-N/*.md"]
WriteSubentry["Write subsection:\n - [Subtitle](section-N/file.md)"]
WriteStandalone["Write standalone entry:\n- [Title](filename.md)"]
NextFile{"More files?"}
Done["SUMMARY.md complete"]
Start --> FindFirst
 
   FindFirst --> ExtractTitle
 
   ExtractTitle --> WriteIntro
 
   WriteIntro --> IterateFiles
 
   IterateFiles --> SkipFirst
 
   SkipFirst -->|Yes| NextFile
 
   SkipFirst -->|No| ExtractNum
 
   ExtractNum --> CheckSubdir
 
   CheckSubdir -->|Yes| WriteSection
 
   WriteSection --> WriteMain
 
   WriteMain --> IterateSubs
 
   IterateSubs --> WriteSubentry
 
   WriteSubentry --> NextFile
 
   CheckSubdir -->|No| WriteStandalone
 
   WriteStandalone --> NextFile
 
   NextFile -->|Yes| SkipFirst
 
   NextFile -->|No| Done

Sources: build-docs.sh:112-159

Directory Structure Detection

The file structure discovery algorithm recognizes two organizational patterns:

Recognized File Hierarchy Patterns

$WIKI_DIR/
├── 1-overview.md                    # Main page (becomes introduction)
├── 2-architecture.md                # Main page with subsections
├── 3-components.md                  # Standalone page
├── section-2/                       # Subsection directory
│   ├── 2-1-system-design.md
│   └── 2-2-data-flow.md
└── section-4/                       # Another subsection directory
    ├── 4-1-phase-one.md
    └── 4-2-phase-two.md

The algorithm uses the following detection logic:

Pattern ElementDetection MethodCode Reference
Main pagesfor file in "$WIKI_DIR"/*.mdbuild-docs.sh126
Section number`echo "$filename"grep -oE '^[0-9]+'`
Subsection directory[ -d "$WIKI_DIR/section-$section_num" ]build-docs.sh138
Subsection filesfor subfile in "$section_dir"/*.mdbuild-docs.sh147

Sources: build-docs.sh:126-157

Title Extraction

Page titles are automatically extracted from the first line of each markdown file using the following approach:

This command:

  1. Reads the first line of the file with head -1
  2. Removes the markdown heading syntax # with sed 's/^# //'
  3. Assigns the result to the title variable for use in SUMMARY.md

Sources: build-docs.sh134 build-docs.sh150

Generated SUMMARY.md Example

Given the file structure shown above, the system generates:

The generation process outputs a count of entries: Generated SUMMARY.md with N entries where N is determined by grep -c '\<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/' src/SUMMARY.md.\n\nSources#LNaN-LNaN" NaN file-path="' src/SUMMARY.md`.\n\nSources">Hii

Auto-Detection in CI/CD Context

Docker Container Limitations

The Git repository auto-detection feature has limitations when running inside a Docker container. The detection logic executes within the container's filesystem, which typically does not include the host's Git repository unless explicitly mounted.

Auto-Detection Context Comparison

Execution ContextGit Repository AvailableAuto-Detection WorksRecommended Usage
Host machine with Git repository✓ Yes✓ YesLocal development/testing
Docker container (default)✗ No✗ NoMust provide REPO env var
Docker with volume mount of Git repo✓ Yes⚠ PartialNot recommended (complexity)
CI/CD pipeline (GitHub Actions, etc.)⚠ Varies⚠ ConditionalUse explicit REPO for reliability

For production and CI/CD usage, explicitly setting the REPO environment variable is recommended:

Sources: build-docs.sh:8-36 README.md:47-53

Implementation Code References

Shell Variable Initialization

The complete auto-detection and default generation sequence:

Sources: build-docs.sh:8-45

Error Handling

The system validates that a repository is available (either from environment or auto-detection) before proceeding:

This validation ensures the system fails fast with a clear error message if configuration is insufficient.

Sources: build-docs.sh:33-37

DeepWiki GitHub

Development Guide

Relevant source files

This page provides guidance for developers who want to modify, extend, or contribute to the DeepWiki-to-mdBook Converter system. It covers the development environment setup, local workflow, testing procedures, and key considerations when working with the codebase.

For detailed information about the repository structure, see Project File Structure. For instructions on building the Docker image, see Building the Docker Image. For Python dependency details, see Python Dependencies.

Development Environment Requirements

The system is designed to run entirely within Docker, but local development requires the following tools:

ToolPurposeVersion
DockerContainer runtimeLatest stable
GitVersion control2.x or later
Text editor/IDECode editingAny (VS Code recommended)
PythonLocal testing (optional)3.12+
Rust toolchainLocal testing (optional)Latest stable

The Docker image handles all runtime dependencies, so local installation of Python and Rust is optional and only needed for testing individual components outside the container.

Sources: Dockerfile:1-33

Development Workflow Architecture

The following diagram shows the typical development cycle and how different components interact during development:

Development Workflow Diagram : Shows the cycle from editing code to building the Docker image to testing with mounted output volume.

graph TB
    subgraph "Development Environment"
        Editor["Code Editor"]
GitRepo["Local Git Repository"]
end
    
    subgraph "Docker Build Process"
        BuildCmd["docker build -t deepwiki-scraper ."]
Stage1["Rust Builder Stage\nCompiles mdbook binaries"]
Stage2["Python Runtime Stage\nAssembles final image"]
FinalImage["deepwiki-scraper:latest"]
end
    
    subgraph "Testing & Validation"
        RunCmd["docker run with test params"]
OutputMount["Volume mount: ./output"]
Validation["Manual inspection of output"]
end
    
    subgraph "Key Development Files"
        Dockerfile["Dockerfile"]
BuildScript["build-docs.sh"]
Scraper["tools/deepwiki-scraper.py"]
Requirements["tools/requirements.txt"]
end
    
 
   Editor -->|Edit| GitRepo
 
   GitRepo --> Dockerfile
 
   GitRepo --> BuildScript
 
   GitRepo --> Scraper
 
   GitRepo --> Requirements
    
 
   BuildCmd --> Stage1
 
   Stage1 --> Stage2
 
   Stage2 --> FinalImage
    
 
   FinalImage --> RunCmd
 
   RunCmd --> OutputMount
 
   OutputMount --> Validation
    
 
   Validation -.->|Iterate| Editor

Sources: Dockerfile:1-33 build-docs.sh:1-206

Component Development Map

This diagram bridges system concepts to actual code entities, showing which files implement which functionality:

Code Entity Mapping Diagram : Maps system functionality to specific code locations, file paths, and binaries.

graph LR
    subgraph "Entry Point Layer"
        CMD["CMD in Dockerfile:32"]
BuildDocs["build-docs.sh"]
end
    
    subgraph "Configuration Layer"
        EnvVars["Environment Variables\nREPO, BOOK_TITLE, etc."]
AutoDetect["Auto-detect logic\nbuild-docs.sh:8-19"]
Validation["Validation\nbuild-docs.sh:33-37"]
end
    
    subgraph "Processing Scripts"
        ScraperPy["deepwiki-scraper.py"]
MdBookBin["/usr/local/bin/mdbook"]
MermaidBin["/usr/local/bin/mdbook-mermaid"]
end
    
    subgraph "Configuration Generation"
        BookToml["book.toml generation\nbuild-docs.sh:85-103"]
SummaryMd["SUMMARY.md generation\nbuild-docs.sh:113-159"]
end
    
    subgraph "Dependency Management"
        ReqTxt["requirements.txt"]
UvInstall["uv pip install\nDockerfile:17"]
CargoInstall["cargo install\nDockerfile:5"]
end
    
 
   CMD --> BuildDocs
 
   BuildDocs --> EnvVars
 
   EnvVars --> AutoDetect
 
   AutoDetect --> Validation
    
 
   Validation --> ScraperPy
 
   BuildDocs --> BookToml
 
   BuildDocs --> SummaryMd
    
 
   BuildDocs --> MdBookBin
 
   MdBookBin --> MermaidBin
    
 
   ReqTxt --> UvInstall
 
   UvInstall --> ScraperPy
 
   CargoInstall --> MdBookBin
 
   CargoInstall --> MermaidBin

Sources: Dockerfile:1-33 build-docs.sh:8-19 build-docs.sh:85-103 build-docs.sh:113-159

Local Development Workflow

1. Clone and Setup

The repository has a minimal structure focused on the essential build artifacts. The .gitignore:1-2 excludes the output/ directory to prevent committing generated files.

2. Make Changes

Key files for common modifications:

Modification TypePrimary FileRelated Files
Scraping logictools/deepwiki-scraper.py-
Build orchestrationbuild-docs.sh-
Python dependenciestools/requirements.txtDockerfile:16-17
Docker build processDockerfile-
Output structurebuild-docs.shLines 179-191

3. Build Docker Image

After making changes, rebuild the Docker image:

The multi-stage build process Dockerfile:1-7 first compiles Rust binaries in a rust:latest builder stage, then Dockerfile:8-33 assembles the final python:3.12-slim image with copied binaries and Python dependencies.

4. Test Changes

Test with a real repository:

Setting MARKDOWN_ONLY=true build-docs.sh:61-76 bypasses the mdBook build phase, allowing faster iteration when testing scraping logic changes.

5. Validate Output

Inspect the generated files:

Sources: .gitignore:1-2 Dockerfile:1-33 build-docs.sh:61-76 build-docs.sh:179-191

Testing Strategies

Fast Iteration with Markdown-Only Mode

The MARKDOWN_ONLY environment variable enables a fast path for testing scraping changes:

This mode executes only Phase 1 (Markdown Extraction) and skips Phase 2 (Diagram Enhancement) and Phase 3 (mdBook Build). See Phase 1: Markdown Extraction for details on what this phase includes.

The conditional logic build-docs.sh:61-76 checks the MARKDOWN_ONLY variable and exits early after copying markdown files to /output/markdown/.

Testing Auto-Detection

The repository auto-detection logic build-docs.sh:8-19 attempts to extract the GitHub repository from Git remotes if REPO is not explicitly set:

The script checks git config --get remote.origin.url and extracts the owner/repo portion using sed pattern matching build-docs.sh16

Testing Configuration Generation

To test book.toml and SUMMARY.md generation without a full build:

The book.toml template build-docs.sh:85-103 uses shell variable substitution to inject environment variables into the TOML structure.

Sources: build-docs.sh:8-19 build-docs.sh:61-76 build-docs.sh:85-103

Debugging Techniques

Inspecting Intermediate Files

The build process creates temporary files in /workspace inside the container. To inspect them:

This allows inspection of:

  • Scraped markdown files in /workspace/wiki/
  • Generated book.toml in /workspace/book/
  • Generated SUMMARY.md in /workspace/book/src/

Adding Debug Output

Both build-docs.sh:1-206 and deepwiki-scraper.py use echo statements for progress tracking. Add additional debug output:

Testing Python Script Independently

To test the scraper without Docker:

This is useful for rapid iteration on scraping logic without rebuilding the Docker image.

Sources: build-docs.sh:1-206 tools/requirements.txt:1-4

Build Optimization Considerations

Multi-Stage Build Rationale

The Dockerfile:1-7 uses a separate Rust builder stage to:

  1. Compile mdbook and mdbook-mermaid with a full Rust toolchain
  2. Discard the ~1.5 GB builder stage after compilation
  3. Copy only the compiled binaries Dockerfile:20-21 to the final image

This reduces the final image size from ~1.5 GB to ~300-400 MB while still providing both Python and Rust tools. See Docker Multi-Stage Build for architectural details.

Dependency Management with uv

The Dockerfile13 copies uv from the official Astral image and uses it Dockerfile17 to install Python dependencies with --no-cache flag:

This approach:

  • Provides faster dependency resolution than pip
  • Reduces layer size with --no-cache
  • Installs system-wide with --system flag

Image Layer Ordering

The Dockerfile orders operations to maximize layer caching:

  1. Copy uv binary (rarely changes)
  2. Install Python dependencies (changes with requirements.txt)
  3. Copy Rust binaries (changes when rebuilding Rust stage)
  4. Copy Python scripts (changes frequently during development)

This ordering means modifying deepwiki-scraper.py only invalidates the final layers Dockerfile:24-29 not the entire dependency installation.

Sources: Dockerfile:1-33

Common Development Tasks

Adding a New Environment Variable

To add a new configuration option:

  1. Define default in build-docs.sh:21-30:

  2. Add to configuration display build-docs.sh:47-53:

  3. Use in downstream processing as needed

  4. Document in Configuration Reference

Modifying SUMMARY.md Generation

The table of contents generation logic build-docs.sh:113-159 uses bash loops and file discovery:

To modify the structure:

  1. Adjust the file pattern matching
  2. Modify the section detection logic
  3. Update the markdown output format
  4. Test with repositories that have different hierarchical structures

Adding New Python Dependencies

  1. Add to tools/requirements.txt:1-4 with version constraint:
new-package>=1.0.0
  1. Rebuild Docker image (triggers Dockerfile17)

  2. Update Python Dependencies documentation

  3. Import and use in deepwiki-scraper.py

Sources: build-docs.sh:21-30 build-docs.sh:113-159 tools/requirements.txt:1-4 Dockerfile17

File Modification Guidelines

Modifying build-docs.sh

The orchestrator script uses several idioms:

PatternPurposeExample
set -eExit on errorbuild-docs.sh2
"${VAR:-default}"Default valuesbuild-docs.sh:22-26
$(command)Command substitutionbuild-docs.sh12
echo ""Visual spacingbuild-docs.sh47
mkdir -pSafe directory creationbuild-docs.sh64

Maintain these patterns for consistency. The script is designed to be readable and self-documenting with clear step labels build-docs.sh:4-6

Modifying Dockerfile

Key considerations:

Modifying Python Scripts

When editing tools/deepwiki-scraper.py:

  • The script is executed via build-docs.sh58 with two arguments: REPO and output directory
  • It must be Python 3.12 compatible Dockerfile8
  • It has access to dependencies from tools/requirements.txt:1-4
  • It should write output to the specified directory argument
  • It should use print() for progress output that appears in build logs

Sources: build-docs.sh2 build-docs.sh58 Dockerfile:1-33 tools/requirements.txt:1-4

Integration Testing

End-to-End Test

Validate the complete pipeline:

Testing Configuration Variants

Test different repository configurations:

Sources: build-docs.sh:8-19 build-docs.sh:61-76

Contributing Guidelines

When submitting changes:

  1. Test locally : Build and run the Docker image with multiple test repositories
  2. Validate output : Ensure markdown files are properly formatted and the HTML site builds correctly
  3. Check backwards compatibility : Existing repositories should continue to work
  4. Update documentation : Modify relevant wiki pages if changing behavior
  5. Follow existing patterns : Match the coding style in build-docs.sh:1-206

The system is designed to be "fully generic" - it should work with any DeepWiki repository without modification. Test that your changes maintain this property.

Sources: build-docs.sh:1-206

Troubleshooting Development Issues

Build Failures

SymptomLikely CauseSolution
Rust compilation failsNetwork issues, incompatible versionsCheck rust:latest image availability
Python package install failsVersion conflicts in requirements.txtVerify package versions are compatible
mdbook not foundBinary copy failedCheck Dockerfile:20-21 paths
Permission denied on scriptsMissing chmod +xVerify Dockerfile:25-29

Runtime Failures

SymptomLikely CauseSolution
"REPO must be set" errorAuto-detection failed, no REPO env varCheck build-docs.sh:33-36 validation logic
Scraper crashesDeepWiki site structure changedDebug deepwiki-scraper.py with local testing
SUMMARY.md is emptyNo markdown files foundVerify scraper output in /workspace/wiki/
mdBook build failsInvalid markdown syntaxInspect markdown files for issues

Output Validation Checklist

After a successful build, verify:

  • output/markdown/ contains .md files
  • Section directories exist (e.g., output/markdown/section-4/)
  • output/book/index.html exists and opens in browser
  • Navigation menu appears in generated site
  • Search functionality works
  • Mermaid diagrams render correctly
  • Links between pages work
  • "Edit this file" links point to correct GitHub URLs

Sources: build-docs.sh:33-36 Dockerfile:20-21 Dockerfile:25-29

DeepWiki GitHub

Project File Structure

Relevant source files

This document describes the repository's file organization, detailing the purpose of each file and directory in the codebase. Understanding this structure is essential for developers who want to modify or extend the system.

For information about building the Docker image, see Building the Docker Image. For details about the Python dependencies, see Python Dependencies.

Repository Layout

The repository follows a minimal, flat structure with only the essential files needed for Docker-based documentation generation.

graph TB
    Root["Repository Root"]
Root --> GitIgnore[".gitignore\n(Excludes output/)"]
Root --> Dockerfile["Dockerfile\n(Multi-stage build)"]
Root --> BuildScript["build-docs.sh\n(Shell orchestrator)"]
Root --> ToolsDir["tools/\n(Python scripts)"]
Root --> OutputDir["output/\n(Generated, git-ignored)"]
ToolsDir --> Scraper["deepwiki-scraper.py\n(Content extraction)"]
ToolsDir --> Requirements["requirements.txt\n(Python deps)"]
OutputDir --> MarkdownOut["markdown/\n(Scraped .md files)"]
OutputDir --> BookOut["book/\n(HTML site)"]
OutputDir --> ConfigOut["book.toml\n(mdBook config)"]
style Root fill:#f9f9f9,stroke:#333
    style ToolsDir fill:#e8f5e9,stroke:#388e3c
    style OutputDir fill:#ffe0b2,stroke:#e64a19
    style Dockerfile fill:#e1f5ff,stroke:#0288d1
    style BuildScript fill:#fff4e1,stroke:#f57c00

Physical File Hierarchy

Sources: .gitignore:1-2 Dockerfile:1-33 build-docs.sh:1-206 tools/deepwiki-scraper.py:1-920 tools/requirements.txt:1-4

Root Directory Files

The repository root contains three essential files that define the system's build and runtime behavior.

FileTypeLinesPurpose
.gitignoreConfig2Excludes the output/ directory from version control
DockerfileBuild33Multi-stage Docker build specification
build-docs.shScript206Shell orchestrator that coordinates all phases

.gitignore

.gitignore:1-2

This file contains a single exclusion rule for the output/ directory, which is generated at runtime and should not be committed to version control. The output/ directory can contain hundreds of megabytes of generated documentation.

Sources: .gitignore:1-2

Dockerfile

Dockerfile:1-33

The Dockerfile implements a two-stage build pattern to optimize image size:

Stage 1: Rust Builder Dockerfile:2-5

  • Base: rust:latest (~1.5 GB)
  • Purpose: Compile mdbook and mdbook-mermaid binaries
  • Command: cargo install mdbook mdbook-mermaid

Stage 2: Final Image Dockerfile:8-32

  • Base: python:3.12-slim (~150 MB)
  • Installs: uv package manager Dockerfile13
  • Copies: Python requirements, compiled Rust binaries, and scripts
  • Entry: /usr/local/bin/build-docs.sh Dockerfile32

The multi-stage approach discards the Rust toolchain (~1.3 GB) while retaining only the compiled binaries, resulting in a final image of ~300-400 MB.

Sources: Dockerfile:1-33

build-docs.sh

build-docs.sh:1-206

graph LR
    AutoDetect["Auto-detect Git repo\nLines 8-19"]
Config["Parse environment vars\nLines 21-53"]
Phase1["Execute scraper\nLines 55-58"]
Phase2["Generate configs\nLines 78-159"]
Phase3["Build with mdBook\nLines 169-176"]
Output["Copy to /output\nLines 178-191"]
AutoDetect --> Config
 
   Config --> Phase1
 
   Phase1 --> Phase2
 
   Phase2 --> Phase3
 
   Phase3 --> Output
    
    MarkdownOnly{"MARKDOWN_ONLY\n==true?"}
Phase1 --> MarkdownOnly
 
   MarkdownOnly -->|Yes| Output
 
   MarkdownOnly -->|No| Phase2

This shell script serves as the orchestrator for the three-phase pipeline. Key sections:

Key Environment Variables:

Critical Paths:

Sources: build-docs.sh:1-206

graph TB
    ToolsDir["tools/"]
ToolsDir --> Scraper["deepwiki-scraper.py\n920 lines\nMain extraction logic"]
ToolsDir --> Reqs["requirements.txt\n4 lines\nDependency specification"]
Scraper --> Extract["extract_wiki_structure()\nLines 78-125"]
Scraper --> Content["extract_page_content()\nLines 453-594"]
Scraper --> Enhance["extract_and_enhance_diagrams()\nLines 596-789"]
Scraper --> Main["main()\nLines 790-919"]
Reqs --> Requests["requests>=2.31.0"]
Reqs --> BS4["beautifulsoup4>=4.12.0"]
Reqs --> H2T["html2text>=2020.1.16"]
style ToolsDir fill:#e8f5e9,stroke:#388e3c
    style Scraper fill:#fff4e1,stroke:#f57c00

Tools Directory

The tools/ directory contains Python-specific components that execute within the Docker container.

Directory Structure

Sources: tools/deepwiki-scraper.py:1-920 tools/requirements.txt:1-4

deepwiki-scraper.py

tools/deepwiki-scraper.py:1-920

This is the core Python module responsible for content extraction and diagram enhancement. It operates in three distinct phases within the temp directory.

Function Breakdown:

FunctionLinesPurpose
sanitize_filename()21-25Convert text to safe filename format
fetch_page()27-42HTTP fetcher with retry logic
discover_subsections()44-76Probe for subsection pages
extract_wiki_structure()78-125Build complete page hierarchy
clean_deepwiki_footer()127-173Remove UI elements from markdown
convert_html_to_markdown()175-216HTML→Markdown via html2text
extract_mermaid_from_nextjs_data()218-331Extract diagrams from JS payload
extract_page_content()453-594Main content extraction logic
extract_and_enhance_diagrams()596-789Fuzzy match and inject diagrams
main()790-919Entry point with temp directory management

Phase Separation:

Temporary Directory Pattern:

The script uses Python's tempfile.TemporaryDirectory() tools/deepwiki-scraper.py808 to create an isolated workspace. All markdown files are first written to this temp directory tools/deepwiki-scraper.py867 then enhanced with diagrams in-place tools/deepwiki-scraper.py:676-788 and finally moved to the output directory tools/deepwiki-scraper.py:897-906 This ensures atomic operations and prevents partial files from appearing in the output.

Sources: tools/deepwiki-scraper.py:1-920

requirements.txt

tools/requirements.txt:1-4

Specifies three production dependencies:

These dependencies are installed via uv pip install during the Docker build Dockerfile17 The uv package manager is used instead of pip for faster, more reliable installations in containerized environments.

Sources: tools/requirements.txt:1-4 Dockerfile:13-17

graph TB
    Output["output/\n(Volume mount point)"]
Output --> Markdown["markdown/\n(Enhanced .md files)"]
Output --> Book["book/\n(HTML documentation)"]
Output --> Config["book.toml\n(mdBook configuration)"]
Markdown --> MainPages["*.md\n(Main pages: 1-overview.md, 2-quick-start.md)"]
Markdown --> Sections["section-*/\n(Subsection directories)"]
Sections --> SubPages["*.md\n(Subsection pages: 2-1-docker.md)"]
Book --> Index["index.html"]
Book --> CSS["css/"]
Book --> JS["mermaid.min.js"]
Book --> Search["searchindex.js"]
style Output fill:#ffe0b2,stroke:#e64a19
    style Markdown fill:#e8f5e9,stroke:#388e3c
    style Book fill:#e1f5ff,stroke:#0288d1

Output Directory (Generated)

The output/ directory is created at runtime and excluded from version control. It contains all generated artifacts.

Output Structure

Sources: build-docs.sh:181-201

Markdown Subdirectory

build-docs.sh:186-188

Contains the enhanced markdown source files organized by hierarchy:

Main Pages (Root Level):

  • Format: {number}-{slug}.md (e.g., 1-overview.md)
  • Location: output/markdown/
  • Example: A page numbered "3" with title "Configuration" becomes 3-configuration.md

Subsection Pages (Nested):

  • Format: section-{main}/ directory containing {number}-{slug}.md files
  • Location: output/markdown/section-{N}/
  • Example: Page "3.2" under section 3 becomes section-3/3-2-environment-variables.md

This hierarchy is created by tools/deepwiki-scraper.py:849-860 based on the page's level field (0 for main pages, 1 for subsections).

Sources: build-docs.sh:186-188 tools/deepwiki-scraper.py:849-860

Book Subdirectory

build-docs.sh184

Contains the complete HTML documentation site generated by mdBook build-docs.sh176 This is a self-contained static website with:

  • Navigation sidebar (from SUMMARY.md)
  • Full-text search (searchindex.js)
  • Mermaid diagram rendering (via mdbook-mermaid build-docs.sh:170-171)
  • Edit-on-GitHub links (from GIT_REPO_URL)
  • Responsive Rust theme build-docs.sh94

The entire book/ directory can be served by any static file server or uploaded to GitHub Pages, Netlify, or similar hosting platforms.

Sources: build-docs.sh:176-184

book.toml Configuration

build-docs.sh:85-103 build-docs.sh191

The book.toml file is dynamically generated with repository-specific metadata:

This configuration is copied to output/book.toml for reference build-docs.sh191

Sources: build-docs.sh:85-103 build-docs.sh191

graph TB
    BuildContext["Docker Build Context"]
BuildContext --> Included["Included in Image"]
BuildContext --> Excluded["Excluded"]
Included --> DockerfileBuild["Dockerfile\n(Build instructions)"]
Included --> ToolsCopy["tools/\n(COPY instruction)"]
Included --> ScriptCopy["build-docs.sh\n(COPY instruction)"]
ToolsCopy --> ReqInstall["requirements.txt\n→ uv pip install"]
ToolsCopy --> ScraperInstall["deepwiki-scraper.py\n→ /usr/local/bin/"]
ScriptCopy --> BuildInstall["build-docs.sh\n→ /usr/local/bin/"]
Excluded --> GitIgnored["output/\n(git-ignored)"]
Excluded --> GitFiles[".git/\n(implicit)"]
Excluded --> Readme["README.md\n(not referenced)"]
style BuildContext fill:#f9f9f9,stroke:#333
    style Included fill:#e8f5e9,stroke:#388e3c
    style Excluded fill:#ffebee,stroke:#c62828

Docker Build Context

The Docker build process includes only the files needed for container construction. Understanding this context is important for build optimization.

Build Context Inclusion

Copy Operations:

  1. Dockerfile16 - COPY tools/requirements.txt /tmp/requirements.txt
  2. Dockerfile24 - COPY tools/deepwiki-scraper.py /usr/local/bin/
  3. Dockerfile28 - COPY build-docs.sh /usr/local/bin/

Not Copied:

  • .gitignore - only used by Git
  • output/ - generated at runtime
  • .git/ - version control metadata
  • Any documentation files (README, LICENSE)

Sources: Dockerfile:16-28 .gitignore:1-2

graph TB
    subgraph BuildTime["Build-Time Dependencies"]
DF["Dockerfile"]
Req["tools/requirements.txt"]
Scraper["tools/deepwiki-scraper.py"]
BuildSh["build-docs.sh"]
DF -->|COPY [Line 16]| Req
 
       DF -->|RUN install [Line 17]| Req
 
       DF -->|COPY [Line 24]| Scraper
 
       DF -->|COPY [Line 28]| BuildSh
 
       DF -->|CMD [Line 32]| BuildSh
    end
    
    subgraph Runtime["Run-Time Dependencies"]
BuildShRun["build-docs.sh\n(Entry point)"]
ScraperExec["deepwiki-scraper.py\n(Phase 1-2)"]
MdBook["mdbook\n(Phase 3)"]
MdBookMermaid["mdbook-mermaid\n(Phase 3)"]
BuildShRun -->|python3 [Line 58]| ScraperExec
 
       BuildShRun -->|mdbook-mermaid install [Line 171]| MdBookMermaid
 
       BuildShRun -->|mdbook build [Line 176]| MdBook
        
 
       ScraperExec -->|import requests| Req
 
       ScraperExec -->|import bs4| Req
 
       ScraperExec -->|import html2text| Req
    end
    
    subgraph Generated["Generated Artifacts"]
WikiDir["$WIKI_DIR/\n(Temp markdown)"]
BookToml["book.toml\n(Config)"]
Summary["SUMMARY.md\n(TOC)"]
OutputDir["output/\n(Final artifacts)"]
ScraperExec -->|sys.argv[2]| WikiDir
 
       BuildShRun -->|cat > [Line 85]| BookToml
 
       BuildShRun -->|Lines 113-159| Summary
 
       BuildShRun -->|cp [Lines 184-191]| OutputDir
    end
    
 
   BuildTime --> Runtime
 
   Runtime --> Generated
    
    style DF fill:#e1f5ff,stroke:#0288d1
    style BuildShRun fill:#fff4e1,stroke:#f57c00
    style ScraperExec fill:#e8f5e9,stroke:#388e3c
    style OutputDir fill:#ffe0b2,stroke:#e64a19

File Dependency Graph

This diagram maps the relationships between files and shows which files depend on or reference others.

Sources: Dockerfile:1-33 build-docs.sh:1-206 tools/deepwiki-scraper.py:1-920 tools/requirements.txt:1-4

File Size and Complexity Metrics

Understanding the relative complexity of each component helps developers identify which files require the most attention during modifications.

FileLinesPurposeComplexity
tools/deepwiki-scraper.py920Content extraction and diagram matchingHigh
build-docs.sh206Orchestration and configurationMedium
Dockerfile33Multi-stage build specificationLow
tools/requirements.txt4Dependency listMinimal
.gitignore2Git exclusion ruleMinimal

Key Observations:

Sources: tools/deepwiki-scraper.py:1-920 build-docs.sh:1-206 Dockerfile:1-33 tools/requirements.txt:1-4 .gitignore:1-2

DeepWiki GitHub

Building the Docker Image

Relevant source files

This page provides instructions for building the Docker image locally from source. It covers the build process, multi-stage build architecture, verification steps, and troubleshooting common build issues.

For information about the architectural rationale behind the multi-stage build strategy, see Docker Multi-Stage Build. For information about running the pre-built image, see Quick Start.

Overview

The DeepWiki-to-mdBook converter is packaged as a Docker image that combines Python runtime components with Rust-compiled binaries. Building the image locally requires Docker and typically takes 5-15 minutes depending on network speed and CPU performance. The build process compiles two Rust applications (mdbook and mdbook-mermaid) from source, then creates a minimal Python-based runtime image with these compiled binaries.

Basic Build Command

To build the Docker image from the repository root:

This command reads the Dockerfile at the repository root and produces a tagged image named deepwiki-scraper. The build process automatically executes both stages defined in the Dockerfile.

Sources: Dockerfile:1-33

Build Process Architecture

The following diagram shows the complete build workflow, mapping from natural language concepts to the actual Docker commands and files involved:

Sources: Dockerfile:1-33

graph TD
 
   User[Developer] -->|docker build -t deepwiki-scraper .| DockerCLI["Docker CLI"]
DockerCLI -->|Reads| Dockerfile["Dockerfile\n(repository root)"]
Dockerfile -->|Stage 1: FROM rust:latest AS builder| Stage1["Stage 1: Rust Builder\nImage: rust:latest"]
Dockerfile -->|Stage 2: FROM python:3.12-slim| Stage2["Stage 2: Final Assembly\nImage: python:3.12-slim"]
Stage1 -->|RUN cargo install mdbook| CargoBuildMdBook["cargo install mdbook\n→ /usr/local/cargo/bin/mdbook"]
Stage1 -->|RUN cargo install mdbook-mermaid| CargoBuildMermaid["cargo install mdbook-mermaid\n→ /usr/local/cargo/bin/mdbook-mermaid"]
Stage2 -->|COPY --from=ghcr.io/astral-sh/uv:latest| UVCopy["Copy /uv and /uvx\n→ /bin/"]
Stage2 -->|COPY tools/requirements.txt| ReqCopy["Copy requirements.txt\n→ /tmp/requirements.txt"]
Stage2 -->|RUN uv pip install --system| PythonDeps["Install Python packages:\nrequests, beautifulsoup4, html2text"]
CargoBuildMdBook -->|COPY --from=builder| BinaryCopy1["Copy to /usr/local/bin/mdbook"]
CargoBuildMermaid -->|COPY --from=builder| BinaryCopy2["Copy to /usr/local/bin/mdbook-mermaid"]
Stage2 -->|COPY tools/deepwiki-scraper.py| ScraperCopy["Copy to /usr/local/bin/deepwiki-scraper.py"]
Stage2 -->|COPY build-docs.sh| BuildScriptCopy["Copy to /usr/local/bin/build-docs.sh"]
Stage2 -->|RUN chmod +x| MakeExecutable["Set execute permissions"]
BinaryCopy1 --> FinalImage["Final Image:\ndeepwiki-scraper"]
BinaryCopy2 --> FinalImage
 
   PythonDeps --> FinalImage
 
   ScraperCopy --> FinalImage
 
   BuildScriptCopy --> FinalImage
 
   MakeExecutable --> FinalImage
    
 
   FinalImage -->|CMD| DefaultEntrypoint["/usr/local/bin/build-docs.sh"]

Stage-by-Stage Build Details

Stage 1: Rust Builder

Stage 1 uses the rust:latest base image (approximately 1.5 GB) to compile the Rust applications. This stage is ephemeral and discarded after binary extraction.

graph LR
    subgraph "Stage 1 Build Context"
        BaseImage["rust:latest\n~1.5 GB"]
CargoEnv["Cargo toolchain\nPre-installed"]
BaseImage --> CargoEnv
        
 
       CargoEnv -->|cargo install mdbook| BuildMdBook["Compile mdbook\nfrom crates.io"]
CargoEnv -->|cargo install mdbook-mermaid| BuildMermaid["Compile mdbook-mermaid\nfrom crates.io"]
BuildMdBook --> Binary1["/usr/local/cargo/bin/mdbook\n(~20-30 MB)"]
BuildMermaid --> Binary2["/usr/local/cargo/bin/mdbook-mermaid\n(~10-20 MB)"]
end
    
    subgraph "Extracted Artifacts"
 
       Binary1 -.->|Copied to Stage 2| FinalBin1["/usr/local/bin/mdbook"]
Binary2 -.->|Copied to Stage 2| FinalBin2["/usr/local/bin/mdbook-mermaid"]
end

The cargo install commands download source code from crates.io, compile with optimization flags, and place the resulting binaries in /usr/local/cargo/bin/. This compilation typically takes 3-8 minutes depending on CPU performance.

Key Dockerfile directives:

  • Line 2: FROM rust:latest AS builder - Establishes the builder stage
  • Line 5: RUN cargo install mdbook mdbook-mermaid - Compiles both tools in a single command

Sources: Dockerfile:1-5

Stage 2: Final Image Assembly

Stage 2 creates the production image using python:3.12-slim (approximately 150 MB) as the base and layers in all necessary runtime components:

LayerPurposeSize ImpactDockerfile Lines
Base imagePython 3.12 runtime~150 MBLine 8
uv package managerFast Python dependency installation~10 MBLine 13
Python dependenciesrequests, beautifulsoup4, html2text~20 MBLines 16-17
Rust binariesmdbook and mdbook-mermaid executables~30-50 MBLines 20-21
Python scriptsdeepwiki-scraper.py~10 KBLines 24-25
Shell scriptsbuild-docs.sh orchestrator~5 KBLines 28-29
TotalFinal image size~300-400 MB-

Key Dockerfile directives:

  • Line 8: FROM python:3.12-slim - Establishes the final stage base
  • Line 13: COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ - Imports uv from external image
  • Lines 20-21: COPY --from=builder - Extracts Rust binaries from Stage 1
  • Line 32: CMD ["/usr/local/bin/build-docs.sh"] - Sets default entrypoint

Sources: Dockerfile:8-33

Python Dependency Installation

The image uses uv instead of pip for faster and more reliable dependency installation. The dependencies are defined in tools/requirements.txt:

requests>=2.31.0
beautifulsoup4>=4.12.0
html2text>=2020.1.16

The installation command uses these flags:

  • --system: Installs packages system-wide (not in a virtual environment)
  • --no-cache: Avoids caching to reduce image size

Sources: Dockerfile:13-17 tools/requirements.txt:1-4

Build Verification

After building the image, verify that all components are correctly installed:

Expected outputs:

  • which commands should return /usr/local/bin/<binary-name>
  • Python import test should print Dependencies OK
  • Script permissions should show -rwxr-xr-x (executable)

Sources: Dockerfile:20-29

graph TB
    subgraph "Repository Files"
        RepoRoot["Repository Root"]
Dockerfile_Src["Dockerfile"]
BuildScript["build-docs.sh"]
ToolsDir["tools/"]
Scraper["tools/deepwiki-scraper.py"]
Reqs["tools/requirements.txt"]
RepoRoot --> Dockerfile_Src
 
       RepoRoot --> BuildScript
 
       RepoRoot --> ToolsDir
 
       ToolsDir --> Scraper
 
       ToolsDir --> Reqs
    end
    
    subgraph "Stage 1 Build Products"
        CargoOutput["/usr/local/cargo/bin/"]
MdBookBin["mdbook binary"]
MermaidBin["mdbook-mermaid binary"]
CargoOutput --> MdBookBin
 
       CargoOutput --> MermaidBin
    end
    
    subgraph "Final Image Filesystem"
        UsrBin["/usr/local/bin/"]
BinDir["/bin/"]
TmpDir["/tmp/"]
MdBookFinal["/usr/local/bin/mdbook"]
MermaidFinal["/usr/local/bin/mdbook-mermaid"]
BuildFinal["/usr/local/bin/build-docs.sh"]
ScraperFinal["/usr/local/bin/deepwiki-scraper.py"]
UVFinal["/bin/uv"]
UVXFinal["/bin/uvx"]
ReqsFinal["/tmp/requirements.txt"]
UsrBin --> MdBookFinal
 
       UsrBin --> MermaidFinal
 
       UsrBin --> BuildFinal
 
       UsrBin --> ScraperFinal
 
       BinDir --> UVFinal
 
       BinDir --> UVXFinal
 
       TmpDir --> ReqsFinal
    end
    
 
   MdBookBin -.->|COPY --from=builder| MdBookFinal
 
   MermaidBin -.->|COPY --from=builder| MermaidFinal
 
   BuildScript -.->|COPY| BuildFinal
 
   Scraper -.->|COPY| ScraperFinal
 
   Reqs -.->|COPY| ReqsFinal

File and Binary Locations in Final Image

The following diagram maps the repository structure to the final image filesystem layout:

Sources: Dockerfile:13-28

Common Build Issues and Solutions

Issue: Cargo Installation Timeout

Symptom: Build fails during Stage 1 with network timeout errors:

error: failed to download `mdbook`

Solution: Increase Docker build timeout or retry the build. The crates.io registry occasionally experiences high load.

Issue: Out of Disk Space

Symptom: Build fails with "no space left on device" error.

Solution: The Rust builder stage requires approximately 2-3 GB of temporary space. Clean up Docker resources:

Issue: Platform Mismatch

Symptom: Built image doesn't run on target platform (e.g., building on ARM Mac but running on x86_64 Linux).

Solution: Specify the target platform explicitly:

Note: Cross-platform builds require QEMU emulation and will be significantly slower.

Issue: Python Dependency Installation Fails

Symptom: Stage 2 fails during uv pip install:

error: Failed to download distribution

Solution: Check network connectivity and retry. If issues persist, build without cache:

Sources: Dockerfile:16-17

Build Customization Options

Building with Different Python Version

To use a different Python version, modify line 8 of the Dockerfile:

Then rebuild:

Building with Specific mdBook Versions

To pin specific versions of the Rust tools, modify line 5 of the Dockerfile:

Reducing Build Time for Development

During development, you can cache the Rust builder stage by building it separately:

Sources: Dockerfile:2-8

Image Size Analysis

The following table breaks down the final image size by component:

ComponentApproximate SizeOptimization Notes
python:3.12-slim base150 MBMinimal Python distribution
System libraries (libc, etc.)20 MBRequired by Python and binaries
Python packages15-20 MBrequests, beautifulsoup4, html2text
uv package manager8-10 MBFaster than pip
mdbook binary20-30 MBStatically linked Rust binary
mdbook-mermaid binary10-20 MBStatically linked Rust binary
Python scripts50-100 KBdeepwiki-scraper.py
Shell scripts5-10 KBbuild-docs.sh
Total~300-400 MBMulti-stage build discards ~1.5 GB

The multi-stage build reduces the image size by approximately 75% compared to a single-stage build that would include the entire Rust toolchain.

Sources: Dockerfile:2-8

Building for Production

For production deployments, consider these additional steps:

  1. Tag with version numbers:

  2. Scan for vulnerabilities:

  3. Push to registry:

  4. Generate SBOM (Software Bill of Materials):

Sources: Dockerfile:1-33

DeepWiki GitHub

Python Dependencies

Relevant source files

This page documents the Python dependencies required by the deepwiki-scraper.py script, including their purposes, version requirements, and how they are used throughout the content extraction and conversion pipeline. For information about the scraper script itself, see deepwiki-scraper.py. For details about how Rust dependencies (mdBook and mdbook-mermaid) are installed, see Docker Multi-Stage Build.

Dependencies Overview

The system requires three core Python libraries for web scraping and HTML-to-Markdown conversion:

PackageMinimum VersionPrimary Purpose
requests2.31.0HTTP client for fetching web pages
beautifulsoup44.12.0HTML parsing and DOM manipulation
html2text2020.1.16HTML to Markdown conversion

These dependencies are declared in tools/requirements.txt:1-3 and installed during Docker image build using the uv package manager.

Sources: tools/requirements.txt:1-3 Dockerfile:16-17

Dependency Usage Flow

The following diagram illustrates how each Python dependency is used across the three-phase processing pipeline:

Sources: tools/deepwiki-scraper.py:17-19 tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:596-788

flowchart TD
    subgraph "Phase 1: Markdown Extraction"
        FetchPage["fetch_page()\n[tools/deepwiki-scraper.py:27-42]"]
ExtractStruct["extract_wiki_structure()\n[tools/deepwiki-scraper.py:78-125]"]
ExtractContent["extract_page_content()\n[tools/deepwiki-scraper.py:453-594]"]
ConvertHTML["convert_html_to_markdown()\n[tools/deepwiki-scraper.py:175-190]"]
end
    
    subgraph "Phase 2: Diagram Enhancement"
        ExtractDiagrams["extract_and_enhance_diagrams()\n[tools/deepwiki-scraper.py:596-788]"]
end
    
    subgraph "requests Library"
        Session["requests.Session()"]
GetMethod["session.get()"]
HeadMethod["session.head()"]
end
    
    subgraph "BeautifulSoup4 Library"
        BS4Parser["BeautifulSoup(html, 'html.parser')"]
FindAll["soup.find_all()"]
Select["soup.select()"]
Decompose["element.decompose()"]
end
    
    subgraph "html2text Library"
        H2TClass["html2text.HTML2Text()"]
HandleMethod["h.handle()"]
end
    
 
   FetchPage --> Session
 
   FetchPage --> GetMethod
 
   ExtractStruct --> GetMethod
 
   ExtractStruct --> BS4Parser
 
   ExtractStruct --> FindAll
    
 
   ExtractContent --> GetMethod
 
   ExtractContent --> BS4Parser
 
   ExtractContent --> Select
 
   ExtractContent --> Decompose
 
   ExtractContent --> ConvertHTML
    
 
   ConvertHTML --> H2TClass
 
   ConvertHTML --> HandleMethod
    
 
   ExtractDiagrams --> GetMethod

requests

The requests library provides HTTP client functionality for fetching web pages from DeepWiki.com. It is imported at tools/deepwiki-scraper.py17 and used throughout the scraper.

Key Usage Patterns

Session Management: A requests.Session() object is created at tools/deepwiki-scraper.py:818-821 to maintain connection pooling and share headers across multiple requests:

HTTP GET Requests: The fetch_page() function at tools/deepwiki-scraper.py:27-42 uses session.get() with retry logic, browser-like headers, and 30-second timeout to fetch HTML content.

HTTP HEAD Requests: The discover_subsections() function at tools/deepwiki-scraper.py:44-76 uses session.head() to efficiently check for page existence without downloading full content.

Configuration Options

The library is configured with:

Sources: tools/deepwiki-scraper.py17 tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:818-821

BeautifulSoup4

The beautifulsoup4 library (imported as bs4) provides HTML parsing and DOM manipulation capabilities. It is imported at tools/deepwiki-scraper.py18 as from bs4 import BeautifulSoup.

Parser Selection

BeautifulSoup is instantiated with the built-in html.parser backend at multiple locations:

This parser choice avoids external dependencies (lxml, html5lib) and provides sufficient functionality for well-formed HTML.

flowchart LR
    subgraph "Navigation Methods"
        FindAll["soup.find_all()"]
Find["soup.find()"]
Select["soup.select()"]
SelectOne["soup.select_one()"]
end
    
    subgraph "Usage in extract_wiki_structure()"
        StructLinks["Find wiki page links\n[line 90]"]
end
    
    subgraph "Usage in extract_page_content()"
        RemoveNav["Remove navigation elements\n[line 466]"]
FindContent["Locate main content area\n[line 473-485]"]
RemoveUI["Remove DeepWiki UI elements\n[line 491-511]"]
end
    
 
   FindAll --> StructLinks
 
   FindAll --> RemoveUI
    
 
   Select --> RemoveNav
 
   SelectOne --> FindContent
    
 
   Find --> FindContent

DOM Navigation Methods

The following diagram maps BeautifulSoup methods to their usage contexts in the codebase:

Sources: tools/deepwiki-scraper.py18 tools/deepwiki-scraper.py84 tools/deepwiki-scraper.py90 tools/deepwiki-scraper.py463 tools/deepwiki-scraper.py:466-511

Content Manipulation

Element Removal: The element.decompose() method permanently removes elements from the DOM tree:

CSS Selectors: BeautifulSoup's select() and select_one() methods support CSS selector syntax for finding content areas:

tools/deepwiki-scraper.py:473-476

Attribute-Based Selection: The find() method with attrs parameter locates elements by ARIA roles:

tools/deepwiki-scraper.py480

Text Extraction

BeautifulSoup's get_text() method extracts plain text from elements:

Sources: tools/deepwiki-scraper.py:466-511

html2text

The html2text library converts HTML content to Markdown format. It is imported at tools/deepwiki-scraper.py19 and used exclusively in the convert_html_to_markdown() function.

Configuration

An HTML2Text instance is created with specific configuration at tools/deepwiki-scraper.py:178-180:

Key Settings:

  • ignore_links = False : Preserves hyperlinks as Markdown link syntax <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/text" undefined file-path="text">Hii</FileRef>
  • body_width = 0 : Disables automatic line wrapping at 80 characters, preserving original formatting

Conversion Process

The handle() method at tools/deepwiki-scraper.py181 performs the actual HTML-to-Markdown conversion:

This processes the cleaned HTML (after BeautifulSoup removes unwanted elements) and produces Markdown text with:

  • Headers converted to # syntax
  • Links converted to <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/text" undefined file-path="text">Hii</FileRef> format
  • Lists converted to - or 1. format
  • Bold/italic formatting preserved
  • Code blocks and inline code preserved

Post-Processing

The conversion output undergoes additional cleanup at tools/deepwiki-scraper.py188:

Sources: tools/deepwiki-scraper.py19 tools/deepwiki-scraper.py:175-190

flowchart TD
    subgraph "Dockerfile Stage 2"
        BaseImage["FROM python:3.12-slim\n[Dockerfile:8]"]
CopyUV["COPY uv from astral-sh image\n[Dockerfile:13]"]
CopyReqs["COPY tools/requirements.txt\n[Dockerfile:16]"]
InstallDeps["RUN uv pip install --system\n[Dockerfile:17]"]
end
    
    subgraph "requirements.txt"
        Requests["requests>=2.31.0"]
BS4["beautifulsoup4>=4.12.0"]
HTML2Text["html2text>=2020.1.16"]
end
    
 
   BaseImage --> CopyUV
 
   CopyUV --> CopyReqs
 
   CopyReqs --> InstallDeps
    
 
   Requests --> InstallDeps
 
   BS4 --> InstallDeps
 
   HTML2Text --> InstallDeps

Installation Process

The dependencies are installed during Docker image build using the uv package manager, which provides fast, reliable Python package installation.

Multi-Stage Build Integration

Sources: Dockerfile8 Dockerfile13 Dockerfile:16-17 tools/requirements.txt:1-3

Installation Command

The dependencies are installed with a single uv pip install command at Dockerfile17:

Flags:

  • --system : Installs into system Python, not a virtual environment
  • --no-cache : Avoids caching to reduce Docker image size
  • -r /tmp/requirements.txt : Specifies requirements file path

The uv tool is significantly faster than standard pip and provides deterministic dependency resolution, making builds more reliable and reproducible.

Sources: Dockerfile:16-17

Version Requirements

The minimum version constraints specified in tools/requirements.txt:1-3 ensure compatibility with required features:

requests >= 2.31.0

This version requirement ensures:

  • Security fixes : Addresses CVE-2023-32681 (Proxy-Authorization header leakage)
  • Session improvements : Enhanced connection pooling and retry mechanisms
  • HTTP/2 support : Better performance for multiple requests

The codebase relies on stable Session API behavior introduced in 2.x releases.

beautifulsoup4 >= 4.12.0

This version requirement ensures:

  • Python 3.12 compatibility : Required for the base image python:3.12-slim
  • Parser stability : Consistent behavior with html.parser backend
  • Security updates : Protection against XML parsing vulnerabilities

The codebase uses standard find/select methods that are stable across 4.x versions.

html2text >= 2020.1.16

This version requirement ensures:

  • Python 3 compatibility : Earlier versions targeted Python 2.7
  • Markdown formatting fixes : Improved handling of nested lists and code blocks
  • Link preservation : Proper conversion of HTML links to Markdown syntax

The codebase uses the body_width=0 configuration which was stabilized in this version.

Sources: tools/requirements.txt:1-3

Import Locations

All three dependencies are imported at the top of deepwiki-scraper.py:

These are the only external dependencies required by the Python layer. The script uses only standard library modules for all other functionality (sys, re, time, pathlib, urllib.parse).

Sources: tools/deepwiki-scraper.py:17-19