Overview
Relevant source files
Purpose and Scope
This document introduces the DeepWiki-to-mdBook Converter, a containerized system that extracts wiki documentation from DeepWiki.com and transforms it into searchable HTML documentation using mdBook. This page covers the system's purpose, core capabilities, and high-level architecture.
For detailed usage instructions, see Quick Start. For architecture details, see System Architecture. For configuration options, see Configuration Reference.
Sources: README.md:1-3
Problem Statement
DeepWiki.com provides AI-generated documentation for GitHub repositories as a web-based wiki. The system addresses the following limitations:
| Problem | Solution |
|---|---|
| Content locked in web platform | HTTP scraping with requests and BeautifulSoup4 |
| Mermaid diagrams rendered client-side only | JavaScript payload extraction with fuzzy matching |
| No offline access | Self-contained HTML site generation |
| No searchability | mdBook's built-in search |
| Platform-specific formatting | Conversion to standard Markdown |
Sources: README.md:3-15
Core Capabilities
The system provides the following capabilities through environment variable configuration:
- Generic Repository Support : Works with any GitHub repository indexed by DeepWiki via
REPOenvironment variable - Auto-Detection : Extracts repository metadata from Git remotes when available
- Hierarchy Preservation : Maintains wiki page numbering and section structure
- Diagram Intelligence : Extracts ~461 total diagrams, matches ~48 with sufficient context using fuzzy matching
- Dual Output Modes : Full mdBook build or markdown-only extraction via
MARKDOWN_ONLYflag - No Authentication : Public HTTP scraping without API keys or credentials
- Containerized Deployment : Single Docker image with all dependencies
Sources: README.md:5-15 README.md:42-51
System Components
The system consists of three primary executable components coordinated by a shell orchestrator:
Main Components
graph TB
User["docker run"]
subgraph Container["deepwiki-scraper Container"]
BuildDocs["build-docs.sh\n(Shell Orchestrator)"]
Scraper["deepwiki-scraper.py\n(Python)"]
MdBook["mdbook\n(Rust Binary)"]
MermaidPlugin["mdbook-mermaid\n(Rust Binary)"]
end
subgraph External["External Systems"]
DeepWiki["deepwiki.com\n(HTTP Scraping)"]
GitHub["github.com\n(Edit Links)"]
end
subgraph Output["Output Directory"]
MarkdownDir["markdown/\n(.md files)"]
BookDir["book/\n(HTML site)"]
ConfigFile["book.toml"]
end
User -->|Environment Variables| BuildDocs
BuildDocs -->|Step 1: Execute| Scraper
BuildDocs -->|Step 4: Execute| MdBook
Scraper -->|HTTP GET| DeepWiki
Scraper -->|Writes| MarkdownDir
MdBook -->|Preprocessor| MermaidPlugin
MdBook -->|Generates| BookDir
BookDir -.->|Edit links| GitHub
BuildDocs -->|Copies| ConfigFile
style BuildDocs fill:#fff4e1
style Scraper fill:#e8f5e9
style MdBook fill:#f3e5f5
| Component | Language | Purpose | Key Functions |
|---|---|---|---|
build-docs.sh | Shell | Orchestration | Parse env vars, generate configs, call executables |
deepwiki-scraper.py | Python 3.12 | Content extraction | HTTP scraping, HTML parsing, diagram matching |
mdbook | Rust | Site generation | Markdown to HTML, navigation, search |
mdbook-mermaid | Rust | Diagram rendering | Inject JavaScript/CSS for Mermaid.js |
Sources: README.md:146-157 Diagram 1, Diagram 5
Processing Pipeline
The system executes a three-phase pipeline with conditional execution based on the MARKDOWN_ONLY environment variable:
Phase Details
stateDiagram-v2
[*] --> ParseEnvVars
ParseEnvVars --> ExecuteScraper : build-docs.sh phase 1
state ExecuteScraper {
[*] --> FetchHTML
FetchHTML --> ConvertMarkdown : html2text
ConvertMarkdown --> ExtractDiagrams : Regex on JS payload
ExtractDiagrams --> FuzzyMatch : Progressive chunks
FuzzyMatch --> WriteMarkdown : output/markdown/
WriteMarkdown --> [*]
}
ExecuteScraper --> CheckMode
state CheckMode <<choice>>
CheckMode --> GenerateBookToml : MARKDOWN_ONLY=false
CheckMode --> CopyOutput : MARKDOWN_ONLY=true
GenerateBookToml --> GenerateSummary : build-docs.sh phase 2
GenerateSummary --> ExecuteMdbook : build-docs.sh phase 3
state ExecuteMdbook {
[*] --> InitBook
InitBook --> CopyMarkdown : mdbook init
CopyMarkdown --> InstallMermaid : mdbook-mermaid install
InstallMermaid --> BuildHTML : mdbook build
BuildHTML --> [*] : output/book/
}
ExecuteMdbook --> CopyOutput
CopyOutput --> [*]
| Phase | Script | Key Operations | Output |
|---|---|---|---|
| 1 | deepwiki-scraper.py | HTTP fetch, BeautifulSoup4 parse, html2text conversion, fuzzy diagram matching | markdown/*.md |
| 2 | build-docs.sh | Generate book.toml, generate SUMMARY.md | Configuration files |
| 3 | mdbook + mdbook-mermaid | Markdown processing, Mermaid.js asset injection, HTML generation | book/ directory |
Sources: README.md:121-145 Diagram 2
Input and Output
Input Requirements
| Input | Format | Source | Example |
|---|---|---|---|
REPO | owner/repo | Environment variable | facebook/react |
BOOK_TITLE | String | Environment variable (optional) | React Documentation |
BOOK_AUTHORS | String | Environment variable (optional) | Meta Open Source |
MARKDOWN_ONLY | true/false | Environment variable (optional) | false |
Sources: README.md:42-51
Output Artifacts
Full Build Mode (MARKDOWN_ONLY=false or unset):
output/
├── markdown/
│ ├── 1-overview.md
│ ├── 2-quick-start.md
│ ├── section-3/
│ │ ├── 3-1-workspace.md
│ │ └── 3-2-parser.md
│ └── ...
├── book/
│ ├── index.html
│ ├── searchindex.json
│ ├── mermaid.min.js
│ └── ...
└── book.toml
Markdown-Only Mode (MARKDOWN_ONLY=true):
output/
└── markdown/
├── 1-overview.md
├── 2-quick-start.md
└── ...
Sources: README.md:89-119
Technical Stack
The system combines multiple technology stacks in a single container using Docker multi-stage builds:
Runtime Dependencies
| Component | Version | Purpose | Installation Method |
|---|---|---|---|
| Python | 3.12-slim | Scraping runtime | Base image |
requests | Latest | HTTP client | uv pip install |
beautifulsoup4 | Latest | HTML parser | uv pip install |
html2text | Latest | HTML to Markdown | uv pip install |
mdbook | Latest | Documentation builder | Compiled from source (Rust) |
mdbook-mermaid | Latest | Diagram preprocessor | Compiled from source (Rust) |
Build Architecture
The Dockerfile uses a two-stage build:
- Stage 1 (
rust:latest): Compilesmdbookandmdbook-mermaidbinaries (~1.5 GB, discarded) - Stage 2 (
python:3.12-slim): Copies binaries into Python runtime (~300-400 MB final)
Sources: README.md:146-157 Diagram 3
File System Interaction
The system interacts with three key filesystem locations:
Temporary Directory Workflow :
deepwiki-scraper.pywrites initial markdown to/tmp/wiki_temp/- After diagram enhancement, files move atomically to
/output/markdown/ build-docs.shcopies final HTML to/output/book/
This ensures no partial states exist in the output directory.
Sources: README.md:220-227 README.md136
Configuration Philosophy
The system operates on three configuration principles:
- Environment-Driven : All customization via environment variables, no file editing required
- Auto-Detection : Intelligent defaults from Git remotes (repository URL, author name)
- Zero-Configuration : Minimal required inputs (
REPOor auto-detect from current directory)
Minimal Example :
This single command triggers the complete extraction, transformation, and build pipeline.
For complete configuration options, see Configuration Reference. For deployment patterns, see Quick Start.
Sources: README.md:22-51 README.md:220-227
Quick Start
Relevant source files
This page provides practical instructions for running the DeepWiki-to-mdBook Converter using Docker. It covers building the image, running basic conversions, and accessing the output. For detailed configuration options, see Configuration Reference. For understanding what happens internally, see System Architecture.
Prerequisites
The following must be available on your system:
| Requirement | Purpose |
|---|---|
| Docker | Runs the containerized conversion system |
| Internet connection | Required to fetch content from DeepWiki.com |
| Disk space | ~500MB for Docker image, variable for output |
Sources: README.md:17-20
Building the Docker Image
The system is distributed as a Dockerfile that must be built before use. The build process compiles Rust tools (mdBook, mdbook-mermaid) and installs Python dependencies.
The build process uses multi-stage Docker builds and takes approximately 5-10 minutes on first run. Subsequent builds use Docker layer caching for faster completion.
Note: For detailed information about the Docker build architecture, see Docker Multi-Stage Build.
Sources: README.md:29-31 build-docs.sh:1-5
Basic Usage Pattern
The converter runs as a Docker container that takes environment variables as input and produces output files in a mounted volume.
sequenceDiagram
participant User
participant Docker
participant Container as "deepwiki-scraper\ncontainer"
participant DeepWiki as "deepwiki.com"
participant OutputVol as "/output volume"
User->>Docker: docker run --rm -e REPO=...
Docker->>Container: Start with env vars
Container->>Container: build-docs.sh orchestrates
Container->>DeepWiki: HTTP requests for wiki pages
DeepWiki-->>Container: HTML content + JS payload
Container->>Container: deepwiki-scraper.py extracts
Container->>Container: mdbook build (unless MARKDOWN_ONLY)
Container->>OutputVol: Write markdown/ and book/
Container-->>Docker: Exit (status 0)
Docker-->>User: Container removed (--rm)
User->>OutputVol: Access generated files
User Interaction Flow
Sources: README.md:24-39 build-docs.sh:1-206
Minimal Command
The absolute minimum command requires only the REPO environment variable:
This command:
- Uses
-e REPO="owner/repo"to specify which GitHub repository's wiki to extract - Mounts the current directory's
output/subdirectory to/outputin the container - Uses
--rmto automatically remove the container after completion - Generates default values for
BOOK_TITLE,BOOK_AUTHORS, andGIT_REPO_URL
Sources: README.md:34-38 build-docs.sh:22-26
Environment Variable Configuration
Sources: build-docs.sh:8-53 README.md:42-51
The following table describes each environment variable:
| Variable | Required | Default Behavior | Example |
|---|---|---|---|
REPO | Yes* | Auto-detected from Git remote if available | facebook/react |
BOOK_TITLE | No | "Documentation" | "React Internals" |
BOOK_AUTHORS | No | Extracted from REPO owner | "Meta Open Source" |
GIT_REPO_URL | No | Constructed as https://github.com/{REPO} | Custom fork URL |
MARKDOWN_ONLY | No | "false" (build full HTML) | "true" for debugging |
REPOis required unless running from a Git repository with a GitHub remote, in which case it is auto-detected via build-docs.sh:8-19
Sources: README.md:42-51 build-docs.sh:8-53
Common Usage Patterns
Pattern 1: Complete Documentation Build
Generate both Markdown source and HTML documentation:
Produces:
/output/markdown/- Source Markdown files with diagrams/output/book/- Complete HTML site with search and navigation/output/book.toml- mdBook configuration
Use when: You want a deployable documentation website.
Sources: README.md:74-87 build-docs.sh:178-192
Pattern 2: Markdown-Only Mode (Fast Iteration)
Extract only Markdown files, skipping the HTML build phase:
Produces:
/output/markdown/- Source Markdown files only
Use when:
- Debugging diagram placement
- Testing content extraction
- You only need Markdown files
- Faster iteration cycles (~3-5x faster than full build)
Skips: Phases 3 (mdBook build) as controlled by build-docs.sh:61-76
Sources: README.md:55-72 build-docs.sh:61-76
Pattern 3: Custom Output Directory
Mount a different output location:
This writes output to /home/user/docs/rust instead of ./output.
Sources: README.md:200-207
Pattern 4: Minimal Configuration with Auto-Detection
If running from a Git repository directory:
The system extracts the repository from git config --get remote.origin.url via build-docs.sh:8-19 This only works when running the Docker command from within a Git repository with a GitHub remote configured.
Sources: build-docs.sh:8-19 README.md53
Output Structure
Complete Build Output
When MARKDOWN_ONLY=false (default), the output structure is:
output/
├── markdown/ # Source Markdown files
│ ├── 1-overview.md
│ ├── 2-quick-start.md
│ ├── section-3/
│ │ ├── 3-1-subsection.md
│ │ └── 3-2-subsection.md
│ └── ...
├── book/ # Generated HTML documentation
│ ├── index.html
│ ├── 1-overview.html
│ ├── searchindex.js
│ ├── mermaid/ # Diagram rendering assets
│ └── ...
└── book.toml # mdBook configuration
Sources: README.md:89-120 build-docs.sh:178-192
File Naming Convention
Files follow the pattern {number}-{title}.md where:
{number}is the hierarchical page number (e.g.,1,2-1,3-2){title}is a URL-safe version of the page title
Subsection files are organized in section-{N}/ subdirectories, where {N} is the parent section number.
Examples from README.md:115-119:
1-overview.md- Top-level page 12-1-workspace-and-crates.md- Subsection 1 of section 2section-4/4-1-logical-planning.md- Subsection 1 of section 4, stored in subdirectory
Sources: README.md:115-119
Viewing the Output
Serving HTML Documentation Locally
After a complete build, serve the HTML site using Python's built-in HTTP server:
Then open http://localhost:8000 in your browser.
The generated site includes:
- Full-text search via
searchindex.js - Responsive navigation sidebar with page hierarchy
- Rendered Mermaid diagrams
- "Edit this page" links to the GitHub repository
- Dark/light theme toggle
Sources: README.md:83-86 build-docs.sh:203-204
Accessing Markdown Files
Markdown files can be read directly or used with other tools:
Sources: README.md:100-113
Execution Flow
Sources: build-docs.sh:1-206 README.md:121-145
Quick Troubleshooting
"REPO must be set"
Error message: ERROR: REPO must be set or run from within a Git repository
Cause: The REPO environment variable was not provided and could not be auto-detected.
Solution:
Sources: build-docs.sh:32-37
"No wiki pages found"
Cause: The repository may not be indexed by DeepWiki.
Solution: Verify the wiki exists by visiting https://deepwiki.com/owner/repo in a browser. Not all GitHub repositories have DeepWiki documentation.
Sources: README.md:160-161
Connection Timeouts
Cause: Network issues or DeepWiki service unavailable.
Solution: The scraper includes automatic retries (3 attempts per page). Wait and retry the command. Check your internet connection.
Sources: README.md:171-172
mdBook Build Fails
Error symptoms: Build completes Phase 1 and 2 but fails during Phase 3.
Solutions:
-
Ensure Docker has sufficient memory (2GB+ recommended)
-
Try
MARKDOWN_ONLY=trueto verify extraction works independently: -
Check Docker logs for Rust compilation errors
Sources: README.md:174-177
Diagrams Not Appearing
Cause: Fuzzy matching may not find appropriate placement context for some diagrams.
Debugging approach:
Not all diagrams can be matched—typically ~48 out of ~461 diagrams have sufficient context for accurate placement.
Sources: README.md:166-169 README.md:132-135
Next Steps
After successfully generating documentation:
- Review Configuration Reference for advanced configuration options
- Explore System Architecture to understand the three-phase processing model
- See Output Structure for detailed information about generated files
- Read Markdown-Only Mode for debugging and iteration workflows
Sources: README.md:1-233
Configuration Reference
Relevant source files
This document provides a comprehensive reference for all configuration options available in the DeepWiki-to-mdBook Converter system. It covers environment variables, their default values, validation logic, auto-detection features, and how configuration flows through the system components.
For information about running the system with these configurations, see Quick Start. For details on how auto-detection works internally, see Auto-Detection Features.
Configuration System Overview
The DeepWiki-to-mdBook Converter uses environment variables as its sole configuration mechanism. All configuration is processed by the build-docs.sh orchestrator script at runtime, with no configuration files required. The system provides intelligent defaults and auto-detection capabilities to minimize required configuration.
Configuration Flow Diagram
flowchart TD
User["User/CI System"]
Docker["docker run -e VAR=value"]
subgraph "build-docs.sh Configuration Processing"
AutoDetect["Git Auto-Detection\n[build-docs.sh:8-19]"]
ParseEnv["Environment Variable Parsing\n[build-docs.sh:21-26]"]
Defaults["Default Value Assignment\n[build-docs.sh:43-45]"]
Validate["Validation\n[build-docs.sh:32-37]"]
end
subgraph "Configuration Consumers"
Scraper["deepwiki-scraper.py\nREPO parameter"]
BookToml["book.toml Generation\n[build-docs.sh:85-103]"]
SummaryGen["SUMMARY.md Generation\n[build-docs.sh:113-159]"]
end
User -->|Set environment variables| Docker
Docker -->|Container startup| AutoDetect
AutoDetect -->|REPO detection| ParseEnv
ParseEnv -->|Parse all vars| Defaults
Defaults -->|Apply defaults| Validate
Validate -->|REPO validated| Scraper
Validate -->|BOOK_TITLE, BOOK_AUTHORS, GIT_REPO_URL| BookToml
Validate -->|No direct config needed| SummaryGen
Sources: build-docs.sh:1-206 README.md:41-51
Environment Variables Reference
The following table lists all environment variables supported by the system:
| Variable | Type | Required | Default | Description |
|---|---|---|---|---|
REPO | String | Conditional | Auto-detected from Git remote | GitHub repository in owner/repo format. Required if not running in a Git repository with a GitHub remote. |
BOOK_TITLE | String | No | "Documentation" | Title displayed in the generated mdBook documentation. Used in book.toml title field. |
BOOK_AUTHORS | String | No | Repository owner (from REPO) | Author name(s) displayed in the documentation. Used in book.toml authors array. |
GIT_REPO_URL | String | No | https://github.com/{REPO} | Full GitHub repository URL. Used for "Edit this page" links in mdBook output. |
MARKDOWN_ONLY | Boolean | No | "false" | When "true", skips Phase 3 (mdBook build) and outputs only extracted Markdown files. Useful for debugging. |
Sources: build-docs.sh:21-26 README.md:44-51
Variable Details and Usage
REPO
Format: owner/repo (e.g., "facebook/react" or "microsoft/vscode")
Purpose: Identifies the GitHub repository to scrape from DeepWiki.com. This is the primary configuration variable that drives the entire system.
flowchart TD
Start["build-docs.sh Startup"]
CheckEnv{"REPO environment\nvariable set?"}
UseEnv["Use provided REPO value\n[build-docs.sh:22]"]
CheckGit{"Git repository\ndetected?"}
GetRemote["Execute: git config --get\nremote.origin.url\n[build-docs.sh:12]"]
ParseURL["Extract owner/repo using regex:\n.*github\\.com[:/]([^/]+/[^/\\.]+)\n[build-docs.sh:16]"]
SetRepo["Set REPO variable\n[build-docs.sh:16]"]
ValidateRepo{"REPO is set?"}
Error["Exit with error\n[build-docs.sh:33-37]"]
Continue["Continue with\nREPO=$REPO_OWNER/$REPO_NAME"]
Start --> CheckEnv
CheckEnv -->|Yes| UseEnv
CheckEnv -->|No| CheckGit
CheckGit -->|Yes| GetRemote
CheckGit -->|No| ValidateRepo
GetRemote --> ParseURL
ParseURL --> SetRepo
UseEnv --> ValidateRepo
SetRepo --> ValidateRepo
ValidateRepo -->|No| Error
ValidateRepo -->|Yes| Continue
Auto-Detection Logic:
Sources: build-docs.sh:8-37
Validation: The system exits with an error if REPO is not set and cannot be auto-detected:
ERROR: REPO must be set or run from within a Git repository with a GitHub remote
Usage: REPO=owner/repo $0
Usage in System:
- Passed as first argument to
deepwiki-scraper.pybuild-docs.sh58 - Used to derive
REPO_OWNERandREPO_NAMEbuild-docs.sh:40-41 - Used to construct
GIT_REPO_URLdefault build-docs.sh45
BOOK_TITLE
Default: "Documentation"
Purpose: Sets the title of the generated mdBook documentation. This appears in the browser tab, navigation header, and book metadata.
Usage: Injected into book.toml configuration file build-docs.sh87:
Examples:
BOOK_TITLE="React Documentation"BOOK_TITLE="VS Code Internals"BOOK_TITLE="Apache Arrow DataFusion Developer Guide"
Sources: build-docs.sh23 build-docs.sh87
BOOK_AUTHORS
Default: Repository owner extracted from REPO
Purpose: Sets the author name(s) in the mdBook documentation metadata.
Default Assignment Logic: build-docs.sh44
This uses shell parameter expansion to set BOOK_AUTHORS to REPO_OWNER only if BOOK_AUTHORS is unset or empty.
Usage: Injected into book.toml as an array build-docs.sh88:
Examples:
- If
REPO="facebook/react"andBOOK_AUTHORSnot set →BOOK_AUTHORS="facebook" - Explicitly set:
BOOK_AUTHORS="Meta Open Source" - Multiple authors:
BOOK_AUTHORS="John Doe, Jane Smith"(rendered as single string in array)
Sources: build-docs.sh24 build-docs.sh44 build-docs.sh88
GIT_REPO_URL
Default: https://github.com/{REPO}
Purpose: Provides the full GitHub repository URL used for "Edit this page" links in the generated mdBook documentation. Each page includes a link back to the source repository.
Default Assignment Logic: build-docs.sh45
Usage: Injected into book.toml configuration build-docs.sh95:
Notes:
- mdBook automatically appends
/edit/main/or similar paths based on its heuristics - The URL must be a valid Git repository URL for the edit links to work correctly
- Can be overridden for non-standard Git hosting scenarios
Sources: build-docs.sh25 build-docs.sh45 build-docs.sh95
MARKDOWN_ONLY
Default: "false"
Type: Boolean string ("true" or "false")
Purpose: Controls whether the system executes the full three-phase pipeline or stops after Phase 2 (Markdown extraction with diagram enhancement). When set to "true", Phase 3 (mdBook build) is skipped.
flowchart TD
Start["build-docs.sh Execution"]
Phase1["Phase 1: Scrape & Extract\n[build-docs.sh:56-58]"]
Phase2["Phase 2: Enhance Diagrams\n(within deepwiki-scraper.py)"]
CheckMode{"MARKDOWN_ONLY\n== 'true'?\n[build-docs.sh:61]"}
CopyMD["Copy markdown to /output/markdown\n[build-docs.sh:64-65]"]
ExitEarly["Exit (skipping mdBook build)\n[build-docs.sh:75]"]
Phase3Init["Phase 3: Initialize mdBook\n[build-docs.sh:79-106]"]
BuildBook["Build HTML documentation\n[build-docs.sh:176]"]
CopyAll["Copy all outputs\n[build-docs.sh:179-191]"]
Start --> Phase1
Phase1 --> Phase2
Phase2 --> CheckMode
CheckMode -->|Yes| CopyMD
CopyMD --> ExitEarly
CheckMode -->|No| Phase3Init
Phase3Init --> BuildBook
BuildBook --> CopyAll
style ExitEarly fill:#ffebee
style CopyAll fill:#e8f5e9
Execution Flow with MARKDOWN_ONLY:
Sources: build-docs.sh26 build-docs.sh:61-76
Use Cases:
- Debugging diagram placement: Quickly iterate on diagram matching without waiting for mdBook build
- Markdown-only extraction: When you only need the Markdown source files
- Faster feedback loops: mdBook build adds significant time; skipping it speeds up testing
- Custom processing: Extract Markdown for processing with different documentation tools
Output Differences:
| Mode | Output Directory Structure |
|---|---|
MARKDOWN_ONLY="false" (default) | /output/book/ (HTML site) |
/output/markdown/ (source) | |
/output/book.toml (config) | |
MARKDOWN_ONLY="true" | /output/markdown/ (source only) |
Performance Impact: Markdown-only mode is approximately 3-5x faster, as it skips:
- mdBook initialization build-docs.sh:79-106
- SUMMARY.md generation build-docs.sh:109-159
- File copying to book/src build-docs.sh:164-166
- mdbook-mermaid asset installation build-docs.sh:169-171
- mdBook HTML build build-docs.sh:174-176
Sources: build-docs.sh:61-76 README.md:55-76
Internal Configuration Variables
These variables are derived or used internally and are not meant to be configured by users:
| Variable | Source | Purpose |
|---|---|---|
WORK_DIR | Hard-coded: /workspace build-docs.sh27 | Temporary working directory inside container |
WIKI_DIR | Derived: $WORK_DIR/wiki build-docs.sh28 | Directory where deepwiki-scraper.py outputs Markdown |
OUTPUT_DIR | Hard-coded: /output build-docs.sh29 | Container output directory (mounted as volume) |
BOOK_DIR | Derived: $WORK_DIR/book build-docs.sh30 | mdBook project directory |
REPO_OWNER | Extracted from REPO build-docs.sh40 | First component of owner/repo |
REPO_NAME | Extracted from REPO build-docs.sh41 | Second component of owner/repo |
Sources: build-docs.sh:27-30 build-docs.sh:40-41
Configuration Precedence and Inheritance
The system follows this precedence order for configuration values:
Sources: build-docs.sh:8-45
Example Scenarios:
- User provides all values:
All explicit values used; no auto-detection occurs.
-
User provides only REPO:
REPO:"facebook/react"(explicit)BOOK_TITLE:"Documentation"(default)BOOK_AUTHORS:"facebook"(derived from REPO)GIT_REPO_URL:"https://github.com/facebook/react"(derived)MARKDOWN_ONLY:"false"(default)
-
User provides no values in Git repo:
REPO: Auto-detected fromgit config --get remote.origin.url- All other values derived or defaulted as above
Generated Configuration Files
The system generates configuration files dynamically based on environment variables:
book.toml
Location: Created at $BOOK_DIR/book.toml build-docs.sh85 copied to /output/book.toml build-docs.sh191
Template Structure:
Sources: build-docs.sh:85-103
Variable Substitution Mapping:
| Template Variable | Environment Variable | Section |
|---|---|---|
${BOOK_TITLE} | $BOOK_TITLE | [book] |
${BOOK_AUTHORS} | $BOOK_AUTHORS | [book] |
${GIT_REPO_URL} | $GIT_REPO_URL | [output.html] |
Hard-Coded Values:
language = "en"build-docs.sh89default-theme = "rust"build-docs.sh94[preprocessor.mermaid]configuration build-docs.sh:97-98- Sidebar folding enabled at level 1 build-docs.sh:100-102
SUMMARY.md
Location: Created at $BOOK_DIR/src/SUMMARY.md build-docs.sh159
Generation: Automatically generated from file structure in $WIKI_DIR, no direct environment variable input. See SUMMARY.md Generation for details.
Sources: build-docs.sh:109-159
Configuration Examples
Minimal Configuration
Results:
REPO:"owner/repo"BOOK_TITLE:"Documentation"BOOK_AUTHORS:"owner"GIT_REPO_URL:"https://github.com/owner/repo"MARKDOWN_ONLY:"false"
Full Custom Configuration
Auto-Detected Configuration
Note: This only works if the current directory is a Git repository with a GitHub remote URL configured.
Debugging Configuration
Outputs only Markdown files to /output/markdown/, skipping the mdBook build phase.
Sources: README.md:28-88
Configuration Validation
The system performs validation on the REPO variable build-docs.sh:32-37:
Validation Rules:
REPOmust be non-empty after auto-detection- No format validation is performed on
REPOvalue (e.g.,owner/repopattern) - Invalid
REPOvalues will cause failures during scraping phase, not during validation
Other Variables:
- No validation performed on
BOOK_TITLE,BOOK_AUTHORS, orGIT_REPO_URL MARKDOWN_ONLYis not validated; any value other than"true"is treated asfalse
Sources: build-docs.sh:32-37
Configuration Debugging
To debug configuration values, check the console output at startup build-docs.sh:47-53:
Configuration:
Repository: facebook/react
Book Title: React Documentation
Authors: Meta Open Source
Git Repo URL: https://github.com/facebook/react
Markdown Only: false
This output shows the final resolved configuration values after auto-detection, derivation, and defaults are applied.
Sources: build-docs.sh:47-53
System Architecture
Relevant source files
This document provides a comprehensive overview of the DeepWiki-to-mdBook Converter's system architecture, explaining how the major components interact and how data flows through the system. It describes the containerized polyglot design, the orchestration model, and the technology integration strategy.
For detailed information about the three-phase processing model, see Three-Phase Pipeline. For Docker containerization specifics, see Docker Multi-Stage Build. For individual component implementation details, see Component Reference.
Architectural Overview
The system follows a layered orchestration architecture where a shell script coordinator invokes specialized tools in sequence. The entire system runs within a single Docker container that combines Python web scraping tools with Rust documentation building tools.
Design Principles
| Principle | Implementation |
|---|---|
| Single Responsibility | Each component (shell, Python, Rust tools) has one clear purpose |
| Language-Specific Tools | Python for web scraping, Rust for documentation building, Shell for orchestration |
| Stateless Processing | No persistent state between runs; all configuration via environment variables |
| Atomic Operations | Temporary directory workflow ensures no partial output states |
| Generic Design | No hardcoded repository details; works with any DeepWiki repository |
Sources: README.md:218-227 build-docs.sh:1-206
Container Architecture
The system uses a two-stage Docker build to create a hybrid Python-Rust runtime environment while minimizing image size.
graph TB
subgraph Stage1["Stage 1: Rust Builder (rust:latest)"]
RustBase["rust:latest base image"]
CargoInstall["cargo install mdbook\ncargo install mdbook-mermaid"]
BinariesOut["/usr/local/cargo/bin/\nmdbook\nmdbook-mermaid"]
RustBase --> CargoInstall
CargoInstall --> BinariesOut
end
subgraph Stage2["Stage 2: Final Image (python:3.12-slim)"]
PyBase["python:3.12-slim base"]
UVInstall["COPY --from=ghcr.io/astral-sh/uv"]
PipInstall["uv pip install\nrequirements.txt"]
CopyBins["COPY --from=builder\nRust binaries"]
CopyScripts["COPY scripts:\ndeepwiki-scraper.py\nbuild-docs.sh"]
PyBase --> UVInstall
UVInstall --> PipInstall
PipInstall --> CopyBins
CopyBins --> CopyScripts
end
BinariesOut -.->|Extract binaries only discard 1.5GB toolchain| CopyBins
CopyScripts --> Runtime["Final Image: ~300-400MB\nPython + Rust binaries\nNo build tools"]
subgraph Runtime["Runtime Contents"]
direction LR
Python["Python 3.12 runtime"]
Packages["requests, BeautifulSoup4,\nhtml2text"]
Tools["mdbook, mdbook-mermaid\nbinaries"]
end
Docker Multi-Stage Build Topology
Stage 1 (Dockerfile:1-5) compiles Rust tools using the full rust:latest image (~1.5 GB) but only the compiled binaries are extracted. Stage 2 (Dockerfile:7-32) builds the final image on a minimal Python base, copying only the Rust binaries and Python scripts, resulting in a compact image.
Sources: Dockerfile:1-33 README.md156
Component Topology and Code Mapping
This diagram maps the system's logical components to their actual code implementations:
graph TB
subgraph User["User Interface"]
CLI["Docker CLI"]
EnvVars["Environment Variables:\nREPO, BOOK_TITLE,\nBOOK_AUTHORS, etc."]
Volume["/output volume mount"]
end
subgraph Orchestrator["Orchestration Layer"]
BuildScript["build-docs.sh"]
MainLoop["Main execution flow:\nLines 55-206"]
ConfigGen["Configuration generation:\nLines 84-103, 108-159"]
AutoDetect["Auto-detection logic:\nLines 8-19, 40-45"]
end
subgraph ScraperLayer["Content Acquisition Layer"]
ScraperMain["deepwiki-scraper.py\nmain()
function"]
ExtractStruct["extract_wiki_structure()\nLine 78"]
ExtractContent["extract_page_content()\nLine 453"]
ExtractDiagrams["extract_and_enhance_diagrams()\nLine 596"]
FetchPage["fetch_page()\nLine 27"]
ConvertHTML["convert_html_to_markdown()\nLine 175"]
CleanFooter["clean_deepwiki_footer()\nLine 127"]
FixLinks["fix_wiki_link()\nLine 549"]
end
subgraph BuildLayer["Documentation Generation Layer"]
MdBookInit["mdbook init"]
MdBookBuild["mdbook build\n(Line 176)"]
MermaidInstall["mdbook-mermaid install\n(Line 171)"]
end
subgraph Output["Output Artifacts"]
TempDir["/workspace/wiki/\n(temp directory)"]
OutputMD["/output/markdown/\nEnhanced .md files"]
OutputBook["/output/book/\nHTML documentation"]
BookToml["/output/book.toml"]
end
CLI --> EnvVars
EnvVars --> BuildScript
BuildScript --> AutoDetect
BuildScript --> MainLoop
MainLoop --> ScraperMain
MainLoop --> ConfigGen
MainLoop --> MdBookInit
ScraperMain --> ExtractStruct
ScraperMain --> ExtractContent
ScraperMain --> ExtractDiagrams
ExtractStruct --> FetchPage
ExtractContent --> FetchPage
ExtractContent --> ConvertHTML
ConvertHTML --> CleanFooter
ExtractContent --> FixLinks
ExtractDiagrams --> TempDir
ExtractContent --> TempDir
ConfigGen --> MdBookInit
TempDir --> MdBookBuild
MdBookBuild --> MermaidInstall
TempDir --> OutputMD
MdBookBuild --> OutputBook
ConfigGen --> BookToml
OutputMD --> Volume
OutputBook --> Volume
BookToml --> Volume
This diagram shows the complete code-to-component mapping, making it easy to locate specific functionality in the codebase.
Sources: build-docs.sh:1-206 tools/deepwiki-scraper.py:1-920
stateDiagram-v2
[*] --> ValidateConfig
ValidateConfig : build-docs.sh - Lines 8-53 Parse REPO, auto-detect if needed Set BOOK_TITLE, BOOK_AUTHORS defaults
ValidateConfig --> Phase1
Phase1 : Phase 1 - Scrape Wiki build-docs.sh - Line 58 Calls - deepwiki-scraper.py
state Phase1 {
[*] --> ExtractStructure
ExtractStructure : extract_wiki_structure() Parse main page, discover subsections
ExtractStructure --> ExtractPages
ExtractPages : extract_page_content() Fetch HTML, convert to markdown
ExtractPages --> EnhanceDiagrams
EnhanceDiagrams : extract_and_enhance_diagrams() Fuzzy match and inject diagrams
EnhanceDiagrams --> [*]
}
Phase1 --> CheckMode
CheckMode : Check MARKDOWN_ONLY flag build-docs.sh - Line 61
state CheckMode <<choice>>
CheckMode --> CopyMarkdown : MARKDOWN_ONLY=true
CheckMode --> Phase2 : MARKDOWN_ONLY=false
CopyMarkdown : Copy to /output/markdown build-docs.sh - Lines 63-75
CopyMarkdown --> Done
Phase2 : Phase 2 - Initialize mdBook build-docs.sh - Lines 79-106
state Phase2 {
[*] --> CreateBookToml
CreateBookToml : Generate book.toml Lines 85-103
CreateBookToml --> GenerateSummary
GenerateSummary : Generate SUMMARY.md Lines 113-159
GenerateSummary --> [*]
}
Phase2 --> Phase3
Phase3 : Phase 3 - Build Documentation build-docs.sh - Lines 164-191
state Phase3 {
[*] --> InstallMermaid
InstallMermaid : mdbook-mermaid install Line 171
InstallMermaid --> BuildBook
BuildBook : mdbook build Line 176
BuildBook --> CopyOutputs
CopyOutputs : Copy to /output Lines 184-191
CopyOutputs --> [*]
}
Phase3 --> Done
Done --> [*]
Execution Flow
The system executes through a well-defined sequence orchestrated by build-docs.sh:
Primary Execution Path
The execution flow has a fast-path (markdown-only mode) and a complete-path (full documentation build). The decision point at line 61 of build-docs.sh determines which path to take based on the MARKDOWN_ONLY environment variable.
Sources: build-docs.sh:55-206 tools/deepwiki-scraper.py:790-916
Technology Stack and Integration Points
Core Technologies
| Layer | Technology | Purpose | Code Reference |
|---|---|---|---|
| Orchestration | Bash | Script coordination, environment handling | build-docs.sh:1-206 |
| Web Scraping | Python 3.12 | HTTP requests, HTML parsing | tools/deepwiki-scraper.py:1-920 |
| HTML Parsing | BeautifulSoup4 | DOM navigation, content extraction | tools/deepwiki-scraper.py:18-19 |
| HTML→MD Conversion | html2text | Clean markdown generation | tools/deepwiki-scraper.py:175-190 |
| Documentation Build | mdBook (Rust) | HTML site generation | build-docs.sh176 |
| Diagram Rendering | mdbook-mermaid | Mermaid diagram support | build-docs.sh171 |
| Package Management | uv | Fast Python dependency installation | Dockerfile:13-17 |
Python Dependencies Integration
The scraper uses three primary Python libraries, installed via uv:
Integration points:
requestssession with retry logic: tools/deepwiki-scraper.py:27-42BeautifulSoupfor content extraction: tools/deepwiki-scraper.py:463-487html2textwithbody_width=0for no wrapping: tools/deepwiki-scraper.py:175-181
Sources: Dockerfile:16-17 tools/deepwiki-scraper.py:17-19 tools/deepwiki-scraper.py:27-42
graph TB
subgraph Docker["Docker Container Filesystem"]
subgraph Workspace["/workspace"]
WikiTemp["/workspace/wiki\n(temporary)\nScraper output"]
BookBuild["/workspace/book\nmdBook build directory"]
BookSrc["/workspace/book/src\nMarkdown source files"]
end
subgraph Binaries["/usr/local/bin"]
MdBook["mdbook"]
MdBookMermaid["mdbook-mermaid"]
Scraper["deepwiki-scraper.py"]
BuildScript["build-docs.sh"]
end
subgraph Output["/output (volume mount)"]
OutputMD["/output/markdown\nFinal markdown files"]
OutputBook["/output/book\nHTML documentation"]
OutputConfig["/output/book.toml"]
end
end
Scraper -.->|Phase 1: Write| WikiTemp
WikiTemp -.->|Phase 2: Enhance in-place| WikiTemp
WikiTemp -.->|Copy| BookSrc
BookSrc -.->|mdbook build| OutputBook
WikiTemp -.->|Move| OutputMD
File System Structure
The system uses a temporary directory workflow to ensure atomic operations:
Directory Layout at Runtime
Workflow:
- Lines 808-877 : Scraper writes to temporary directory in
/tmp(created bytempfile.TemporaryDirectory()) - Lines 880 : Diagram enhancement modifies files in temporary directory
- Lines 887-908 : Completed files moved atomically to
/output - Lines 166 :
build-docs.shcopies to mdBook source directory - Lines 176 : mdBook builds HTML to
/workspace/book/book - Lines 184-191 : Outputs copied to
/outputvolume
This pattern ensures no partial or corrupted output is visible to users.
Sources: tools/deepwiki-scraper.py:804-916 build-docs.sh:164-191
Configuration Management
Configuration flows from environment variables through shell script processing to generated config files:
Configuration Flow
| Input | Processor | Output | Code Reference |
|---|---|---|---|
REPO | build-docs.sh:8-19 | Auto-detected from Git or required | build-docs.sh:8-36 |
BOOK_TITLE | build-docs.sh:23 | Defaults to "Documentation" | build-docs.sh23 |
BOOK_AUTHORS | build-docs.sh:24,44 | Defaults to repo owner | build-docs.sh:24-44 |
GIT_REPO_URL | build-docs.sh:25,45 | Constructed from REPO | build-docs.sh:25-45 |
MARKDOWN_ONLY | build-docs.sh:26,61 | Controls pipeline execution | build-docs.sh:26-61 |
| All config | build-docs.sh:85-103 | book.toml generation | build-docs.sh:85-103 |
| File structure | build-docs.sh:113-159 | SUMMARY.md generation | build-docs.sh:113-159 |
Auto-Detection Logic
The system can automatically detect repository information from Git remotes:
This enables zero-configuration usage in CI/CD environments where the code is already checked out.
Sources: build-docs.sh:8-45 README.md:47-53
Summary
The DeepWiki-to-mdBook Converter architecture demonstrates several key design patterns:
- Polyglot Orchestration : Shell coordinates Python and Rust tools, each optimized for their specific task
- Multi-Stage Container Build : Separates build-time tooling from runtime dependencies for minimal image size
- Temporary Directory Workflow : Ensures atomic operations and prevents partial output states
- Progressive Processing : Three distinct phases (extract, enhance, build) with optional fast-path
- Zero-Configuration Capability : Intelligent defaults and auto-detection minimize required configuration
The architecture prioritizes maintainability (clear separation of concerns), reliability (atomic operations), and usability (intelligent defaults) while remaining fully generic and portable.
Sources: README.md:1-233 Dockerfile:1-33 build-docs.sh:1-206 tools/deepwiki-scraper.py:1-920
Three-Phase Pipeline
Relevant source files
Purpose and Scope
This document describes the three-phase processing pipeline that transforms DeepWiki HTML pages into searchable mdBook documentation. The pipeline consists of Phase 1: Clean Markdown Extraction , Phase 2: Diagram Enhancement , and Phase 3: mdBook Build. Each phase has distinct responsibilities and uses different technology stacks.
For overall system architecture, see System Architecture. For detailed implementation of individual phases, see Phase 1: Markdown Extraction, Phase 2: Diagram Enhancement, and Phase 3: mdBook Build. For configuration that affects pipeline behavior, see Configuration Reference.
Pipeline Overview
The system processes content through three sequential phases, with an optional bypass mechanism for Phase 3.
Pipeline Execution Flow
stateDiagram-v2
[*] --> Initialize
Initialize --> Phase1 : Start build-docs.sh
state "Phase 1 : Markdown Extraction" as Phase1 {
[*] --> extract_wiki_structure
extract_wiki_structure --> extract_page_content : For each page
extract_page_content --> convert_html_to_markdown
convert_html_to_markdown --> WriteTemp : Write to /workspace/wiki
WriteTemp --> [*]
}
Phase1 --> CheckMode : deepwiki-scraper.py complete
state CheckMode <<choice>>
CheckMode --> Phase2 : MARKDOWN_ONLY=false
CheckMode --> CopyOutput : MARKDOWN_ONLY=true
state "Phase 2 : Diagram Enhancement" as Phase2 {
[*] --> extract_and_enhance_diagrams
extract_and_enhance_diagrams --> ExtractJS : Fetch JS payload
ExtractJS --> FuzzyMatch : ~461 diagrams found
FuzzyMatch --> InjectDiagrams : ~48 placed
InjectDiagrams --> [*] : Update temp files
}
Phase2 --> Phase3 : Enhancement complete
state "Phase 3 : mdBook Build" as Phase3 {
[*] --> CreateBookToml : build-docs.sh
CreateBookToml --> GenerateSummary : book.toml created
GenerateSummary --> CopyToSrc : SUMMARY.md generated
CopyToSrc --> MdbookMermaidInstall : Copy to /workspace/book/src
MdbookMermaidInstall --> MdbookBuild : Install assets
MdbookBuild --> [*] : HTML in /workspace/book/book
}
Phase3 --> CopyOutput
CopyOutput --> [*] : Copy to /output
Sources: build-docs.sh:1-206 tools/deepwiki-scraper.py:790-919
Phase Coordination
The build-docs.sh orchestrator coordinates all three phases and handles the decision point for markdown-only mode.
Orchestrator Control Flow
flowchart TD
Start[/"docker run with env vars"/]
Start --> ParseEnv["Parse environment variables\nREPO, BOOK_TITLE, MARKDOWN_ONLY"]
ParseEnv --> ValidateRepo{"REPO set?"}
ValidateRepo -->|No| AutoDetect["git config --get remote.origin.url\nExtract owner/repo"]
ValidateRepo -->|Yes| CallScraper
AutoDetect --> CallScraper
CallScraper["python3 /usr/local/bin/deepwiki-scraper.py\nArgs: REPO, /workspace/wiki"]
CallScraper --> ScraperPhase1["Phase 1: extract_wiki_structure()\nextract_page_content()\nWrite to temp directory"]
ScraperPhase1 --> ScraperPhase2["Phase 2: extract_and_enhance_diagrams()\nFuzzy match and inject\nUpdate temp files"]
ScraperPhase2 --> CheckMarkdownOnly{"MARKDOWN_ONLY\n== true?"}
CheckMarkdownOnly -->|Yes| CopyMdOnly["cp -r /workspace/wiki/* /output/markdown/\nExit"]
CheckMarkdownOnly -->|No| InitMdBook
InitMdBook["mkdir -p /workspace/book\nGenerate book.toml"]
InitMdBook --> GenSummary["Generate src/SUMMARY.md\nScan /workspace/wiki/*.md\nBuild table of contents"]
GenSummary --> CopyToSrc["cp -r /workspace/wiki/* src/"]
CopyToSrc --> InstallMermaid["mdbook-mermaid install /workspace/book"]
InstallMermaid --> BuildBook["mdbook build"]
BuildBook --> CopyOutputs["cp -r book /output/\ncp -r /workspace/wiki/* /output/markdown/\ncp book.toml /output/"]
CopyMdOnly --> End[/"Exit with outputs in /output"/]
CopyOutputs --> End
Sources: build-docs.sh:8-76 build-docs.sh:78-206
Phase 1: Clean Markdown Extraction
Phase 1 discovers the wiki structure and converts HTML pages to clean Markdown, writing files to a temporary directory (/workspace/wiki). This phase is implemented entirely in Python within deepwiki-scraper.py.
Phase 1 Data Flow
flowchart LR
DeepWiki["https://deepwiki.com/\nowner/repo"]
DeepWiki -->|HTTP GET| extract_wiki_structure
extract_wiki_structure["extract_wiki_structure()\nParse sidebar links\nBuild page list"]
extract_wiki_structure --> PageList["pages = [\n {number, title, url, href, level},\n ...\n]"]
PageList --> Loop["For each page"]
Loop --> extract_page_content["extract_page_content(url, session)\nFetch HTML\nRemove nav/footer elements"]
extract_page_content --> BeautifulSoup["BeautifulSoup(response.text)\nFind article/main/body\nRemove DeepWiki UI"]
BeautifulSoup --> convert_html_to_markdown["convert_html_to_markdown(html)\nhtml2text.HTML2Text()\nbody_width=0"]
convert_html_to_markdown --> clean_deepwiki_footer["clean_deepwiki_footer(markdown)\nRemove footer patterns"]
clean_deepwiki_footer --> FixLinks["Fix internal links\nRegex: /owner/repo/N-title\nConvert to relative .md paths"]
FixLinks --> WriteTempFile["Write to /workspace/wiki/\nMain: N-title.md\nSubsection: section-N/N-M-title.md"]
WriteTempFile --> Loop
style extract_wiki_structure fill:#f9f9f9
style extract_page_content fill:#f9f9f9
style convert_html_to_markdown fill:#f9f9f9
style clean_deepwiki_footer fill:#f9f9f9
Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:175-216 tools/deepwiki-scraper.py:127-173
Key Functions and Their Roles
| Function | File Location | Responsibility |
|---|---|---|
extract_wiki_structure() | tools/deepwiki-scraper.py:78-125 | Discover all pages by parsing sidebar links with pattern /repo/\d+ |
extract_page_content() | tools/deepwiki-scraper.py:453-594 | Fetch individual page, parse HTML, remove navigation elements |
convert_html_to_markdown() | tools/deepwiki-scraper.py:175-216 | Convert HTML to Markdown using html2text with body_width=0 |
clean_deepwiki_footer() | tools/deepwiki-scraper.py:127-173 | Remove DeepWiki UI elements using regex pattern matching |
sanitize_filename() | tools/deepwiki-scraper.py:21-25 | Convert page titles to safe filenames |
fix_wiki_link() | tools/deepwiki-scraper.py:549-589 | Rewrite internal links to relative .md paths |
File Organization Logic
flowchart TD
PageNum["page['number']"]
PageNum --> CheckLevel{"page['level']\n== 0?"}
CheckLevel -->|Yes main page| RootFile["Filename: N-title.md\nPath: /workspace/wiki/N-title.md\nExample: 2-quick-start.md"]
CheckLevel -->|No subsection| ExtractMain["Extract main section\nmain_section = number.split('.')[0]"]
ExtractMain --> SubDir["Create directory\nsection-{main_section}/"]
SubDir --> SubFile["Filename: N-M-title.md\nPath: section-N/N-M-title.md\nExample: section-2/2-1-installation.md"]
The system organizes files hierarchically based on page numbering:
Sources: tools/deepwiki-scraper.py:849-860
Phase 2: Diagram Enhancement
Phase 2 extracts Mermaid diagrams from the JavaScript payload and uses fuzzy matching to intelligently place them in the appropriate Markdown files. This phase operates on files in the temporary directory (/workspace/wiki).
Phase 2 Algorithm Flow
flowchart TD
Start["extract_and_enhance_diagrams(repo, temp_dir, session)"]
Start --> FetchJS["GET https://deepwiki.com/owner/repo/1-overview\nExtract response.text"]
FetchJS --> ExtractAll["Regex: ```mermaid\\\n(.*?)```\nFind all diagram blocks"]
ExtractAll --> CountTotal["all_diagrams list\n(~461 total)"]
CountTotal --> ExtractContext["Regex: ([^`]{'{500,}'}?)```mermaid\\ (.*?)```\nExtract 500-char context before each"]
ExtractContext --> Unescape["For each diagram:\ncontext.replace('\\\n', '\\\n')\ndiagram.replace('\\\n', '\\\n')\nUnescape HTML entities"]
Unescape --> BuildContext["diagram_contexts = [\n {\n last_heading: str,\n anchor_text: str (last 300 chars),\n diagram: str\n },\n ...\n]\n(~48 with context)"]
BuildContext --> ScanFiles["For each .md file in temp_dir.glob('**/*.md')"]
ScanFiles --> SkipExisting{"File contains\n'```mermaid'?"}
SkipExisting -->|Yes| ScanFiles
SkipExisting -->|No| NormalizeContent
NormalizeContent["content_normalized = content.lower()\ncontent_normalized = ' '.join(content_normalized.split())"]
NormalizeContent --> MatchLoop["For each diagram in diagram_contexts"]
MatchLoop --> TryChunks["Try chunk sizes: [300, 200, 150, 100, 80]\ntest_chunk = anchor_normalized[-chunk_size:]\npos = content_normalized.find(test_chunk)"]
TryChunks --> FoundMatch{"Match found?"}
FoundMatch -->|Yes| ConvertToLine["Convert char position to line number\nScan through lines counting chars"]
FoundMatch -->|No| TryHeading["Try heading match\nCompare normalized heading text"]
TryHeading --> FoundMatch2{"Match found?"}
FoundMatch2 -->|Yes| ConvertToLine
FoundMatch2 -->|No| MatchLoop
ConvertToLine --> FindInsertPoint["Find insertion point:\nIf heading: skip blank lines, skip paragraph\nIf paragraph: find end of paragraph"]
FindInsertPoint --> QueueInsert["pending_insertions.append(\n (insert_line, diagram, score, idx)\n)"]
QueueInsert --> MatchLoop
MatchLoop --> InsertDiagrams["Sort by line number (reverse)\nInsert from bottom up:\nlines.insert(pos, '')\nlines.insert(pos, '```')\nlines.insert(pos, diagram)\nlines.insert(pos, '```mermaid')\nlines.insert(pos, '')"]
InsertDiagrams --> WriteFile["Write enhanced file back to disk"]
WriteFile --> ScanFiles
ScanFiles --> Complete["Return to orchestrator"]
Sources: tools/deepwiki-scraper.py:596-788
Fuzzy Matching Algorithm
The algorithm uses progressive chunk sizes to find the best match location for each diagram:
Sources: tools/deepwiki-scraper.py:716-730 tools/deepwiki-scraper.py:732-745
flowchart LR
Anchor["anchor_text\n(300 chars from JS context)"]
Anchor --> Normalize1["Normalize:\nlowercase\ncollapse whitespace"]
Content["markdown file content"]
Content --> Normalize2["Normalize:\nlowercase\ncollapse whitespace"]
Normalize1 --> Try300["Try 300-char chunk\ntest_chunk = anchor[-300:]"]
Normalize2 --> Try300
Try300 --> Found300{"Found?"}
Found300 -->|Yes| Match300["best_match_score = 300"]
Found300 -->|No| Try200["Try 200-char chunk"]
Try200 --> Found200{"Found?"}
Found200 -->|Yes| Match200["best_match_score = 200"]
Found200 -->|No| Try150["Try 150-char chunk"]
Try150 --> Found150{"Found?"}
Found150 -->|Yes| Match150["best_match_score = 150"]
Found150 -->|No| Try100["Try 100-char chunk"]
Try100 --> Found100{"Found?"}
Found100 -->|Yes| Match100["best_match_score = 100"]
Found100 -->|No| Try80["Try 80-char chunk"]
Try80 --> Found80{"Found?"}
Found80 -->|Yes| Match80["best_match_score = 80"]
Found80 -->|No| TryHeading["Fallback: heading match"]
TryHeading --> FoundH{"Found?"}
FoundH -->|Yes| Match50["best_match_score = 50"]
FoundH -->|No| NoMatch["No match\nSkip this diagram"]
Diagram Extraction from JavaScript
Diagrams are extracted from the Next.js JavaScript payload using two strategies:
Extraction Strategies
| Strategy | Pattern | Description |
|---|---|---|
| Fenced blocks | mermaid\\n(.*?) | Primary strategy: extract code blocks with escaped newlines |
| JavaScript strings | "graph TD..." | Fallback: find Mermaid start keywords in quoted strings |
The function extract_mermaid_from_nextjs_data() at tools/deepwiki-scraper.py:218-331 handles unescaping:
block.replace('\\n', '\n')
block.replace('\\t', '\t')
block.replace('\\"', '"')
block.replace('\\\\', '\\')
block.replace('\\u003c', '<')
block.replace('\\u003e', '>')
block.replace('\\u0026', '&')
Sources: tools/deepwiki-scraper.py:218-331 tools/deepwiki-scraper.py:615-646
Phase 3: mdBook Build
Phase 3 generates mdBook configuration, creates the table of contents, and builds the final HTML documentation. This phase is orchestrated by build-docs.sh and invokes Rust tools (mdbook, mdbook-mermaid).
Phase 3 Component Interactions
flowchart TD
Start["Phase 3 entry point\n(build-docs.sh:78)"]
Start --> MkdirBook["mkdir -p /workspace/book\ncd /workspace/book"]
MkdirBook --> GenToml["Generate book.toml:\n[book]\ntitle, authors, language\n[output.html]\ndefault-theme=rust\ngit-repository-url\n[preprocessor.mermaid]"]
GenToml --> MkdirSrc["mkdir -p src"]
MkdirSrc --> GenSummary["Generate src/SUMMARY.md"]
GenSummary --> ScanRoot["Scan /workspace/wiki/*.md\nFind first page for intro"]
ScanRoot --> ProcessMain["For each main page:\nExtract title from first line\nCheck for section-N/ subdirectory"]
ProcessMain --> HasSubs{"Has\nsubsections?"}
HasSubs -->|Yes| WriteSection["Write to SUMMARY.md:\n# Title\n- [Title](N-title.md)\n - [Subtitle](section-N/N-M-title.md)"]
HasSubs -->|No| WriteStandalone["Write to SUMMARY.md:\n- [Title](N-title.md)"]
WriteSection --> ProcessMain
WriteStandalone --> ProcessMain
ProcessMain --> CopySrc["cp -r /workspace/wiki/* src/"]
CopySrc --> InstallMermaid["mdbook-mermaid install /workspace/book\nInstalls mermaid.min.js\nInstalls mermaid-init.js\nUpdates book.toml"]
InstallMermaid --> MdbookBuild["mdbook build\nReads src/SUMMARY.md\nProcesses all .md files\nApplies rust theme\nGenerates book/index.html\nGenerates book/*/index.html"]
MdbookBuild --> CopyOut["Copy outputs:\ncp -r book /output/\ncp -r /workspace/wiki/* /output/markdown/\ncp book.toml /output/"]
Sources: build-docs.sh:78-206
book.toml Generation
The orchestrator dynamically generates book.toml with runtime configuration:
Sources: build-docs.sh:84-103
flowchart TD
Start["Generate SUMMARY.md"]
Start --> FindFirst["first_page = ls /workspace/wiki/*.md /head -1 Extract title from first line Write: [Title] filename"]
FindFirst --> LoopMain["For each /workspace/wiki/*.md excluding first_page"]
LoopMain --> ExtractNum["section_num = filename.match /^[0-9]+/"]
ExtractNum --> CheckDir{"section-{num}/ exists?"}
CheckDir -->|Yes|WriteSectionHeader["Write: # {title} - [{title}] {filename}"]
WriteSectionHeader --> LoopSubs["For each section-{num}/*.md"]
LoopSubs --> WriteSubitem["Write: - [{subtitle}] section-{num}/{subfilename}"]
WriteSubitem --> LoopSubs
LoopSubs --> LoopMain
CheckDir -->|No| WriteStandalone["Write:\n- [{title}]({filename})"]
WriteStandalone --> LoopMain
LoopMain --> Complete["SUMMARY.md complete\ngrep -c '\\[' to count entries"]
SUMMARY.md Generation Algorithm
The table of contents is generated by scanning the actual file structure in /workspace/wiki:
Sources: build-docs.sh:108-162
mdBook and mdbook-mermaid Execution
The build process invokes two Rust binaries:
| Command | Purpose | Output |
|---|---|---|
mdbook-mermaid install $BOOK_DIR | Install Mermaid.js assets and update book.toml | mermaid.min.js, mermaid-init.js in book/ |
mdbook build | Parse SUMMARY.md, process Markdown, generate HTML | HTML files in /workspace/book/book/ |
The mdbook binary:
- Reads
src/SUMMARY.mdto determine structure - Processes each Markdown file referenced in SUMMARY.md
- Applies the
rusttheme specified in book.toml - Generates navigation sidebar
- Adds search functionality
- Creates "Edit this page" links using
git-repository-url
Sources: build-docs.sh:169-176
Data Transformation Summary
Each phase transforms data in specific ways:
| Phase | Input Format | Processing | Output Format |
|---|---|---|---|
| Phase 1 | HTML pages from DeepWiki | BeautifulSoup parsing, html2text conversion, link rewriting | Clean Markdown in /workspace/wiki/ |
| Phase 2 | Markdown files + JavaScript payload | Regex extraction, fuzzy matching, diagram injection | Enhanced Markdown in /workspace/wiki/ (modified in place) |
| Phase 3 | Markdown files + environment variables | book.toml generation, SUMMARY.md generation, mdbook build | HTML site in /workspace/book/book/ |
Final Output Structure:
/output/
├── book/ # HTML documentation site
│ ├── index.html
│ ├── 1-overview.html
│ ├── section-2/
│ │ └── 2-1-subsection.html
│ ├── mermaid.min.js
│ ├── mermaid-init.js
│ └── ...
├── markdown/ # Source Markdown files
│ ├── 1-overview.md
│ ├── section-2/
│ │ └── 2-1-subsection.md
│ └── ...
└── book.toml # mdBook configuration
Sources: build-docs.sh:178-205 README.md:89-119
flowchart TD
Phase1["Phase 1: Extraction\n(deepwiki-scraper.py)"]
Phase2["Phase 2: Enhancement\n(deepwiki-scraper.py)"]
Phase1 --> Phase2
Phase2 --> Check{"MARKDOWN_ONLY\n== true?"}
Check -->|Yes| FastPath["cp -r /workspace/wiki/* /output/markdown/\nExit (fast path)"]
Check -->|No| Phase3["Phase 3: mdBook Build\n(build-docs.sh)"]
Phase3 --> FullOutput["Copy book/ and markdown/ to /output/\nExit (full build)"]
FastPath --> End[/"Build complete"/]
FullOutput --> End
Conditional Execution: MARKDOWN_ONLY Mode
The MARKDOWN_ONLY environment variable allows bypassing Phase 3 for faster iteration during development:
When MARKDOWN_ONLY=true:
- Execution time: ~30-60 seconds (scraping + diagram matching only)
- Output:
/output/markdown/only - Use case: Debugging diagram placement, testing content extraction
When MARKDOWN_ONLY=false (default):
- Execution time: ~60-120 seconds (full pipeline)
- Output:
/output/book/,/output/markdown/,/output/book.toml - Use case: Production documentation builds
Sources: build-docs.sh:60-76 README.md:55-76
Docker Multi-Stage Build
Relevant source files
Purpose and Scope
This document explains the Docker multi-stage build strategy used to create the deepwiki-scraper container image. It details how the system combines a Rust toolchain for compiling documentation tools with a Python runtime for web scraping, while optimizing the final image size.
For information about how the container orchestrates the build process, see build-docs.sh Orchestrator. For details on the Python scraper implementation, see deepwiki-scraper.py.
Multi-Stage Build Strategy
The Dockerfile implements a two-stage build pattern that separates compilation from runtime. Stage 1 uses a full Rust development environment to compile mdBook binaries from source. Stage 2 creates a minimal Python runtime and extracts only the compiled binaries, discarding the build toolchain.
Build Stages Flow
graph TD
subgraph Stage1["Stage 1: Rust Builder (rust:latest)"]
RustBase["rust:latest base image\n~1.5 GB with toolchain"]
CargoInstall["cargo install mdbook\ncargo install mdbook-mermaid"]
Binaries["/usr/local/cargo/bin/\nmdbook\nmdbook-mermaid"]
RustBase --> CargoInstall
CargoInstall --> Binaries
end
subgraph Stage2["Stage 2: Python Runtime (python:3.12-slim)"]
PyBase["python:3.12-slim base\n~150 MB"]
UVInstall["Copy uv from\nghcr.io/astral-sh/uv:latest"]
PipInstall["uv pip install --system\nrequirements.txt"]
CopyBinaries["COPY --from=builder\n/usr/local/cargo/bin/"]
CopyScripts["COPY tools/ and\nbuild-docs.sh"]
PyBase --> UVInstall
UVInstall --> PipInstall
PipInstall --> CopyBinaries
CopyBinaries --> CopyScripts
end
Binaries -.->|Extract only binaries| CopyBinaries
CopyScripts --> FinalImage["Final Image\n~300-400 MB"]
Stage1 -.->|Discarded after build| Discard["Discard"]
style RustBase fill:#f5f5f5
style PyBase fill:#f5f5f5
style FinalImage fill:#e8e8e8
style Discard fill:#fff,stroke-dasharray: 5 5
Sources: Dockerfile:1-33
Stage 1: Rust Builder
Stage 1 uses rust:latest as the base image, providing the complete Rust toolchain including cargo, the Rust package manager and build tool.
Rust Builder Configuration
| Aspect | Details |
|---|---|
| Base Image | rust:latest |
| Size | ~1.5 GB (includes rustc, cargo, stdlib) |
| Build Commands | cargo install mdbook, cargo install mdbook-mermaid |
| Output Location | /usr/local/cargo/bin/ |
| Stage Identifier | builder |
The cargo install commands fetch mdBook and mdbook-mermaid source from crates.io, compile them from source, and install the resulting binaries to /usr/local/cargo/bin/.
flowchart LR
subgraph BuilderStage["builder stage"]
CratesIO["crates.io\n(package registry)"]
CargoFetch["cargo fetch\n(download sources)"]
CargoCompile["cargo build --release\n(compile to binary)"]
CargoInstallBin["Install to\n/usr/local/cargo/bin/"]
CratesIO --> CargoFetch
CargoFetch --> CargoCompile
CargoCompile --> CargoInstallBin
end
CargoInstallBin --> MdBookBin["mdbook binary"]
CargoInstallBin --> MermaidBin["mdbook-mermaid binary"]
MdBookBin -.->|Copied to Stage 2| NextStage["NextStage"]
MermaidBin -.->|Copied to Stage 2| NextStage
Sources: Dockerfile:1-5
Stage 2: Python Runtime Assembly
Stage 2 builds the final runtime image starting from python:3.12-slim, a minimal Python base image that omits development headers and unnecessary packages.
Python Runtime Components
graph TB
subgraph PythonBase["python:3.12-slim"]
PyInterpreter["Python 3.12 interpreter"]
PyStdlib["Python standard library"]
BaseUtils["Essential utilities\n(bash, sh, coreutils)"]
end
subgraph InstalledTools["Installed via COPY"]
UV["uv package manager\n/bin/uv, /bin/uvx"]
PyDeps["Python packages\n(requests, beautifulsoup4, html2text)"]
RustBins["Rust binaries\n(mdbook, mdbook-mermaid)"]
Scripts["Application scripts\n(deepwiki-scraper.py, build-docs.sh)"]
end
PythonBase --> UV
UV --> PyDeps
PythonBase --> RustBins
PythonBase --> Scripts
PyDeps --> Runtime["Runtime Environment"]
RustBins --> Runtime
Scripts --> Runtime
The installation sequence follows a specific order:
- Copy uv Dockerfile13 - Multi-stage copy from
ghcr.io/astral-sh/uv:latest - Install Python dependencies Dockerfile:16-17 - Uses
uv pip install --system --no-cache - Copy Rust binaries Dockerfile:20-21 - Extracts from builder stage
- Copy application scripts Dockerfile:24-29 - Adds Python scraper and orchestrator
Sources: Dockerfile:8-29
Binary Extraction and Integration
The critical optimization occurs at Dockerfile:20-21 where the COPY --from=builder directive extracts only the compiled binaries without any build dependencies.
Binary Extraction Pattern
| Source (Stage 1) | Destination (Stage 2) | Purpose |
|---|---|---|
/usr/local/cargo/bin/mdbook | /usr/local/bin/mdbook | Documentation builder executable |
/usr/local/cargo/bin/mdbook-mermaid | /usr/local/bin/mdbook-mermaid | Mermaid preprocessor executable |
flowchart LR
subgraph BuilderFS["Builder Filesystem"]
CargoDir["/usr/local/cargo/bin/"]
MdBookSrc["mdbook\n(compiled binary)"]
MermaidSrc["mdbook-mermaid\n(compiled binary)"]
CargoDir --> MdBookSrc
CargoDir --> MermaidSrc
end
subgraph RuntimeFS["Runtime Filesystem"]
BinDir["/usr/local/bin/"]
MdBookDst["mdbook\n(extracted)"]
MermaidDst["mdbook-mermaid\n(extracted)"]
BinDir --> MdBookDst
BinDir --> MermaidDst
end
MdBookSrc -.->|COPY --from=builder| MdBookDst
MermaidSrc -.->|COPY --from=builder| MermaidDst
subgraph Discarded["Discarded (not copied)"]
RustToolchain["rustc compiler"]
CargoTool["cargo build tool"]
SourceFiles["mdBook source files"]
BuildCache["cargo build cache"]
end
Both binaries are statically linked or contain all necessary Rust runtime dependencies, allowing them to execute in the Python base image without the Rust toolchain.
Sources: Dockerfile:19-21
Python Dependency Installation
Python dependencies are installed using uv, a fast Python package installer written in Rust. The dependencies are defined in tools/requirements.txt:1-4
Python Dependencies
| Package | Version | Purpose |
|---|---|---|
requests | ≥2.31.0 | HTTP client for scraping DeepWiki |
beautifulsoup4 | ≥4.12.0 | HTML parsing and navigation |
html2text | ≥2020.1.16 | HTML to Markdown conversion |
The installation command Dockerfile17 uses these flags:
--system: Install to system Python (not virtualenv)--no-cache: Don't cache downloaded packages (reduces image size)-r /tmp/requirements.txt: Read dependencies from file
Sources: Dockerfile:16-17 tools/requirements.txt:1-4
graph LR
subgraph Approach1["Single-Stage Approach (Hypothetical)"]
Single["rust:latest + Python\n~2+ GB"]
end
subgraph Approach2["Multi-Stage Approach (Actual)"]
Builder["Stage 1: rust:latest\n~1.5 GB\n(discarded)"]
Runtime["Stage 2: python:3.12-slim\n+ binaries + dependencies\n~300-400 MB"]
Builder -.->|Extract binaries only| Runtime
end
Single -->|Contains unnecessary build toolchain| Waste["Wasted Space"]
Runtime -->|Contains only runtime essentials| Efficient["Efficient"]
style Single fill:#f5f5f5
style Builder fill:#f5f5f5
style Runtime fill:#e8e8e8
style Waste fill:#fff,stroke-dasharray: 5 5
style Efficient fill:#fff,stroke-dasharray: 5 5
Image Size Optimization
The multi-stage strategy achieves significant size reduction by discarding the build environment.
Size Comparison
Size Breakdown of Final Image
| Component | Approximate Size |
|---|---|
| Python 3.12 slim base | ~150 MB |
| Python packages (requests, BeautifulSoup4, html2text) | ~20 MB |
| mdBook binary | ~8 MB |
| mdbook-mermaid binary | ~6 MB |
| uv package manager | ~10 MB |
| Application scripts | <1 MB |
| Total | ~300-400 MB |
Sources: Dockerfile:1-33 README.md156
graph TB
subgraph Filesystem["/usr/local/bin/"]
BuildScript["build-docs.sh\n(orchestrator)"]
ScraperScript["deepwiki-scraper.py\n(Python scraper)"]
MdBookBin["mdbook\n(Rust binary)"]
MermaidBin["mdbook-mermaid\n(Rust binary)"]
UVBin["uv\n(Python installer)"]
end
subgraph SystemPython["/usr/local/lib/python3.12/"]
Requests["requests package"]
BS4["beautifulsoup4 package"]
Html2Text["html2text package"]
end
subgraph Execution["Execution Flow"]
Docker["docker run\n(CMD)"]
Docker --> BuildScript
BuildScript -->|python| ScraperScript
BuildScript -->|subprocess| MdBookBin
MdBookBin -->|preprocessor| MermaidBin
ScraperScript --> Requests
ScraperScript --> BS4
ScraperScript --> Html2Text
end
Runtime Environment Structure
The final image contains a hybrid Python-Rust runtime where Python scripts can execute Rust binaries as subprocesses.
Runtime Component Locations
The entrypoint Dockerfile32 executes /usr/local/bin/build-docs.sh, which orchestrates calls to both Python and Rust components. The script can execute:
python /usr/local/bin/deepwiki-scraper.pyfor web scrapingmdbook initfor initializationmdbook buildfor HTML generationmdbook-mermaid installfor asset installation
Sources: Dockerfile:28-32 build-docs.sh
Container Execution Model
When the container runs, Docker executes the CMD Dockerfile32 which invokes build-docs.sh. This shell script has access to all binaries in /usr/local/bin/ (automatically on $PATH).
Process Tree During Execution
graph TD
Docker["docker run\n(container init)"]
Docker --> CMD["CMD: build-docs.sh"]
CMD --> Phase1["Phase 1:\npython deepwiki-scraper.py"]
CMD --> Phase2["Phase 2: mdbook init"]
CMD --> Phase3["Phase 3: mdbook-mermaid install"]
CMD --> Phase4["Phase 4: mdbook build"]
Phase1 --> PyProc["Python 3.12 process"]
PyProc --> ReqLib["requests.get()"]
PyProc --> BS4Lib["BeautifulSoup()"]
PyProc --> H2TLib["html2text.HTML2Text()"]
Phase2 --> MdBookProc1["mdbook binary process"]
Phase3 --> MermaidProc["mdbook-mermaid binary process"]
Phase4 --> MdBookProc2["mdbook binary process"]
MdBookProc2 --> MermaidPreproc["mdbook-mermaid\n(as preprocessor)"]
Sources: Dockerfile32 README.md:122-145
Component Reference
Relevant source files
This page provides an overview of the three major components that comprise the DeepWiki-to-mdBook Converter system and their responsibilities. Each component operates at a different layer of the technology stack (Shell, Python, Rust) and handles a specific phase of the documentation transformation pipeline.
For detailed documentation of each component's internal implementation, see:
- Shell orchestration logic: build-docs.sh Orchestrator
- Python scraping and enhancement algorithms: deepwiki-scraper.py
- Rust documentation building integration: mdBook Integration
System Component Architecture
The system consists of three primary executable components that work together in sequence, coordinated through file system operations and process execution.
Component Architecture Diagram
graph TB
subgraph "Shell Layer"
buildsh["build-docs.sh\nOrchestrator"]
end
subgraph "Python Layer"
scraper["deepwiki-scraper.py\nContent Processor"]
bs4["BeautifulSoup4\nHTML Parser"]
html2text["html2text\nMarkdown Converter"]
requests["requests\nHTTP Client"]
end
subgraph "Rust Layer"
mdbook["mdbook\nBinary"]
mermaid["mdbook-mermaid\nBinary"]
end
subgraph "Configuration Files"
booktoml["book.toml"]
summarymd["SUMMARY.md"]
end
subgraph "File System"
wikidir["$WIKI_DIR\nTemp Storage"]
outputdir["$OUTPUT_DIR\nFinal Output"]
end
buildsh -->|executes python3| scraper
buildsh -->|generates| booktoml
buildsh -->|generates| summarymd
buildsh -->|executes mdbook-mermaid| mermaid
buildsh -->|executes mdbook| mdbook
scraper -->|uses| bs4
scraper -->|uses| html2text
scraper -->|uses| requests
scraper -->|writes .md files| wikidir
mdbook -->|integrates| mermaid
mdbook -->|reads config| booktoml
mdbook -->|reads TOC| summarymd
mdbook -->|reads sources| wikidir
mdbook -->|writes HTML| outputdir
buildsh -->|copies files| outputdir
Sources: build-docs.sh:1-206 tools/deepwiki-scraper.py:1-920 Dockerfile:1-33
Component Execution Flow
This diagram shows the actual execution sequence with specific function calls and file operations that occur during a complete documentation build.
Execution Flow with Code Entities
sequenceDiagram
participant User
participant buildsh as "build-docs.sh"
participant scraper as "deepwiki-scraper.py::main()"
participant extract as "extract_wiki_structure()"
participant content as "extract_page_content()"
participant enhance as "extract_and_enhance_diagrams()"
participant mdbook as "mdbook binary"
participant mermaid as "mdbook-mermaid binary"
participant fs as "File System"
User->>buildsh: docker run -e REPO=...
buildsh->>buildsh: Parse $REPO, $BOOK_TITLE, etc
buildsh->>buildsh: Set WIKI_DIR=/workspace/wiki
buildsh->>scraper: python3 deepwiki-scraper.py $REPO $WIKI_DIR
scraper->>extract: extract_wiki_structure(repo, session)
extract->>extract: BeautifulSoup4 parsing
extract-->>scraper: pages[] array
loop For each page
scraper->>content: extract_page_content(url, session)
content->>content: convert_html_to_markdown()
content->>content: clean_deepwiki_footer()
content->>fs: Write to $WIKI_DIR/*.md
end
scraper->>enhance: extract_and_enhance_diagrams(repo, temp_dir)
enhance->>enhance: Extract diagrams from JavaScript
enhance->>enhance: Fuzzy match with progressive chunks
enhance->>fs: Update $WIKI_DIR/*.md with diagrams
scraper-->>buildsh: Exit 0
alt MARKDOWN_ONLY=true
buildsh->>fs: cp $WIKI_DIR/* $OUTPUT_DIR/
buildsh-->>User: Exit (skip mdBook)
else Full build
buildsh->>buildsh: Generate book.toml
buildsh->>buildsh: Generate SUMMARY.md from files
buildsh->>fs: mkdir $BOOK_DIR/src
buildsh->>fs: cp $WIKI_DIR/* $BOOK_DIR/src/
buildsh->>mermaid: mdbook-mermaid install $BOOK_DIR
mermaid->>fs: Install mermaid.js assets
buildsh->>mdbook: mdbook build
mdbook->>mdbook: Parse SUMMARY.md
mdbook->>mdbook: Process markdown files
mdbook->>mdbook: Render HTML with rust theme
mdbook->>fs: Write to $BOOK_DIR/book/
buildsh->>fs: cp $BOOK_DIR/book $OUTPUT_DIR/
buildsh->>fs: cp $WIKI_DIR $OUTPUT_DIR/markdown/
buildsh-->>User: Build complete
end
Sources: build-docs.sh:55-206 tools/deepwiki-scraper.py:790-916 tools/deepwiki-scraper.py:596-789
Component Responsibility Matrix
The following table details the specific responsibilities and capabilities of each component.
| Component | Type | Primary Responsibility | Key Functions/Operations | Input | Output |
|---|---|---|---|---|---|
build-docs.sh | Shell Script | Orchestration and configuration | - Parse environment variables | ||
| - Auto-detect Git repository | |||||
| - Execute scraper | |||||
- Generate book.toml | |||||
- Generate SUMMARY.md | |||||
| - Execute mdBook tools | |||||
| - Copy outputs | Environment variables | Complete documentation site | |||
deepwiki-scraper.py | Python Script | Content extraction and enhancement | - extract_wiki_structure() | ||
- extract_page_content() | |||||
- convert_html_to_markdown() | |||||
- extract_and_enhance_diagrams() | |||||
- clean_deepwiki_footer() | DeepWiki URL | Enhanced Markdown files | |||
mdbook | Rust Binary | HTML generation | - Parse SUMMARY.md | ||
| - Process Markdown | |||||
| - Apply theme | |||||
| - Generate navigation | |||||
| - Enable search | Markdown + config | HTML documentation | |||
mdbook-mermaid | Rust Binary | Diagram rendering | - Install mermaid.js | ||
| - Install CSS assets | |||||
| - Process mermaid code blocks | Markdown with mermaid | HTML with rendered diagrams |
Sources: build-docs.sh:1-206 tools/deepwiki-scraper.py:1-920 README.md:146-156
Component File Locations
Each component resides in a specific location within the repository and Docker container, with distinct installation methods.
File System Layout
graph TB
subgraph "Repository Structure"
repo["/"]
buildscript["build-docs.sh\nOrchestrator script"]
dockerfile["Dockerfile\nMulti-stage build"]
toolsdir["tools/"]
scraper_py["deepwiki-scraper.py\nMain scraper"]
requirements["requirements.txt\nPython deps"]
repo --> buildscript
repo --> dockerfile
repo --> toolsdir
toolsdir --> scraper_py
toolsdir --> requirements
end
subgraph "Docker Container"
container["/"]
usrbin["/usr/local/bin/"]
buildsh_installed["build-docs.sh"]
scraper_installed["deepwiki-scraper.py"]
mdbook_bin["mdbook"]
mermaid_bin["mdbook-mermaid"]
workspace["/workspace"]
wikidir["/workspace/wiki"]
bookdir["/workspace/book"]
outputvol["/output"]
container --> usrbin
container --> workspace
container --> outputvol
usrbin --> buildsh_installed
usrbin --> scraper_installed
usrbin --> mdbook_bin
usrbin --> mermaid_bin
workspace --> wikidir
workspace --> bookdir
end
buildscript -.->|COPY| buildsh_installed
scraper_py -.->|COPY| scraper_installed
style buildsh_installed fill:#fff9c4
style scraper_installed fill:#e8f5e9
style mdbook_bin fill:#f3e5f5
style mermaid_bin fill:#f3e5f5
Sources: Dockerfile:1-33 build-docs.sh:27-30
Component Dependencies
Each component has specific external dependencies that must be available at runtime.
| Component | Runtime | Dependencies | Installation Method |
|---|---|---|---|
build-docs.sh | bash | - Git (optional, for auto-detection) | |
| - Python 3.12+ | |||
| - mdbook binary | |||
| - mdbook-mermaid binary | Bundled in Docker | ||
deepwiki-scraper.py | Python 3.12 | - requests (HTTP client) | |
- beautifulsoup4 (HTML parsing) | |||
- html2text (Markdown conversion) | uv pip install -r requirements.txt | ||
mdbook | Native Binary | - Compiled from Rust source | |
| - No runtime dependencies | cargo install mdbook | ||
mdbook-mermaid | Native Binary | - Compiled from Rust source | |
| - No runtime dependencies | cargo install mdbook-mermaid |
Sources: Dockerfile:1-33 tools/requirements.txt README.md:154-156
Component Communication Protocol
Components communicate exclusively through the file system and process exit codes, with no direct API calls or shared memory.
Inter-Component Communication
graph LR
subgraph "Phase 1: Extraction"
buildsh1["build-docs.sh"]
scraper1["deepwiki-scraper.py"]
env["Environment:\n$REPO\n$WIKI_DIR"]
wikidir1["$WIKI_DIR/\n*.md files"]
buildsh1 -->|sets| env
env -->|python3 scraper.py $REPO $WIKI_DIR| scraper1
scraper1 -->|writes| wikidir1
scraper1 -.->|exit 0| buildsh1
end
subgraph "Phase 2: Configuration"
buildsh2["build-docs.sh"]
booktoml2["book.toml"]
summarymd2["SUMMARY.md"]
wikidir2["$WIKI_DIR/\nfile scan"]
buildsh2 -->|reads structure| wikidir2
buildsh2 -->|cat > book.toml| booktoml2
buildsh2 -->|generates from files| summarymd2
end
subgraph "Phase 3: Build"
buildsh3["build-docs.sh"]
mermaid3["mdbook-mermaid"]
mdbook3["mdbook"]
config3["book.toml\nSUMMARY.md\nsrc/*.md"]
output3["$OUTPUT_DIR/\nbook/"]
buildsh3 -->|mdbook-mermaid install| mermaid3
mermaid3 -->|writes assets| config3
buildsh3 -->|mdbook build| mdbook3
mdbook3 -->|reads| config3
mdbook3 -->|writes| output3
mdbook3 -.->|exit 0| buildsh3
end
wikidir1 -->|same files| wikidir2
Sources: build-docs.sh:55-206
Environment Variable Interface
The orchestrator component accepts configuration through environment variables, which control all aspects of system behavior.
| Variable | Purpose | Default | Used By | Set At |
|---|---|---|---|---|
$REPO | GitHub repository identifier | Auto-detected | build-docs.sh, deepwiki-scraper.py | build-docs.sh:9-19 |
$BOOK_TITLE | Documentation title | "Documentation" | build-docs.sh (book.toml) | build-docs.sh23 |
$BOOK_AUTHORS | Author name(s) | Extracted from $REPO | build-docs.sh (book.toml) | build-docs.sh:24-44 |
$GIT_REPO_URL | Source repository URL | Constructed from $REPO | build-docs.sh (book.toml) | build-docs.sh:25-45 |
$MARKDOWN_ONLY | Skip mdBook build | "false" | build-docs.sh | build-docs.sh:26-76 |
$WORK_DIR | Working directory | "/workspace" | build-docs.sh | build-docs.sh27 |
$WIKI_DIR | Temp markdown storage | "$WORK_DIR/wiki" | build-docs.sh, deepwiki-scraper.py | build-docs.sh28 |
$OUTPUT_DIR | Final output location | "/output" | build-docs.sh | build-docs.sh29 |
$BOOK_DIR | mdBook workspace | "$WORK_DIR/book" | build-docs.sh | build-docs.sh30 |
Sources: build-docs.sh:8-30 build-docs.sh:43-45
Python Module Structure
The deepwiki-scraper.py component is organized as a single-file script with a clear functional hierarchy.
Python Function Call Graph
graph TD
main["main()\nEntry point"]
extract_struct["extract_wiki_structure()\nDiscover pages"]
extract_content["extract_page_content()\nProcess single page"]
enhance["extract_and_enhance_diagrams()\nAdd diagrams"]
fetch["fetch_page()\nHTTP with retries"]
sanitize["sanitize_filename()\nClean filenames"]
convert["convert_html_to_markdown()\nHTML→MD"]
clean["clean_deepwiki_footer()\nRemove UI"]
extract_mermaid["extract_mermaid_from_nextjs_data()\nParse JS payload"]
main --> extract_struct
main --> extract_content
main --> enhance
extract_struct --> fetch
extract_content --> fetch
extract_content --> convert
convert --> clean
enhance --> fetch
enhance --> extract_mermaid
extract_content --> sanitize
Sources: tools/deepwiki-scraper.py:790-919 tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:596-789
graph TB
start["Start"]
detect["Auto-detect Git repository\nlines 9-19"]
validate["Validate configuration\nlines 32-53"]
step1["Step 1: Execute scraper\nline 58:\npython3 deepwiki-scraper.py"]
check{"MARKDOWN_ONLY\n== true?"}
markdown_exit["Copy markdown only\nlines 64-76\nexit 0"]
step2["Step 2: Initialize mdBook\nlines 79-106:\nmkdir, cat > book.toml"]
step3["Step 3: Generate SUMMARY.md\nlines 109-159:\nscan files, generate TOC"]
step4["Step 4: Copy sources\nlines 164-166:\ncp wiki/* src/"]
step5["Step 5: Install mermaid\nlines 169-171:\nmdbook-mermaid install"]
step6["Step 6: Build book\nlines 174-176:\nmdbook build"]
step7["Step 7: Copy outputs\nlines 179-191:\ncp to /output"]
done["Done"]
start --> detect
detect --> validate
validate --> step1
step1 --> check
check -->|yes| markdown_exit
markdown_exit --> done
check -->|no| step2
step2 --> step3
step3 --> step4
step4 --> step5
step5 --> step6
step6 --> step7
step7 --> done
Shell Script Structure
The build-docs.sh orchestrator follows a linear execution model with conditional branching for markdown-only mode.
Shell Script Execution Blocks
Sources: build-docs.sh:1-206
Cross-Component Data Formats
Data passes between components in well-defined formats through the file system.
| Data Format | Producer | Consumer | Location | Structure |
|---|---|---|---|---|
| Enhanced Markdown | deepwiki-scraper.py | mdbook | $WIKI_DIR/*.md | UTF-8 text, front matter optional, mermaid code blocks |
book.toml | build-docs.sh | mdbook | $BOOK_DIR/book.toml | TOML format, sections: [book], [output.html], [preprocessor.mermaid] |
SUMMARY.md | build-docs.sh | mdbook | $BOOK_DIR/src/SUMMARY.md | Markdown list format, relative file paths |
| File hierarchy | deepwiki-scraper.py | build-docs.sh | $WIKI_DIR/ and $WIKI_DIR/section-*/ | Root: N-title.md, Subsections: section-N/N-M-title.md |
| HTML output | mdbook | User | $OUTPUT_DIR/book/ | Complete static site with search index |
Sources: build-docs.sh:84-103 build-docs.sh:112-159 tools/deepwiki-scraper.py:849-868
graph TB
subgraph "Stage 1: rust:latest"
rust_base["rust:latest base\n~1.5 GB"]
cargo["cargo install"]
mdbook_build["mdbook binary\ncompilation"]
mermaid_build["mdbook-mermaid binary\ncompilation"]
rust_base --> cargo
cargo --> mdbook_build
cargo --> mermaid_build
end
subgraph "Stage 2: python:3.12-slim"
py_base["python:3.12-slim base\n~150 MB"]
uv_install["Install uv package manager"]
pip_install["uv pip install\nrequirements.txt"]
copy_rust["COPY --from=builder\nRust binaries"]
copy_scripts["COPY Python + Shell scripts"]
py_base --> uv_install
uv_install --> pip_install
pip_install --> copy_rust
copy_rust --> copy_scripts
end
subgraph "Final Image Contents"
final["/usr/local/bin/"]
build_sh["build-docs.sh"]
scraper_py["deepwiki-scraper.py"]
mdbook_final["mdbook"]
mermaid_final["mdbook-mermaid"]
final --> build_sh
final --> scraper_py
final --> mdbook_final
final --> mermaid_final
end
mdbook_build -.->|extract| copy_rust
mermaid_build -.->|extract| copy_rust
copy_scripts --> build_sh
copy_scripts --> scraper_py
copy_rust --> mdbook_final
copy_rust --> mermaid_final
Component Installation in Docker
The multi-stage Docker build process installs each component using its native tooling, then combines them in a minimal runtime image.
Docker Build Process
Sources: Dockerfile:1-33
Next Steps
For detailed implementation documentation of each component, see:
- build-docs.sh Orchestrator : Environment variable parsing, Git auto-detection, configuration file generation, subprocess execution, error handling
- deepwiki-scraper.py : Wiki structure discovery, HTML parsing, Markdown conversion, diagram extraction algorithms, fuzzy matching implementation
- mdBook Integration : Configuration schema, SUMMARY.md generation algorithm, mdbook-mermaid preprocessor integration, theme customization
build-docs.sh Orchestrator
Relevant source files
Purpose and Scope
This page documents the build-docs.sh shell script, which serves as the central orchestrator for the entire documentation build process. This script is the container's entry point and coordinates all phases of the system: configuration parsing, scraper invocation, mdBook configuration generation, and output management.
For details about the Python scraping component that this orchestrator calls, see deepwiki-scraper.py. For information about the mdBook integration and configuration format, see mdBook Integration.
Overview
The build-docs.sh script is a Bash orchestration layer that implements the three-phase pipeline described in Three-Phase Pipeline. It has the following core responsibilities:
| Responsibility | Lines | Description |
|---|---|---|
| Auto-detection | build-docs.sh:8-19 | Detects repository from Git remote if not provided |
| Configuration | build-docs.sh:21-53 | Parses environment variables and applies defaults |
| Phase 1 orchestration | build-docs.sh:55-58 | Invokes Python scraper |
| Markdown-only exit | build-docs.sh:60-76 | Implements fast-path for debugging |
| Phase 3 orchestration | build-docs.sh:78-191 | Generates configs, builds mdBook, copies outputs |
Sources: build-docs.sh:1-206
Script Workflow
Complete Execution Flow
The following diagram shows the complete control flow through the orchestrator, including all decision points and phase transitions:
flowchart TD
Start[["build-docs.sh entry"]]
Start --> AutoDetect["Auto-detect repository\nfrom git config"]
AutoDetect --> ValidateRepo{"REPO variable\nset?"}
ValidateRepo -->|No| Error[["Exit with error"]]
ValidateRepo -->|Yes| ExtractParts["Extract REPO_OWNER\nand REPO_NAME"]
ExtractParts --> SetDefaults["Set defaults:\nBOOK_AUTHORS=REPO_OWNER\nGIT_REPO_URL=github.com/REPO"]
SetDefaults --> PrintConfig["Print configuration\nto stdout"]
PrintConfig --> Phase1["Execute Phase 1:\npython3 deepwiki-scraper.py"]
Phase1 --> CheckMode{"MARKDOWN_ONLY\n= true?"}
CheckMode -->|Yes| CopyMd["Copy WIKI_DIR to\nOUTPUT_DIR/markdown"]
CopyMd --> ExitMd[["Exit: markdown-only"]]
CheckMode -->|No| InitBook["Create BOOK_DIR\nand book.toml"]
InitBook --> GenSummary["Generate SUMMARY.md\nfrom file structure"]
GenSummary --> CopySrc["Copy WIKI_DIR to\nBOOK_DIR/src"]
CopySrc --> InstallMermaid["mdbook-mermaid install"]
InstallMermaid --> BuildBook["mdbook build"]
BuildBook --> CopyOutputs["Copy outputs:\nbook/, markdown/, book.toml"]
CopyOutputs --> Success[["Exit: build complete"]]
style Start fill:#f9f9f9
style Phase1 fill:#e8f5e9
style CheckMode fill:#fff9c4
style ExitMd fill:#f9f9f9
style Success fill:#f9f9f9
style Error fill:#ffebee
Sources: build-docs.sh:1-206
Key Decision Point: MARKDOWN_ONLY Mode
The MARKDOWN_ONLY environment variable creates two distinct execution paths in the orchestrator. When set to "true", the script bypasses mdBook configuration generation and building (Phase 3), providing a fast path for debugging content extraction and diagram placement.
Sources: build-docs.sh26 build-docs.sh:60-76
Configuration Handling
Auto-Detection System
The script implements an intelligent auto-detection system for the REPO variable when running in a Git repository context:
flowchart LR
Start["REPO variable"] --> Check{"REPO set?"}
Check -->|Yes| UseProvided["Use provided value"]
Check -->|No| GitCheck{"Inside Git\nrepository?"}
GitCheck -->|No| RequireManual["REPO remains empty"]
GitCheck -->|Yes| GetRemote["git config --get\nremote.origin.url"]
GetRemote --> Extract["Extract owner/repo\nusing sed regex"]
Extract --> SetRepo["Set REPO variable"]
UseProvided --> Validate["Validation check"]
SetRepo --> Validate
RequireManual --> Validate
Validate --> ValidCheck{"REPO\nis set?"}
ValidCheck -->|No| ExitError[["Exit with error:\nREPO must be set"]]
ValidCheck -->|Yes| Continue["Continue execution"]
The regular expression used for extraction handles multiple GitHub URL formats:
https://github.com/owner/repo.gitgit@github.com:owner/repo.githttps://github.com/owner/repo
Sources: build-docs.sh:8-19 build-docs.sh:32-37
Configuration Variable Flow
The script manages five primary configuration variables, with the following precedence and default logic:
| Variable | Source | Default Derivation | Code Reference |
|---|---|---|---|
REPO | Environment or Git auto-detect | (required) | build-docs.sh8-22 |
BOOK_TITLE | Environment | "Documentation" | build-docs.sh23 |
BOOK_AUTHORS | Environment | $REPO_OWNER | build-docs.sh:40-44 |
GIT_REPO_URL | Environment | https://github.com/$REPO | build-docs.sh:40-45 |
MARKDOWN_ONLY | Environment | "false" | build-docs.sh26 |
The script extracts REPO_OWNER and REPO_NAME from the REPO variable using shell string manipulation:
Sources: build-docs.sh:39-45
Working Directory Structure
The orchestrator uses four primary directory paths:
WORK_DIR="/workspace": Temporary workspace for all build operationsWIKI_DIR="$WORK_DIR/wiki": Scraper output locationBOOK_DIR="$WORK_DIR/book": mdBook project directoryOUTPUT_DIR="/output": Volume-mounted final output location
Sources: build-docs.sh:27-30
Phase Orchestration
Phase 1: Scraper Invocation
The orchestrator invokes the Python scraper with exactly two positional arguments:
This command executes the complete Phase 1 and Phase 2 pipeline as documented in Phase 1: Markdown Extraction and Phase 2: Diagram Enhancement. The scraper writes all output to $WIKI_DIR.
Sources: build-docs.sh:55-58
Phase 3: mdBook Configuration and Build
Phase 3 is implemented through six distinct steps in the orchestrator:
Note: Step numbering in stdout messages is off-by-one from phase numbering because the scraper is "Step 1."
Sources: build-docs.sh:78-191
Configuration File Generation
flowchart LR
EnvVars["Environment variables:\nBOOK_TITLE\nBOOK_AUTHORS\nGIT_REPO_URL"]
Template["Heredoc template\nat line 85-103"]
BookToml["BOOK_DIR/book.toml"]
EnvVars --> Template
Template --> BookToml
BookToml --> MdBook["mdbook build"]
book.toml Generation
The orchestrator dynamically generates the book.toml configuration file for mdBook using a heredoc:
The generated book.toml includes:
[book]section:title,authors,language,multilingual,src[output.html]section:default-theme,git-repository-url[preprocessor.mermaid]section:command[output.html.fold]section:enable,level
The git-repository-url setting enables mdBook's "Edit this page" functionality, linking back to the GitHub repository specified in $GIT_REPO_URL.
Sources: build-docs.sh:84-103
flowchart TD
Start["Begin SUMMARY.md generation"]
Start --> FindFirst["Find first .md file\nin WIKI_DIR root"]
FindFirst --> ExtractTitle1["Extract title from\nfirst line (# Title)"]
ExtractTitle1 --> WriteIntro["Write as Introduction link"]
WriteIntro --> IterateMain["Iterate *.md files\nin WIKI_DIR root"]
IterateMain --> SkipFirst{"Is this\nfirst file?"}
SkipFirst -->|Yes| NextFile["Skip to next file"]
SkipFirst -->|No| ExtractTitle2["Extract title\nfrom first line"]
ExtractTitle2 --> GetSectionNum["Extract section number\nusing grep regex"]
GetSectionNum --> CheckSubdir{"section-N/\ndirectory exists?"}
CheckSubdir -->|No| WriteStandalone["Write as standalone:\n- [Title](file.md)"]
CheckSubdir -->|Yes| WriteSection["Write section header:\n# Title"]
WriteSection --> WriteMainLink["Write main page link:\n- [Title](file.md)"]
WriteMainLink --> IterateSubs["Iterate section-N/*.md"]
IterateSubs --> WriteSubLinks["Write indented sub-links:\n - [SubTitle](section-N/file.md)"]
WriteStandalone --> NextFile
WriteSubLinks --> NextFile
NextFile --> MoreFiles{"More\nfiles?"}
MoreFiles -->|Yes| IterateMain
MoreFiles -->|No| WriteSummary["Write to BOOK_DIR/src/SUMMARY.md"]
WriteSummary --> Done["Generation complete"]
SUMMARY.md Generation Algorithm
The orchestrator generates the table of contents (SUMMARY.md) by scanning the actual file structure in $WIKI_DIR. This dynamic generation ensures the table of contents always matches the scraped content.
The algorithm extracts titles by reading the first line of each Markdown file and removing the # prefix using sed:
Section numbers are extracted using grep with a regex pattern:
For detailed information about how the file structure is organized, see Wiki Structure Discovery.
Sources: build-docs.sh:108-159
File Operations
Copy Operations Mapping
The orchestrator performs strategic copy operations to move data through the pipeline:
| Source | Destination | Purpose | Code Reference |
|---|---|---|---|
$WIKI_DIR/* | $OUTPUT_DIR/markdown/ | Markdown-only mode output | build-docs.sh65 |
$WIKI_DIR/* | $BOOK_DIR/src/ | Source files for mdBook | build-docs.sh166 |
$BOOK_DIR/book | $OUTPUT_DIR/book/ | Final HTML output | build-docs.sh184 |
$WIKI_DIR/* | $OUTPUT_DIR/markdown/ | Markdown reference copy | build-docs.sh188 |
$BOOK_DIR/book.toml | $OUTPUT_DIR/book.toml | Configuration reference | build-docs.sh191 |
The final output structure in $OUTPUT_DIR is:
/output/
├── book/ # HTML documentation (from BOOK_DIR/book)
│ ├── index.html
│ ├── *.html
│ └── ...
├── markdown/ # Source Markdown files (from WIKI_DIR)
│ ├── 1-overview.md
│ ├── 2-section.md
│ ├── section-2/
│ └── ...
└── book.toml # Configuration copy (from BOOK_DIR)
Sources: build-docs.sh:178-191
Atomic Output Management
The orchestrator uses a two-stage directory strategy for atomic outputs:
- Working stage : All operations occur in
/workspace(ephemeral) - Output stage : Final artifacts are copied to
/output(volume-mounted)
This ensures that partial builds never appear in the output directory—only completed artifacts are copied. If any step fails, the set -e directive at build-docs.sh2 causes immediate script termination with no partial outputs.
Sources: build-docs.sh2 build-docs.sh:27-30 build-docs.sh:178-191
Tool Invocations
External Command Execution
The orchestrator invokes three external tools during execution:
Each tool is invoked with specific working directories and arguments:
Python scraper invocation (build-docs.sh58):
mdbook-mermaid installation (build-docs.sh171):
This installs the necessary JavaScript and CSS assets for Mermaid diagram rendering into the mdBook project.
mdBook build (build-docs.sh176):
Executed from within $BOOK_DIR due to cd "$BOOK_DIR" at build-docs.sh82
Sources: build-docs.sh58 build-docs.sh82 build-docs.sh171 build-docs.sh176
Error Handling
Validation and Exit Conditions
The script implements minimal but critical validation:
The set -e directive at build-docs.sh2 ensures that any command failure (non-zero exit code) immediately terminates the script. This includes:
- HTTP failures in the Python scraper
- File system errors during copy operations
- mdBook build failures
- mdbook-mermaid installation failures
The only explicit validation check is for the REPO variable at build-docs.sh:32-37 which prints usage instructions and exits with code 1 if not set.
Sources: build-docs.sh2 build-docs.sh:32-37
Stdout Output Format
The orchestrator provides structured console output for monitoring build progress:
================================================================================
DeepWiki Documentation Builder
================================================================================
Configuration:
Repository: owner/repo
Book Title: Documentation Title
Authors: Author Name
Git Repo URL: https://github.com/owner/repo
Markdown Only: false
Step 1: Scraping wiki from DeepWiki...
[scraper output...]
Step 2: Initializing mdBook structure...
Step 3: Generating SUMMARY.md from scraped content...
Generated SUMMARY.md with N entries
Step 4: Copying markdown files to book...
Step 5: Installing mdbook-mermaid assets...
Step 6: Building mdBook...
Step 7: Copying outputs to /output...
================================================================================
✓ Documentation build complete!
================================================================================
Outputs:
- HTML book: /output/book/
- Markdown files: /output/markdown/
- Book config: /output/book.toml
To serve the book locally:
cd /output && python3 -m http.server --directory book 8000
Each step is clearly labeled with progress indicators. The configuration block is printed before processing begins to aid in debugging.
Sources: build-docs.sh:4-6 build-docs.sh:47-53 build-docs.sh:55-205
deepwiki-scraper.py
Relevant source files
Purpose and Scope
The deepwiki-scraper.py script is the core content extraction engine that scrapes wiki pages from DeepWiki.com and converts them into clean Markdown files with intelligently placed Mermaid diagrams. This page documents the script's internal architecture, algorithms, and data transformations.
For information about how this script is orchestrated within the larger build system, see 5.1: build-docs.sh Orchestrator. For detailed explanations of the extraction and enhancement phases, see 6: Phase 1: Markdown Extraction and 7: Phase 2: Diagram Enhancement.
Sources: tools/deepwiki-scraper.py:1-11
Command-Line Interface
The script accepts exactly two arguments and is designed to be called programmatically:
| Parameter | Description | Example |
|---|---|---|
owner/repo | GitHub repository identifier in format owner/repo | facebook/react |
output-dir | Directory where markdown files will be written | ./output/markdown |
The script validates the repository format using regex ^[\w-]+/[\w-]+$ and exits with an error if the format is invalid.
Sources: tools/deepwiki-scraper.py:790-802
Main Execution Flow
The main() function orchestrates all operations using a temporary directory workflow to ensure atomic file operations:
Atomic Workflow Design: All scraping and enhancement operations occur in a temporary directory. Files are only moved to the final output directory after all processing completes successfully. If the script crashes or is interrupted, the output directory remains untouched.
graph TB
Start["main()"] --> Validate["Validate Arguments\nRegex: ^[\w-]+/[\w-]+$"]
Validate --> TempDir["Create Temporary Directory\ntempfile.TemporaryDirectory()"]
TempDir --> Session["Create requests.Session()\nwith User-Agent headers"]
Session --> Phase1["PHASE 1: Clean Markdown\nextract_wiki_structure()\nextract_page_content()"]
Phase1 --> WriteTemp["Write files to temp_dir\nOrganized by hierarchy"]
WriteTemp --> Phase2["PHASE 2: Diagram Enhancement\nextract_and_enhance_diagrams()"]
Phase2 --> EnhanceTemp["Enhance files in temp_dir\nInsert diagrams via fuzzy matching"]
EnhanceTemp --> Phase3["PHASE 3: Atomic Move\nshutil.copytree()\nshutil.copy2()"]
Phase3 --> CleanOutput["Clear output_dir\nMove temp files to output"]
CleanOutput --> Complete["Complete\ntemp_dir auto-deleted"]
style Phase1 fill:#e8f5e9
style Phase2 fill:#f3e5f5
style Phase3 fill:#fff4e1
Sources: tools/deepwiki-scraper.py:790-919
Dependencies and HTTP Session
The script imports three primary libraries for web scraping and conversion:
| Dependency | Purpose | Key Usage |
|---|---|---|
requests | HTTP client with session support | tools/deepwiki-scraper.py17 |
beautifulsoup4 | HTML parsing and DOM traversal | tools/deepwiki-scraper.py18 |
html2text | HTML to Markdown conversion | tools/deepwiki-scraper.py19 |
The HTTP session is configured with browser-like headers to avoid being blocked:
Sources: tools/deepwiki-scraper.py:817-821 tools/requirements.txt:1-4
Core Function Reference
Structure Discovery Functions
extract_wiki_structure(repo, session) tools/deepwiki-scraper.py:78-125
- Fetches the repository's main wiki page
- Extracts all links matching pattern
/owner/repo/\d+ - Parses page numbers (e.g.,
1,2.1,3.2.1) and titles - Determines hierarchy level by counting dots in page number
- Returns sorted list of page dictionaries with keys:
number,title,url,href,level
discover_subsections(repo, main_page_num, session) tools/deepwiki-scraper.py:44-76
- Attempts to discover subsections by testing URL patterns
- Tests up to 10 subsections per main page (e.g.,
/repo/2-1-,/repo/2-2-) - Uses HEAD requests for efficiency
- Returns list of discovered subsection metadata
Sources: tools/deepwiki-scraper.py:44-125
Content Extraction Functions
extract_page_content(url, session, current_page_info) tools/deepwiki-scraper.py:453-594
- Main content extraction function called for each wiki page
- Removes navigation and UI elements before conversion
- Converts HTML to Markdown using
html2textlibrary - Rewrites internal DeepWiki links to relative Markdown file paths
- Returns clean Markdown string
fetch_page(url, session) tools/deepwiki-scraper.py:27-42
- Implements retry logic with exponential backoff
- Attempts each request up to 3 times with 2-second delays
- Raises exception on final failure
- Returns
requests.Responseobject
convert_html_to_markdown(html_content) tools/deepwiki-scraper.py:175-216
- Configures
html2text.HTML2Text()withbody_width=0(no line wrapping) - Sets
ignore_links=Falseto preserve link structure - Calls
clean_deepwiki_footer()to remove UI elements - Diagrams are not extracted here (handled in Phase 2)
Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:175-216 tools/deepwiki-scraper.py:453-594
graph TB
Input["Input: /owner/repo/4-2-query-planning"] --> Extract["Extract via regex:\n/(\d+(?:\.\d+)*)-(.+)$"]
Extract --> ParseNum["page_num = '4.2'\nslug = 'query-planning'"]
ParseNum --> ConvertNum["file_num = page_num.replace('.', '-')\nResult: '4-2'"]
ConvertNum --> CheckTarget{"Target is\nsubsection?\n(has dot)"}
CheckTarget -->|Yes| CheckSource{"Source is\nsubsection?\n(level > 0)"}
CheckTarget -->|No| CheckSource2{"Source is\nsubsection?"}
CheckSource -->|Yes, same section| SameSec["Return: '4-2-query-planning.md'"]
CheckSource -->|No or different| DiffSec["Return: 'section-4/4-2-query-planning.md'"]
CheckSource2 -->|Yes| UpLevel["Return: '../4-2-query-planning.md'"]
CheckSource2 -->|No| SameLevel["Return: '4-2-query-planning.md'"]
style CheckTarget fill:#e8f5e9
style CheckSource fill:#fff4e1
Link Rewriting Logic
The script converts DeepWiki's absolute URLs to relative Markdown file paths, handling hierarchical section directories:
Algorithm Implementation: tools/deepwiki-scraper.py:549-592
The fix_wiki_link() nested function handles four scenarios:
- Both main pages: Use filename only (e.g.,
2-overview.md) - Source subsection → target main page: Use
../prefix (e.g.,../2-overview.md) - Both in same section directory: Use filename only (e.g.,
4-2-sql-parser.md) - Different sections: Use full path (e.g.,
section-4/4-2-sql-parser.md)
Sources: tools/deepwiki-scraper.py:549-592
Diagram Enhancement Architecture
Diagram Extraction from Next.js Payload
DeepWiki embeds all Mermaid diagrams in a JavaScript payload within the HTML. The extract_and_enhance_diagrams() function extracts diagrams with contextual information:
Key Data Structures:
graph TB
Start["extract_and_enhance_diagrams(repo, temp_dir, session)"] --> FetchJS["Fetch https://deepwiki.com/{repo}/1-overview\nAny page contains all diagrams"]
FetchJS --> Pattern1["Regex: ```mermaid\\\\\n(.*?)```\nFind all diagram blocks"]
Pattern1 --> Count["Print: Found {N}
total diagrams"]
Count --> Pattern2["Regex with context:\n([^`]{500,}?)```mermaid\\\\ (.*?)```"]
Pattern2 --> Extract["For each match:\n- Extract 500-char context before\n- Extract diagram code"]
Extract --> Unescape["Unescape sequences:\n\\\n→ newline\n\\ → tab\n\\\" → quote\n\< → '<'"]
Unescape --> Parse["Parse context:\n- Find last heading\n- Extract last 2-3 non-heading lines\n- Create anchor_text (last 300 chars)"]
Parse --> Store["Store diagram_contexts[]\nKeys: last_heading, anchor_text, diagram"]
Store --> Enhance["Enhance all .md files in temp_dir"]
style Pattern1 fill:#e8f5e9
style Pattern2 fill:#fff4e1
style Parse fill:#f3e5f5
Sources: tools/deepwiki-scraper.py:596-674
graph TB
Start["For each markdown file"] --> Normalize["Normalize content:\n- Convert to lowercase\n- Collapse whitespace\n- content_normalized = ' '.join(content.split())"]
Normalize --> Loop["For each diagram in diagram_contexts"]
Loop --> GetAnchors["Get anchor_text and last_heading\nfrom diagram context"]
GetAnchors --> TryChunks{"Try chunk sizes:\n300, 200, 150, 100, 80"}
TryChunks --> ExtractChunk["Extract last N chars of anchor_text\ntest_chunk = anchor[-chunk_size:]"]
ExtractChunk --> FindPos["pos = content_normalized.find(test_chunk)"]
FindPos --> Found{"pos != -1?"}
Found -->|Yes| ConvertLine["Convert char position to line number\nby counting chars in each line"]
Found -->|No| TrySmaller{"Try smaller\nchunk?"}
TrySmaller -->|Yes| ExtractChunk
TrySmaller -->|No| Fallback["Fallback: Match heading text\nheading_normalized in line_normalized"]
ConvertLine --> FindInsert["Find insertion point:\n- After heading: skip blanks, skip paragraph\n- After paragraph: find blank line"]
Fallback --> FindInsert
FindInsert --> Queue["Add to pending_insertions[]\n(line_num, diagram, score, idx)"]
Queue --> InsertAll["Sort by line_num (reverse)\nInsert diagrams bottom-up"]
InsertAll --> Save["Write enhanced file\nto same path in temp_dir"]
style TryChunks fill:#e8f5e9
style Found fill:#fff4e1
style FindInsert fill:#f3e5f5
Fuzzy Matching Algorithm
The script uses progressive chunk matching to find where diagrams belong in the Markdown content:
Progressive Chunk Sizes: The algorithm tries matching increasingly smaller chunks (300 → 200 → 150 → 100 → 80 characters) until it finds a match. This handles variations in text formatting between the JavaScript payload and html2text output.
Scoring: Each match is scored based on the chunk size used. Larger chunks indicate more confident matches.
Bottom-Up Insertion: Diagrams are inserted from the bottom of the file upward to preserve line numbers during insertion.
Sources: tools/deepwiki-scraper.py:676-788
Helper Functions
Filename Sanitization
sanitize_filename(text) tools/deepwiki-scraper.py:21-25
- Removes non-alphanumeric characters except hyphens and spaces
- Collapses multiple hyphens/spaces into single hyphens
- Converts to lowercase
- Example:
"Query Planning & Optimization"→"query-planning-optimization"
Footer Cleaning
clean_deepwiki_footer(markdown) tools/deepwiki-scraper.py:127-173
- Removes DeepWiki UI elements from markdown using regex patterns
- Patterns include:
"Dismiss","Refresh this wiki","Edit Wiki","On this page" - Scans last 50 lines backwards to find footer start
- Removes all content from footer start to end of file
- Also removes trailing empty lines
Sources: tools/deepwiki-scraper.py:21-25 tools/deepwiki-scraper.py:127-173
File Organization and Output
The script organizes output files based on the hierarchical page structure:
File Naming Convention: {number}-{title-slug}.md
- Number with dots replaced by hyphens (e.g.,
2.1→2-1) - Title sanitized to safe filename format
- Examples:
1-overview.md,2-1-workspace.md,4-3-2-optimizer.md
Sources: tools/deepwiki-scraper.py:842-877 tools/deepwiki-scraper.py:897-908
Error Handling and Resilience
The script implements multiple layers of error handling:
Retry Logic
HTTP Request Retries: tools/deepwiki-scraper.py:33-42
- Each HTTP request attempts up to 3 times
- 2-second delay between attempts
- Only raises exception on final failure
Graceful Degradation
| Scenario | Behavior |
|---|---|
| No pages found | Exit with error message and status code 1 |
| Page extraction fails | Print error, continue with remaining pages |
| Diagram extraction fails | Print warning, continue without diagrams |
| Content selector not found | Fall back to <body> tag as last resort |
Temporary Directory Cleanup
The script uses Python's tempfile.TemporaryDirectory() context manager, which automatically deletes the temporary directory even if the script crashes or is interrupted. This prevents accumulation of partial work files.
Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:808-916
Performance Characteristics
Rate Limiting
The script includes a 1-second sleep between page fetches to be respectful to the DeepWiki server:
Sources: tools/deepwiki-scraper.py872
Memory Efficiency
- Uses streaming HTTP responses where possible
- Processes one page at a time rather than loading all pages into memory
- Temporary directory is cleared automatically after completion
Typical Execution Times
For a repository with approximately 20 pages and 50 diagrams:
- Phase 1 (Extraction): ~30-40 seconds (with 1-second delays between requests)
- Phase 2 (Enhancement): ~5-10 seconds (local processing)
- Phase 3 (Move): <1 second (file operations)
Total: Approximately 40-50 seconds for a medium-sized wiki.
Sources: tools/deepwiki-scraper.py:790-919
Data Flow Summary
Sources: tools/deepwiki-scraper.py:1-919
mdBook Integration
Relevant source files
Purpose and Scope
This document explains how the DeepWiki-to-mdBook Converter integrates with mdBook and its plugins to generate the final HTML documentation. It covers the configuration file generation, table of contents assembly, build orchestration, and diagram rendering setup. For details about the earlier phases that produce the markdown input files, see Phase 1: Markdown Extraction and Phase 2: Diagram Enhancement. For specifics on the build process flow, see Phase 3: mdBook Build.
Overview
The system integrates mdBook as the final transformation stage, converting enhanced Markdown files into a searchable, navigable HTML documentation site. This integration is orchestrated by the build-docs.sh script and uses two Rust-based tools compiled during the Docker build:
| Tool | Purpose | Installation Method |
|---|---|---|
mdbook | Core documentation generator | Compiled from source via cargo install |
mdbook-mermaid | Mermaid diagram preprocessor | Compiled from source via cargo install |
The integration is optional and can be bypassed by setting MARKDOWN_ONLY=true, which produces only the enhanced Markdown files without the HTML build.
Sources: build-docs.sh:60-76 Dockerfile:1-5 Dockerfile:19-21
Integration Architecture
The following diagram shows how mdBook is integrated into the three-phase pipeline and what files it consumes and produces:
Sources: build-docs.sh:60-206 README.md:138-144
graph TB
subgraph "Phase 1 & 2 Output"
WikiDir["$WIKI_DIR\n(Scraped Markdown)"]
RootMD["*.md files\n(Root pages)"]
SectionDirs["section-N/\n(Subsection pages)"]
WikiDir --> RootMD
WikiDir --> SectionDirs
end
subgraph "build-docs.sh Orchestrator"
CheckMode{"MARKDOWN_ONLY\ncheck"}
GenConfig["Generate book.toml\n(lines 84-103)"]
GenSummary["Generate SUMMARY.md\n(lines 108-159)"]
CopyFiles["Copy to src/\n(line 166)"]
InstallAssets["mdbook-mermaid install\n(line 171)"]
BuildCmd["mdbook build\n(line 176)"]
end
subgraph "mdBook Process"
ParseConfig["Parse book.toml"]
ParseSummary["Parse SUMMARY.md"]
ProcessMD["Process Markdown files"]
MermaidPreproc["mdbook-mermaid\npreprocessor"]
RenderHTML["Render HTML pages"]
end
subgraph "Output Directory"
BookHTML["$OUTPUT_DIR/book/\n(HTML site)"]
BookToml["$OUTPUT_DIR/book.toml\n(Config copy)"]
MarkdownCopy["$OUTPUT_DIR/markdown/\n(Source copy)"]
end
WikiDir --> CheckMode
CheckMode -->|false| GenConfig
CheckMode -->|true| MarkdownCopy
GenConfig --> GenSummary
GenSummary --> CopyFiles
CopyFiles --> InstallAssets
InstallAssets --> BuildCmd
BuildCmd --> ParseConfig
BuildCmd --> ParseSummary
ParseConfig --> ProcessMD
ParseSummary --> ProcessMD
ProcessMD --> MermaidPreproc
MermaidPreproc --> RenderHTML
RenderHTML --> BookHTML
GenConfig -.->|copy| BookToml
WikiDir -.->|copy| MarkdownCopy
Configuration Generation (book.toml)
The build-docs.sh script dynamically generates the book.toml configuration file using environment variables. This generation occurs in build-docs.sh:84-103 and produces a TOML file with three main sections:
Configuration Structure
Configuration Fields
The generated configuration includes:
| Section | Field | Value Source | Purpose |
|---|---|---|---|
[book] | title | $BOOK_TITLE or "Documentation" | Book title in navigation |
[book] | authors | $BOOK_AUTHORS or $REPO_OWNER | Author attribution |
[book] | language | "en" (hardcoded) | Content language |
[book] | src | "src" (hardcoded) | Source directory path |
[output.html] | default-theme | "rust" (hardcoded) | Visual theme |
[output.html] | git-repository-url | $GIT_REPO_URL | "Edit" link target |
[preprocessor.mermaid] | command | "mdbook-mermaid" | Preprocessor binary |
[output.html.fold] | enable | true | Enable section folding |
[output.html.fold] | level | 1 | Fold depth level |
Sources: build-docs.sh:84-103 build-docs.sh:39-45
Table of Contents Generation (SUMMARY.md)
The system automatically generates SUMMARY.md from the scraped file structure, discovering the hierarchy from filename patterns and directory organization. This logic is implemented in build-docs.sh:108-159
Generation Algorithm
File Structure Detection
The algorithm detects the hierarchical structure using these conventions:
| Pattern | Interpretation | Example |
|---|---|---|
N-title.md | Main section N | 5-component-reference.md |
section-N/ | Directory for section N subsections | section-5/ |
N-M-title.md in section-N/ | Subsection M of section N | section-5/5-1-build-docs-sh.md |
The algorithm extracts the section number from the filename using grep -oE '^[0-9]+' and checks for the existence of a corresponding section-N directory. If found, it writes the main section as a header followed by indented subsections.
Sources: build-docs.sh:108-159 build-docs.sh:117-123 build-docs.sh:126-158
Build Process Orchestration
The build process follows a specific sequence of operations coordinated by build-docs.sh:
Sources: build-docs.sh:78-191
sequenceDiagram
participant Script as build-docs.sh
participant FS as File System
participant MdBook as mdbook binary
participant Mermaid as mdbook-mermaid binary
Note over Script: Step 2: Initialize\n(lines 78-82)
Script->>FS: mkdir -p $BOOK_DIR
Script->>Script: cd $BOOK_DIR
Note over Script: Generate Config\n(lines 84-103)
Script->>FS: Write book.toml
Script->>FS: mkdir -p src
Note over Script: Step 3: Generate TOC\n(lines 108-159)
Script->>FS: Read $WIKI_DIR/*.md files
Script->>FS: Read section-*/*.md files
Script->>FS: Write src/SUMMARY.md
Note over Script: Step 4: Copy Files\n(lines 164-166)
Script->>FS: cp -r $WIKI_DIR/* src/
Note over Script: Step 5: Install Assets\n(lines 169-171)
Script->>Mermaid: mdbook-mermaid install $BOOK_DIR
Mermaid->>FS: Install mermaid.min.js
Mermaid->>FS: Install mermaid-init.js
Mermaid->>FS: Update book.toml
Note over Script: Step 6: Build\n(lines 174-176)
Script->>MdBook: mdbook build
MdBook->>FS: Read book.toml
MdBook->>FS: Read src/SUMMARY.md
MdBook->>FS: Read src/*.md files
MdBook->>Mermaid: Preprocess (mermaid blocks)
Mermaid-->>MdBook: Transformed Markdown
MdBook->>FS: Write book/ directory
Note over Script: Step 7: Copy Outputs\n(lines 179-191)
Script->>FS: cp -r book $OUTPUT_DIR/
Script->>FS: cp book.toml $OUTPUT_DIR/
Script->>FS: cp -r $WIKI_DIR/* $OUTPUT_DIR/markdown/
mdbook-mermaid Integration
The system uses the mdbook-mermaid preprocessor to enable Mermaid diagram rendering in the final HTML output. This integration involves three steps:
Installation and Configuration
The mdbook-mermaid binary is compiled during the Docker build stage:
Preprocessor Configuration
The preprocessor is configured in the generated book.toml:
This configuration tells mdBook to run mdbook-mermaid as a preprocessor before rendering HTML. The preprocessor scans all Markdown files for code blocks with the mermaid language tag and transforms them into HTML containers that the Mermaid JavaScript library can render.
Asset Installation
The mdbook-mermaid install command (executed at build-docs.sh171) installs required JavaScript and CSS assets:
| Asset | Purpose |
|---|---|
mermaid.min.js | Mermaid diagram rendering library |
mermaid-init.js | Initialization script for Mermaid |
| Additional CSS | Styling for diagram containers |
These assets are installed into the book's theme directory and are automatically included in all generated HTML pages.
Sources: Dockerfile5 Dockerfile21 build-docs.sh:169-171 build-docs.sh:97-98
Output Structure
After the mdBook build completes, the system produces three output artifacts:
HTML Site Features
The generated HTML site includes:
| Feature | Description | Enabled By |
|---|---|---|
| Navigation sidebar | Left sidebar with TOC | Generated SUMMARY.md |
| Search functionality | Full-text search | mdBook default feature |
| Responsive design | Mobile-friendly layout | Rust theme |
| Mermaid diagrams | Interactive diagrams | mdbook-mermaid preprocessor |
| Edit links | Link to GitHub source | git-repository-url config |
| Section folding | Collapsible sections | [output.html.fold] config |
Sources: build-docs.sh:179-191 README.md:93-104
Binary Requirements
The integration requires two Rust binaries to be available at runtime:
| Binary | Installation Location | Required For |
|---|---|---|
mdbook | /usr/local/bin/mdbook | HTML generation |
mdbook-mermaid | /usr/local/bin/mdbook-mermaid | Diagram preprocessing |
Both binaries are compiled during the Docker multi-stage build and copied to the final image. The compilation occurs in Dockerfile:1-5 using cargo install, and the binaries are extracted in Dockerfile:19-21 from the build stage.
Sources: Dockerfile:1-21 build-docs.sh171 build-docs.sh176
Phase 1: Markdown Extraction
Relevant source files
This page documents Phase 1 of the three-phase processing pipeline, which handles the extraction and initial conversion of wiki content from DeepWiki.com into clean Markdown files. Phase 1 produces raw Markdown files in a temporary directory before diagram enhancement (Phase 2, see #7) and mdBook HTML generation (Phase 3, see #8).
For detailed information about specific sub-processes within Phase 1, see:
Scope and Objectives
Phase 1 accomplishes the following:
- Discover all wiki pages and their hierarchical structure from DeepWiki
- Fetch HTML content for each page via HTTP requests
- Parse HTML to extract main content and remove UI elements
- Convert cleaned HTML to Markdown using
html2text - Organize output files into a hierarchical directory structure
- Save to a temporary directory for subsequent processing
This phase is implemented entirely in Python within deepwiki-scraper.py and operates independently of Phases 2 and 3.
Sources: README.md:121-128 tools/deepwiki-scraper.py:790-876
Phase 1 Execution Flow
The following diagram shows the complete execution flow of Phase 1, mapping high-level steps to specific functions in the codebase:
Sources: tools/deepwiki-scraper.py:790-876 tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594
flowchart TD
Start["main()
Entry Point"]
CreateTemp["Create tempfile.TemporaryDirectory()"]
CreateSession["requests.Session()
with User-Agent"]
DiscoverPhase["Structure Discovery Phase"]
ExtractWiki["extract_wiki_structure(repo, session)"]
ParseLinks["BeautifulSoup: find_all('a', href=pattern)"]
SortPages["sort by page number (handle dots)"]
ExtractionPhase["Content Extraction Phase"]
LoopPages["For each page in pages list"]
FetchContent["extract_page_content(url, session, page_info)"]
FetchHTML["fetch_page(url, session)
with retries"]
ParseHTML["BeautifulSoup(response.text)"]
RemoveNav["Remove nav/header/footer/aside elements"]
FindContent["Find main content: article/main/[role='main']"]
ConvertPhase["Conversion Phase"]
ConvertMD["convert_html_to_markdown(html_content)"]
HTML2Text["html2text.HTML2Text with body_width=0"]
CleanFooter["clean_deepwiki_footer(markdown)"]
FixLinks["Regex replace: wiki links → .md paths"]
SavePhase["File Organization Phase"]
DetermineLevel{"page['level'] == 0?"}
SaveRoot["Save to temp_dir/NUM-title.md"]
CreateSubdir["Create temp_dir/section-N/"]
SaveSubdir["Save to section-N/NUM-title.md"]
NextPage{"More pages?"}
Complete["Phase 1 Complete: temp_dir contains all .md files"]
Start --> CreateTemp
CreateTemp --> CreateSession
CreateSession --> DiscoverPhase
DiscoverPhase --> ExtractWiki
ExtractWiki --> ParseLinks
ParseLinks --> SortPages
SortPages --> ExtractionPhase
ExtractionPhase --> LoopPages
LoopPages --> FetchContent
FetchContent --> FetchHTML
FetchHTML --> ParseHTML
ParseHTML --> RemoveNav
RemoveNav --> FindContent
FindContent --> ConvertPhase
ConvertPhase --> ConvertMD
ConvertMD --> HTML2Text
HTML2Text --> CleanFooter
CleanFooter --> FixLinks
FixLinks --> SavePhase
SavePhase --> DetermineLevel
DetermineLevel -->|Yes: Main Page| SaveRoot
DetermineLevel -->|No: Subsection| CreateSubdir
CreateSubdir --> SaveSubdir
SaveRoot --> NextPage
SaveSubdir --> NextPage
NextPage -->|Yes| LoopPages
NextPage -->|No| Complete
Core Components and Data Flow
Structure Discovery Pipeline
The structure discovery process identifies all wiki pages and builds a hierarchical page list:
Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:90-116 tools/deepwiki-scraper.py:118-123
flowchart LR
subgraph Input
BaseURL["Base URL\ndeepwiki.com/owner/repo"]
end
subgraph extract_wiki_structure
FetchMain["fetch_page(base_url)"]
ParseSoup["BeautifulSoup(response.text)"]
FindLinks["soup.find_all('a', href=regex)"]
ExtractInfo["Extract page number & title\nRegex: /(\d+(?:\.\d+)*)-(.+)$"]
CalcLevel["Calculate level from dots\nlevel = page_num.count('.')"]
BuildPages["Build pages list with metadata"]
SortFunc["Sort by sort_key(page)\nparts = [int(x)
for x in num.split('.')]"]
end
subgraph Output
PagesList["List[Dict]\n{'number': '2.1',\n'title': 'Section',\n'url': full_url,\n'href': path,\n'level': 1}"]
end
BaseURL --> FetchMain
FetchMain --> ParseSoup
ParseSoup --> FindLinks
FindLinks --> ExtractInfo
ExtractInfo --> CalcLevel
CalcLevel --> BuildPages
BuildPages --> SortFunc
SortFunc --> PagesList
Content Extraction and Cleaning
Each page undergoes a multi-step cleaning process to remove DeepWiki UI elements:
Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:175-190 tools/deepwiki-scraper.py:127-173
flowchart TD
subgraph fetch_page
MakeRequest["requests.get(url, headers)\nUser-Agent: Mozilla/5.0..."]
RetryLogic["Retry up to 3 times\n2 second delay between attempts"]
CheckStatus["response.raise_for_status()"]
end
subgraph extract_page_content
ParsePage["BeautifulSoup(response.text)"]
RemoveUnwanted["Decompose: nav, header, footer,\naside, .sidebar, script, style"]
FindMain["Try selectors in order:\narticle → main → .wiki-content\n→ [role='main'] → body"]
RemoveUI["Remove DeepWiki UI elements:\n'Edit Wiki', 'Last indexed:',\n'Index your code with Devin'"]
RemoveNavLists["Remove navigation <ul> lists\n(80%+ internal wiki links)"]
end
subgraph convert_html_to_markdown
HTML2TextInit["h = html2text.HTML2Text()\nh.ignore_links = False\nh.body_width = 0"]
HandleContent["markdown = h.handle(html_content)"]
CleanFooterCall["clean_deepwiki_footer(markdown)"]
end
subgraph clean_deepwiki_footer
SplitLines["lines = markdown.split('\\n')"]
ScanBackward["Scan last 50 lines backward\nfor footer patterns"]
MatchPatterns["Regex patterns:\n'Dismiss', 'Refresh this wiki',\n'On this page', 'Edit Wiki'"]
TruncateLines["lines = lines[:footer_start]"]
RemoveEmpty["Remove trailing empty lines"]
end
MakeRequest --> RetryLogic
RetryLogic --> CheckStatus
CheckStatus --> ParsePage
ParsePage --> RemoveUnwanted
RemoveUnwanted --> FindMain
FindMain --> RemoveUI
RemoveUI --> RemoveNavLists
RemoveNavLists --> HTML2TextInit
HTML2TextInit --> HandleContent
HandleContent --> CleanFooterCall
CleanFooterCall --> SplitLines
SplitLines --> ScanBackward
ScanBackward --> MatchPatterns
MatchPatterns --> TruncateLines
TruncateLines --> RemoveEmpty
Link Rewriting Logic
Phase 1 transforms internal DeepWiki links into relative Markdown file paths. The rewriting logic accounts for hierarchical directory structure:
Sources: tools/deepwiki-scraper.py:549-592
flowchart TD
subgraph Input
WikiLink["DeepWiki Link\n[text](/owner/repo/2-1-section)"]
SourcePage["Current Page Info\n{level: 1, number: '2.1'}"]
end
subgraph fix_wiki_link
ExtractPath["Regex: /(\d+(?:\.\d+)*)-(.+)$"]
ParseNumbers["Extract: page_num='2.1', slug='section'"]
ConvertNum["file_num = page_num.replace('.', '-')\nResult: '2-1'"]
CheckTarget{"Is target\nsubsection?\n(has '.')"}
CheckSource{"Is source\nsubsection?\n(level > 0)"}
CheckSame{"Same main\nsection?"}
PathSameSection["Relative path:\nfile_num-slug.md"]
PathDiffSection["Full path:\nsection-N/file_num-slug.md"]
PathToMain["Up one level:\n../file_num-slug.md"]
PathMainToMain["Same level:\nfile_num-slug.md"]
end
subgraph Output
MDLink["Markdown Link\n[text](2-1-section.md)\nor [text](section-2/2-1-section.md)\nor [text](../2-1-section.md)"]
end
WikiLink --> ExtractPath
ExtractPath --> ParseNumbers
ParseNumbers --> ConvertNum
ConvertNum --> CheckTarget
CheckTarget -->|Yes| CheckSource
CheckTarget -->|No: Main Page| CheckSource
CheckSource -->|Target: Sub, Source: Sub| CheckSame
CheckSource -->|Target: Sub, Source: Main| PathDiffSection
CheckSource -->|Target: Main, Source: Sub| PathToMain
CheckSource -->|Target: Main, Source: Main| PathMainToMain
CheckSame -->|Yes| PathSameSection
CheckSame -->|No| PathDiffSection
PathSameSection --> MDLink
PathDiffSection --> MDLink
PathToMain --> MDLink
PathMainToMain --> MDLink
File Organization Strategy
Phase 1 organizes output files into a hierarchical directory structure based on page levels:
Directory Structure Rules
| Page Level | Page Number Format | Directory Location | Filename Pattern | Example |
|---|---|---|---|---|
| 0 (Main) | 1, 2, 3 | temp_dir/ (root) | {num}-{slug}.md | 1-overview.md |
| 1 (Subsection) | 2.1, 3.4 | temp_dir/section-{N}/ | {num}-{slug}.md | section-2/2-1-workspace.md |
File Organization Implementation
Sources: tools/deepwiki-scraper.py:21-25 tools/deepwiki-scraper.py:845-868
HTTP Session Configuration
Phase 1 uses a persistent requests.Session with browser-like headers and retry logic:
Session Setup
Retry Strategy
Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:817-821
Data Structures
Page Metadata Dictionary
Each page discovered by extract_wiki_structure() is represented as a dictionary:
Sources: tools/deepwiki-scraper.py:109-115
BeautifulSoup Content Selectors
Phase 1 attempts multiple selector strategies to find main content, in priority order:
| Priority | Selector Type | Selector Value | Rationale |
|---|---|---|---|
| 1 | CSS Selector | article | Semantic HTML5 element for main content |
| 2 | CSS Selector | main | HTML5 main landmark element |
| 3 | CSS Selector | .wiki-content | Common class name for wiki content |
| 4 | CSS Selector | .content | Generic content class |
| 5 | CSS Selector | #content | Generic content ID |
| 6 | CSS Selector | .markdown-body | GitHub-style markdown container |
| 7 | Attribute | role="main" | ARIA landmark role |
| 8 | Fallback | body | Last resort: entire body |
Sources: tools/deepwiki-scraper.py:472-484
Error Handling and Robustness
Page Extraction Error Handling
Phase 1 implements graceful degradation for individual page failures:
Sources: tools/deepwiki-scraper.py:841-876
Content Extraction Fallbacks
If primary content selectors fail, Phase 1 applies fallback strategies:
- Content Selector Fallback Chain : Try 8 different selectors (see table above)
- Empty Content Check : Raises exception if no content element found tools/deepwiki-scraper.py:486-487
- HTTP Retry Logic : 3 attempts with exponential backoff
- Session Persistence : Reuses TCP connections for efficiency
Sources: tools/deepwiki-scraper.py:472-487 tools/deepwiki-scraper.py:27-42
Output Format
Temporary Directory Structure
At the end of Phase 1, the temporary directory contains the following structure:
temp_dir/
├── 1-overview.md # Main page (level 0)
├── 2-architecture.md # Main page (level 0)
├── 3-components.md # Main page (level 0)
├── section-2/ # Subsections of page 2
│ ├── 2-1-workspace-and-crates.md # Subsection (level 1)
│ └── 2-2-dependency-graph.md # Subsection (level 1)
└── section-4/ # Subsections of page 4
├── 4-1-logical-planning.md
└── 4-2-physical-planning.md
Markdown File Format
Each generated Markdown file has the following characteristics:
- Title : Always starts with
# {Page Title}heading - Content : Cleaned HTML converted to Markdown via
html2text - Links : Internal wiki links rewritten to relative
.mdpaths - No Diagrams : Diagrams are added in Phase 2 (see #7)
- No Footer : DeepWiki UI elements removed via
clean_deepwiki_footer() - Encoding : UTF-8
Sources: tools/deepwiki-scraper.py:862-868 tools/deepwiki-scraper.py:127-173
Phase 1 Completion Criteria
Phase 1 is considered complete when:
- All pages discovered by
extract_wiki_structure()have been processed - Each page's Markdown file has been written to the temporary directory
- Directory structure (main pages +
section-N/subdirectories) has been created - Success count is reported:
"✓ Successfully extracted N/M pages to temp directory"
The temporary directory is then passed to Phase 2 for diagram enhancement.
Sources: tools/deepwiki-scraper.py877 tools/deepwiki-scraper.py:596-788
Wiki Structure Discovery
Relevant source files
Purpose and Scope
This document describes the wiki structure discovery mechanism in Phase 1 of the processing pipeline. The system analyzes the main DeepWiki repository page to identify all available wiki pages and their hierarchical relationships. This discovery phase produces a structured page list that drives subsequent content extraction.
For the HTML-to-Markdown conversion that follows discovery, see HTML to Markdown Conversion. For the overall Phase 1 process, see Phase 1: Markdown Extraction.
Overview
The discovery process fetches the main wiki page for a repository and parses its HTML to extract all wiki page references. The system identifies both main pages (e.g., 1, 2, 3) and subsections (e.g., 2.1, 2.2, 3.1) by analyzing link patterns. The output is a sorted list of page metadata that includes page numbers, titles, URLs, and hierarchical levels.
flowchart TD
Start["main()
entry point"] --> ValidateRepo["Validate repo format\n(owner/repo)"]
ValidateRepo --> CreateSession["Create requests.Session\nwith User-Agent headers"]
CreateSession --> CallExtract["extract_wiki_structure(repo, session)"]
CallExtract --> FetchMain["Fetch https://deepwiki.com/{repo}"]
FetchMain --> ParseHTML["BeautifulSoup(response.text)"]
ParseHTML --> FindLinks["soup.find_all('a', href=regex)"]
FindLinks --> IterateLinks["Iterate over all links"]
IterateLinks --> ExtractPattern["Regex: /(\d+(?:\.\d+)*)-(.+)$"]
ExtractPattern --> BuildPageDict["Build page dict:\n{number, title, url, href, level}"]
BuildPageDict --> CheckDupe{"href in seen_urls?"}
CheckDupe -->|Yes| IterateLinks
CheckDupe -->|No| AddToList["pages.append(page_dict)"]
AddToList --> IterateLinks
IterateLinks -->|Done| SortPages["Sort by numeric parts:\nsort_key([int(x)
for x in num.split('.')])"]
SortPages --> ReturnPages["Return pages list"]
ReturnPages --> ProcessPages["Process each page\nin main loop"]
style CallExtract fill:#f9f,stroke:#333,stroke-width:2px
style ExtractPattern fill:#f9f,stroke:#333,stroke-width:2px
style SortPages fill:#f9f,stroke:#333,stroke-width:2px
Discovery Flow Diagram
Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:790-831
Main Discovery Function
The extract_wiki_structure function performs the core discovery logic. It accepts a repository identifier (e.g., "jzombie/deepwiki-to-mdbook") and an HTTP session object, then returns a list of page dictionaries.
Function Signature and Entry Point
Sources: tools/deepwiki-scraper.py:78-79
HTTP Request and HTML Parsing
The function constructs the base URL and fetches the main wiki page:
The fetch_page helper includes retry logic (3 attempts) and browser-like headers to handle transient failures.
Sources: tools/deepwiki-scraper.py:80-84 tools/deepwiki-scraper.py:27-42
Link Pattern Matching
Regex-Based Link Discovery
The system uses a compiled regex pattern to find all wiki page links:
This pattern matches URLs like:
/jzombie/deepwiki-to-mdbook/1-overview/jzombie/deepwiki-to-mdbook/2-quick-start/jzombie/deepwiki-to-mdbook/2-1-basic-usage
Sources: tools/deepwiki-scraper.py:88-90
Page Information Extraction
For each matched link, the system extracts page metadata using a detailed regex pattern:
The regex r'/(\d+(?:\.\d+)*)-(.+)$' captures:
- Group 1: Page number with optional dots (e.g.,
1,2.1,3.2.1) - Group 2: URL slug (e.g.,
overview,basic-usage)
Sources: tools/deepwiki-scraper.py:98-107
Link Extraction Data Flow
Sources: tools/deepwiki-scraper.py:98-115
Deduplication and Sorting
Deduplication Strategy
The system maintains a seen_urls set to prevent duplicate page entries:
Sources: tools/deepwiki-scraper.py:92-116
Hierarchical Sorting
Pages are sorted by their numeric components to maintain proper ordering:
This ensures ordering like: 1 → 2 → 2.1 → 2.2 → 3 → 3.1
Sources: tools/deepwiki-scraper.py:118-123
Sorting Example
| Before Sorting (Link Order) | Page Number | After Sorting (Numeric Order) |
|---|---|---|
/3-phase-3 | 3 | /1-overview |
/2-1-subsection-one | 2.1 | /2-quick-start |
/1-overview | 1 | /2-1-subsection-one |
/2-quick-start | 2 | /2-2-subsection-two |
/2-2-subsection-two | 2.2 | /3-phase-3 |
Page Data Structure
Page Dictionary Schema
Each discovered page is represented as a dictionary:
Sources: tools/deepwiki-scraper.py:109-115
Level Calculation
The level field indicates hierarchical depth:
| Page Number | Level | Type |
|---|---|---|
1 | 0 | Main page |
2 | 0 | Main page |
2.1 | 1 | Subsection |
2.2 | 1 | Subsection |
3.1.1 | 2 | Sub-subsection |
Sources: tools/deepwiki-scraper.py:106-114
Discovery Result Processing
Output Statistics
After discovery, the system categorizes pages and reports statistics:
Sources: tools/deepwiki-scraper.py:824-837
Integration with Content Extraction
The discovered page list drives the extraction loop in main():
Sources: tools/deepwiki-scraper.py:841-860
Alternative Discovery Method (Unused)
Subsection Probing Function
The codebase includes a discover_subsections function that uses HTTP HEAD requests to probe for subsections, but this function is not invoked in the current implementation:
This function attempts to discover subsections by making HEAD requests to potential URLs (e.g., /repo/2-1-, /repo/2-2-). However, the actual implementation relies entirely on parsing links from the main wiki page.
Sources: tools/deepwiki-scraper.py:44-76
Discovery Method Comparison
Sources: tools/deepwiki-scraper.py:44-76 tools/deepwiki-scraper.py:78-125
Error Handling
No Pages Found
The system validates that at least one page was discovered:
Sources: tools/deepwiki-scraper.py:828-830
Network Failures
The fetch_page function includes retry logic:
Sources: tools/deepwiki-scraper.py:33-42
Summary
The wiki structure discovery process provides a robust mechanism for identifying all pages in a DeepWiki repository through a single HTML parse operation. The resulting page list is hierarchically organized and drives all subsequent extraction operations in Phase 1.
Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:790-831
HTML to Markdown Conversion
Relevant source files
This document describes the HTML parsing and Markdown conversion process that transforms DeepWiki's HTML pages into clean, portable Markdown files. This is a core component of Phase 1 (Markdown Extraction) in the three-phase pipeline.
For information about the diagram enhancement that occurs after this conversion, see Phase 2: Diagram Enhancement. For details on how the wiki structure is discovered before this conversion begins, see Wiki Structure Discovery.
Purpose and Scope
The HTML to Markdown conversion process takes raw HTML fetched from DeepWiki.com and transforms it into clean Markdown files suitable for processing by mdBook. This conversion must handle several challenges:
- Extract only content, removing DeepWiki's UI elements and navigation
- Preserve the semantic structure (headings, lists, code blocks)
- Convert internal wiki links to relative Markdown file paths
- Remove DeepWiki-specific footer content
- Handle hierarchical link relationships between main pages and subsections
Conversion Pipeline Overview
Conversion Flow: HTML to Clean Markdown
Sources: tools/deepwiki-scraper.py:453-594
HTML Parsing and Content Extraction
BeautifulSoup Content Location Strategy
The system uses a multi-strategy approach to locate the main content area, trying selectors in order of specificity:
Content Locator Strategies
flowchart LR
Start["extract_page_content()"]
Strat1["Try CSS Selectors\narticle, main, .wiki-content"]
Strat2["Try Role Attribute\nrole='main'"]
Strat3["Fallback: body Element"]
Success["Content Found"]
Error["Raise Exception"]
Start --> Strat1
Strat1 -->|Found| Success
Strat1 -->|Not Found| Strat2
Strat2 -->|Found| Success
Strat2 -->|Not Found| Strat3
Strat3 -->|Found| Success
Strat3 -->|Not Found| Error
The system attempts these selectors in sequence:
| Priority | Selector Type | Selector Value | Purpose |
|---|---|---|---|
| 1 | CSS | article, main, .wiki-content, .content, #content, .markdown-body | Semantic HTML5 content containers |
| 2 | Attribute | role="main" | ARIA landmark for main content |
| 3 | Fallback | body | Last resort - entire body element |
Sources: tools/deepwiki-scraper.py:469-487
UI Element Removal
The conversion process removes several categories of unwanted elements before processing:
Structural Element Removal
The following element types are removed wholesale using elem.decompose():
Text-Based UI Element Removal
DeepWiki-specific UI elements are identified by text content patterns:
| Pattern | Purpose | Max Length Filter |
|---|---|---|
Index your code with Devin | AI indexing prompt | < 200 chars |
Edit Wiki | Edit button | < 200 chars |
Last indexed: | Metadata display | < 200 chars |
View this search on DeepWiki | Search link | < 200 chars |
The length filter prevents accidental removal of paragraph content that happens to contain these phrases.
Sources: tools/deepwiki-scraper.py:466-500
Navigation List Detection
The system automatically detects and removes navigation lists using heuristics:
Navigation List Detection Algorithm
flowchart TD
FindUL["Find all <ul> elements"]
CountLinks["Count <a> tags"]
Check5["links.length > 5?"]
CountInternal["Count internal links\nhref starts with '/'"]
Check80["wiki_links > 80% of links?"]
Remove["ul.decompose()"]
Keep["Keep element"]
FindUL --> CountLinks
CountLinks --> Check5
Check5 -->|Yes| CountInternal
Check5 -->|No| Keep
CountInternal --> Check80
Check80 -->|Yes| Remove
Check80 -->|No| Keep
This heuristic successfully identifies table of contents lists and navigation menus while preserving legitimate bulleted lists in the content.
Sources: tools/deepwiki-scraper.py:502-511
html2text Conversion Configuration
The core conversion uses the html2text library with specific configuration to ensure clean output:
html2text Configuration
Key Configuration Decisions
| Setting | Value | Rationale |
|---|---|---|
ignore_links | False | Links must be preserved so they can be rewritten to relative paths |
body_width | 0 | Disables line wrapping, which would interfere with diagram matching in Phase 2 |
The body_width=0 setting is particularly important because Phase 2's fuzzy matching algorithm compares text chunks from the JavaScript payload with the converted Markdown. Line wrapping would cause mismatches.
Sources: tools/deepwiki-scraper.py:175-190
DeepWiki Footer Cleaning
After html2text conversion, the system removes DeepWiki-specific footer content using pattern matching.
Footer Detection Patterns
The clean_deepwiki_footer() function uses compiled regex patterns to identify footer content:
Footer Pattern Table
| Pattern | Example Match | Purpose |
|---|---|---|
^\s*Dismiss\s*$ | "Dismiss" | Modal dismiss button |
Refresh this wiki | "Refresh this wiki" | Refresh action link |
This wiki was recently refreshed | Full phrase | Status message |
###\s*On this page | "### On this page" | TOC heading |
Please wait \d+ days? to refresh | "Please wait 7 days" | Rate limit message |
You can refresh again in | Full phrase | Alternative rate limit |
^\s*View this search on DeepWiki | Full phrase | Search link |
^\s*Edit Wiki\s*$ | "Edit Wiki" | Edit action |
Footer Scanning Algorithm
The backward scan ensures the earliest footer indicator is found, preventing content loss if footer elements are scattered.
Sources: tools/deepwiki-scraper.py:127-173
Internal Link Rewriting
The most complex part of the conversion is rewriting internal wiki links to relative Markdown file paths. Links must account for the hierarchical directory structure where subsections are placed in subdirectories.
Link Rewriting Logic
The fix_wiki_link() function handles four distinct cases based on source and target locations:
Link Rewriting Decision Matrix
| Source Location | Target Location | Relative Path Format | Example |
|---|---|---|---|
| Main page | Main page | {file_num}-{slug}.md | 2-overview.md |
| Main page | Subsection | section-{main}/{file_num}-{slug}.md | section-2/2-1-details.md |
| Subsection | Same section subsection | {file_num}-{slug}.md | 2-2-more.md |
| Subsection | Main page | ../{file_num}-{slug}.md | ../3-next.md |
| Subsection | Different section | ../section-{main}/{file_num}-{slug}.md | ../section-3/3-1-sub.md |
Link Rewriting Flow
Link Pattern Matching
The link rewriting uses regex substitution on Markdown link syntax:
The regex captures only the page-slug portion after the repository path, which is then processed by fix_wiki_link().
Sources: tools/deepwiki-scraper.py:547-592
Post-Conversion Cleanup
After all conversions and transformations, final cleanup removes artifacts:
Duplicate Title Removal
Duplicate Title Detection
This cleanup handles cases where DeepWiki includes the page title multiple times in the rendered HTML.
Sources: tools/deepwiki-scraper.py:525-545
flowchart TD
Start["extract_page_content(url, session, current_page_info)"]
Fetch["fetch_page(url, session)\nHTTP GET with retries"]
Parse["BeautifulSoup(response.text)"]
RemoveNav["Remove nav/header/footer"]
FindContent["Locate main content area"]
RemoveUI["Remove DeepWiki UI elements"]
RemoveLists["Remove navigation lists"]
ToStr["str(content)"]
Convert["convert_html_to_markdown(html)"]
CleanUp["Remove duplicate titles/Menu"]
FixLinks["Rewrite internal links\nusing current_page_info"]
Return["Return markdown string"]
Start --> Fetch
Fetch --> Parse
Parse --> RemoveNav
RemoveNav --> FindContent
FindContent --> RemoveUI
RemoveUI --> RemoveLists
RemoveLists --> ToStr
ToStr --> Convert
Convert --> CleanUp
CleanUp --> FixLinks
FixLinks --> Return
Integration with Extract Page Content
The complete content extraction flow shows how all components work together:
extract_page_content() Complete Flow
The current_page_info parameter provides context about the source page's location in the hierarchy, which is essential for generating correct relative link paths.
Sources: tools/deepwiki-scraper.py:453-594
Error Handling and Retries
HTTP Fetch with Retries
The fetch_page() function implements exponential backoff for failed requests:
| Attempt | Action | Delay |
|---|---|---|
| 1 | Try request | None |
| 2 | Retry after error | 2 seconds |
| 3 | Final retry | 2 seconds |
| Fail | Raise exception | N/A |
Browser-like headers are used to avoid being blocked:
Sources: tools/deepwiki-scraper.py:27-42
Rate Limiting
To be respectful to the DeepWiki server, the main extraction loop includes a 1-second delay between page requests:
This appears in the main loop after each successful page extraction.
Sources: tools/deepwiki-scraper.py872
Dependencies
The HTML to Markdown conversion relies on three key Python libraries:
| Library | Version | Purpose |
|---|---|---|
requests | ≥2.31.0 | HTTP requests with session management |
beautifulsoup4 | ≥4.12.0 | HTML parsing and element manipulation |
html2text | ≥2020.1.16 | HTML to Markdown conversion |
Sources: tools/requirements.txt:1-3
Output Characteristics
The Markdown files produced by this conversion have these properties:
- No line wrapping : Original formatting preserved (
body_width=0) - Clean structure : No UI elements or navigation
- Relative links : All internal links point to local
.mdfiles - Title guarantee : Every file starts with an H1 heading
- Hierarchy-aware : Links account for subdirectory structure
- Footer-free : DeepWiki-specific footer content removed
These characteristics make the files suitable for Phase 2 diagram enhancement and Phase 3 mdBook building without further modification.
Phase 2: Diagram Enhancement
Relevant source files
Purpose and Scope
Phase 2 performs intelligent diagram extraction and placement after Phase 1 has generated clean markdown files. This phase extracts Mermaid diagrams from DeepWiki's JavaScript payload, matches them to appropriate locations in the markdown content using fuzzy text matching, and inserts them contextually after relevant paragraphs.
For information about the initial markdown extraction that precedes this phase, see Phase 1: Markdown Extraction. For details on the specific fuzzy matching algorithm implementation, see Fuzzy Diagram Matching Algorithm. For information about the extraction patterns used, see Diagram Extraction from Next.js.
Sources: README.md:130-136 tools/deepwiki-scraper.py:596-789
The Client-Side Rendering Problem
DeepWiki renders diagrams client-side using JavaScript, making them invisible to traditional HTML scraping. All Mermaid diagrams are embedded in a JavaScript payload (self.__next_f.push) that contains diagram code for all pages in the wiki , not just the current page. This creates a matching problem: given ~461 diagrams in a single payload and individual markdown files, how do we determine which diagrams belong in which files?
Key challenges:
- Diagrams are escaped JavaScript strings (
\n,\t,\") - No metadata associates diagrams with specific pages
- html2text conversion changes text formatting from the original JavaScript context
- Must avoid false positives (placing diagrams in wrong locations)
Sources: tools/deepwiki-scraper.py:458-461 README.md:131-136
Architecture Overview
Diagram: Phase 2 Processing Pipeline
Sources: tools/deepwiki-scraper.py:596-789
Diagram Extraction Process
The extraction process reads the JavaScript payload from any DeepWiki page and locates all Mermaid diagram blocks using regex pattern matching.
flowchart TD
Start["extract_and_enhance_diagrams()"]
FetchURL["Fetch https://deepwiki.com/repo/1-overview"]
subgraph "Pattern Matching"
Pattern1["Pattern: r'```mermaid\\\\\n(.*?)```'\n(re.DOTALL)"]
Pattern2["Pattern: r'([^`]{500,}?)```mermaid\\\\ (.*?)```'\n(with context)"]
FindAll["re.findall() → all_diagrams list"]
FindIter["re.finditer() → diagram_contexts with context"]
end
subgraph "Unescaping"
ReplaceNewline["Replace '\\\\\n' → newline"]
ReplaceTab["Replace '\\\\ ' → tab"]
ReplaceQuote["Replace '\\\\\"' → double-quote"]
ReplaceUnicode["Replace Unicode escapes:\n\\\< → '<'\n\\\> → '>'\n\\\& → '&'"]
end
subgraph "Context Processing"
Last500["Extract last 500 chars of context"]
FindHeading["Scan for last heading starting with #"]
ExtractAnchor["Extract last 2-3 non-heading lines\n(min 20 chars each)"]
BuildDict["Build dict: {last_heading, anchor_text, diagram}"]
end
Start --> FetchURL
FetchURL --> Pattern1
FetchURL --> Pattern2
Pattern1 --> FindAll
Pattern2 --> FindIter
FindAll --> ReplaceNewline
FindIter --> ReplaceNewline
ReplaceNewline --> ReplaceTab
ReplaceTab --> ReplaceQuote
ReplaceQuote --> ReplaceUnicode
ReplaceUnicode --> Last500
Last500 --> FindHeading
FindHeading --> ExtractAnchor
ExtractAnchor --> BuildDict
BuildDict --> Output["Returns:\n- all_diagrams count\n- diagram_contexts list"]
Extraction Function Flow
Diagram: Diagram Extraction and Context Building
Sources: tools/deepwiki-scraper.py:604-674
Key Implementation Details
| Component | Implementation | Location |
|---|---|---|
| Regex Pattern | r'```mermaid\\n(.*?)```' with re.DOTALL flag | tools/deepwiki-scraper.py615 |
| Context Pattern | r'([^]{500,}?)mermaid\\n(.*?)'` captures 500+ chars | tools/deepwiki-scraper.py621 |
| Unescape Operations | replace('\\n', '\n'), replace('\\t', '\t'), etc. | tools/deepwiki-scraper.py:628-635 tools/deepwiki-scraper.py:639-645 |
| Heading Detection | line.startswith('#') on reversed context lines | tools/deepwiki-scraper.py:652-656 |
| Anchor Extraction | Last 2-3 lines with len(line) > 20, max 300 chars | tools/deepwiki-scraper.py:658-666 |
| Context Storage | Dict with keys: last_heading, anchor_text, diagram | tools/deepwiki-scraper.py:668-672 |
Sources: tools/deepwiki-scraper.py:614-674
Fuzzy Matching Algorithm
The fuzzy matching algorithm determines where each diagram should be inserted by finding the best match between the diagram's context and the markdown file's content.
flowchart TD
Start["For each diagram_contexts[idx]"]
CheckUsed["idx in diagrams_used?"]
Skip["Skip to next diagram"]
subgraph "Text Normalization"
NormFile["Normalize file content:\ncontent.lower()\n' '.join(content.split())"]
NormAnchor["Normalize anchor_text:\nanchor.lower()\n' '.join(anchor.split())"]
NormHeading["Normalize heading:\nheading.lower().replace('#', '').strip()"]
end
subgraph "Progressive Chunk Matching"
Try300["Try chunk_size=300"]
Try200["Try chunk_size=200"]
Try150["Try chunk_size=150"]
Try100["Try chunk_size=100"]
Try80["Try chunk_size=80"]
ExtractChunk["test_chunk = anchor_normalized[-chunk_size:]"]
FindPos["pos = content_normalized.find(test_chunk)"]
CheckPos["pos != -1?"]
ConvertLine["Convert char position to line number"]
RecordMatch["Record best_match_line, best_match_score"]
end
subgraph "Heading Fallback"
IterLines["For each line in markdown"]
CheckHeadingLine["line.strip().startswith('#')?"]
NormalizeLinе["Normalize line heading"]
CheckContains["heading_normalized in line_normalized?"]
RecordHeadingMatch["best_match_line = line_num\nbest_match_score = 50"]
end
Start --> CheckUsed
CheckUsed -->|Yes| Skip
CheckUsed -->|No| NormFile
NormFile --> NormAnchor
NormAnchor --> Try300
Try300 --> ExtractChunk
ExtractChunk --> FindPos
FindPos --> CheckPos
CheckPos -->|Found| ConvertLine
CheckPos -->|Not found| Try200
ConvertLine --> RecordMatch
Try200 --> Try150
Try150 --> Try100
Try100 --> Try80
Try80 -->|All failed| IterLines
RecordMatch --> Success["Return match with score"]
IterLines --> CheckHeadingLine
CheckHeadingLine -->|Yes| NormalizeLinе
NormalizeLinе --> CheckContains
CheckContains -->|Yes| RecordHeadingMatch
RecordHeadingMatch --> Success
Matching Strategy
Diagram: Progressive Chunk Matching with Fallback
Sources: tools/deepwiki-scraper.py:708-746
Chunk Size Progression
The algorithm tries progressively smaller chunk sizes to accommodate variations in text formatting between the JavaScript context and the html2text-converted markdown:
| Chunk Size | Use Case | Success Rate |
|---|---|---|
| 300 chars | Perfect or near-perfect matches | Highest precision |
| 200 chars | Minor formatting differences | Good precision |
| 150 chars | Moderate text variations | Acceptable precision |
| 100 chars | Significant reformatting | Lower precision |
| 80 chars | Minimal context available | Lowest precision |
| Heading match | Fallback when text matching fails | Score: 50 |
The algorithm stops at the first successful match, prioritizing larger chunks for higher confidence.
Sources: tools/deepwiki-scraper.py:716-730 README.md134
flowchart TD
Start["Found best_match_line"]
CheckType["lines[best_match_line].strip().startswith('#')?"]
subgraph "Heading Case"
H1["insert_line = best_match_line + 1"]
H2["Skip blank lines after heading"]
H3["Skip through paragraph content"]
H4["Stop at next blank line or heading"]
end
subgraph "Paragraph Case"
P1["insert_line = best_match_line + 1"]
P2["Find end of current paragraph"]
P3["Stop at next blank line or heading"]
end
subgraph "Insertion Format"
I1["Insert: empty line"]
I2["Insert: ```mermaid"]
I3["Insert: diagram code"]
I4["Insert: ```"]
I5["Insert: empty line"]
end
Start --> CheckType
CheckType -->|Heading| H1
CheckType -->|Paragraph| P1
H1 --> H2
H2 --> H3
H3 --> H4
P1 --> P2
P2 --> P3
H4 --> I1
P3 --> I1
I1 --> I2
I2 --> I3
I3 --> I4
I4 --> I5
I5 --> Append["Append to pending_insertions list:\n(insert_line, diagram, score, idx)"]
Insertion Point Logic
After finding a match, the system determines the precise line number where the diagram should be inserted.
Insertion Algorithm
Diagram: Insertion Point Calculation
Sources: tools/deepwiki-scraper.py:747-768
graph LR
Collect["Collect all\npending_insertions"]
Sort["Sort by insert_line\n(descending)"]
Insert["Insert from bottom to top\npreserves line numbers"]
Write["Write enhanced file\nto temp_dir"]
Collect --> Sort
Sort --> Insert
Insert --> Write
Batch Insertion Strategy
Diagrams are inserted in descending line order to avoid invalidating insertion points:
Diagram: Batch Insertion Order
Implementation:
Sources: tools/deepwiki-scraper.py:771-783
sequenceDiagram
participant Main as extract_and_enhance_diagrams()
participant Glob as temp_dir.glob('**/*.md')
participant File as Individual .md file
participant Matcher as Fuzzy Matcher
participant Writer as File Writer
Main->>Main: Extract all diagram_contexts
Main->>Glob: Find all markdown files
loop For each md_file
Glob->>File: Open and read content
File->>File: Check if '```mermaid' already present
alt Already has diagrams
File->>Glob: Skip (continue)
else No diagrams
File->>Matcher: Normalize content
loop For each diagram_context
Matcher->>Matcher: Try progressive chunk matching
Matcher->>Matcher: Try heading fallback
Matcher->>Matcher: Record best match
end
Matcher->>File: Return pending_insertions list
File->>File: Sort insertions (descending)
File->>File: Insert diagrams bottom-up
File->>Writer: Write enhanced content
Writer->>Main: Increment enhanced_count
end
end
Main->>Main: Print summary
File Processing Workflow
Phase 2 operates on files in the temporary directory created by Phase 1, enhancing them in-place before they are moved to the final output directory.
Processing Loop
Diagram: File Processing Sequence
Sources: tools/deepwiki-scraper.py:676-788
Performance Characteristics
Extraction Statistics
From a typical wiki with ~10 pages:
| Metric | Value | Location |
|---|---|---|
| Total diagrams in JS payload | ~461 | README.md132 |
| Diagrams with context (500+ chars) | ~48 | README.md133 |
| Context window size | 500 characters | tools/deepwiki-scraper.py621 |
| Anchor text max length | 300 characters | tools/deepwiki-scraper.py666 |
| Typical enhanced files | Varies by content | Printed in output |
Sources: README.md:132-133 tools/deepwiki-scraper.py674 tools/deepwiki-scraper.py788
Matching Performance
The progressive chunk size strategy balances precision and recall:
- High precision matches (300-200 chars) : Strong contextual alignment
- Medium precision matches (150-100 chars) : Acceptable with some risk
- Low precision matches (80 chars) : Risk of false positives
- Heading-only matches (score: 50) : Last resort fallback
The algorithm prefers to skip a diagram rather than place it incorrectly, prioritizing documentation quality over diagram count.
Sources: tools/deepwiki-scraper.py:716-745
Integration with Phases 1 and 3
Input Requirements (from Phase 1)
- Clean markdown files in
temp_dir - Files must not already contain
\```mermaidblocks - Proper heading structure for fallback matching
- Normalized link structure
Sources: tools/deepwiki-scraper.py:810-877
Output Guarantees (for Phase 3)
- Enhanced markdown files in
temp_dir - Diagrams inserted with proper fencing:
\```mermaid...````` - Blank lines before and after diagrams for proper rendering
- Original file structure preserved (section-N directories maintained)
- Atomic file operations (write complete file or skip)
Sources: tools/deepwiki-scraper.py:781-786 tools/deepwiki-scraper.py:883-908
Workflow Integration
Diagram: Three-Phase Integration
Sources: README.md:123-145 tools/deepwiki-scraper.py:810-916
Error Handling and Edge Cases
Skipped Files
Files are skipped if they already contain Mermaid diagrams to avoid duplicate insertion:
Sources: tools/deepwiki-scraper.py:686-687
Failed Matches
When a diagram cannot be matched:
- The diagram is not inserted (conservative approach)
- No error is raised (continues processing other diagrams)
- File is left unmodified if no diagrams match
Sources: tools/deepwiki-scraper.py:699-746
Network Errors
If diagram extraction fails (network error, changed HTML structure):
- Warning is printed but Phase 2 continues
- Phase 1 files remain valid
- System can still proceed to Phase 3 without diagrams
Sources: tools/deepwiki-scraper.py:610-612
Diagram Quality Thresholds
| Threshold | Purpose |
|---|---|
len(diagram) > 10 | Filter out trivial/invalid diagram code |
len(anchor) > 50 | Ensure sufficient context for matching |
len(line) > 20 | Filter out short lines from anchor text |
chunk_size >= 80 | Minimum viable match size |
Sources: tools/deepwiki-scraper.py648 tools/deepwiki-scraper.py712 tools/deepwiki-scraper.py661
Summary
Phase 2 implements a sophisticated fuzzy matching system that:
- Extracts all Mermaid diagrams from DeepWiki's JavaScript payload using regex patterns
- Processes diagram context to extract heading and anchor text metadata
- Matches diagrams to markdown files using progressive chunk size comparison (300→80 chars)
- Inserts diagrams after relevant paragraphs with proper formatting
- Validates through conservative matching to avoid false positives
The phase operates entirely on files in the temporary directory, leaving Phase 1's output intact while preparing enhanced files for Phase 3's mdBook build process.
Sources: README.md:130-136 tools/deepwiki-scraper.py:596-789
Fuzzy Diagram Matching Algorithm
Relevant source files
Purpose and Scope
This document describes the fuzzy matching algorithm used to intelligently place Mermaid diagrams extracted from DeepWiki's JavaScript payload into the correct locations within Markdown files. The algorithm solves the problem of matching diagram context (as it appears in the JavaScript) to content locations in the html2text-converted Markdown, accounting for formatting differences between the two representations.
For information about how diagrams are extracted from the Next.js payload, see Diagram Extraction from Next.js. For the overall diagram enhancement phase, see Phase 2: Diagram Enhancement.
The Matching Problem
The fuzzy matching algorithm addresses a fundamental mismatch: diagrams are embedded in DeepWiki's JavaScript payload alongside their surrounding context text, but this context text differs significantly from the final Markdown output produced by html2text. The algorithm must find where each diagram belongs despite these differences.
Format Differences Between Sources
| Aspect | JavaScript Payload | html2text Output |
|---|---|---|
| Whitespace | Escaped \n sequences | Actual newlines |
| Line wrapping | No wrapping (continuous text) | Wrapped at natural boundaries |
| HTML entities | Escaped (\u003c, \u0026) | Decoded (<, &) |
| Formatting | Inline with escaped quotes | Clean Markdown syntax |
| Structure | Linear text stream | Hierarchical headings/paragraphs |
Sources: tools/deepwiki-scraper.py:615-646
Context Extraction Strategy
The algorithm extracts two types of context for each diagram to enable matching:
1. Last Heading Before Diagram
The algorithm scans backwards through context lines to find the most recent heading, which provides a coarse-grained location hint.
Sources: tools/deepwiki-scraper.py:651-656
2. Anchor Text (Last 2-3 Paragraphs)
The anchor text consists of the last 2-3 substantial non-heading lines before the diagram, truncated to 300 characters. This provides fine-grained matching capability.
Sources: tools/deepwiki-scraper.py:658-666
Progressive Chunk Size Matching
The core of the fuzzy matching algorithm uses progressively smaller chunk sizes to find matches, prioritizing longer (more specific) matches over shorter ones.
Chunk Size Progression
The algorithm tests chunks in this order:
| Chunk Size | Purpose | Match Quality |
|---|---|---|
| 300 chars | Full anchor text | Highest confidence |
| 200 chars | Most of anchor | High confidence |
| 150 chars | Significant portion | Medium-high confidence |
| 100 chars | Key phrases | Medium confidence |
| 80 chars | Minimum viable match | Low confidence |
Matching Algorithm Flow
Sources: tools/deepwiki-scraper.py:716-730
Text Normalization
Both the diagram context and the target Markdown content undergo identical normalization to maximize matching success:
This process:
- Converts all text to lowercase
- Collapses all consecutive whitespace (spaces, tabs, newlines) into single spaces
- Removes leading/trailing whitespace
Sources: tools/deepwiki-scraper.py:695-696 tools/deepwiki-scraper.py:713-714
Fallback: Heading-Based Matching
If progressive chunk matching fails, the algorithm falls back to heading-based matching:
Heading-based matches receive a fixed score of 50, lower than any chunk-based match, indicating lower confidence.
Sources: tools/deepwiki-scraper.py:733-745
Insertion Point Calculation
Once a match is found, the algorithm calculates the precise insertion point for the diagram:
Insertion After Headings
Sources: tools/deepwiki-scraper.py:751-759
Insertion After Paragraphs
Sources: tools/deepwiki-scraper.py:760-765
Scoring and Deduplication
The algorithm tracks which diagrams have been used to prevent duplicates:
For each file, the algorithm:
- Attempts to match all diagrams with context
- Stores successful matches with their scores in
pending_insertions - Marks diagrams as used in
diagrams_used - Sorts insertions by line number (descending) to avoid index shifting
- Inserts diagrams from bottom to top
Sources: tools/deepwiki-scraper.py692 tools/deepwiki-scraper.py:767-768
Diagram Insertion Format
Diagrams are inserted with proper Markdown fencing and spacing:
This results in the following structure in the Markdown file:
Next paragraph text.
**Sources:** <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/tools/deepwiki-scraper.py#L774-L779" min=774 max=779 file-path="tools/deepwiki-scraper.py">Hii</FileRef>
## Complete Matching Pipeline
```mermaid
flowchart TD
Start["extract_and_enhance_diagrams"] --> FetchJS["Fetch JavaScript from\n/1-overview page"]
FetchJS --> ExtractAll["Extract all diagrams\nwith 500+ char context"]
ExtractAll --> ParseContext["Parse each context:\n- last_heading\n- anchor_text (300 chars)"]
ParseContext --> FindFiles["Find all .md files\nin temp directory"]
FindFiles --> ForEachFile["For each Markdown file"]
ForEachFile --> SkipExisting["Skip if already has\n```mermaid blocks"]
SkipExisting --> NormalizeContent["Normalize file content"]
NormalizeContent --> ForEachDiagram["For each diagram\nwith context"]
ForEachDiagram --> TryChunks["Try progressive chunk\nmatching (300-80)"]
TryChunks -->|Match| StoreMatch["Store match with score"]
TryChunks -->|No match| TryHeading["Try heading fallback"]
TryHeading -->|Match| StoreMatch
TryHeading -->|No match| NextDiagram["Try next diagram"]
StoreMatch --> NextDiagram
NextDiagram --> ForEachDiagram
ForEachDiagram -->|All tried| InsertAll["Insert all matched\ndiagrams (bottom-up)"]
InsertAll --> SaveFile["Save enhanced file"]
SaveFile --> ForEachFile
ForEachFile -->|All files| Report["Report statistics"]
Sources: tools/deepwiki-scraper.py:596-788
Key Functions
| Function | Location | Purpose |
|---|---|---|
extract_and_enhance_diagrams | tools/deepwiki-scraper.py:596-788 | Main orchestrator for diagram enhancement phase |
| Progressive chunk loop | tools/deepwiki-scraper.py:716-730 | Tries decreasing chunk sizes for matching |
| Heading fallback | tools/deepwiki-scraper.py:733-745 | Matches based on heading text when chunks fail |
| Insertion point calculation | tools/deepwiki-scraper.py:748-765 | Determines where to insert diagram after match |
| Diagram insertion | tools/deepwiki-scraper.py:774-779 | Inserts diagram with proper fencing |
Performance Characteristics
The algorithm processes diagrams in a single pass per file with the following complexity:
| Operation | Complexity | Notes |
|---|---|---|
| Content normalization | O(n) | Where n = file size in characters |
| Chunk search | O(n × c) | c = number of chunk sizes (5) |
| Line number conversion | O(L) | Where L = number of lines in file |
| Insertion sorting | O(k log k) | Where k = matched diagrams |
| Bottom-up insertion | O(k × L) | Avoids index recalculation |
For a typical file with 1000 lines and 50 diagram candidates, the algorithm completes in under 100ms.
Sources: tools/deepwiki-scraper.py:681-788
Match Quality Statistics
As reported in the console output, the algorithm typically achieves:
- Total diagrams in JavaScript : ~461 diagrams across all pages
- Diagrams with sufficient context : ~48 diagrams (500+ char context)
- Average match rate : 60-80% of diagrams with context are successfully placed
- Typical score distribution :
- 300-char matches: 20-30% (highest confidence)
- 200-char matches: 15-25%
- 150-char matches: 15-20%
- 100-char matches: 10-15%
- 80-char matches: 5-10%
- Heading fallback: 5-10% (lowest confidence)
Sources: README.md:132-136 tools/deepwiki-scraper.py674
Diagram Extraction from Next.js
Relevant source files
Purpose and Scope
This document details how Mermaid diagrams are extracted from DeepWiki's Next.js JavaScript payload. DeepWiki uses client-side rendering for diagrams, embedding them as escaped strings within the HTML's JavaScript data structures. This page covers the extraction algorithms, regex patterns, unescaping logic, and deduplication mechanisms used to recover these diagrams.
For information about how extracted diagrams are matched to content and injected into Markdown files, see Fuzzy Diagram Matching Algorithm. For the overall diagram enhancement workflow, see Phase 2: Diagram Enhancement.
The Next.js Data Payload Problem
DeepWiki's architecture presents a unique challenge for diagram extraction. The application uses Next.js with client-side rendering, where Mermaid diagrams are embedded in the JavaScript payload rather than being present in the static HTML. Furthermore, the JavaScript payload contains diagrams from all pages in the wiki, not just the currently viewed page, making per-page extraction impossible without additional context matching.
Diagram: Next.js Payload Structure
graph TB
subgraph "Browser View"
HTML["HTML Response\nfrom deepwiki.com"]
end
subgraph "Embedded JavaScript"
JSPayload["Next.js Data Payload\nMixed content from all pages"]
DiagramData["Mermaid Diagrams\nAs escaped strings"]
end
subgraph "String Format"
EscapedFormat["```mermaid\\ \ngraph TD\\ \nA --> B\\ \n```"]
UnescapedFormat["```mermaid\ngraph TD\nA --> B\n```"]
end
HTML --> JSPayload
JSPayload --> DiagramData
DiagramData --> EscapedFormat
EscapedFormat -.->|extract_mermaid_from_nextjs_data| UnescapedFormat
note1["Problem: Diagrams from\nALL wiki pages mixed together"]
JSPayload -.-> note1
note2["Problem: Escape sequences\n\\\n, \\ , \\\", etc."]
EscapedFormat -.-> note2
Sources: tools/deepwiki-scraper.py:218-331 tools/deepwiki-scraper.py:596-674
The key characteristics of this data format:
| Characteristic | Description | Impact |
|---|---|---|
| Escaped newlines | Literal \\n instead of newline characters | Requires unescaping before use |
| Mixed content | All pages' diagrams in one payload | Requires context matching (Phase 2) |
| Unicode escapes | Sequences like \\u003c for < | Requires comprehensive unescape logic |
| String wrapping | Diagrams wrapped in JavaScript strings | Requires careful quote handling |
Extraction Entry Point
The extract_mermaid_from_nextjs_data() function serves as the primary extraction mechanism. It is called during Phase 2 of the pipeline when processing the HTML response from any DeepWiki page.
Diagram: Extraction Function Flow
flowchart TD
Start["extract_mermaid_from_nextjs_data(html_text)"]
Strategy1["Strategy 1:\nFenced Block Pattern\n```mermaid\\\n(.*?)```"]
Check1{{"Blocks found?"}}
Strategy2["Strategy 2:\nJavaScript String Scan\nSearch for diagram keywords"]
Unescape["Unescape all blocks:\n\\\n→ newline\n\\ → tab\n\< → <"]
Dedup["Deduplicate by fingerprint\nFirst 100 chars"]
Return["Return unique_blocks"]
Start --> Strategy1
Strategy1 --> Check1
Check1 -->|Yes| Unescape
Check1 -->|No| Strategy2
Strategy2 --> Unescape
Unescape --> Dedup
Dedup --> Return
Sources: tools/deepwiki-scraper.py:218-331
Strategy 1: Fenced Mermaid Block Pattern
The primary extraction strategy uses a regex pattern to locate fenced Mermaid code blocks within the JavaScript payload. These blocks follow the Markdown convention but with escaped newlines.
Regex Pattern : r'```mermaid\\n(.*?)```'
This pattern specifically targets:
- Opening fence: ````mermaid`
- Escaped newline:
\\n(literal backslash-n in the string) - Diagram content:
(.*?)(non-greedy capture) - Closing fence: `````
Diagram: Fenced Block Extraction Process
Sources: tools/deepwiki-scraper.py:223-244
Code Implementation :
The extraction loop at tools/deepwiki-scraper.py:223-244 implements this strategy:
- Pattern matching : Uses
re.finditer()withre.DOTALLflag to handle multi-line diagrams - Content extraction : Captures the diagram code via
match.group(1) - Unescaping : Applies comprehensive escape sequence replacement
- Validation : Filters blocks with
len(block) > 10to exclude empty matches - Logging : Prints first 50 characters and line count for diagnostics
Strategy 2: JavaScript String Scanning
When Strategy 1 fails to find fenced blocks, the function falls back to scanning for raw diagram strings embedded in JavaScript. This handles cases where diagrams are stored as plain strings without Markdown fencing.
Diagram: JavaScript String Scan Algorithm
flowchart TD
Start["For each diagram keyword"]
Keywords["Keywords:\ngraph TD, graph TB,\nflowchart TD, sequenceDiagram,\nclassDiagram"]
FindKW["pos = html_text.find(keyword, pos)"]
CheckFound{{"Keyword found?"}}
BackwardScan["Scan backwards 20 chars\nFind opening quote"]
QuoteFound{{"Quote found?"}}
ForwardScan["Scan forward up to 10000 chars\nFind closing quote\nSkip escaped quotes"]
Extract["Extract string_start:string_end"]
UnescapeValidate["Unescape and validate\nMust have 3+ lines"]
Append["Append to mermaid_blocks"]
NextPos["pos += 1, continue search"]
Start --> Keywords
Keywords --> FindKW
FindKW --> CheckFound
CheckFound -->|Yes| BackwardScan
CheckFound -->|No, break| End["Move to next keyword"]
BackwardScan --> QuoteFound
QuoteFound -->|Yes| ForwardScan
QuoteFound -->|No| NextPos
ForwardScan --> Extract
Extract --> UnescapeValidate
UnescapeValidate --> Append
Append --> NextPos
NextPos --> FindKW
Sources: tools/deepwiki-scraper.py:246-302
Keyword List :
The algorithm searches for these Mermaid diagram type indicators:
Quote Handling Logic :
The forward scan at tools/deepwiki-scraper.py:273-285 implements careful quote detection:
- Scans up to 10,000 characters forward (safety limit)
- Checks if previous character is
\to identify escaped quotes - Breaks on first unescaped
"character - Returns to search position + 1 if no closing quote found
Unescape Processing
All extracted diagram blocks undergo comprehensive unescaping to convert JavaScript string representations into valid Mermaid code. The unescaping process handles multiple escape sequence types.
Escape Sequence Mapping :
| Escaped Form | Unescaped Result | Purpose |
|---|---|---|
\\n | \n (newline) | Line breaks in diagram code |
\\t | \t (tab) | Indentation |
\\" | " (quote) | String literals in labels |
\\\\ | \ (backslash) | Literal backslashes |
\\u003c | < | Less-than symbol |
\\u003e | > | Greater-than symbol |
\\u0026 | & | Ampersand |
Diagram: Unescape Transformation Pipeline
Sources: tools/deepwiki-scraper.py:231-238 tools/deepwiki-scraper.py:289-295
Implementation Details :
The unescaping sequence at tools/deepwiki-scraper.py:231-238 executes in a specific order to prevent double-processing:
- Newlines first :
\\n→\n(most common) - Tabs :
\\t→\t(whitespace) - Quotes :
\\"→"(before backslash handling to avoid conflicts) - Backslashes :
\\\\→\(last to avoid interfering with other escapes) - Unicode :
\\u003c,\\u003e,\\u0026→<,>,&
The order matters: processing backslashes before quotes would incorrectly unescape \\\\" sequences.
flowchart TD
Start["Input: mermaid_blocks[]\n(may contain duplicates)"]
Init["Initialize:\nunique_blocks = []\nseen = set()"]
Loop["For each block in mermaid_blocks"]
Fingerprint["fingerprint = block[:100]\n(first 100 chars)"]
CheckSeen{{"fingerprint in seen?"}}
Skip["Skip duplicate"]
Add["Add to seen set\nAppend to unique_blocks"]
Return["Return unique_blocks"]
Start --> Init
Init --> Loop
Loop --> Fingerprint
Fingerprint --> CheckSeen
CheckSeen -->|Yes| Skip
CheckSeen -->|No| Add
Skip --> Loop
Add --> Loop
Loop -->|Done| Return
Deduplication Mechanism
Since multiple extraction strategies may find the same diagram (once as fenced block, once as JavaScript string), the function implements fingerprint-based deduplication.
Diagram: Deduplication Algorithm
Sources: tools/deepwiki-scraper.py:304-311
Fingerprint Strategy :
The deduplication at tools/deepwiki-scraper.py:304-311 uses the first 100 characters as a unique identifier. This approach:
- Avoids exact string comparison : Saves memory and time for large diagrams
- Handles minor variations : Trailing whitespace differences don't affect matching
- Preserves order : First occurrence wins (FIFO)
- Works across strategies : Catches duplicates from both extraction methods
Integration with Enhancement Pipeline
The extract_mermaid_from_nextjs_data() function is called from extract_and_enhance_diagrams() during Phase 2 processing. The integration pattern extracts diagrams globally, then distributes them to individual pages through context matching.
Diagram: Phase 2 Integration Flow
sequenceDiagram
participant Main as "extract_and_enhance_diagrams()"
participant HTTP as "requests.Session"
participant Extract as "extract_mermaid_from_nextjs_data()"
participant Context as "Context Extraction"
participant Files as "Markdown Files"
Main->>HTTP: GET https://deepwiki.com/repo/1-overview
HTTP-->>Main: HTML response (all diagrams)
Main->>Extract: extract_mermaid_from_nextjs_data(html_text)
Extract->>Extract: Strategy 1: Fenced blocks
Extract->>Extract: Strategy 2: JS strings
Extract->>Extract: Unescape all
Extract->>Extract: Deduplicate
Extract-->>Main: all_diagrams[] (~461 diagrams)
Main->>Context: Extract with 500-char context
Context-->>Main: diagram_contexts[] (~48 with context)
Main->>Files: Fuzzy match and inject into *.md
Files-->>Main: enhanced_count files modified
Sources: tools/deepwiki-scraper.py:596-674 tools/deepwiki-scraper.py:604-612
Call Site :
The extraction is invoked at tools/deepwiki-scraper.py:604-612 within extract_and_enhance_diagrams():
- Fetch any page : Typically uses
/1-overviewas all diagrams are in every page's payload - Extract globally : Calls
extract_mermaid_from_nextjs_data()on full HTML response - Count total : Logs total diagram count (~461 in typical repositories)
- Extract context : Secondary regex pass to capture surrounding text (see Fuzzy Diagram Matching Algorithm)
Alternative Pattern Search :
Phase 2 also performs a second extraction pass at tools/deepwiki-scraper.py:615-646 with context:
- Pattern :
r'([^]{500,}?)mermaid\\n(.*?)'` - Purpose : Captures 500+ characters before each diagram for context matching
- Result :
diagram_contexts[]withlast_heading,anchor_text, anddiagramfields - Filtering : Only diagrams with meaningful context are used for fuzzy matching
Error Handling and Diagnostics
The extraction function includes comprehensive error handling and diagnostic output to aid in debugging and monitoring extraction quality.
Error Handling Strategy :
Sources: tools/deepwiki-scraper.py:327-331
Diagnostic Output :
The function provides detailed logging at tools/deepwiki-scraper.py244 tools/deepwiki-scraper.py248 tools/deepwiki-scraper.py300 tools/deepwiki-scraper.py:314-316:
| Output Message | Condition | Purpose |
|---|---|---|
"Found mermaid diagram: {first_50}... ({lines} lines)" | Each successful extraction | Verify diagram content |
"No fenced mermaid blocks found, trying JavaScript extraction..." | Strategy 1 fails | Indicate fallback |
"Found JS mermaid diagram: {first_50}... ({lines} lines)" | Strategy 2 success | Show fallback results |
"Extracted {count} unique mermaid diagram(s)" | Deduplication complete | Report final count |
"Warning: No valid mermaid diagrams extracted" | Zero diagrams found | Alert to potential issues |
"Warning: Failed to extract mermaid from page data: {e}" | Exception caught | Debug extraction failures |
Performance Characteristics
The extraction algorithm exhibits specific performance characteristics relevant to large wiki repositories.
Complexity Analysis :
| Operation | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Strategy 1 regex | O(n) | O(m) | n = HTML length, m = diagram count |
| Strategy 2 scan | O(n × k) | O(m) | k = keyword count (10) |
| Unescaping | O(m × d) | O(m × d) | d = avg diagram length |
| Deduplication | O(m) | O(m) | Uses 100-char fingerprint |
| Total | O(n × k) | O(m × d) | Dominated by Strategy 2 |
Typical Performance :
Based on the diagnostic output patterns at tools/deepwiki-scraper.py314:
- Input size : ~2-5 MB HTML response
- Extraction time : ~200-500ms (dominated by regex operations)
- Diagrams found : ~461 total diagrams
- Diagrams with context : ~48 after filtering
- Memory usage : ~1-2 MB for diagram storage (ephemeral)
Optimization Opportunities :
The current implementation prioritizes correctness over performance. Potential optimizations:
- Early termination : Stop Strategy 2 after finding sufficient diagrams
- Compiled patterns : Pre-compile regex patterns (currently done inline)
- Streaming extraction : Process HTML in chunks rather than loading entirely
- Fingerprint cache : Persist fingerprints across runs to avoid re-extraction
However, given typical execution times (<1 second), these optimizations are not currently necessary.
Summary
The Next.js diagram extraction mechanism solves the challenge of recovering client-side rendered Mermaid diagrams from DeepWiki's JavaScript payload. The implementation uses a two-strategy approach (fenced blocks and JavaScript string scanning), comprehensive unescaping logic, and fingerprint-based deduplication to reliably extract hundreds of diagrams from a single HTML response. The extracted diagrams are then passed to the fuzzy matching algorithm (see Fuzzy Diagram Matching Algorithm) for intelligent placement in the appropriate Markdown files.
Key Functions and Components :
| Component | Location | Purpose |
|---|---|---|
extract_mermaid_from_nextjs_data() | tools/deepwiki-scraper.py:218-331 | Main extraction function |
| Strategy 1 regex | tools/deepwiki-scraper.py:223-244 | Fenced block pattern matching |
| Strategy 2 scanner | tools/deepwiki-scraper.py:246-302 | JavaScript string scanning |
| Deduplication | tools/deepwiki-scraper.py:304-311 | Fingerprint-based uniqueness |
| Phase 2 integration | tools/deepwiki-scraper.py:604-612 | Call site in enhancement pipeline |
Sources: tools/deepwiki-scraper.py:218-331 tools/deepwiki-scraper.py:596-674
Phase 3: mdBook Build
Relevant source files
This document describes the final phase of the three-phase pipeline, where the extracted and enhanced Markdown files are transformed into a searchable HTML documentation site using mdBook and mdbook-mermaid. This phase is orchestrated by build-docs.sh and can be optionally skipped using the MARKDOWN_ONLY configuration flag.
For details on configuration file generation, see Configuration Generation. For the table of contents generation algorithm, see SUMMARY.md Generation. For the earlier phases of content extraction and diagram enhancement, see Phase 1: Markdown Extraction and Phase 2: Diagram Enhancement.
Overview
Phase 3 executes only when MARKDOWN_ONLY is not set to "true". It consists of seven distinct steps that transform the extracted Markdown files into a complete mdBook-based HTML documentation site with working Mermaid diagram rendering, search functionality, and navigation.
Phase 3 Execution Flow
graph TB
Start["Phase 2 Complete\n$WIKI_DIR contains\nenhanced Markdown"]
CheckMode{"MARKDOWN_ONLY\n== 'true'?"}
SkipPath["Skip Phase 3\nCopy markdown to output\nExit"]
Step2["Step 2:\nInitialize mdBook Structure\nmkdir -p $BOOK_DIR\ncd $BOOK_DIR"]
Step3["Step 3:\nGenerate book.toml\nConfigure title, authors,\ntheme, preprocessors"]
Step4["Step 4:\nGenerate SUMMARY.md\nDiscover file structure\nCreate table of contents"]
Step5["Step 5:\nCopy Markdown Files\ncp -r $WIKI_DIR/* src/"]
Step6["Step 6:\nInstall Mermaid Assets\nmdbook-mermaid install"]
Step7["Step 7:\nBuild HTML Book\nmdbook build"]
Step8["Step 8:\nCopy Outputs\nbook/ → /output/book/\nmarkdown → /output/markdown/\nbook.toml → /output/"]
End["Complete:\nHTML documentation ready"]
Start --> CheckMode
CheckMode -->|Yes| SkipPath
CheckMode -->|No| Step2
SkipPath --> End
Step2 --> Step3
Step3 --> Step4
Step4 --> Step5
Step5 --> Step6
Step6 --> Step7
Step7 --> Step8
Step8 --> End
Sources: build-docs.sh:60-76 build-docs.sh:78-205
Directory Structure Transformation
Phase 3 transforms the flat wiki directory structure into mdBook's required layout. The following diagram shows how files are organized and moved through the build process:
Directory Layout Evolution Through Phase 3
graph LR
subgraph Input["Input: $WIKI_DIR"]
WikiRoot["$WIKI_DIR/\n(scraped content)"]
WikiMD["*.md files"]
WikiSections["section-N/ dirs"]
WikiRoot --> WikiMD
WikiRoot --> WikiSections
end
subgraph BookStructure["$BOOK_DIR Structure"]
BookRoot["$BOOK_DIR/"]
BookToml["book.toml"]
SrcDir["src/"]
SrcSummary["src/SUMMARY.md"]
SrcMD["src/*.md"]
SrcSections["src/section-N/"]
BookOutput["book/"]
BookHTML["book/index.html\nbook/*.html"]
BookRoot --> BookToml
BookRoot --> SrcDir
BookRoot --> BookOutput
SrcDir --> SrcSummary
SrcDir --> SrcMD
SrcDir --> SrcSections
BookOutput --> BookHTML
end
subgraph Output["Output: /output"]
OutRoot["/output/"]
OutBook["book/"]
OutMarkdown["markdown/"]
OutToml["book.toml"]
OutRoot --> OutBook
OutRoot --> OutMarkdown
OutRoot --> OutToml
end
WikiMD -.->|Step 5: cp -r| SrcMD
WikiSections -.->|Step 5: cp -r| SrcSections
BookHTML -.->|Step 8: cp -r book| OutBook
WikiMD -.->|Step 8: cp -r| OutMarkdown
BookToml -.->|Step 8: cp| OutToml
Sources: build-docs.sh:27-30 build-docs.sh:81-106 build-docs.sh:163-191
Step 2: Initialize mdBook Structure
The build script creates the base directory structure required by mdBook. This establishes the workspace where all subsequent operations occur.
| Operation | Command | Purpose |
|---|---|---|
| Create book directory | mkdir -p "$BOOK_DIR" | Root directory for mdBook project |
| Change to book directory | cd "$BOOK_DIR" | Set working directory for mdBook commands |
| Create source directory | mkdir -p src | Directory where Markdown files will be placed |
The $BOOK_DIR variable defaults to /workspace/book and serves as the mdBook project root throughout Phase 3.
Sources: build-docs.sh:78-106
Step 3: Generate book.toml Configuration
The script dynamically generates book.toml, mdBook's main configuration file, using environment variables and auto-detected metadata. This file controls the book's metadata, theme, preprocessors, and output settings.
Configuration File Structure
| Section | Key | Value Source | Purpose |
|---|---|---|---|
[book] | title | $BOOK_TITLE | Book title displayed in HTML |
[book] | authors | $BOOK_AUTHORS | Author names (defaults to repo owner) |
[book] | language | "en" (hardcoded) | Documentation language |
[book] | multilingual | false (hardcoded) | Single language mode |
[book] | src | "src" (hardcoded) | Source directory name |
[output.html] | default-theme | "rust" (hardcoded) | Visual theme (Rust documentation style) |
[output.html] | git-repository-url | $GIT_REPO_URL | Enables "Edit this page" links |
[preprocessor.mermaid] | command | "mdbook-mermaid" | Diagram rendering preprocessor |
[output.html.fold] | enable | true | Enable sidebar section folding |
[output.html.fold] | level | 1 | Fold at level 1 by default |
The generated file is written to $BOOK_DIR/book.toml using a heredoc construct. For detailed information on this process, see Configuration Generation.
Sources: build-docs.sh:84-103
Step 4: Generate SUMMARY.md Table of Contents
The script automatically discovers the file structure and generates SUMMARY.md, which defines the book's navigation hierarchy. This is a critical step that bridges the flat file structure to mdBook's hierarchical navigation.
SUMMARY.md Generation Algorithm
graph TD
Start["Start:\nsrc/SUMMARY.md creation"]
Header["Write header:\n'# Summary'"]
FindFirst["Find first page:\nls $WIKI_DIR/*.md /head -1"]
CheckFirst{"First page exists?"}
AddIntro["Extract title from first line Add as introduction link"]
LoopMain["Loop through all *.md in $WIKI_DIR"]
CheckSkip{"Is this first page?"}
SkipIt["Continue to next file"]
ExtractTitle["Extract title: head -1/ sed 's/^# //'"]
GetSectionNum["Extract section number:\ngrep -oE '^[0-9]+'"]
CheckSubsections{"section-$num\ndirectory\nexists?"}
AddMain["Add main section entry:\n- [title](filename)"]
EndLoop{"More\nfiles?"}
AddSection["Add section header:\n# title\n- [title](filename)"]
LoopSubs["Loop through\nsection-N/*.md files"]
AddSub["Add subsection:\n - [subtitle](section-N/file)"]
CheckMoreSubs{"More\nsubsections?"}
Done["SUMMARY.md complete"]
Start --> Header
Header --> FindFirst
FindFirst --> CheckFirst
CheckFirst -->|Yes| AddIntro
CheckFirst -->|No| LoopMain
AddIntro --> LoopMain
LoopMain --> CheckSkip
CheckSkip -->|Yes| SkipIt
CheckSkip -->|No| ExtractTitle
SkipIt --> EndLoop
ExtractTitle --> GetSectionNum
GetSectionNum --> CheckSubsections
CheckSubsections -->|Yes| AddSection
CheckSubsections -->|No| AddMain
AddMain --> EndLoop
AddSection --> LoopSubs
LoopSubs --> AddSub
AddSub --> CheckMoreSubs
CheckMoreSubs -->|Yes| LoopSubs
CheckMoreSubs -->|No| EndLoop
EndLoop -->|Yes| LoopMain
EndLoop -->|No| Done
The script counts the generated entries using grep -c '\[' src/SUMMARY.md and logs the total. For a detailed explanation of the SUMMARY.md format and generation logic, see SUMMARY.md Generation.
Sources: build-docs.sh:108-161
Step 5: Copy Markdown Files to Book Source
The script copies all Markdown files from the wiki directory to the mdBook source directory. This includes both top-level files and subsection directories.
The copy operation uses recursive mode to preserve the directory structure:
- Command:
cp -r "$WIKI_DIR"/* src/ - Source:
/workspace/wiki/(contains enhanced Markdown with diagrams) - Destination:
$BOOK_DIR/src/(mdBook source directory)
All files retain their names and relative paths, ensuring SUMMARY.md references remain valid.
Sources: build-docs.sh:163-166
Step 6: Install mdbook-mermaid Assets
The mdbook-mermaid preprocessor requires JavaScript and CSS assets to render Mermaid diagrams in the browser. The installation command copies these assets into the mdBook theme directory.
Asset Installation Process
graph LR
Cmd["mdbook-mermaid install $BOOK_DIR"]
Detect["Detect book.toml location"]
ThemeDir["Create/locate theme directory\n$BOOK_DIR/theme/"]
CopyJS["Copy mermaid.min.js\n(Mermaid rendering library)"]
CopyInit["Copy mermaid-init.js\n(Initialization code)"]
CopyCSS["Copy mermaid.css\n(Diagram styling)"]
Complete["Assets installed\nReady for diagram rendering"]
Cmd --> Detect
Detect --> ThemeDir
ThemeDir --> CopyJS
ThemeDir --> CopyInit
ThemeDir --> CopyCSS
CopyJS --> Complete
CopyInit --> Complete
CopyCSS --> Complete
After installation, mdBook will automatically include these assets in all generated HTML pages, enabling client-side Mermaid diagram rendering.
Sources: build-docs.sh:168-171 README.md:138-144
Step 7: Build HTML Documentation
The core build operation is performed by the mdbook build command, which reads the configuration, processes all Markdown files, and generates the complete HTML site.
mdBook Build Pipeline
graph TB
Start["mdbook build command"]
ReadConfig["Read book.toml\nLoad configuration"]
ParseSummary["Parse src/SUMMARY.md\nBuild navigation tree"]
ReadMarkdown["Read all .md files\nfrom src/ directory"]
PreprocessMermaid["Run mermaid preprocessor\nDetect ```mermaid blocks"]
ConvertHTML["Convert Markdown → HTML\nApply rust theme"]
GeneratePages["Generate HTML pages\nOne per Markdown file"]
AddNav["Add navigation sidebar\nBased on SUMMARY.md"]
AddSearch["Generate search index\nSearchable content"]
AddAssets["Include CSS/JS assets\nTheme + Mermaid libraries"]
WriteOutput["Write to book/ directory\nComplete static site"]
Done["Build complete:\nbook/index.html ready"]
Start --> ReadConfig
ReadConfig --> ParseSummary
ParseSummary --> ReadMarkdown
ReadMarkdown --> PreprocessMermaid
PreprocessMermaid --> ConvertHTML
ConvertHTML --> GeneratePages
GeneratePages --> AddNav
GeneratePages --> AddSearch
GeneratePages --> AddAssets
AddNav --> WriteOutput
AddSearch --> WriteOutput
AddAssets --> WriteOutput
WriteOutput --> Done
The build command produces the following outputs in $BOOK_DIR/book/:
index.html- Main entry point- Individual HTML pages for each Markdown file
searchindex.js- Full-text search indexsearchindex.json- Search metadata- CSS and JavaScript assets (theme + Mermaid)
- Font files and icons
Sources: build-docs.sh:173-176 README.md:93-99
Step 8: Copy Outputs to Volume Mount
The final step copies all generated artifacts to the /output directory, which is typically mounted as a Docker volume. This makes the results accessible outside the container.
| Source | Destination | Contents |
|---|---|---|
$BOOK_DIR/book/ | /output/book/ | Complete HTML documentation site |
$WIKI_DIR/* | /output/markdown/ | Enhanced Markdown source files |
$BOOK_DIR/book.toml | /output/book.toml | Configuration file (for reference) |
The script outputs a summary showing the locations of all artifacts:
Outputs:
- HTML book: /output/book/
- Markdown files: /output/markdown/
- Book config: /output/book.toml
The HTML book in /output/book/ is a self-contained static site that can be:
- Served with any web server (e.g.,
python3 -m http.server) - Deployed to static hosting (GitHub Pages, Netlify, etc.)
- Opened directly in a browser (
file://URLs)
Sources: build-docs.sh:178-205
Conditional Execution: Markdown-Only Mode
Phase 3 can be completely bypassed by setting the MARKDOWN_ONLY environment variable to "true". This provides a fast-path execution mode useful for debugging content extraction and diagram placement without the overhead of building HTML.
Execution Path Decision
graph TD
Start["Phase 2 Complete"]
CheckVar{"MARKDOWN_ONLY\n== 'true'?"}
FullPath["Execute Full Phase 3:\n- Initialize mdBook\n- Generate configs\n- Build HTML\n- Copy all outputs"]
FastPath["Execute Fast Path:\n- mkdir -p /output/markdown\n- cp -r $WIKI_DIR/* /output/markdown/\n- Exit immediately"]
FullOutput["Outputs:\n- /output/book/ (HTML)\n- /output/markdown/ (source)\n- /output/book.toml (config)"]
FastOutput["Outputs:\n- /output/markdown/ (source only)"]
Start --> CheckVar
CheckVar -->|false| FullPath
CheckVar -->|true| FastPath
FullPath --> FullOutput
FastPath --> FastOutput
When MARKDOWN_ONLY="true":
- Steps 2-7 are skipped entirely
- Only the Markdown files are copied to
/output/markdown/ - Build time is significantly reduced (seconds vs. minutes)
- Useful for iterating on diagram placement logic
- No HTML output is generated
Sources: build-docs.sh:60-76 README.md:55-75
Error Handling and Requirements
Phase 3 executes with set -e enabled, causing the script to exit immediately if any command fails. This ensures partial builds are not created.
Potential Failure Points
| Step | Command | Failure Condition | Impact |
|---|---|---|---|
| Step 2 | mkdir -p "$BOOK_DIR" | Insufficient permissions | Cannot create workspace |
| Step 3 | cat > book.toml | Write permission denied | No configuration file |
| Step 4 | File discovery loop | No .md files found | Empty SUMMARY.md |
| Step 5 | cp -r "$WIKI_DIR"/* src/ | Source directory empty | No content to build |
| Step 6 | mdbook-mermaid install | Binary not in PATH | Diagrams won't render |
| Step 7 | mdbook build | Invalid Markdown syntax | Build fails with error |
| Step 8 | cp -r book "$OUTPUT_DIR/" | Insufficient disk space | Incomplete output |
The multi-stage Docker build ensures both mdbook and mdbook-mermaid binaries are present in the final image. See Docker Multi-Stage Build for details on how these tools are compiled and installed.
Sources: build-docs.sh:1-2 build-docs.sh:168-176
graph TB
EnvVars["Environment Variables\nREPO\nBOOK_TITLE\nBOOK_AUTHORS\nGIT_REPO_URL\nMARKDOWN_ONLY"]
BuildScript["build-docs.sh\n(orchestrator)"]
ScraperOutput["$WIKI_DIR/\n(Phase 1 & 2 output)"]
BookToml["book.toml\n(generated config)"]
SummaryMd["src/SUMMARY.md\n(generated TOC)"]
MdBookBinary["mdbook binary\n(Rust tool)"]
MermaidBinary["mdbook-mermaid binary\n(Rust preprocessor)"]
HTMLOutput["book/ directory\n(final HTML site)"]
VolumeMount["/output volume\n(Docker mount point)"]
EnvVars --> BuildScript
ScraperOutput --> BuildScript
BuildScript --> BookToml
BuildScript --> SummaryMd
BuildScript --> MdBookBinary
MdBookBinary --> MermaidBinary
BookToml --> MdBookBinary
SummaryMd --> MdBookBinary
ScraperOutput --> MdBookBinary
MdBookBinary --> HTMLOutput
BuildScript --> VolumeMount
HTMLOutput --> VolumeMount
Integration with Other Components
Phase 3 integrates with multiple system components:
Component Integration Map
The phase acts as the bridge between the Python-based content extraction layers and the final deliverable, using Rust-based tools for the HTML generation. For details on how environment variables are processed, see Configuration Reference. For the complete system architecture, see System Architecture.
Sources: build-docs.sh:21-53 build-docs.sh:78-205
Configuration Generation
Relevant source files
Purpose and Scope
This document details how the book.toml configuration file is dynamically generated during Phase 3 of the build process. The configuration generation occurs in build-docs.sh:78-103 and uses environment variables, Git repository metadata, and computed defaults to produce a complete mdBook configuration.
For information about how the generated configuration is used by mdBook to build the documentation site, see mdBook Integration. For details about the SUMMARY.md generation that happens after configuration generation, see SUMMARY.md Generation.
Sources: build-docs.sh:1-206
Configuration Flow Overview
The configuration generation process transforms user-provided environment variables and auto-detected repository information into a complete book.toml file that controls all aspects of the mdBook build.
Configuration Data Flow
flowchart TB
subgraph "Input Sources"
ENV_REPO["Environment Variable:\nREPO"]
ENV_TITLE["Environment Variable:\nBOOK_TITLE"]
ENV_AUTHORS["Environment Variable:\nBOOK_AUTHORS"]
ENV_URL["Environment Variable:\nGIT_REPO_URL"]
GIT["Git Remote:\norigin URL"]
end
subgraph "Processing in build-docs.sh"
AUTO_DETECT["Auto-Detection Logic\n[8-19]"]
VAR_INIT["Variable Initialization\n[21-26]"]
VALIDATION["Repository Validation\n[32-37]"]
DEFAULTS["Default Computation\n[39-45]"]
end
subgraph "Computed Values"
FINAL_REPO["REPO:\nowner/repo"]
FINAL_TITLE["BOOK_TITLE:\nDocumentation"]
FINAL_AUTHORS["BOOK_AUTHORS:\nrepo owner"]
FINAL_URL["GIT_REPO_URL:\nhttps://github.com/owner/repo"]
REPO_PARTS["REPO_OWNER, REPO_NAME"]
end
subgraph "book.toml Generation"
BOOK_SECTION["[book] section\n[86-91]"]
HTML_SECTION["[output.html] section\n[93-95]"]
PREPROC_SECTION["[preprocessor.mermaid]\n[97-98]"]
FOLD_SECTION["[output.html.fold]\n[100-102]"]
end
OUTPUT["book.toml file\nwritten to /workspace/book/"]
GIT -->|if REPO not set| AUTO_DETECT
ENV_REPO -->|or explicit| AUTO_DETECT
AUTO_DETECT --> VAR_INIT
ENV_TITLE --> VAR_INIT
ENV_AUTHORS --> VAR_INIT
ENV_URL --> VAR_INIT
VAR_INIT --> VALIDATION
VALIDATION --> DEFAULTS
VALIDATION --> REPO_PARTS
DEFAULTS --> FINAL_REPO
DEFAULTS --> FINAL_TITLE
DEFAULTS --> FINAL_AUTHORS
DEFAULTS --> FINAL_URL
REPO_PARTS --> FINAL_AUTHORS
REPO_PARTS --> FINAL_URL
FINAL_TITLE --> BOOK_SECTION
FINAL_AUTHORS --> BOOK_SECTION
FINAL_URL --> HTML_SECTION
BOOK_SECTION --> OUTPUT
HTML_SECTION --> OUTPUT
PREPROC_SECTION --> OUTPUT
FOLD_SECTION --> OUTPUT
Sources: build-docs.sh:8-103
Environment Variables and Defaults
The configuration generation system processes five primary environment variables, each with intelligent defaults computed from the repository context.
Environment Variable Processing
| Variable | Purpose | Default Value | Computation Logic |
|---|---|---|---|
REPO | Repository identifier (owner/repo) | Auto-detected from Git | Extracted from git config remote.origin.url build-docs.sh:8-19 |
BOOK_TITLE | Title displayed in documentation | "Documentation" | Simple string default build-docs.sh23 |
BOOK_AUTHORS | Author name(s) in metadata | Repository owner | Extracted from REPO using cut -d'/' -f1 build-docs.sh:40-44 |
GIT_REPO_URL | Link to source repository | https://github.com/owner/repo | Constructed from REPO build-docs.sh45 |
MARKDOWN_ONLY | Skip mdBook build | "false" | Boolean flag build-docs.sh26 |
Sources: build-docs.sh:21-45
Variable Initialization Code Structure
Sources: build-docs.sh:21-45
Auto-Detection Logic
When the REPO environment variable is not explicitly provided, the system attempts to auto-detect it from the Git repository configuration. This enables zero-configuration usage in CI/CD environments.
Git Remote URL Extraction
The auto-detection logic in build-docs.sh:8-19 performs the following steps:
- Check if running in a Git repository: Uses
git rev-parse --git-dirto verify - Extract remote URL: Retrieves
remote.origin.urlfrom Git config - Parse GitHub URL: Uses sed regex to extract owner/repo from various URL formats
Supported GitHub URL Formats:
- HTTPS:
https://github.com/owner/repo.git - SSH:
git@github.com:owner/repo.git - Without .git suffix:
https://github.com/owner/repo
Auto-Detection Algorithm
Sources: build-docs.sh:8-19
Regex Pattern Details
The sed command at build-docs.sh16 uses this regex pattern:
s#.*github\.com<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/#LNaN-LNaN" NaN file-path="">Hii</FileRef>(\.git)?.*#\1#
This pattern:
- Matches
github.comfollowed by either:(SSH) or/(HTTPS) - Captures the owner/repo portion:
([^/]+/[^/\.]+) - Optionally matches
.gitsuffix:(\.git)? - Extracts only the owner/repo portion as the replacement
Sources: build-docs.sh16
book.toml Structure
The generated book.toml file contains four configuration sections that control mdBook's behavior. The file is created at build-docs.sh:84-103 using a here-document.
Configuration Sections
Sources: build-docs.sh:84-103
Section Details
[book] Section
Located at build-docs.sh:86-91 this section defines core book metadata:
| Field | Value | Source |
|---|---|---|
title | Value of $BOOK_TITLE | Environment variable or default "Documentation" |
authors | Array with $BOOK_AUTHORS | Computed from repository owner |
language | "en" | Hardcoded English |
multilingual | false | Hardcoded |
src | "src" | mdBook convention for source directory |
Sources: build-docs.sh:86-91
[output.html] Section
Located at build-docs.sh:93-95 this section configures HTML output:
| Field | Value | Purpose |
|---|---|---|
default-theme | "rust" | Uses mdBook's Rust theme for consistent styling |
git-repository-url | Value of $GIT_REPO_URL | Creates "Edit this file on GitHub" links in the UI |
Sources: build-docs.sh:93-95
[preprocessor.mermaid] Section
Located at build-docs.sh:97-98 this section enables Mermaid diagram rendering:
| Field | Value | Purpose |
|---|---|---|
command | "mdbook-mermaid" | Specifies the preprocessor binary to invoke |
This preprocessor is executed before mdBook processes the Markdown files, transforming Mermaid code blocks into rendered diagrams. The preprocessor binary is installed during the Docker build and its assets are installed at build-docs.sh:170-171
Sources: build-docs.sh:97-98
[output.html.fold] Section
Located at build-docs.sh:100-102 this section configures navigation behavior:
| Field | Value | Purpose |
|---|---|---|
enable | true | Enables collapsible sections in the navigation sidebar |
level | 1 | Folds sections at depth 1, keeping top-level sections visible |
Sources: build-docs.sh:100-102
Configuration Generation Process
The complete configuration generation process occurs within the orchestration logic of build-docs.sh. This diagram maps the process to specific code locations.
Code Execution Sequence
sequenceDiagram
participant ENV as Environment Variables
participant AUTO as Auto-Detection [8-19]
participant INIT as Initialization [21-26]
participant VAL as Validation [32-37]
participant PARSE as Parsing [39-41]
participant DEF as Defaults [44-45]
participant LOG as Logging [47-53]
participant GEN as Generation [84-103]
participant FILE as book.toml
ENV->>AUTO: Check if REPO set
AUTO->>AUTO: Try git config
AUTO->>INIT: Return REPO (or empty)
INIT->>INIT: Set variable defaults
INIT->>VAL: Pass to validation
VAL->>VAL: Check REPO not empty
alt REPO is empty
VAL-->>ENV: Exit with error
end
VAL->>PARSE: Continue with valid REPO
PARSE->>PARSE: Extract REPO_OWNER
PARSE->>PARSE: Extract REPO_NAME
PARSE->>DEF: Pass extracted parts
DEF->>DEF: Compute BOOK_AUTHORS
DEF->>DEF: Compute GIT_REPO_URL
DEF->>LOG: Pass final config
LOG->>LOG: Echo configuration
Note over LOG,GEN: Scraper runs here [58]
LOG->>GEN: Proceed to book.toml
GEN->>GEN: Create [book] section
GEN->>GEN: Create [output.html]
GEN->>GEN: Create [preprocessor.mermaid]
GEN->>GEN: Create [output.html.fold]
GEN->>FILE: Write complete file
Sources: build-docs.sh:8-103
Template Interpolation
The book.toml file is generated using shell variable interpolation within a here-document (heredoc). This technique allows dynamic insertion of computed values into the template.
Here-Document Structure
The generation code at build-docs.sh:85-103 uses this structure:
cat > book.toml <<EOF
[book]
title = "$BOOK_TITLE"
authors = ["$BOOK_AUTHORS"]
...
EOF
The <<EOF syntax creates a here-document that:
- Allows multi-line content with preserved formatting
- Performs shell variable expansion (using
$VARIABLEsyntax) - Writes the result to
book.tomlvia redirection
Variable Interpolation Points
| Line | Variable | Context |
|---|---|---|
| 87 | $BOOK_TITLE | Used in title = "$BOOK_TITLE" |
| 88 | $BOOK_AUTHORS | Used in authors = ["$BOOK_AUTHORS"] |
| 95 | $GIT_REPO_URL | Used in git-repository-url = "$GIT_REPO_URL" |
The quotes around variable names ensure that values containing spaces are properly handled in the TOML format.
Sources: build-docs.sh:85-103
Output Location and Usage
The generated book.toml file is written to /workspace/book/book.toml within the Docker container. This location is significant because:
- Working Directory: The script changes to
$BOOK_DIRat build-docs.sh82 - mdBook Convention: mdBook expects
book.tomlin the project root - Build Process: The
mdbook buildcommand at build-docs.sh176 reads this file
File Lifecycle
The file is also copied to the output directory at build-docs.sh191 for user reference and debugging purposes.
Sources: build-docs.sh:82-191
Error Handling
Configuration generation includes validation to ensure required values are present before proceeding with the build.
Repository Validation
The validation logic at build-docs.sh:32-37 checks that REPO has a value:
stateDiagram-v2
[*] --> AutoDetect : Script starts
AutoDetect --> CheckEmpty : Lines 8-19
CheckEmpty --> Error : REPO still empty
CheckEmpty --> ParseRepo : REPO has value
Error --> [*] : Exit code 1
ParseRepo --> ComputeDefaults : Lines 39-45
ComputeDefaults --> LogConfig : Lines 47-53
LogConfig --> GenerateConfig : Lines 84-103
GenerateConfig --> [*] : Continue build
This validation occurs after auto-detection attempts, so the error message guides users to either:
- Set the
REPOenvironment variable explicitly, or - Run the command from within a Git repository with a GitHub remote configured
Validation Timing
Sources: build-docs.sh:32-37
Configuration Output Example
Based on the generation logic, here is an example of a complete generated book.toml file when processing the repository jzombie/deepwiki-to-mdbook:
This example demonstrates:
- Default title when
BOOK_TITLEis not set - Author extracted from repository owner
- Computed Git repository URL
- All static configuration values
Sources: build-docs.sh:84-103
Integration with mdBook Build
The generated configuration is consumed by mdBook during the build process. The integration points are:
- mdbook-mermaid install at build-docs.sh171 reads
[preprocessor.mermaid] - mdbook build at build-docs.sh176 reads all sections
- HTML output uses
[output.html]settings for theming and repository links - Navigation rendering uses
[output.html.fold]for sidebar behavior
Configuration-Driven Features
| Configuration | Visible Result |
|---|---|
git-repository-url | "Suggest an edit" button in top-right of each page |
default-theme = "rust" | Consistent color scheme and typography matching Rust documentation |
[preprocessor.mermaid] | Mermaid code blocks rendered as interactive diagrams |
enable = true (folding) | Collapsible sections in left sidebar navigation |
For details on how mdBook processes this configuration to build the final documentation site, see mdBook Integration.
Sources: build-docs.sh:84-176
SUMMARY.md Generation
Relevant source files
Purpose and Scope
This document explains how the SUMMARY.md file is dynamically generated from the scraped markdown content structure. The SUMMARY.md file serves as mdBook's table of contents, defining the navigation structure and page hierarchy for the generated HTML documentation.
For information about how the markdown files are initially organized during scraping, see Wiki Structure Discovery. For details about the overall mdBook build configuration, see Configuration Generation.
SUMMARY.md in mdBook
The SUMMARY.md file is mdBook's primary navigation document. It defines:
- The order of pages in the documentation
- The hierarchical structure (chapters and sub-chapters)
- The titles displayed in the navigation sidebar
- Which markdown files map to which sections
mdBook parses SUMMARY.md to construct the entire book structure. Pages not listed in SUMMARY.md will not be included in the generated documentation.
Sources: build-docs.sh:108-161
Generation Process Overview
The SUMMARY.md generation occurs in Step 3 of the build pipeline, after markdown extraction is complete but before the mdBook build begins. The generation algorithm automatically discovers the file structure and constructs an appropriate table of contents.
Diagram: SUMMARY.md Generation Workflow
flowchart TD
Start["Start: SUMMARY.md Generation\n(build-docs.sh:110)"]
Init["Initialize Output\nEcho '# Summary'"]
FindFirst["Find First Page\nls $WIKI_DIR/*.md /head -1"]
ExtractFirst["Extract Title head -1 $file/ sed 's/^# //'"]
AddFirst["Add as Introduction\n[title](filename)"]
LoopStart["Iterate: $WIKI_DIR/*.md"]
Skip{"Is first_page?"}
ExtractTitle["Extract Title from First Line\nhead -1 $file /sed 's/^# //'"]
GetSectionNum["Extract section_num grep -oE '^[0-9]+'"]
CheckDir{"Directory Exists? section-$section_num/"}
MainWithSub["Output Section Header # $title - [$title] $filename"]
IterateSub["Iterate: section-$section_num/*.md"]
AddSub["Add Subsection - [$subtitle] section-N/$subfilename"]
Standalone["Output Standalone - [$title] $filename"]
LoopEnd{"More Files?"}
Complete["Write to src/SUMMARY.md"]
End["End: SUMMARY.md Generated"]
Start --> Init
Init --> FindFirst
FindFirst --> ExtractFirst
ExtractFirst --> AddFirst
AddFirst --> LoopStart
LoopStart --> Skip
Skip -->|Yes|LoopEnd
Skip -->|No|ExtractTitle
ExtractTitle --> GetSectionNum
GetSectionNum --> CheckDir
CheckDir -->|Yes|MainWithSub
MainWithSub --> IterateSub
IterateSub --> AddSub
AddSub --> LoopEnd
CheckDir -->|No|Standalone
Standalone --> LoopEnd
LoopEnd -->|Yes|LoopStart
LoopEnd -->|No| Complete
Complete --> End
Sources: build-docs.sh:108-161
Algorithm Components
Step 1: Introduction Page Selection
The first markdown file in the wiki directory is designated as the introduction page. This ensures the documentation has a clear entry point.
Diagram: First Page Processing Pipeline
flowchart LR
ListFiles["ls $WIKI_DIR/*.md"]
TakeFirst["head -1"]
GetBasename["basename"]
StoreName["first_page variable"]
ExtractTitle["head -1 $WIKI_DIR/$first_page"]
RemoveHash["sed 's/^# //'"]
StoreTitle["title variable"]
WriteEntry["echo '[${title}]($first_page)'"]
ToSummary[">> src/SUMMARY.md"]
ListFiles --> TakeFirst
TakeFirst --> GetBasename
GetBasename --> StoreName
StoreName --> ExtractTitle
ExtractTitle --> RemoveHash
RemoveHash --> StoreTitle
StoreTitle --> WriteEntry
WriteEntry --> ToSummary
The implementation uses shell command chaining:
ls "$WIKI_DIR"/*.md 2>/dev/null | head -1 | xargs basenameextracts the first filenamehead -1 "$WIKI_DIR/$first_page" | sed 's/^# //'extracts the title by removing the leading#from the first line
Sources: build-docs.sh:118-123
Step 2: Main Page Iteration
All markdown files in the root wiki directory are processed sequentially. Each file represents either a standalone page or a main section with subsections.
| Processing Step | Command/Logic | Purpose |
|---|---|---|
| File Discovery | for file in "$WIKI_DIR"/*.md | Iterate all root-level markdown files |
| File Check | [ -f "$file" ] | Verify file existence |
| Basename Extraction | basename "$file" | Get filename without path |
| First Page Skip | [ "$filename" = "$first_page" ] | Avoid duplicate introduction |
| Title Extraction | `head -1 "$file" | sed 's/^# //'` |
Sources: build-docs.sh:126-135
Step 3: Subsection Detection
The algorithm determines whether a main page has subsections by:
- Extracting the numeric prefix from the filename (e.g.,
5from5-component-reference.md) - Checking if a corresponding
section-N/directory exists - If found, treating the page as a main section with nested subsections
flowchart TD
MainPage["Main Page File\ne.g., 5-component-reference.md"]
ExtractNum["Extract section_num\necho $filename /grep -oE '^[0-9]+'"]
HasNum{"Numeric Prefix?"}
BuildPath["Construct section_dir $WIKI_DIR/section-$section_num"]
CheckDir["Check Directory [ -d $section_dir ]"]
DirExists{"Directory Exists?"}
OutputHeader["Output Section Header # $title"]
OutputMain["Output Main Link - [$title] $filename"]
IterateSubs["for subfile in $section_dir/*.md"]
ExtractSubTitle["head -1 $subfile/ sed 's/^# //'"]
OutputSub["Output Subsection\n - [$subtitle](section-N/$subfilename)"]
OutputStandalone["Output Standalone\n- [$title]($filename)"]
MainPage --> ExtractNum
ExtractNum --> HasNum
HasNum -->|Yes| BuildPath
HasNum -->|No| OutputStandalone
BuildPath --> CheckDir
CheckDir --> DirExists
DirExists -->|Yes| OutputHeader
OutputHeader --> OutputMain
OutputMain --> IterateSubs
IterateSubs --> ExtractSubTitle
ExtractSubTitle --> OutputSub
DirExists -->|No| OutputStandalone
Diagram: Subsection Detection and Nesting Logic
Sources: build-docs.sh:137-158
Step 4: Subsection Processing
When a section-N/ directory is detected, all markdown files within it are processed as subsections:
Key aspects:
- Subsections use two-space indentation:
- <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/$subtitle" undefined file-path="$subtitle">Hii</FileRef> - File paths include the
section-N/directory prefix - Each subsection's title is extracted using the same pattern as main pages
Sources: build-docs.sh:147-152
File Structure Conventions
The generation algorithm depends on the file structure created during markdown extraction (see Wiki Structure Discovery):
Diagram: File Structure Conventions for SUMMARY.md Generation
| Pattern | Location | SUMMARY.md Output |
|---|---|---|
*.md | Root directory | Main pages |
N-*.md | Root directory | Main section (if section-N/ exists) |
*.md | section-N/ directory | Subsections (indented under section N) |
Sources: build-docs.sh:126-158
Title Extraction Method
All page titles are extracted using a consistent pattern:
This assumes that every markdown file begins with a level-1 heading (# Title). The sed command removes the # prefix, leaving only the title text.
Extraction Pipeline:
| Command | Purpose | Example Input | Example Output |
|---|---|---|---|
head -1 "$file" | Get first line | # Component Reference | # Component Reference |
sed 's/^# //' | Remove heading syntax | # Component Reference | Component Reference |
Sources: build-docs.sh120 build-docs.sh134 build-docs.sh150
Output Format
The generated SUMMARY.md follows mdBook's syntax:
Format Rules:
| Element | Syntax | Purpose |
|---|---|---|
| Header | # Summary | Required mdBook header |
| Introduction | <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Title" undefined file-path="Title">Hii</FileRef> | First page (no bullet) |
| Main Page | - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Title" undefined file-path="Title">Hii</FileRef> | Top-level navigation item |
| Section Header | # Section Name | Visual grouping in sidebar |
| Subsection | - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Title" undefined file-path="Title">Hii</FileRef> | Nested under main section (2-space indent) |
Sources: build-docs.sh:113-159
Implementation Code Mapping
The following table maps the algorithm steps to specific code locations:
Diagram: Code Location Mapping for SUMMARY.md Generation
| Variable Name | Purpose | Example Value |
|---|---|---|
WIKI_DIR | Source directory for markdown files | /workspace/wiki |
first_page | First markdown file (introduction) | 1-overview.md |
section_num | Numeric prefix of main page | 5 (from 5-component-reference.md) |
section_dir | Subsection directory path | /workspace/wiki/section-5 |
title | Extracted page title | Component Reference |
subtitle | Extracted subsection title | build-docs.sh Orchestrator |
Sources: build-docs.sh:108-161
Generation Statistics
After generation completes, the script logs the number of entries created:
Output Structure
Relevant source files
This page documents the structure and contents of the /output directory produced by the DeepWiki-to-mdBook converter. The output structure varies depending on whether the system runs in full build mode or markdown-only mode. For information about enabling markdown-only mode, see Markdown-Only Mode.
Output Directory Overview
The system writes all artifacts to the /output directory, which is typically mounted as a Docker volume. The contents of this directory depend on the MARKDOWN_ONLY environment variable:
Output Mode Decision Logic
graph TD
Start["build-docs.sh execution"]
CheckMode{MARKDOWN_ONLY\nenvironment variable}
FullBuild["Full Build Path"]
MarkdownOnly["Markdown-Only Path"]
OutputBook["/output/book/\nHTML documentation"]
OutputMarkdown["/output/markdown/\nSource .md files"]
OutputToml["/output/book.toml\nConfiguration"]
Start --> CheckMode
CheckMode -->|false default| FullBuild
CheckMode -->|true| MarkdownOnly
FullBuild --> OutputBook
FullBuild --> OutputMarkdown
FullBuild --> OutputToml
MarkdownOnly --> OutputMarkdown
Sources: build-docs.sh26 build-docs.sh:60-76
Full Build Mode Output
When MARKDOWN_ONLY is not set or is false, the system produces three distinct outputs:
graph TD
Output["/output/"]
Book["book/\nComplete HTML site"]
Markdown["markdown/\nSource files"]
BookToml["book.toml\nConfiguration"]
BookIndex["index.html"]
BookCSS["css/"]
BookJS["FontAwesome/"]
BookSearchJS["searchindex.js"]
BookMermaid["mermaid-init.js"]
BookPages["*.html pages"]
MarkdownRoot["*.md files\n(main pages)"]
MarkdownSections["section-N/\n(subsection dirs)"]
Output --> Book
Output --> Markdown
Output --> BookToml
Book --> BookIndex
Book --> BookCSS
Book --> BookJS
Book --> BookSearchJS
Book --> BookMermaid
Book --> BookPages
Markdown --> MarkdownRoot
Markdown --> MarkdownSections
Directory Structure
Full Build Output Structure
Sources: build-docs.sh:178-192 README.md:92-104
/output/book/ Directory
The book/ directory contains the complete HTML documentation site generated by mdBook. This is a self-contained static website that can be hosted on any web server or opened directly in a browser.
| Component | Description | Generated By |
|---|---|---|
index.html | Main entry point for the documentation | mdBook |
*.html | Individual page files corresponding to each .md source | mdBook |
css/ | Styling for the rust theme | mdBook |
FontAwesome/ | Icon font assets | mdBook |
searchindex.js | Search index for site-wide search functionality | mdBook |
mermaid.min.js | Mermaid diagram rendering library | mdbook-mermaid |
mermaid-init.js | Mermaid initialization script | mdbook-mermaid |
The HTML site includes:
- Responsive navigation sidebar with hierarchical structure
- Full-text search functionality
- Syntax highlighting for code blocks
- Working Mermaid diagram rendering
- "Edit this page" links pointing to
GIT_REPO_URL - Collapsible sections in the navigation
Sources: build-docs.sh:173-176 build-docs.sh:94-95 README.md:95-99
/output/markdown/ Directory
The markdown/ directory contains the source Markdown files extracted from DeepWiki and enhanced with Mermaid diagrams. These files follow a specific naming convention and organizational structure.
File Naming Convention:
<page-number>-<page-title-slug>.md
Examples from actual output:
1-overview.md2-1-workspace-and-crates.md3-2-sql-parser.md
Subsection Organization:
Pages with subsections have their children organized into directories:
section-N/
N-1-first-subsection.md
N-2-second-subsection.md
...
For example, if page 4-architecture.md has subsections, they appear in:
section-4/
4-1-overview.md
4-2-components.md
This organization is reflected in the mdBook SUMMARY.md generation logic at build-docs.sh:125-159
Sources: README.md:100-119 build-docs.sh:163-166 build-docs.sh:186-188
/output/book.toml File
The book.toml file is a copy of the mdBook configuration used to generate the HTML site. It contains:
This file can be used to:
- Understand the configuration used for the build
- Regenerate the book with different settings
- Debug mdBook configuration issues
Sources: build-docs.sh:84-103 build-docs.sh:190-191
Markdown-Only Mode Output
When MARKDOWN_ONLY=true, the system produces only the /output/markdown/ directory. This mode skips the mdBook build phase entirely.
Markdown-Only Mode Data Flow
graph LR
Scraper["deepwiki-scraper.py"]
TempDir["/workspace/wiki/\nTemporary directory"]
OutputMarkdown["/output/markdown/\nFinal output"]
Scraper -->|Writes enhanced .md files| TempDir
TempDir -->|cp -r| OutputMarkdown
The output structure is identical to the markdown/ directory in full build mode, but the book/ and book.toml artifacts are not created.
Sources: build-docs.sh:60-76 README.md:106-113
graph TD
subgraph "Phase 1: Scraping"
Scraper["deepwiki-scraper.py"]
WikiDir["/workspace/wiki/"]
end
subgraph "Phase 2: Decision Point"
CheckMode{MARKDOWN_ONLY\ncheck}
end
subgraph "Phase 3: mdBook Build (conditional)"
BookInit["Initialize /workspace/book/"]
GenToml["Generate book.toml"]
GenSummary["Generate SUMMARY.md"]
CopyToSrc["cp wiki/* book/src/"]
MermaidInstall["mdbook-mermaid install"]
MdBookBuild["mdbook build"]
BuildOutput["/workspace/book/book/"]
end
subgraph "Phase 4: Copy to Output"
CopyBook["cp -r book /output/"]
CopyMarkdown["cp -r wiki /output/markdown/"]
CopyToml["cp book.toml /output/"]
end
Scraper -->|Writes to| WikiDir
WikiDir --> CheckMode
CheckMode -->|false| BookInit
CheckMode -->|true| CopyMarkdown
BookInit --> GenToml
GenToml --> GenSummary
GenSummary --> CopyToSrc
CopyToSrc --> MermaidInstall
MermaidInstall --> MdBookBuild
MdBookBuild --> BuildOutput
BuildOutput --> CopyBook
WikiDir --> CopyMarkdown
GenToml --> CopyToml
Output Generation Process
The following diagram shows how each output artifact is generated during the build process:
Complete Output Generation Pipeline
Sources: build-docs.sh:55-205
File Naming Examples
The following table shows actual filename patterns produced by the system:
| Pattern | Example | Description |
|---|---|---|
N-title.md | 1-overview.md | Main page without subsections |
N-M-title.md | 2-1-workspace-and-crates.md | Subsection file in root (legacy format) |
section-N/N-M-title.md | section-4/4-1-logical-planning.md | Subsection file in section directory |
The system automatically detects which pages have subsections by examining the numeric prefix and checking for corresponding section-N/ directories during SUMMARY.md generation.
Sources: build-docs.sh:125-159 README.md:115-119
Volume Mounting
The /output directory is designed to be mounted as a Docker volume. The typical Docker run command specifies:
This mounts the host's ./output directory to the container's /output directory, making all generated artifacts accessible on the host filesystem after the container exits.
Sources: README.md:34-38 README.md:83-86
Output Size Characteristics
The output directory typically contains:
- Markdown files : 10-500 KB per page depending on content length and diagram count
- HTML book : 5-50 MB total depending on page count and assets
- book.toml : ~500 bytes
For a typical repository with 20-30 documentation pages, expect:
markdown/: 5-15 MBbook/: 10-30 MB (includes all HTML, CSS, JS, and search index)book.toml: < 1 KB
The HTML book is significantly larger than the markdown source because it includes:
- Complete mdBook framework (CSS, JavaScript)
- Search index (
searchindex.js) - Mermaid rendering library (
mermaid.min.js) - Font assets (FontAwesome)
- Generated HTML for each page with navigation
Sources: build-docs.sh:178-205
Serving the Output
The HTML documentation in /output/book/ can be served using any static web server:
The markdown files in /output/markdown/ can be:
- Committed to a Git repository
- Used as input for other documentation systems
- Edited and re-processed through mdBook manually
- Served directly by markdown-aware platforms like GitHub
Sources: README.md:83-86 build-docs.sh:203-204
Advanced Topics
Relevant source files
This page covers advanced usage scenarios, implementation details, and power-user features of the DeepWiki-to-mdBook Converter. It provides deeper insight into optional features, debugging techniques, and the internal mechanisms that enable the system's flexibility and robustness.
For basic usage and configuration, see Quick Start and Configuration Reference. For architectural overview, see System Architecture. For component-level details, see Component Reference.
When to Use Advanced Features
The system provides several advanced features designed for specific scenarios:
Markdown-Only Mode : Extract markdown without building the HTML documentation. Useful for:
- Debugging diagram placement and content extraction
- Quick iteration during development
- Creating markdown archives for version control
- Feeding extracted content into other tools
Auto-Detection : Automatically determine repository metadata from Git remotes. Useful for:
- CI/CD pipeline integration with minimal configuration
- Running from within a repository checkout
- Reducing configuration boilerplate
Custom Configuration : Override default behaviors through environment variables. Useful for:
- Multi-repository documentation builds
- Custom branding and themes
- Specialized output requirements
Decision Flow for Build Modes
Sources: build-docs.sh:60-76 build-docs.sh:78-206 README.md:55-76
Debugging Strategies
Using Markdown-Only Mode for Fast Iteration
The MARKDOWN_ONLY environment variable bypasses the mdBook build phase, reducing build time from minutes to seconds. This is controlled by a simple conditional check in the orchestration script.
Workflow:
- Set
MARKDOWN_ONLY=truein Docker run command - Script executes build-docs.sh:60-76 which skips Steps 2-6
- Only Phase 1 (scraping) and Phase 2 (diagram enhancement) execute
- Output written directly to
/output/markdown/
Typical debugging session:
The check at build-docs.sh61 determines whether to exit early:
For detailed information about this mode, see Markdown-Only Mode.
Sources: build-docs.sh:60-76 build-docs.sh26 README.md:55-76
Inspecting Intermediate Outputs
The system uses a temporary directory workflow that can be examined for debugging:
| Stage | Location | Contents |
|---|---|---|
| During Phase 1 | /workspace/wiki/ (temp) | Raw markdown before diagram enhancement |
| During Phase 2 | /workspace/wiki/ (temp) | Markdown with injected diagrams |
| During Phase 3 | /workspace/book/src/ | Markdown copied for mdBook |
| Final Output | /output/markdown/ | Final enhanced markdown files |
The temporary directory pattern is implemented using Python's tempfile.TemporaryDirectory at tools/deepwiki-scraper.py808:
This ensures atomic operations—if the script fails mid-process, partial outputs are automatically cleaned up.
Sources: tools/deepwiki-scraper.py:804-916 build-docs.sh:27-30
Diagram Placement Debugging
Diagram injection uses fuzzy matching with progressive chunk sizes. To debug placement:
- Check raw extraction count : Look for console output "Found N total diagrams"
- Check context extraction : Look for "Found N diagrams with context"
- Check matching : Look for "Enhanced X files with diagrams"
The matching algorithm tries progressively smaller chunks at tools/deepwiki-scraper.py:716-730:
Debugging poor matches:
- If too few diagrams placed: The context from JavaScript may not match converted markdown
- If diagrams in wrong locations: Context text may appear in multiple locations
- If no diagrams: Repository may not contain mermaid diagrams
Sources: tools/deepwiki-scraper.py:596-788 tools/deepwiki-scraper.py:216-331
Link Rewriting Implementation
The Link Rewriting Problem
DeepWiki uses absolute URLs like /owner/repo/2-1-subsection. The scraper must convert these to relative markdown paths that work in the mdBook file hierarchy:
output/markdown/
├── 1-overview.md
├── 2-main-section.md
├── section-2/
│ ├── 2-1-subsection.md
│ └── 2-2-another.md
└── 3-next-section.md
Links must account for:
- Source page location (main page vs. subsection)
- Target page location (main page vs. subsection)
- Same section vs. cross-section links
Link Rewriting Algorithm
Sources: tools/deepwiki-scraper.py:549-593
Link Rewriting Code Structure
The fix_wiki_link function at tools/deepwiki-scraper.py:549-589 implements this logic:
Input parsing:
Location detection:
Path generation rules:
| Source Location | Target Location | Generated Path | Example |
|---|---|---|---|
| Main page | Main page | file.md | 3-next.md |
| Main page | Subsection | section-N/file.md | section-2/2-1-sub.md |
| Subsection | Main page | ../file.md | ../3-next.md |
| Subsection (same section) | Subsection | file.md | 2-2-another.md |
| Subsection (diff section) | Subsection | section-N/file.md | section-3/3-1-sub.md |
The regex replacement at tools/deepwiki-scraper.py592 applies this transformation to all links:
For detailed explanation, see Link Rewriting Logic.
Sources: tools/deepwiki-scraper.py:549-593
Auto-Detection Mechanisms
flowchart TD
Start["build-docs.sh starts"] --> CheckRepo{"REPO env var\nprovided?"}
CheckRepo -->|Yes| UseEnv["Use provided REPO value"]
CheckRepo -->|No| CheckGit{"Is current directory\na Git repository?\n(git rev-parse --git-dir)"}
CheckGit -->|Yes| GetRemote["Get remote.origin.url:\ngit config --get\nremote.origin.url"]
CheckGit -->|No| SetEmpty["Set REPO=<empty>"]
GetRemote --> HasRemote{"Remote URL\nfound?"}
HasRemote -->|Yes| ParseURL["Parse GitHub URL using sed regex:\nExtract owner/repo"]
HasRemote -->|No| SetEmpty
ParseURL --> ValidateFormat{"Format is\nowner/repo?"}
ValidateFormat -->|Yes| SetRepo["Set REPO variable"]
ValidateFormat -->|No| SetEmpty
SetEmpty --> FinalCheck{"REPO is empty?"}
UseEnv --> Continue["Continue with REPO"]
SetRepo --> Continue
FinalCheck -->|Yes| Error["ERROR: REPO must be set\nExit with code 1"]
FinalCheck -->|No| Continue
Git Remote Auto-Detection
When REPO environment variable is not provided, the system attempts to auto-detect it from the Git repository in the current working directory.
Sources: build-docs.sh:8-37
Implementation Details
The auto-detection logic at build-docs.sh:8-19 handles multiple Git URL formats:
Supported URL formats:
- HTTPS:
https://github.com/owner/repo.git - HTTPS (no .git):
https://github.com/owner/repo - SSH:
git@github.com:owner/repo.git - SSH (no .git):
git@github.com:owner/repo
The regex pattern .*github\.com<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/#LNaN-LNaN" NaN file-path="">Hii</FileRef>(\.git)?.* captures:
[:/]- Matches either:(SSH) or/(HTTPS)([^/]+/[^/\.]+)- Capturesowner/repo(stops at/or.)(\.git)?- Optionally matches.gitsuffix
Derived defaults:
After determining REPO, the script derives other configuration at build-docs.sh:39-45:
This provides sensible defaults:
BOOK_AUTHORSdefaults to repository ownerGIT_REPO_URLdefaults to GitHub URL (for "Edit this page" links)
For detailed explanation, see Auto-Detection Features.
Sources: build-docs.sh:8-45 README.md:47-53
Performance Considerations
Build Time Breakdown
Typical build times for a medium-sized repository (50-100 pages):
| Phase | Time | Bottleneck |
|---|---|---|
| Phase 1: Scraping | 60-120s | Network requests + 1s delays |
| Phase 2: Diagrams | 5-10s | Regex matching + file I/O |
| Phase 3: mdBook | 10-20s | Rust compilation + mermaid assets |
| Total | 75-150s | Network + computation |
Optimization Strategies
Network optimization:
- The scraper includes
time.sleep(1)at tools/deepwiki-scraper.py872 between pages - Retry logic with exponential backoff at tools/deepwiki-scraper.py:33-42
- HTTP session reuse via
requests.Session()at tools/deepwiki-scraper.py:818-821
Markdown-only mode:
- Skips Phase 3 entirely, reducing build time by ~15-25%
- Useful for content-only iterations
Docker build optimization:
- Multi-stage build discards Rust toolchain (~1.5 GB)
- Final image only contains binaries (~300-400 MB)
- See Docker Multi-Stage Build for details
Caching considerations:
- No internal caching—each run fetches fresh content
- DeepWiki serves dynamic content (no cache headers)
- Docker layer caching helps with repeated image builds
Sources: tools/deepwiki-scraper.py:28-42 tools/deepwiki-scraper.py:817-821 tools/deepwiki-scraper.py872
Extending the System
Adding New Output Formats
The system's three-phase architecture makes it easy to add new output formats:
Integration points:
- Before Phase 3: Add code after build-docs.sh188 to read from
$WIKI_DIR - Alternative Phase 3: Replace build-docs.sh:174-176 with custom builder
- Post-processing: Add steps after build-docs.sh192 to transform mdBook output
Example: Adding PDF export:
Sources: build-docs.sh:174-206
Customizing Diagram Matching
The fuzzy matching algorithm can be tuned by modifying the chunk sizes at tools/deepwiki-scraper.py716:
Matching strategy customization:
The scoring system at tools/deepwiki-scraper.py:709-745 prioritizes:
- Anchor text matching (weighted by chunk size)
- Heading matching (weight: 50)
You can add additional heuristics by modifying the scoring logic or adding new matching strategies.
Sources: tools/deepwiki-scraper.py:596-788 tools/deepwiki-scraper.py:716-745
Adding New Content Cleaners
The HTML-to-markdown conversion can be enhanced by adding custom cleaners at tools/deepwiki-scraper.py:489-511:
The footer cleaner at tools/deepwiki-scraper.py:127-173 can be extended with additional patterns:
Sources: tools/deepwiki-scraper.py:127-173 tools/deepwiki-scraper.py:466-511
Common Advanced Scenarios
CI/CD Integration
GitHub Actions example:
The auto-detection at build-docs.sh:8-19 determines REPO from Git context. The BOOK_TITLE overrides the default.
Sources: build-docs.sh:8-45 README.md:228-232
Multi-Repository Builds
Build documentation for multiple repositories in parallel:
Each build runs in an isolated container with separate output directories.
Sources: build-docs.sh:21-53 README.md:200-207
Custom Theming
Override mdBook theme by modifying the generated book.toml template at build-docs.sh:85-103:
Or inject custom CSS:
Sources: build-docs.sh:84-103
Markdown-Only Mode
Relevant source files
Purpose and Scope
This document describes the Markdown-Only Mode feature, which provides a fast-path execution mode that bypasses Phase 3 of the processing pipeline. This mode is primarily used for debugging content extraction and diagram placement without waiting for the full mdBook HTML build. For information about the complete three-phase pipeline, see Three-Phase Pipeline. For details about the full output structure, see Output Structure.
Overview
Markdown-Only Mode is a configuration option that terminates the build process after completing Phase 1 (Markdown Extraction) and Phase 2 (Diagram Enhancement), skipping the computationally expensive Phase 3 (mdBook Build). This mode produces only the enhanced Markdown files without generating the final HTML documentation site.
The mode is controlled by the MARKDOWN_ONLY environment variable, which defaults to false for complete builds.
Sources: README.md:55-72 build-docs.sh26
Configuration
The mode is activated by setting the MARKDOWN_ONLY environment variable to "true":
| Variable | Value | Effect |
|---|---|---|
MARKDOWN_ONLY | "true" | Skip Phase 3, output only markdown files |
MARKDOWN_ONLY | "false" (default) | Complete all three phases, generate full HTML site |
Sources: build-docs.sh26 README.md:45-51
Processing Pipeline Comparison
Full Build vs. Markdown-Only Mode Decision Flow
Sources: build-docs.sh:1-206
Phase Execution Matrix
| Phase | Full Build | Markdown-Only Mode |
|---|---|---|
| Phase 1: Markdown Extraction | ✓ Executed | ✓ Executed |
| Phase 2: Diagram Enhancement | ✓ Executed | ✓ Executed |
| Phase 3: mdBook Build | ✓ Executed | ✗ Skipped |
Both modes execute the Python scraper deepwiki-scraper.py identically. The difference occurs after scraping completes, where the shell orchestrator makes a conditional decision based on the MARKDOWN_ONLY variable.
Sources: build-docs.sh:58-76 README.md:123-145
Implementation Details
Conditional Logic in build-docs.sh
The markdown-only mode implementation consists of a simple conditional branch in the shell orchestrator:
graph TD
ScraperCall["python3 /usr/local/bin/deepwiki-scraper.py\n[build-docs.sh:58]"]
CheckVar{"if [ '$MARKDOWN_ONLY' = 'true' ]\n[build-docs.sh:61]"}
subgraph "Markdown-Only Path [lines 63-76]"
MkdirOut["mkdir -p $OUTPUT_DIR/markdown"]
CpMarkdown["cp -r $WIKI_DIR/* $OUTPUT_DIR/markdown/"]
EchoSuccess["Echo success message"]
Exit0["exit 0"]
end
subgraph "Full Build Path [lines 78-206]"
MkdirBook["mkdir -p $BOOK_DIR"]
CreateToml["Create book.toml"]
CreateSummary["Generate SUMMARY.md"]
CopySrcFiles["Copy markdown to src/"]
MdbookMermaid["mdbook-mermaid install"]
MdbookBuild["mdbook build"]
CopyAll["Copy to /output"]
end
ScraperCall --> CheckVar
CheckVar -->|true| MkdirOut
MkdirOut --> CpMarkdown
CpMarkdown --> EchoSuccess
EchoSuccess --> Exit0
CheckVar -->|false| MkdirBook
MkdirBook --> CreateToml
CreateToml --> CreateSummary
CreateSummary --> CopySrcFiles
CopySrcFiles --> MdbookMermaid
MdbookMermaid --> MdbookBuild
MdbookBuild --> CopyAll
The implementation performs an early exit at build-docs.sh75 when markdown-only mode is enabled, preventing execution of the entire mdBook build pipeline.
Sources: build-docs.sh:60-76
Variable Reading and Default Value
The MARKDOWN_ONLY variable is read with a default value of "false":
This line at build-docs.sh26 uses bash parameter expansion to set the variable to "false" if it is unset or empty. The string comparison at build-docs.sh61 checks for exact equality with "true", meaning any other value (including "false", "", or "1") results in a full build.
Sources: build-docs.sh26 build-docs.sh61
Output Structure Comparison
Markdown-Only Mode Output
When MARKDOWN_ONLY="true", the output directory contains:
/output/
└── markdown/
├── 1-overview.md
├── 2-quick-start.md
├── 3-configuration-reference.md
├── section-4/
│ ├── 4-1-three-phase-pipeline.md
│ └── 4-2-docker-multi-stage-build.md
└── ...
| Output | Present | Location |
|---|---|---|
| Markdown files | ✓ | /output/markdown/ |
| HTML documentation | ✗ | N/A |
book.toml | ✗ | N/A |
SUMMARY.md | ✗ | N/A |
Sources: build-docs.sh:64-74 README.md:106-114
Full Build Mode Output
When MARKDOWN_ONLY="false" (default), the output directory contains:
/output/
├── book/
│ ├── index.html
│ ├── print.html
│ ├── searchindex.js
│ ├── css/
│ ├── FontAwesome/
│ └── ...
├── markdown/
│ ├── 1-overview.md
│ ├── 2-quick-start.md
│ └── ...
└── book.toml
| Output | Present | Location |
|---|---|---|
| Markdown files | ✓ | /output/markdown/ |
| HTML documentation | ✓ | /output/book/ |
book.toml | ✓ | /output/book.toml |
SUMMARY.md | ✓ (internal) | Copied to mdBook src during build |
Sources: build-docs.sh:179-201 README.md:89-105
Use Cases
Debugging Diagram Placement
Markdown-only mode is particularly useful when debugging the fuzzy diagram matching algorithm (see Fuzzy Diagram Matching Algorithm). Developers can rapidly iterate on diagram placement logic without waiting for mdBook compilation:
This workflow allows inspection of the exact markdown output, including where diagrams were injected, without the overhead of HTML generation.
Sources: README.md:55-72 README.md:168-169
Content Extraction Verification
When verifying that HTML-to-Markdown conversion is working correctly (see HTML to Markdown Conversion), markdown-only mode provides quick feedback:
Sources: README.md:55-72
CI/CD Pipeline Intermediate Artifacts
In continuous integration pipelines, markdown-only mode can be used as an intermediate step to produce version-controlled markdown artifacts without generating the full HTML site:
Sources: README.md:228-232
Performance Characteristics
Build Time Comparison
| Build Mode | Phase 1 | Phase 2 | Phase 3 | Total |
|---|---|---|---|---|
| Full Build | ~30s | ~10s | ~20s | ~60s |
| Markdown-Only | ~30s | ~10s | 0s | ~40s |
The markdown-only mode provides approximately 33% faster execution by eliminating:
- mdBook binary initialization
book.tomlgenerationSUMMARY.mdgeneration- Markdown file copying to
src/ mdbook-mermaidasset installation- HTML compilation and asset generation
Note: Actual times vary based on repository size, number of diagrams, and system resources.
Sources: build-docs.sh:78-176 README.md:71-72
Resource Consumption
| Resource | Full Build | Markdown-Only |
|---|---|---|
| CPU | High (Rust compilation) | Medium (Python only) |
| Memory | ~2GB recommended | ~512MB sufficient |
| Disk I/O | High (HTML generation) | Low (markdown only) |
| Network | Same (scraping) | Same (scraping) |
Sources: README.md175
Common Workflows
Iterative Debugging Workflow
This workflow minimizes iteration time during development by using markdown-only mode for rapid feedback loops, only running the full build when markdown output is verified correct.
Sources: README.md:55-72 README.md177
Markdown Extraction for Other Tools
Markdown-only mode can extract clean markdown files for use with documentation tools other than mdBook:
The extracted markdown files contain properly rewritten internal links and enhanced diagrams, making them suitable for any markdown-compatible documentation system.
Sources: README.md:106-114 README.md:218-232
Console Output Differences
Markdown-Only Mode Output
When MARKDOWN_ONLY="true", the console output terminates after Phase 2:
Step 1: Scraping wiki from DeepWiki...
[... scraping progress ...]
Step 2: Copying markdown files to output (markdown-only mode)...
================================================================================
✓ Markdown extraction complete!
================================================================================
Outputs:
- Markdown files: /output/markdown/
The script exits at build-docs.sh75 with exit code 0.
Sources: build-docs.sh:63-75
Full Build Mode Output
When MARKDOWN_ONLY="false", the console output continues through all phases:
Step 1: Scraping wiki from DeepWiki...
[... scraping progress ...]
Step 2: Initializing mdBook structure...
Step 3: Generating SUMMARY.md from scraped content...
Step 4: Copying markdown files to book...
Step 5: Installing mdbook-mermaid assets...
Step 6: Building mdBook...
Step 7: Copying outputs to /output...
================================================================================
✓ Documentation build complete!
================================================================================
Outputs:
- HTML book: /output/book/
- Markdown files: /output/markdown/
- Book config: /output/book.toml
Sources: build-docs.sh:193-205
Summary
Markdown-Only Mode provides a fast-path execution option controlled by the MARKDOWN_ONLY environment variable. It executes Phases 1 and 2 of the pipeline but bypasses Phase 3, producing only enhanced markdown files without HTML documentation. This mode is essential for:
- Debugging : Rapid iteration on content extraction and diagram placement
- Performance : 33% faster execution when HTML output is not needed
- Flexibility : Extract markdown for use with other documentation tools
The implementation is straightforward: a single conditional check at build-docs.sh61 determines whether to execute the mdBook build pipeline or exit early with only markdown artifacts.
Sources: build-docs.sh:60-76 README.md:55-72 README.md:106-114
Link Rewriting Logic
Relevant source files
This document details the algorithm for converting internal DeepWiki URL links into relative Markdown file paths during the content extraction process. The link rewriting system ensures that cross-references between wiki pages function correctly in the final mdBook output by transforming absolute web URLs into appropriate relative file paths based on the hierarchical structure of the documentation.
For information about the overall markdown extraction process, see Phase 1: Markdown Extraction. For details about file organization and directory structure, see Output Structure.
Overview
DeepWiki uses absolute URL paths for internal wiki links in the format /owner/repo/N-page-title or /owner/repo/N.M-subsection-title. These links must be rewritten to relative Markdown file paths that respect the mdBook directory structure where:
- Main pages (e.g., "1-overview", "2-architecture") reside in the root markdown directory
- Subsections (e.g., "2.1-subsection", "2.2-another") reside in subdirectories named
section-N/ - File names use hyphens instead of dots (e.g.,
2-1-subsection.mdinstead of2.1-subsection.md)
The rewriting logic must compute the correct relative path based on both the source page location and the target page location.
Sources: tools/deepwiki-scraper.py:547-594
Directory Structure Context
The system organizes markdown files into a hierarchical structure that affects link rewriting:
Diagram: File Organization Hierarchy
graph TB
Root["Root Directory\n(output/markdown/)"]
Main1["1-overview.md\n(Main Page)"]
Main2["2-architecture.md\n(Main Page)"]
Main3["3-installation.md\n(Main Page)"]
Section2["section-2/\n(Subsection Directory)"]
Section3["section-3/\n(Subsection Directory)"]
Sub2_1["2-1-components.md\n(Subsection)"]
Sub2_2["2-2-workflows.md\n(Subsection)"]
Sub3_1["3-1-docker-setup.md\n(Subsection)"]
Sub3_2["3-2-manual-setup.md\n(Subsection)"]
Root --> Main1
Root --> Main2
Root --> Main3
Root --> Section2
Root --> Section3
Section2 --> Sub2_1
Section2 --> Sub2_2
Section3 --> Sub3_1
Section3 --> Sub3_2
This structure requires different relative path strategies depending on where the link originates and where it points.
Sources: tools/deepwiki-scraper.py:848-860
Link Transformation Algorithm
Input Format Detection
The algorithm begins by matching markdown links that reference the DeepWiki URL structure using a regular expression pattern.
Diagram: Link Pattern Matching Flow
flowchart TD
Start["Markdown Content"]
Regex["Apply Regex Pattern:\n'\\]\\(/[^/]+/[^/]+/([^)]+)\\)'"]
Extract["Extract Path Component:\ne.g., '4-query-planning'"]
Parse["Parse Page Number and Slug:\nPattern: '(\\d+(?:\\.\\d+)*)-(.+)$'"]
PageNum["page_num\n(e.g., '2.1' or '4')"]
Slug["slug\n(e.g., 'query-planning')"]
Start --> Regex
Regex --> Extract
Extract --> Parse
Parse --> PageNum
Parse --> Slug
The regex \]\(/[^/]+/[^/]+/([^)]+)\) captures the path component after the repository identifier. For example, in <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/text" undefined file-path="text">Hii</FileRef>, it captures 4-query-planning.
Sources: tools/deepwiki-scraper.py592
Page Classification Logic
Each page (source and target) is classified based on whether it contains a dot in its page number, indicating a subsection.
Diagram: Page Type Classification
graph TB
subgraph "Target Classification"
TargetNum["Target Page Number"]
CheckDot["Contains '.' ?"]
IsTargetSub["is_target_subsection = True\ntarget_main_section = N"]
IsTargetMain["is_target_subsection = False\ntarget_main_section = None"]
TargetNum --> CheckDot
CheckDot -->|Yes "2.1"| IsTargetSub
CheckDot -->|No "2"| IsTargetMain
end
subgraph "Source Classification"
SourceInfo["current_page_info"]
SourceLevel["Check 'level' field"]
IsSourceSub["is_source_subsection = True\nsource_main_section = N"]
IsSourceMain["is_source_subsection = False\nsource_main_section = None"]
SourceInfo --> SourceLevel
SourceLevel -->|> 0| IsSourceSub
SourceLevel -->|= 0| IsSourceMain
end
The level field in current_page_info is set during wiki structure discovery and indicates the depth in the hierarchy (0 for main pages, 1+ for subsections).
Sources: tools/deepwiki-scraper.py:554-570
Path Generation Decision Matrix
The relative path is computed based on the combination of source and target types:
| Source Type | Target Type | Relative Path | Example |
|---|---|---|---|
| Main Page | Main Page | {file_num}-{slug}.md | 3-installation.md |
| Main Page | Subsection | section-{N}/{file_num}-{slug}.md | section-2/2-1-components.md |
| Subsection | Main Page | ../{file_num}-{slug}.md | ../3-installation.md |
| Subsection (same section) | Subsection (same section) | {file_num}-{slug}.md | 2-2-workflows.md |
| Subsection (section A) | Subsection (section B) | ../section-{N}/{file_num}-{slug}.md | ../section-3/3-1-setup.md |
Sources: tools/deepwiki-scraper.py:573-588
Implementation Details
flowchart TD
Start["fix_wiki_link(match)"]
ExtractPath["full_path = match.group(1)"]
ParseLink["Match pattern: '(\\d+(?:\\.\\d+)*)-(.+)$'"]
Success{"Match\nsuccessful?"}
NoMatch["Return match.group(0)\n(unchanged)"]
ExtractParts["page_num = match.group(1)\nslug = match.group(2)"]
ConvertNum["file_num = page_num.replace('.', '-')"]
ClassifyTarget["Classify target:\nis_target_subsection\ntarget_main_section"]
ClassifySource["Classify source:\nis_source_subsection\nsource_main_section"]
Decision{"Target is\nsubsection?"}
DecisionYes{"Source in same\nsection?"}
DecisionNo{"Source is\nsubsection?"}
Path1["Return '{file_num}-{slug}.md'"]
Path2["Return 'section-{N}/{file_num}-{slug}.md'"]
Path3["Return '../{file_num}-{slug}.md'"]
Start --> ExtractPath
ExtractPath --> ParseLink
ParseLink --> Success
Success -->|No| NoMatch
Success -->|Yes| ExtractParts
ExtractParts --> ConvertNum
ConvertNum --> ClassifyTarget
ClassifyTarget --> ClassifySource
ClassifySource --> Decision
Decision -->|Yes| DecisionYes
DecisionYes -->|Yes| Path1
DecisionYes -->|No| Path2
Decision -->|No| DecisionNo
DecisionNo -->|Yes| Path3
DecisionNo -->|No| Path1
The fix_wiki_link Function
The core implementation is a nested function fix_wiki_link that serves as a callback for re.sub.
Diagram: fix_wiki_link Function Control Flow
The function handles all path generation cases through a series of conditional checks, using information from both the link match and the current_page_info parameter.
Sources: tools/deepwiki-scraper.py:549-589
Page Number Transformation
The transformation from page numbers with dots to file names with hyphens is critical for matching the file system structure:
Diagram: Page Number Format Conversion
graph LR
subgraph "DeepWiki Format"
DW1["Page: '2.1'"]
DW2["URL: '/repo/2.1-title'"]
end
subgraph "Transformation"
Trans["Replace '.' with '-'"]
end
subgraph "File System Format"
FS1["File Number: '2-1'"]
FS2["Path: 'section-2/2-1-title.md'"]
end
DW1 --> Trans
DW2 --> Trans
Trans --> FS1
Trans --> FS2
This conversion is performed by the line file_num = page_num.replace('.', '-'), which ensures that subsection identifiers match the actual file names created during extraction.
Sources: tools/deepwiki-scraper.py558
Detailed Example Scenarios
Scenario 1: Main Page to Main Page Link
When a main page (e.g., 1-overview.md) links to another main page (e.g., 4-features.md):
- Source:
1-overview.md(level = 0, in root directory) - Target:
4-features(no dot, is main page) - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Features" undefined file-path="Features">Hii</FileRef> - Generated Path:
4-features.md - Reason: Both files are in the same root directory, so only the filename is needed
Sources: tools/deepwiki-scraper.py:586-588
Scenario 2: Main Page to Subsection Link
When a main page (e.g., 2-architecture.md) links to a subsection (e.g., 2.1-components):
- Source:
2-architecture.md(level = 0, in root directory) - Target:
2.1-components(contains dot, is subsection in section-2/) - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Components" undefined file-path="Components">Hii</FileRef> - Generated Path:
section-2/2-1-components.md - Reason: Target is in subdirectory
section-2/, source is in root, so full relative path is needed
Sources: tools/deepwiki-scraper.py:579-580
Scenario 3: Subsection to Main Page Link
When a subsection (e.g., 2.1-components.md in section-2/) links to a main page (e.g., 3-installation.md):
- Source:
2.1-components.md(level = 1, in section-2/ directory) - Target:
3-installation(no dot, is main page) - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Installation" undefined file-path="Installation">Hii</FileRef> - Generated Path:
../3-installation.md - Reason: Source is in subdirectory, target is in parent directory, so
../is needed to go up one level
Sources: tools/deepwiki-scraper.py:583-585
Scenario 4: Subsection to Subsection (Same Section)
When a subsection (e.g., 2.1-components.md) links to another subsection in the same section (e.g., 2.2-workflows.md):
- Source:
2.1-components.md(level = 1, in section-2/) - Source Main Section:
2 - Target:
2.2-workflows(contains dot, in section-2/) - Target Main Section:
2 - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Workflows" undefined file-path="Workflows">Hii</FileRef> - Generated Path:
2-2-workflows.md - Reason: Both files are in the same
section-2/directory, so only the filename is needed
Sources: tools/deepwiki-scraper.py:575-577
Scenario 5: Subsection to Subsection (Different Section)
When a subsection (e.g., 2.1-components.md in section-2/) links to a subsection in a different section (e.g., 3.1-docker-setup.md in section-3/):
- Source:
2.1-components.md(level = 1, in section-2/) - Source Main Section:
2 - Target:
3.1-docker-setup(contains dot, in section-3/) - Target Main Section:
3 - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Docker Setup" undefined file-path="Docker Setup">Hii</FileRef> - Generated Path:
section-3/3-1-docker-setup.md - Reason: Sections don't match, so full path from root perspective is used (implicitly going up and into different section directory)
Sources: tools/deepwiki-scraper.py:579-580
sequenceDiagram
participant EPC as extract_page_content
participant CTM as convert_html_to_markdown
participant FWL as fix_wiki_link
participant RE as re.sub
EPC->>CTM: Convert HTML to Markdown
CTM-->>EPC: Raw Markdown (with DeepWiki URLs)
Note over EPC: Clean up content
EPC->>RE: Apply link rewriting regex
loop For each matched link
RE->>FWL: Call with match object
FWL->>FWL: Parse page number and slug
FWL->>FWL: Classify source and target
FWL->>FWL: Compute relative path
FWL-->>RE: Return rewritten link
end
RE-->>EPC: Markdown with relative paths
EPC-->>EPC: Return final markdown
Integration with Content Extraction
The link rewriting is integrated into the extract_page_content function and applied after HTML-to-Markdown conversion:
Diagram: Link Rewriting Integration Sequence
The rewriting occurs at line 592 using re.sub(r'\]\(/[^/]+/[^/]+/([^)]+)\)', fix_wiki_link, markdown), which finds all internal wiki links and replaces them with their rewritten versions.
Sources: tools/deepwiki-scraper.py:547-594
Edge Cases and Error Handling
Invalid Link Format
If a link doesn't match the expected pattern (\d+(?:\.\d+)*)-(.+)$, the function returns the original match unchanged:
This ensures that malformed or external links are preserved in their original form.
Sources: tools/deepwiki-scraper.py:551-589
Missing current_page_info
If current_page_info is not provided (e.g., during development or testing), the function defaults to treating the source as a main page:
This allows the function to work in degraded mode, though links from subsections may not be correctly rewritten.
Sources: tools/deepwiki-scraper.py:565-570
Performance Considerations
The link rewriting is performed using a single re.sub call with a callback function, which is efficient for typical wiki pages with dozens to hundreds of links. The regex compilation is implicit and cached by Python's re module.
The algorithm has O(n) complexity where n is the number of internal links in the page, with each link requiring constant-time string operations.
Sources: tools/deepwiki-scraper.py592
Testing and Validation
The correctness of link rewriting can be validated by:
- Checking that generated links use
.mdextension - Verifying that links from subsections to main pages use
../ - Confirming that links to subsections use the
section-N/prefix when appropriate - Testing cross-section subsection links resolve correctly
The mdBook build process will fail if links are incorrectly rewritten, providing a validation mechanism during Phase 3 of the pipeline.
Sources: tools/deepwiki-scraper.py:547-594
Auto-Detection Features
Relevant source files
This document describes the automatic detection and configuration mechanisms in the DeepWiki-to-mdBook converter system. These features enable the system to operate with minimal user configuration by intelligently inferring repository metadata, generating sensible defaults, and dynamically discovering file structures.
For information about manually configuring these values, see Configuration Reference. For details on how SUMMARY.md generation works, see SUMMARY.md Generation.
Overview
The system implements three primary auto-detection capabilities:
- Git Repository Detection : Automatically identifies the GitHub repository from Git remote URLs
- Configuration Defaults : Generates book metadata from detected repository information
- File Structure Discovery : Dynamically builds table of contents from actual file hierarchies
These features allow the system to run with a single docker run command in many cases, with all necessary configuration inferred from context.
Git Repository Auto-Detection
Detection Mechanism
The system attempts to auto-detect the GitHub repository when the REPO environment variable is not provided. This detection occurs in the shell orchestrator and follows a specific fallback sequence.
Git Repository Auto-Detection Flow
flowchart TD
Start["build-docs.sh execution"]
CheckRepo{"REPO env\nvariable set?"}
UseRepo["Use $REPO value"]
CheckGit{"Git repository\ndetected?"}
GetRemote["Execute:\ngit config --get\nremote.origin.url"]
CheckRemote{"Remote URL\nfound?"}
ExtractOwnerRepo["Apply regex pattern:\ns#.*github\.com[:/]([^/]+/[^/\.]+)\n(\.git)?.*#\1#"]
SetRepo["Set REPO variable\nto owner/repo"]
ErrorExit["Exit with error:\nREPO must be set"]
Start --> CheckRepo
CheckRepo -->|Yes| UseRepo
CheckRepo -->|No| CheckGit
CheckGit -->|No| ErrorExit
CheckGit -->|Yes| GetRemote
GetRemote --> CheckRemote
CheckRemote -->|No| ErrorExit
CheckRemote -->|Yes| ExtractOwnerRepo
ExtractOwnerRepo --> SetRepo
UseRepo --> Continue["Continue with\nbuild process"]
SetRepo --> Continue
Sources: build-docs.sh:8-19
Implementation Details
The auto-detection logic is implemented in the shell script's initialization section:
| Detection Step | Shell Command | Purpose |
|---|---|---|
| Check Git repository | git rev-parse --git-dir > /dev/null 2>&1 | Verify current directory is a Git repository |
| Retrieve remote URL | git config --get remote.origin.url | Get the origin remote URL |
| Extract repository | sed -E 's#.*github\.com<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/#LNaN-LNaN" NaN file-path="">Hii</FileRef>(\.git)?.*#\1#' | Parse owner/repo from various URL formats |
The regex pattern in the sed command handles multiple GitHub URL formats:
- HTTPS:
https://github.com/owner/repo.git - SSH:
git@github.com:owner/repo.git - HTTPS without .git:
https://github.com/owner/repo - SSH without .git:
git@github.com:owner/repo
Sources: build-docs.sh:8-19
Supported URL Formats
The detection regex supports the following GitHub remote URL patterns:
The regex captures the repository path between github.com and any optional .git suffix, handling both : (SSH) and / (HTTPS) separators.
Sources: build-docs.sh:14-16
Configuration Defaults Generation
flowchart LR
REPO["$REPO\n(owner/repo)"]
Extract["Parse repository\ncomponents"]
REPO_OWNER["$REPO_OWNER\n(cut -d'/' -f1)"]
REPO_NAME["$REPO_NAME\n(cut -d'/' -f2)"]
DefaultAuthors["BOOK_AUTHORS\ndefault: $REPO_OWNER"]
DefaultURL["GIT_REPO_URL\ndefault: https://github.com/$REPO"]
DefaultTitle["BOOK_TITLE\ndefault: Documentation"]
FinalAuthors["Final BOOK_AUTHORS"]
FinalURL["Final GIT_REPO_URL"]
FinalTitle["Final BOOK_TITLE"]
REPO --> Extract
Extract --> REPO_OWNER
Extract --> REPO_NAME
REPO_OWNER --> DefaultAuthors
REPO --> DefaultURL
DefaultAuthors -->|Override if env var set| FinalAuthors
DefaultURL -->|Override if env var set| FinalURL
DefaultTitle -->|Override if env var set| FinalTitle
FinalAuthors --> BookToml["book.toml\ngeneration"]
FinalURL --> BookToml
FinalTitle --> BookToml
Metadata Derivation
Once the REPO variable is determined (either from environment or auto-detection), the system generates additional configuration values with intelligent defaults:
Configuration Default Generation Flow
Sources: build-docs.sh:39-45
Default Value Table
| Configuration Variable | Default Value Expression | Example Result | Override Behavior |
|---|---|---|---|
REPO_OWNER | `$(echo "$REPO" | cut -d'/' -f1)` | jzombie |
REPO_NAME | `$(echo "$REPO" | cut -d'/' -f2)` | deepwiki-to-mdbook |
BOOK_AUTHORS | ${BOOK_AUTHORS:=$REPO_OWNER} | jzombie | Environment variable takes precedence |
GIT_REPO_URL | ${GIT_REPO_URL:=https://github.com/$REPO} | https://github.com/jzombie/deepwiki-to-mdbook | Environment variable takes precedence |
BOOK_TITLE | ${BOOK_TITLE:-Documentation} | Documentation | Environment variable takes precedence |
The shell parameter expansion syntax ${VAR:=default} assigns the default value only if VAR is unset or null, enabling environment variable overrides.
Sources: build-docs.sh:21-26 build-docs.sh:39-45
book.toml Generation
The auto-detected and default values are incorporated into the dynamically generated book.toml configuration file:
The git-repository-url field enables mdBook to generate "Edit this page" links that direct users to the appropriate GitHub repository file.
Sources: build-docs.sh:85-103 README.md99
File Structure Discovery
Dynamic SUMMARY.md Generation
The system automatically discovers the file hierarchy and generates a table of contents without requiring manual configuration. This process analyzes the scraped markdown files to determine their structure.
File Structure Discovery Algorithm
flowchart TD
Start["Begin SUMMARY.md\ngeneration"]
FindFirst["Find first .md file:\nls $WIKI_DIR/*.md /head -1"]
ExtractTitle["Extract title: head -1 file/ sed 's/^# //'"]
WriteIntro["Write introduction entry\nto SUMMARY.md"]
IterateFiles["Iterate all .md files\nin $WIKI_DIR"]
SkipFirst{"Is this\nfirst page?"}
ExtractNum["Extract section number:\ngrep -oE '^[0-9]+'"]
CheckSubdir{"Does section-N\ndirectory exist?"}
WriteSection["Write section header:\n# Title"]
WriteMain["Write main entry:\n- [Title](filename.md)"]
IterateSubs["Iterate subsection files:\nsection-N/*.md"]
WriteSubentry["Write subsection:\n - [Subtitle](section-N/file.md)"]
WriteStandalone["Write standalone entry:\n- [Title](filename.md)"]
NextFile{"More files?"}
Done["SUMMARY.md complete"]
Start --> FindFirst
FindFirst --> ExtractTitle
ExtractTitle --> WriteIntro
WriteIntro --> IterateFiles
IterateFiles --> SkipFirst
SkipFirst -->|Yes| NextFile
SkipFirst -->|No| ExtractNum
ExtractNum --> CheckSubdir
CheckSubdir -->|Yes| WriteSection
WriteSection --> WriteMain
WriteMain --> IterateSubs
IterateSubs --> WriteSubentry
WriteSubentry --> NextFile
CheckSubdir -->|No| WriteStandalone
WriteStandalone --> NextFile
NextFile -->|Yes| SkipFirst
NextFile -->|No| Done
Sources: build-docs.sh:112-159
Directory Structure Detection
The file structure discovery algorithm recognizes two organizational patterns:
Recognized File Hierarchy Patterns
$WIKI_DIR/
├── 1-overview.md # Main page (becomes introduction)
├── 2-architecture.md # Main page with subsections
├── 3-components.md # Standalone page
├── section-2/ # Subsection directory
│ ├── 2-1-system-design.md
│ └── 2-2-data-flow.md
└── section-4/ # Another subsection directory
├── 4-1-phase-one.md
└── 4-2-phase-two.md
The algorithm uses the following detection logic:
| Pattern Element | Detection Method | Code Reference |
|---|---|---|
| Main pages | for file in "$WIKI_DIR"/*.md | build-docs.sh126 |
| Section number | `echo "$filename" | grep -oE '^[0-9]+'` |
| Subsection directory | [ -d "$WIKI_DIR/section-$section_num" ] | build-docs.sh138 |
| Subsection files | for subfile in "$section_dir"/*.md | build-docs.sh147 |
Sources: build-docs.sh:126-157
Title Extraction
Page titles are automatically extracted from the first line of each markdown file using the following approach:
This command:
- Reads the first line of the file with
head -1 - Removes the markdown heading syntax
#withsed 's/^# //' - Assigns the result to the
titlevariable for use in SUMMARY.md
Sources: build-docs.sh134 build-docs.sh150
Generated SUMMARY.md Example
Given the file structure shown above, the system generates:
The generation process outputs a count of entries: Generated SUMMARY.md with N entries where N is determined by grep -c '\<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/' src/SUMMARY.md.\n\nSources#LNaN-LNaN" NaN file-path="' src/SUMMARY.md`.\n\nSources">Hii
Auto-Detection in CI/CD Context
Docker Container Limitations
The Git repository auto-detection feature has limitations when running inside a Docker container. The detection logic executes within the container's filesystem, which typically does not include the host's Git repository unless explicitly mounted.
Auto-Detection Context Comparison
| Execution Context | Git Repository Available | Auto-Detection Works | Recommended Usage |
|---|---|---|---|
| Host machine with Git repository | ✓ Yes | ✓ Yes | Local development/testing |
| Docker container (default) | ✗ No | ✗ No | Must provide REPO env var |
| Docker with volume mount of Git repo | ✓ Yes | ⚠ Partial | Not recommended (complexity) |
| CI/CD pipeline (GitHub Actions, etc.) | ⚠ Varies | ⚠ Conditional | Use explicit REPO for reliability |
For production and CI/CD usage, explicitly setting the REPO environment variable is recommended:
Sources: build-docs.sh:8-36 README.md:47-53
Implementation Code References
Shell Variable Initialization
The complete auto-detection and default generation sequence:
Sources: build-docs.sh:8-45
Error Handling
The system validates that a repository is available (either from environment or auto-detection) before proceeding:
This validation ensures the system fails fast with a clear error message if configuration is insufficient.
Sources: build-docs.sh:33-37
Development Guide
Relevant source files
This page provides guidance for developers who want to modify, extend, or contribute to the DeepWiki-to-mdBook Converter system. It covers the development environment setup, local workflow, testing procedures, and key considerations when working with the codebase.
For detailed information about the repository structure, see Project File Structure. For instructions on building the Docker image, see Building the Docker Image. For Python dependency details, see Python Dependencies.
Development Environment Requirements
The system is designed to run entirely within Docker, but local development requires the following tools:
| Tool | Purpose | Version |
|---|---|---|
| Docker | Container runtime | Latest stable |
| Git | Version control | 2.x or later |
| Text editor/IDE | Code editing | Any (VS Code recommended) |
| Python | Local testing (optional) | 3.12+ |
| Rust toolchain | Local testing (optional) | Latest stable |
The Docker image handles all runtime dependencies, so local installation of Python and Rust is optional and only needed for testing individual components outside the container.
Sources: Dockerfile:1-33
Development Workflow Architecture
The following diagram shows the typical development cycle and how different components interact during development:
Development Workflow Diagram : Shows the cycle from editing code to building the Docker image to testing with mounted output volume.
graph TB
subgraph "Development Environment"
Editor["Code Editor"]
GitRepo["Local Git Repository"]
end
subgraph "Docker Build Process"
BuildCmd["docker build -t deepwiki-scraper ."]
Stage1["Rust Builder Stage\nCompiles mdbook binaries"]
Stage2["Python Runtime Stage\nAssembles final image"]
FinalImage["deepwiki-scraper:latest"]
end
subgraph "Testing & Validation"
RunCmd["docker run with test params"]
OutputMount["Volume mount: ./output"]
Validation["Manual inspection of output"]
end
subgraph "Key Development Files"
Dockerfile["Dockerfile"]
BuildScript["build-docs.sh"]
Scraper["tools/deepwiki-scraper.py"]
Requirements["tools/requirements.txt"]
end
Editor -->|Edit| GitRepo
GitRepo --> Dockerfile
GitRepo --> BuildScript
GitRepo --> Scraper
GitRepo --> Requirements
BuildCmd --> Stage1
Stage1 --> Stage2
Stage2 --> FinalImage
FinalImage --> RunCmd
RunCmd --> OutputMount
OutputMount --> Validation
Validation -.->|Iterate| Editor
Sources: Dockerfile:1-33 build-docs.sh:1-206
Component Development Map
This diagram bridges system concepts to actual code entities, showing which files implement which functionality:
Code Entity Mapping Diagram : Maps system functionality to specific code locations, file paths, and binaries.
graph LR
subgraph "Entry Point Layer"
CMD["CMD in Dockerfile:32"]
BuildDocs["build-docs.sh"]
end
subgraph "Configuration Layer"
EnvVars["Environment Variables\nREPO, BOOK_TITLE, etc."]
AutoDetect["Auto-detect logic\nbuild-docs.sh:8-19"]
Validation["Validation\nbuild-docs.sh:33-37"]
end
subgraph "Processing Scripts"
ScraperPy["deepwiki-scraper.py"]
MdBookBin["/usr/local/bin/mdbook"]
MermaidBin["/usr/local/bin/mdbook-mermaid"]
end
subgraph "Configuration Generation"
BookToml["book.toml generation\nbuild-docs.sh:85-103"]
SummaryMd["SUMMARY.md generation\nbuild-docs.sh:113-159"]
end
subgraph "Dependency Management"
ReqTxt["requirements.txt"]
UvInstall["uv pip install\nDockerfile:17"]
CargoInstall["cargo install\nDockerfile:5"]
end
CMD --> BuildDocs
BuildDocs --> EnvVars
EnvVars --> AutoDetect
AutoDetect --> Validation
Validation --> ScraperPy
BuildDocs --> BookToml
BuildDocs --> SummaryMd
BuildDocs --> MdBookBin
MdBookBin --> MermaidBin
ReqTxt --> UvInstall
UvInstall --> ScraperPy
CargoInstall --> MdBookBin
CargoInstall --> MermaidBin
Sources: Dockerfile:1-33 build-docs.sh:8-19 build-docs.sh:85-103 build-docs.sh:113-159
Local Development Workflow
1. Clone and Setup
The repository has a minimal structure focused on the essential build artifacts. The .gitignore:1-2 excludes the output/ directory to prevent committing generated files.
2. Make Changes
Key files for common modifications:
| Modification Type | Primary File | Related Files |
|---|---|---|
| Scraping logic | tools/deepwiki-scraper.py | - |
| Build orchestration | build-docs.sh | - |
| Python dependencies | tools/requirements.txt | Dockerfile:16-17 |
| Docker build process | Dockerfile | - |
| Output structure | build-docs.sh | Lines 179-191 |
3. Build Docker Image
After making changes, rebuild the Docker image:
The multi-stage build process Dockerfile:1-7 first compiles Rust binaries in a rust:latest builder stage, then Dockerfile:8-33 assembles the final python:3.12-slim image with copied binaries and Python dependencies.
4. Test Changes
Test with a real repository:
Setting MARKDOWN_ONLY=true build-docs.sh:61-76 bypasses the mdBook build phase, allowing faster iteration when testing scraping logic changes.
5. Validate Output
Inspect the generated files:
Sources: .gitignore:1-2 Dockerfile:1-33 build-docs.sh:61-76 build-docs.sh:179-191
Testing Strategies
Fast Iteration with Markdown-Only Mode
The MARKDOWN_ONLY environment variable enables a fast path for testing scraping changes:
This mode executes only Phase 1 (Markdown Extraction) and skips Phase 2 (Diagram Enhancement) and Phase 3 (mdBook Build). See Phase 1: Markdown Extraction for details on what this phase includes.
The conditional logic build-docs.sh:61-76 checks the MARKDOWN_ONLY variable and exits early after copying markdown files to /output/markdown/.
Testing Auto-Detection
The repository auto-detection logic build-docs.sh:8-19 attempts to extract the GitHub repository from Git remotes if REPO is not explicitly set:
The script checks git config --get remote.origin.url and extracts the owner/repo portion using sed pattern matching build-docs.sh16
Testing Configuration Generation
To test book.toml and SUMMARY.md generation without a full build:
The book.toml template build-docs.sh:85-103 uses shell variable substitution to inject environment variables into the TOML structure.
Sources: build-docs.sh:8-19 build-docs.sh:61-76 build-docs.sh:85-103
Debugging Techniques
Inspecting Intermediate Files
The build process creates temporary files in /workspace inside the container. To inspect them:
This allows inspection of:
- Scraped markdown files in
/workspace/wiki/ - Generated
book.tomlin/workspace/book/ - Generated
SUMMARY.mdin/workspace/book/src/
Adding Debug Output
Both build-docs.sh:1-206 and deepwiki-scraper.py use echo statements for progress tracking. Add additional debug output:
Testing Python Script Independently
To test the scraper without Docker:
This is useful for rapid iteration on scraping logic without rebuilding the Docker image.
Sources: build-docs.sh:1-206 tools/requirements.txt:1-4
Build Optimization Considerations
Multi-Stage Build Rationale
The Dockerfile:1-7 uses a separate Rust builder stage to:
- Compile
mdbookandmdbook-mermaidwith a full Rust toolchain - Discard the ~1.5 GB builder stage after compilation
- Copy only the compiled binaries Dockerfile:20-21 to the final image
This reduces the final image size from ~1.5 GB to ~300-400 MB while still providing both Python and Rust tools. See Docker Multi-Stage Build for architectural details.
Dependency Management with uv
The Dockerfile13 copies uv from the official Astral image and uses it Dockerfile17 to install Python dependencies with --no-cache flag:
This approach:
- Provides faster dependency resolution than pip
- Reduces layer size with
--no-cache - Installs system-wide with
--systemflag
Image Layer Ordering
The Dockerfile orders operations to maximize layer caching:
- Copy
uvbinary (rarely changes) - Install Python dependencies (changes with
requirements.txt) - Copy Rust binaries (changes when rebuilding Rust stage)
- Copy Python scripts (changes frequently during development)
This ordering means modifying deepwiki-scraper.py only invalidates the final layers Dockerfile:24-29 not the entire dependency installation.
Sources: Dockerfile:1-33
Common Development Tasks
Adding a New Environment Variable
To add a new configuration option:
-
Define default in build-docs.sh:21-30:
-
Add to configuration display build-docs.sh:47-53:
-
Use in downstream processing as needed
-
Document in Configuration Reference
Modifying SUMMARY.md Generation
The table of contents generation logic build-docs.sh:113-159 uses bash loops and file discovery:
To modify the structure:
- Adjust the file pattern matching
- Modify the section detection logic
- Update the markdown output format
- Test with repositories that have different hierarchical structures
Adding New Python Dependencies
- Add to tools/requirements.txt:1-4 with version constraint:
new-package>=1.0.0
-
Rebuild Docker image (triggers Dockerfile17)
-
Update Python Dependencies documentation
-
Import and use in
deepwiki-scraper.py
Sources: build-docs.sh:21-30 build-docs.sh:113-159 tools/requirements.txt:1-4 Dockerfile17
File Modification Guidelines
Modifying build-docs.sh
The orchestrator script uses several idioms:
| Pattern | Purpose | Example |
|---|---|---|
set -e | Exit on error | build-docs.sh2 |
"${VAR:-default}" | Default values | build-docs.sh:22-26 |
$(command) | Command substitution | build-docs.sh12 |
echo "" | Visual spacing | build-docs.sh47 |
mkdir -p | Safe directory creation | build-docs.sh64 |
Maintain these patterns for consistency. The script is designed to be readable and self-documenting with clear step labels build-docs.sh:4-6
Modifying Dockerfile
Key considerations:
- Keep stages separate Dockerfile:1-2 vs Dockerfile8
- Use
COPY --from=builderDockerfile:20-21 for cross-stage artifact copying - Set executable permissions Dockerfile:25-29 for scripts
- Use
WORKDIRDockerfile10 to establish consistent working directory - Keep
CMDDockerfile32 as the default entrypoint
Modifying Python Scripts
When editing tools/deepwiki-scraper.py:
- The script is executed via build-docs.sh58 with two arguments:
REPOand output directory - It must be Python 3.12 compatible Dockerfile8
- It has access to dependencies from tools/requirements.txt:1-4
- It should write output to the specified directory argument
- It should use
print()for progress output that appears in build logs
Sources: build-docs.sh2 build-docs.sh58 Dockerfile:1-33 tools/requirements.txt:1-4
Integration Testing
End-to-End Test
Validate the complete pipeline:
Testing Configuration Variants
Test different repository configurations:
Sources: build-docs.sh:8-19 build-docs.sh:61-76
Contributing Guidelines
When submitting changes:
- Test locally : Build and run the Docker image with multiple test repositories
- Validate output : Ensure markdown files are properly formatted and the HTML site builds correctly
- Check backwards compatibility : Existing repositories should continue to work
- Update documentation : Modify relevant wiki pages if changing behavior
- Follow existing patterns : Match the coding style in build-docs.sh:1-206
The system is designed to be "fully generic" - it should work with any DeepWiki repository without modification. Test that your changes maintain this property.
Sources: build-docs.sh:1-206
Troubleshooting Development Issues
Build Failures
| Symptom | Likely Cause | Solution |
|---|---|---|
| Rust compilation fails | Network issues, incompatible versions | Check rust:latest image availability |
| Python package install fails | Version conflicts in requirements.txt | Verify package versions are compatible |
mdbook not found | Binary copy failed | Check Dockerfile:20-21 paths |
| Permission denied on scripts | Missing chmod +x | Verify Dockerfile:25-29 |
Runtime Failures
| Symptom | Likely Cause | Solution |
|---|---|---|
| "REPO must be set" error | Auto-detection failed, no REPO env var | Check build-docs.sh:33-36 validation logic |
| Scraper crashes | DeepWiki site structure changed | Debug deepwiki-scraper.py with local testing |
| SUMMARY.md is empty | No markdown files found | Verify scraper output in /workspace/wiki/ |
| mdBook build fails | Invalid markdown syntax | Inspect markdown files for issues |
Output Validation Checklist
After a successful build, verify:
output/markdown/contains.mdfiles- Section directories exist (e.g.,
output/markdown/section-4/) output/book/index.htmlexists and opens in browser- Navigation menu appears in generated site
- Search functionality works
- Mermaid diagrams render correctly
- Links between pages work
- "Edit this file" links point to correct GitHub URLs
Sources: build-docs.sh:33-36 Dockerfile:20-21 Dockerfile:25-29
Project File Structure
Relevant source files
This document describes the repository's file organization, detailing the purpose of each file and directory in the codebase. Understanding this structure is essential for developers who want to modify or extend the system.
For information about building the Docker image, see Building the Docker Image. For details about the Python dependencies, see Python Dependencies.
Repository Layout
The repository follows a minimal, flat structure with only the essential files needed for Docker-based documentation generation.
graph TB
Root["Repository Root"]
Root --> GitIgnore[".gitignore\n(Excludes output/)"]
Root --> Dockerfile["Dockerfile\n(Multi-stage build)"]
Root --> BuildScript["build-docs.sh\n(Shell orchestrator)"]
Root --> ToolsDir["tools/\n(Python scripts)"]
Root --> OutputDir["output/\n(Generated, git-ignored)"]
ToolsDir --> Scraper["deepwiki-scraper.py\n(Content extraction)"]
ToolsDir --> Requirements["requirements.txt\n(Python deps)"]
OutputDir --> MarkdownOut["markdown/\n(Scraped .md files)"]
OutputDir --> BookOut["book/\n(HTML site)"]
OutputDir --> ConfigOut["book.toml\n(mdBook config)"]
style Root fill:#f9f9f9,stroke:#333
style ToolsDir fill:#e8f5e9,stroke:#388e3c
style OutputDir fill:#ffe0b2,stroke:#e64a19
style Dockerfile fill:#e1f5ff,stroke:#0288d1
style BuildScript fill:#fff4e1,stroke:#f57c00
Physical File Hierarchy
Sources: .gitignore:1-2 Dockerfile:1-33 build-docs.sh:1-206 tools/deepwiki-scraper.py:1-920 tools/requirements.txt:1-4
Root Directory Files
The repository root contains three essential files that define the system's build and runtime behavior.
| File | Type | Lines | Purpose |
|---|---|---|---|
.gitignore | Config | 2 | Excludes the output/ directory from version control |
Dockerfile | Build | 33 | Multi-stage Docker build specification |
build-docs.sh | Script | 206 | Shell orchestrator that coordinates all phases |
.gitignore
This file contains a single exclusion rule for the output/ directory, which is generated at runtime and should not be committed to version control. The output/ directory can contain hundreds of megabytes of generated documentation.
Sources: .gitignore:1-2
Dockerfile
The Dockerfile implements a two-stage build pattern to optimize image size:
Stage 1: Rust Builder Dockerfile:2-5
- Base:
rust:latest(~1.5 GB) - Purpose: Compile
mdbookandmdbook-mermaidbinaries - Command:
cargo install mdbook mdbook-mermaid
Stage 2: Final Image Dockerfile:8-32
- Base:
python:3.12-slim(~150 MB) - Installs:
uvpackage manager Dockerfile13 - Copies: Python requirements, compiled Rust binaries, and scripts
- Entry:
/usr/local/bin/build-docs.shDockerfile32
The multi-stage approach discards the Rust toolchain (~1.3 GB) while retaining only the compiled binaries, resulting in a final image of ~300-400 MB.
Sources: Dockerfile:1-33
build-docs.sh
graph LR
AutoDetect["Auto-detect Git repo\nLines 8-19"]
Config["Parse environment vars\nLines 21-53"]
Phase1["Execute scraper\nLines 55-58"]
Phase2["Generate configs\nLines 78-159"]
Phase3["Build with mdBook\nLines 169-176"]
Output["Copy to /output\nLines 178-191"]
AutoDetect --> Config
Config --> Phase1
Phase1 --> Phase2
Phase2 --> Phase3
Phase3 --> Output
MarkdownOnly{"MARKDOWN_ONLY\n==true?"}
Phase1 --> MarkdownOnly
MarkdownOnly -->|Yes| Output
MarkdownOnly -->|No| Phase2
This shell script serves as the orchestrator for the three-phase pipeline. Key sections:
Key Environment Variables:
REPO: Target GitHub repository (owner/repo format)BOOK_TITLE: Generated book title build-docs.sh23BOOK_AUTHORS: Author metadata build-docs.sh24GIT_REPO_URL: Repository URL for edit links build-docs.sh25MARKDOWN_ONLY: Skip mdBook build for debugging build-docs.sh26
Critical Paths:
WORK_DIR=/workspacebuild-docs.sh27WIKI_DIR=/workspace/wikibuild-docs.sh28OUTPUT_DIR=/outputbuild-docs.sh29BOOK_DIR=/workspace/bookbuild-docs.sh30
Sources: build-docs.sh:1-206
graph TB
ToolsDir["tools/"]
ToolsDir --> Scraper["deepwiki-scraper.py\n920 lines\nMain extraction logic"]
ToolsDir --> Reqs["requirements.txt\n4 lines\nDependency specification"]
Scraper --> Extract["extract_wiki_structure()\nLines 78-125"]
Scraper --> Content["extract_page_content()\nLines 453-594"]
Scraper --> Enhance["extract_and_enhance_diagrams()\nLines 596-789"]
Scraper --> Main["main()\nLines 790-919"]
Reqs --> Requests["requests>=2.31.0"]
Reqs --> BS4["beautifulsoup4>=4.12.0"]
Reqs --> H2T["html2text>=2020.1.16"]
style ToolsDir fill:#e8f5e9,stroke:#388e3c
style Scraper fill:#fff4e1,stroke:#f57c00
Tools Directory
The tools/ directory contains Python-specific components that execute within the Docker container.
Directory Structure
Sources: tools/deepwiki-scraper.py:1-920 tools/requirements.txt:1-4
deepwiki-scraper.py
tools/deepwiki-scraper.py:1-920
This is the core Python module responsible for content extraction and diagram enhancement. It operates in three distinct phases within the temp directory.
Function Breakdown:
| Function | Lines | Purpose |
|---|---|---|
sanitize_filename() | 21-25 | Convert text to safe filename format |
fetch_page() | 27-42 | HTTP fetcher with retry logic |
discover_subsections() | 44-76 | Probe for subsection pages |
extract_wiki_structure() | 78-125 | Build complete page hierarchy |
clean_deepwiki_footer() | 127-173 | Remove UI elements from markdown |
convert_html_to_markdown() | 175-216 | HTML→Markdown via html2text |
extract_mermaid_from_nextjs_data() | 218-331 | Extract diagrams from JS payload |
extract_page_content() | 453-594 | Main content extraction logic |
extract_and_enhance_diagrams() | 596-789 | Fuzzy match and inject diagrams |
main() | 790-919 | Entry point with temp directory management |
Phase Separation:
Temporary Directory Pattern:
The script uses Python's tempfile.TemporaryDirectory() tools/deepwiki-scraper.py808 to create an isolated workspace. All markdown files are first written to this temp directory tools/deepwiki-scraper.py867 then enhanced with diagrams in-place tools/deepwiki-scraper.py:676-788 and finally moved to the output directory tools/deepwiki-scraper.py:897-906 This ensures atomic operations and prevents partial files from appearing in the output.
Sources: tools/deepwiki-scraper.py:1-920
requirements.txt
Specifies three production dependencies:
requests>=2.31.0: HTTP client for fetching wiki pages tools/deepwiki-scraper.py17beautifulsoup4>=4.12.0: HTML parsing library tools/deepwiki-scraper.py18html2text>=2020.1.16: HTML-to-Markdown converter tools/deepwiki-scraper.py19
These dependencies are installed via uv pip install during the Docker build Dockerfile17 The uv package manager is used instead of pip for faster, more reliable installations in containerized environments.
Sources: tools/requirements.txt:1-4 Dockerfile:13-17
graph TB
Output["output/\n(Volume mount point)"]
Output --> Markdown["markdown/\n(Enhanced .md files)"]
Output --> Book["book/\n(HTML documentation)"]
Output --> Config["book.toml\n(mdBook configuration)"]
Markdown --> MainPages["*.md\n(Main pages: 1-overview.md, 2-quick-start.md)"]
Markdown --> Sections["section-*/\n(Subsection directories)"]
Sections --> SubPages["*.md\n(Subsection pages: 2-1-docker.md)"]
Book --> Index["index.html"]
Book --> CSS["css/"]
Book --> JS["mermaid.min.js"]
Book --> Search["searchindex.js"]
style Output fill:#ffe0b2,stroke:#e64a19
style Markdown fill:#e8f5e9,stroke:#388e3c
style Book fill:#e1f5ff,stroke:#0288d1
Output Directory (Generated)
The output/ directory is created at runtime and excluded from version control. It contains all generated artifacts.
Output Structure
Sources: build-docs.sh:181-201
Markdown Subdirectory
Contains the enhanced markdown source files organized by hierarchy:
Main Pages (Root Level):
- Format:
{number}-{slug}.md(e.g.,1-overview.md) - Location:
output/markdown/ - Example: A page numbered "3" with title "Configuration" becomes
3-configuration.md
Subsection Pages (Nested):
- Format:
section-{main}/directory containing{number}-{slug}.mdfiles - Location:
output/markdown/section-{N}/ - Example: Page "3.2" under section 3 becomes
section-3/3-2-environment-variables.md
This hierarchy is created by tools/deepwiki-scraper.py:849-860 based on the page's level field (0 for main pages, 1 for subsections).
Sources: build-docs.sh:186-188 tools/deepwiki-scraper.py:849-860
Book Subdirectory
Contains the complete HTML documentation site generated by mdBook build-docs.sh176 This is a self-contained static website with:
- Navigation sidebar (from SUMMARY.md)
- Full-text search (searchindex.js)
- Mermaid diagram rendering (via mdbook-mermaid build-docs.sh:170-171)
- Edit-on-GitHub links (from
GIT_REPO_URL) - Responsive Rust theme build-docs.sh94
The entire book/ directory can be served by any static file server or uploaded to GitHub Pages, Netlify, or similar hosting platforms.
Sources: build-docs.sh:176-184
book.toml Configuration
build-docs.sh:85-103 build-docs.sh191
The book.toml file is dynamically generated with repository-specific metadata:
This configuration is copied to output/book.toml for reference build-docs.sh191
Sources: build-docs.sh:85-103 build-docs.sh191
graph TB
BuildContext["Docker Build Context"]
BuildContext --> Included["Included in Image"]
BuildContext --> Excluded["Excluded"]
Included --> DockerfileBuild["Dockerfile\n(Build instructions)"]
Included --> ToolsCopy["tools/\n(COPY instruction)"]
Included --> ScriptCopy["build-docs.sh\n(COPY instruction)"]
ToolsCopy --> ReqInstall["requirements.txt\n→ uv pip install"]
ToolsCopy --> ScraperInstall["deepwiki-scraper.py\n→ /usr/local/bin/"]
ScriptCopy --> BuildInstall["build-docs.sh\n→ /usr/local/bin/"]
Excluded --> GitIgnored["output/\n(git-ignored)"]
Excluded --> GitFiles[".git/\n(implicit)"]
Excluded --> Readme["README.md\n(not referenced)"]
style BuildContext fill:#f9f9f9,stroke:#333
style Included fill:#e8f5e9,stroke:#388e3c
style Excluded fill:#ffebee,stroke:#c62828
Docker Build Context
The Docker build process includes only the files needed for container construction. Understanding this context is important for build optimization.
Build Context Inclusion
Copy Operations:
- Dockerfile16 -
COPY tools/requirements.txt /tmp/requirements.txt - Dockerfile24 -
COPY tools/deepwiki-scraper.py /usr/local/bin/ - Dockerfile28 -
COPY build-docs.sh /usr/local/bin/
Not Copied:
.gitignore- only used by Gitoutput/- generated at runtime.git/- version control metadata- Any documentation files (README, LICENSE)
Sources: Dockerfile:16-28 .gitignore:1-2
graph TB
subgraph BuildTime["Build-Time Dependencies"]
DF["Dockerfile"]
Req["tools/requirements.txt"]
Scraper["tools/deepwiki-scraper.py"]
BuildSh["build-docs.sh"]
DF -->|COPY [Line 16]| Req
DF -->|RUN install [Line 17]| Req
DF -->|COPY [Line 24]| Scraper
DF -->|COPY [Line 28]| BuildSh
DF -->|CMD [Line 32]| BuildSh
end
subgraph Runtime["Run-Time Dependencies"]
BuildShRun["build-docs.sh\n(Entry point)"]
ScraperExec["deepwiki-scraper.py\n(Phase 1-2)"]
MdBook["mdbook\n(Phase 3)"]
MdBookMermaid["mdbook-mermaid\n(Phase 3)"]
BuildShRun -->|python3 [Line 58]| ScraperExec
BuildShRun -->|mdbook-mermaid install [Line 171]| MdBookMermaid
BuildShRun -->|mdbook build [Line 176]| MdBook
ScraperExec -->|import requests| Req
ScraperExec -->|import bs4| Req
ScraperExec -->|import html2text| Req
end
subgraph Generated["Generated Artifacts"]
WikiDir["$WIKI_DIR/\n(Temp markdown)"]
BookToml["book.toml\n(Config)"]
Summary["SUMMARY.md\n(TOC)"]
OutputDir["output/\n(Final artifacts)"]
ScraperExec -->|sys.argv[2]| WikiDir
BuildShRun -->|cat > [Line 85]| BookToml
BuildShRun -->|Lines 113-159| Summary
BuildShRun -->|cp [Lines 184-191]| OutputDir
end
BuildTime --> Runtime
Runtime --> Generated
style DF fill:#e1f5ff,stroke:#0288d1
style BuildShRun fill:#fff4e1,stroke:#f57c00
style ScraperExec fill:#e8f5e9,stroke:#388e3c
style OutputDir fill:#ffe0b2,stroke:#e64a19
File Dependency Graph
This diagram maps the relationships between files and shows which files depend on or reference others.
Sources: Dockerfile:1-33 build-docs.sh:1-206 tools/deepwiki-scraper.py:1-920 tools/requirements.txt:1-4
File Size and Complexity Metrics
Understanding the relative complexity of each component helps developers identify which files require the most attention during modifications.
| File | Lines | Purpose | Complexity |
|---|---|---|---|
tools/deepwiki-scraper.py | 920 | Content extraction and diagram matching | High |
build-docs.sh | 206 | Orchestration and configuration | Medium |
Dockerfile | 33 | Multi-stage build specification | Low |
tools/requirements.txt | 4 | Dependency list | Minimal |
.gitignore | 2 | Git exclusion rule | Minimal |
Key Observations:
- 90% of code is in the Python scraper tools/deepwiki-scraper.py:1-920
- Shell script handles high-level orchestration build-docs.sh:1-206
- Dockerfile is minimal due to multi-stage optimization Dockerfile:1-33
- No configuration files in repository root (all generated at runtime)
Sources: tools/deepwiki-scraper.py:1-920 build-docs.sh:1-206 Dockerfile:1-33 tools/requirements.txt:1-4 .gitignore:1-2
Building the Docker Image
Relevant source files
This page provides instructions for building the Docker image locally from source. It covers the build process, multi-stage build architecture, verification steps, and troubleshooting common build issues.
For information about the architectural rationale behind the multi-stage build strategy, see Docker Multi-Stage Build. For information about running the pre-built image, see Quick Start.
Overview
The DeepWiki-to-mdBook converter is packaged as a Docker image that combines Python runtime components with Rust-compiled binaries. Building the image locally requires Docker and typically takes 5-15 minutes depending on network speed and CPU performance. The build process compiles two Rust applications (mdbook and mdbook-mermaid) from source, then creates a minimal Python-based runtime image with these compiled binaries.
Basic Build Command
To build the Docker image from the repository root:
This command reads the Dockerfile at the repository root and produces a tagged image named deepwiki-scraper. The build process automatically executes both stages defined in the Dockerfile.
Sources: Dockerfile:1-33
Build Process Architecture
The following diagram shows the complete build workflow, mapping from natural language concepts to the actual Docker commands and files involved:
Sources: Dockerfile:1-33
graph TD
User[Developer] -->|docker build -t deepwiki-scraper .| DockerCLI["Docker CLI"]
DockerCLI -->|Reads| Dockerfile["Dockerfile\n(repository root)"]
Dockerfile -->|Stage 1: FROM rust:latest AS builder| Stage1["Stage 1: Rust Builder\nImage: rust:latest"]
Dockerfile -->|Stage 2: FROM python:3.12-slim| Stage2["Stage 2: Final Assembly\nImage: python:3.12-slim"]
Stage1 -->|RUN cargo install mdbook| CargoBuildMdBook["cargo install mdbook\n→ /usr/local/cargo/bin/mdbook"]
Stage1 -->|RUN cargo install mdbook-mermaid| CargoBuildMermaid["cargo install mdbook-mermaid\n→ /usr/local/cargo/bin/mdbook-mermaid"]
Stage2 -->|COPY --from=ghcr.io/astral-sh/uv:latest| UVCopy["Copy /uv and /uvx\n→ /bin/"]
Stage2 -->|COPY tools/requirements.txt| ReqCopy["Copy requirements.txt\n→ /tmp/requirements.txt"]
Stage2 -->|RUN uv pip install --system| PythonDeps["Install Python packages:\nrequests, beautifulsoup4, html2text"]
CargoBuildMdBook -->|COPY --from=builder| BinaryCopy1["Copy to /usr/local/bin/mdbook"]
CargoBuildMermaid -->|COPY --from=builder| BinaryCopy2["Copy to /usr/local/bin/mdbook-mermaid"]
Stage2 -->|COPY tools/deepwiki-scraper.py| ScraperCopy["Copy to /usr/local/bin/deepwiki-scraper.py"]
Stage2 -->|COPY build-docs.sh| BuildScriptCopy["Copy to /usr/local/bin/build-docs.sh"]
Stage2 -->|RUN chmod +x| MakeExecutable["Set execute permissions"]
BinaryCopy1 --> FinalImage["Final Image:\ndeepwiki-scraper"]
BinaryCopy2 --> FinalImage
PythonDeps --> FinalImage
ScraperCopy --> FinalImage
BuildScriptCopy --> FinalImage
MakeExecutable --> FinalImage
FinalImage -->|CMD| DefaultEntrypoint["/usr/local/bin/build-docs.sh"]
Stage-by-Stage Build Details
Stage 1: Rust Builder
Stage 1 uses the rust:latest base image (approximately 1.5 GB) to compile the Rust applications. This stage is ephemeral and discarded after binary extraction.
graph LR
subgraph "Stage 1 Build Context"
BaseImage["rust:latest\n~1.5 GB"]
CargoEnv["Cargo toolchain\nPre-installed"]
BaseImage --> CargoEnv
CargoEnv -->|cargo install mdbook| BuildMdBook["Compile mdbook\nfrom crates.io"]
CargoEnv -->|cargo install mdbook-mermaid| BuildMermaid["Compile mdbook-mermaid\nfrom crates.io"]
BuildMdBook --> Binary1["/usr/local/cargo/bin/mdbook\n(~20-30 MB)"]
BuildMermaid --> Binary2["/usr/local/cargo/bin/mdbook-mermaid\n(~10-20 MB)"]
end
subgraph "Extracted Artifacts"
Binary1 -.->|Copied to Stage 2| FinalBin1["/usr/local/bin/mdbook"]
Binary2 -.->|Copied to Stage 2| FinalBin2["/usr/local/bin/mdbook-mermaid"]
end
The cargo install commands download source code from crates.io, compile with optimization flags, and place the resulting binaries in /usr/local/cargo/bin/. This compilation typically takes 3-8 minutes depending on CPU performance.
Key Dockerfile directives:
- Line 2:
FROM rust:latest AS builder- Establishes the builder stage - Line 5:
RUN cargo install mdbook mdbook-mermaid- Compiles both tools in a single command
Sources: Dockerfile:1-5
Stage 2: Final Image Assembly
Stage 2 creates the production image using python:3.12-slim (approximately 150 MB) as the base and layers in all necessary runtime components:
| Layer | Purpose | Size Impact | Dockerfile Lines |
|---|---|---|---|
| Base image | Python 3.12 runtime | ~150 MB | Line 8 |
| uv package manager | Fast Python dependency installation | ~10 MB | Line 13 |
| Python dependencies | requests, beautifulsoup4, html2text | ~20 MB | Lines 16-17 |
| Rust binaries | mdbook and mdbook-mermaid executables | ~30-50 MB | Lines 20-21 |
| Python scripts | deepwiki-scraper.py | ~10 KB | Lines 24-25 |
| Shell scripts | build-docs.sh orchestrator | ~5 KB | Lines 28-29 |
| Total | Final image size | ~300-400 MB | - |
Key Dockerfile directives:
- Line 8:
FROM python:3.12-slim- Establishes the final stage base - Line 13:
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/- Imports uv from external image - Lines 20-21:
COPY --from=builder- Extracts Rust binaries from Stage 1 - Line 32:
CMD ["/usr/local/bin/build-docs.sh"]- Sets default entrypoint
Sources: Dockerfile:8-33
Python Dependency Installation
The image uses uv instead of pip for faster and more reliable dependency installation. The dependencies are defined in tools/requirements.txt:
requests>=2.31.0
beautifulsoup4>=4.12.0
html2text>=2020.1.16
The installation command uses these flags:
--system: Installs packages system-wide (not in a virtual environment)--no-cache: Avoids caching to reduce image size
Sources: Dockerfile:13-17 tools/requirements.txt:1-4
Build Verification
After building the image, verify that all components are correctly installed:
Expected outputs:
whichcommands should return/usr/local/bin/<binary-name>- Python import test should print
Dependencies OK - Script permissions should show
-rwxr-xr-x(executable)
Sources: Dockerfile:20-29
graph TB
subgraph "Repository Files"
RepoRoot["Repository Root"]
Dockerfile_Src["Dockerfile"]
BuildScript["build-docs.sh"]
ToolsDir["tools/"]
Scraper["tools/deepwiki-scraper.py"]
Reqs["tools/requirements.txt"]
RepoRoot --> Dockerfile_Src
RepoRoot --> BuildScript
RepoRoot --> ToolsDir
ToolsDir --> Scraper
ToolsDir --> Reqs
end
subgraph "Stage 1 Build Products"
CargoOutput["/usr/local/cargo/bin/"]
MdBookBin["mdbook binary"]
MermaidBin["mdbook-mermaid binary"]
CargoOutput --> MdBookBin
CargoOutput --> MermaidBin
end
subgraph "Final Image Filesystem"
UsrBin["/usr/local/bin/"]
BinDir["/bin/"]
TmpDir["/tmp/"]
MdBookFinal["/usr/local/bin/mdbook"]
MermaidFinal["/usr/local/bin/mdbook-mermaid"]
BuildFinal["/usr/local/bin/build-docs.sh"]
ScraperFinal["/usr/local/bin/deepwiki-scraper.py"]
UVFinal["/bin/uv"]
UVXFinal["/bin/uvx"]
ReqsFinal["/tmp/requirements.txt"]
UsrBin --> MdBookFinal
UsrBin --> MermaidFinal
UsrBin --> BuildFinal
UsrBin --> ScraperFinal
BinDir --> UVFinal
BinDir --> UVXFinal
TmpDir --> ReqsFinal
end
MdBookBin -.->|COPY --from=builder| MdBookFinal
MermaidBin -.->|COPY --from=builder| MermaidFinal
BuildScript -.->|COPY| BuildFinal
Scraper -.->|COPY| ScraperFinal
Reqs -.->|COPY| ReqsFinal
File and Binary Locations in Final Image
The following diagram maps the repository structure to the final image filesystem layout:
Sources: Dockerfile:13-28
Common Build Issues and Solutions
Issue: Cargo Installation Timeout
Symptom: Build fails during Stage 1 with network timeout errors:
error: failed to download `mdbook`
Solution: Increase Docker build timeout or retry the build. The crates.io registry occasionally experiences high load.
Issue: Out of Disk Space
Symptom: Build fails with "no space left on device" error.
Solution: The Rust builder stage requires approximately 2-3 GB of temporary space. Clean up Docker resources:
Issue: Platform Mismatch
Symptom: Built image doesn't run on target platform (e.g., building on ARM Mac but running on x86_64 Linux).
Solution: Specify the target platform explicitly:
Note: Cross-platform builds require QEMU emulation and will be significantly slower.
Issue: Python Dependency Installation Fails
Symptom: Stage 2 fails during uv pip install:
error: Failed to download distribution
Solution: Check network connectivity and retry. If issues persist, build without cache:
Sources: Dockerfile:16-17
Build Customization Options
Building with Different Python Version
To use a different Python version, modify line 8 of the Dockerfile:
Then rebuild:
Building with Specific mdBook Versions
To pin specific versions of the Rust tools, modify line 5 of the Dockerfile:
Reducing Build Time for Development
During development, you can cache the Rust builder stage by building it separately:
Sources: Dockerfile:2-8
Image Size Analysis
The following table breaks down the final image size by component:
| Component | Approximate Size | Optimization Notes |
|---|---|---|
| python:3.12-slim base | 150 MB | Minimal Python distribution |
| System libraries (libc, etc.) | 20 MB | Required by Python and binaries |
| Python packages | 15-20 MB | requests, beautifulsoup4, html2text |
| uv package manager | 8-10 MB | Faster than pip |
| mdbook binary | 20-30 MB | Statically linked Rust binary |
| mdbook-mermaid binary | 10-20 MB | Statically linked Rust binary |
| Python scripts | 50-100 KB | deepwiki-scraper.py |
| Shell scripts | 5-10 KB | build-docs.sh |
| Total | ~300-400 MB | Multi-stage build discards ~1.5 GB |
The multi-stage build reduces the image size by approximately 75% compared to a single-stage build that would include the entire Rust toolchain.
Sources: Dockerfile:2-8
Building for Production
For production deployments, consider these additional steps:
-
Tag with version numbers:
-
Scan for vulnerabilities:
-
Push to registry:
-
Generate SBOM (Software Bill of Materials):
Sources: Dockerfile:1-33
Python Dependencies
Relevant source files
This page documents the Python dependencies required by the deepwiki-scraper.py script, including their purposes, version requirements, and how they are used throughout the content extraction and conversion pipeline. For information about the scraper script itself, see deepwiki-scraper.py. For details about how Rust dependencies (mdBook and mdbook-mermaid) are installed, see Docker Multi-Stage Build.
Dependencies Overview
The system requires three core Python libraries for web scraping and HTML-to-Markdown conversion:
| Package | Minimum Version | Primary Purpose |
|---|---|---|
requests | 2.31.0 | HTTP client for fetching web pages |
beautifulsoup4 | 4.12.0 | HTML parsing and DOM manipulation |
html2text | 2020.1.16 | HTML to Markdown conversion |
These dependencies are declared in tools/requirements.txt:1-3 and installed during Docker image build using the uv package manager.
Sources: tools/requirements.txt:1-3 Dockerfile:16-17
Dependency Usage Flow
The following diagram illustrates how each Python dependency is used across the three-phase processing pipeline:
Sources: tools/deepwiki-scraper.py:17-19 tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:596-788
flowchart TD
subgraph "Phase 1: Markdown Extraction"
FetchPage["fetch_page()\n[tools/deepwiki-scraper.py:27-42]"]
ExtractStruct["extract_wiki_structure()\n[tools/deepwiki-scraper.py:78-125]"]
ExtractContent["extract_page_content()\n[tools/deepwiki-scraper.py:453-594]"]
ConvertHTML["convert_html_to_markdown()\n[tools/deepwiki-scraper.py:175-190]"]
end
subgraph "Phase 2: Diagram Enhancement"
ExtractDiagrams["extract_and_enhance_diagrams()\n[tools/deepwiki-scraper.py:596-788]"]
end
subgraph "requests Library"
Session["requests.Session()"]
GetMethod["session.get()"]
HeadMethod["session.head()"]
end
subgraph "BeautifulSoup4 Library"
BS4Parser["BeautifulSoup(html, 'html.parser')"]
FindAll["soup.find_all()"]
Select["soup.select()"]
Decompose["element.decompose()"]
end
subgraph "html2text Library"
H2TClass["html2text.HTML2Text()"]
HandleMethod["h.handle()"]
end
FetchPage --> Session
FetchPage --> GetMethod
ExtractStruct --> GetMethod
ExtractStruct --> BS4Parser
ExtractStruct --> FindAll
ExtractContent --> GetMethod
ExtractContent --> BS4Parser
ExtractContent --> Select
ExtractContent --> Decompose
ExtractContent --> ConvertHTML
ConvertHTML --> H2TClass
ConvertHTML --> HandleMethod
ExtractDiagrams --> GetMethod
requests
The requests library provides HTTP client functionality for fetching web pages from DeepWiki.com. It is imported at tools/deepwiki-scraper.py17 and used throughout the scraper.
Key Usage Patterns
Session Management: A requests.Session() object is created at tools/deepwiki-scraper.py:818-821 to maintain connection pooling and share headers across multiple requests:
HTTP GET Requests: The fetch_page() function at tools/deepwiki-scraper.py:27-42 uses session.get() with retry logic, browser-like headers, and 30-second timeout to fetch HTML content.
HTTP HEAD Requests: The discover_subsections() function at tools/deepwiki-scraper.py:44-76 uses session.head() to efficiently check for page existence without downloading full content.
Configuration Options
The library is configured with:
- Custom User-Agent: Mimics a real browser to avoid bot detection tools/deepwiki-scraper.py:29-31
- Timeout: 30-second limit on requests tools/deepwiki-scraper.py35
- Retry Logic: Up to 3 attempts with 2-second delays tools/deepwiki-scraper.py:33-42
- Connection Pooling: Automatic via
Session()object
Sources: tools/deepwiki-scraper.py17 tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:818-821
BeautifulSoup4
The beautifulsoup4 library (imported as bs4) provides HTML parsing and DOM manipulation capabilities. It is imported at tools/deepwiki-scraper.py18 as from bs4 import BeautifulSoup.
Parser Selection
BeautifulSoup is instantiated with the built-in html.parser backend at multiple locations:
- Structure discovery: tools/deepwiki-scraper.py84
- Content extraction: tools/deepwiki-scraper.py463
This parser choice avoids external dependencies (lxml, html5lib) and provides sufficient functionality for well-formed HTML.
flowchart LR
subgraph "Navigation Methods"
FindAll["soup.find_all()"]
Find["soup.find()"]
Select["soup.select()"]
SelectOne["soup.select_one()"]
end
subgraph "Usage in extract_wiki_structure()"
StructLinks["Find wiki page links\n[line 90]"]
end
subgraph "Usage in extract_page_content()"
RemoveNav["Remove navigation elements\n[line 466]"]
FindContent["Locate main content area\n[line 473-485]"]
RemoveUI["Remove DeepWiki UI elements\n[line 491-511]"]
end
FindAll --> StructLinks
FindAll --> RemoveUI
Select --> RemoveNav
SelectOne --> FindContent
Find --> FindContent
DOM Navigation Methods
The following diagram maps BeautifulSoup methods to their usage contexts in the codebase:
Sources: tools/deepwiki-scraper.py18 tools/deepwiki-scraper.py84 tools/deepwiki-scraper.py90 tools/deepwiki-scraper.py463 tools/deepwiki-scraper.py:466-511
Content Manipulation
Element Removal: The element.decompose() method permanently removes elements from the DOM tree:
- Navigation elements: tools/deepwiki-scraper.py:466-467
- DeepWiki UI components: tools/deepwiki-scraper.py:491-500
- Table of contents lists: tools/deepwiki-scraper.py:504-511
CSS Selectors: BeautifulSoup's select() and select_one() methods support CSS selector syntax for finding content areas:
tools/deepwiki-scraper.py:473-476
Attribute-Based Selection: The find() method with attrs parameter locates elements by ARIA roles:
Text Extraction
BeautifulSoup's get_text() method extracts plain text from elements:
- With
strip=Trueto remove whitespace: tools/deepwiki-scraper.py94 tools/deepwiki-scraper.py492 - Used for DeepWiki UI element detection: tools/deepwiki-scraper.py:492-500
Sources: tools/deepwiki-scraper.py:466-511
html2text
The html2text library converts HTML content to Markdown format. It is imported at tools/deepwiki-scraper.py19 and used exclusively in the convert_html_to_markdown() function.
Configuration
An HTML2Text instance is created with specific configuration at tools/deepwiki-scraper.py:178-180:
Key Settings:
ignore_links = False: Preserves hyperlinks as Markdown link syntax<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/text" undefined file-path="text">Hii</FileRef>body_width = 0: Disables automatic line wrapping at 80 characters, preserving original formatting
Conversion Process
The handle() method at tools/deepwiki-scraper.py181 performs the actual HTML-to-Markdown conversion:
This processes the cleaned HTML (after BeautifulSoup removes unwanted elements) and produces Markdown text with:
- Headers converted to
#syntax - Links converted to
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/text" undefined file-path="text">Hii</FileRef>format - Lists converted to
-or1.format - Bold/italic formatting preserved
- Code blocks and inline code preserved
Post-Processing
The conversion output undergoes additional cleanup at tools/deepwiki-scraper.py188:
- DeepWiki footer removal via
clean_deepwiki_footer(): tools/deepwiki-scraper.py:127-173 - Link rewriting to relative paths: tools/deepwiki-scraper.py:549-592
- Duplicate title removal: tools/deepwiki-scraper.py:525-545
Sources: tools/deepwiki-scraper.py19 tools/deepwiki-scraper.py:175-190
flowchart TD
subgraph "Dockerfile Stage 2"
BaseImage["FROM python:3.12-slim\n[Dockerfile:8]"]
CopyUV["COPY uv from astral-sh image\n[Dockerfile:13]"]
CopyReqs["COPY tools/requirements.txt\n[Dockerfile:16]"]
InstallDeps["RUN uv pip install --system\n[Dockerfile:17]"]
end
subgraph "requirements.txt"
Requests["requests>=2.31.0"]
BS4["beautifulsoup4>=4.12.0"]
HTML2Text["html2text>=2020.1.16"]
end
BaseImage --> CopyUV
CopyUV --> CopyReqs
CopyReqs --> InstallDeps
Requests --> InstallDeps
BS4 --> InstallDeps
HTML2Text --> InstallDeps
Installation Process
The dependencies are installed during Docker image build using the uv package manager, which provides fast, reliable Python package installation.
Multi-Stage Build Integration
Sources: Dockerfile8 Dockerfile13 Dockerfile:16-17 tools/requirements.txt:1-3
Installation Command
The dependencies are installed with a single uv pip install command at Dockerfile17:
Flags:
--system: Installs into system Python, not a virtual environment--no-cache: Avoids caching to reduce Docker image size-r /tmp/requirements.txt: Specifies requirements file path
The uv tool is significantly faster than standard pip and provides deterministic dependency resolution, making builds more reliable and reproducible.
Sources: Dockerfile:16-17
Version Requirements
The minimum version constraints specified in tools/requirements.txt:1-3 ensure compatibility with required features:
requests >= 2.31.0
This version requirement ensures:
- Security fixes : Addresses CVE-2023-32681 (Proxy-Authorization header leakage)
- Session improvements : Enhanced connection pooling and retry mechanisms
- HTTP/2 support : Better performance for multiple requests
The codebase relies on stable Session API behavior introduced in 2.x releases.
beautifulsoup4 >= 4.12.0
This version requirement ensures:
- Python 3.12 compatibility : Required for the base image
python:3.12-slim - Parser stability : Consistent behavior with
html.parserbackend - Security updates : Protection against XML parsing vulnerabilities
The codebase uses standard find/select methods that are stable across 4.x versions.
html2text >= 2020.1.16
This version requirement ensures:
- Python 3 compatibility : Earlier versions targeted Python 2.7
- Markdown formatting fixes : Improved handling of nested lists and code blocks
- Link preservation : Proper conversion of HTML links to Markdown syntax
The codebase uses the body_width=0 configuration which was stabilized in this version.
Sources: tools/requirements.txt:1-3
Import Locations
All three dependencies are imported at the top of deepwiki-scraper.py:
These are the only external dependencies required by the Python layer. The script uses only standard library modules for all other functionality (sys, re, time, pathlib, urllib.parse).
Sources: tools/deepwiki-scraper.py:17-19