This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Overview
Loading…
Overview
Relevant source files
Purpose and Scope
This document introduces the DeepWiki-to-mdBook Converter, a containerized system that extracts wiki documentation from DeepWiki.com and transforms it into searchable HTML documentation using mdBook. This page covers the system’s purpose, core capabilities, and high-level architecture.
For detailed usage instructions, see Quick Start. For architecture details, see System Architecture. For configuration options, see Configuration Reference.
Sources: README.md:1-3
Problem Statement
DeepWiki.com provides AI-generated documentation for GitHub repositories as a web-based wiki. The system addresses the following limitations:
| Problem | Solution |
|---|---|
| Content locked in web platform | HTTP scraping with requests and BeautifulSoup4 |
| Mermaid diagrams rendered client-side only | JavaScript payload extraction with fuzzy matching |
| No offline access | Self-contained HTML site generation |
| No searchability | mdBook’s built-in search |
| Platform-specific formatting | Conversion to standard Markdown |
Sources: README.md:3-15
Core Capabilities
The system provides the following capabilities through environment variable configuration:
- Generic Repository Support : Works with any GitHub repository indexed by DeepWiki via
REPOenvironment variable - Auto-Detection : Extracts repository metadata from Git remotes when available
- Hierarchy Preservation : Maintains wiki page numbering and section structure
- Diagram Intelligence : Extracts ~461 total diagrams, matches ~48 with sufficient context using fuzzy matching
- Dual Output Modes : Full mdBook build or markdown-only extraction via
MARKDOWN_ONLYflag - No Authentication : Public HTTP scraping without API keys or credentials
- Containerized Deployment : Single Docker image with all dependencies
Sources: README.md:5-15 README.md:42-51
System Components
The system consists of three primary executable components coordinated by a shell orchestrator. The following diagram maps user interaction to specific code entities:
Component Architecture with Code Entities
graph TB
DockerRun["docker run\nwith env vars"]
subgraph Container["/usr/local/bin/ executables"]
BuildDocs["build-docs.sh"]
Scraper["deepwiki-scraper.py\nmain()\nscrape_wiki()\nextract_mermaid_from_nextjs_data()"]
MdBook["mdbook binary\n(Rust)"]
MermaidPlugin["mdbook-mermaid binary\n(Rust)"]
end
subgraph External["External HTTP Endpoints"]
DeepWikiAPI["deepwiki.com/$REPO"]
GitHubEdit["github.com/$REPO/edit/"]
end
subgraph OutputVol["/output volume mount"]
MarkdownDir["markdown/\nnumbered .md files"]
RawMarkdownDir["raw_markdown/\npre-enhancement .md"]
BookDir["book/\nindex.html + search"]
ConfigFile["book.toml"]
SummaryFile["SUMMARY.md"]
end
DockerRun -->|REPO BOOK_TITLE BOOK_AUTHORS MARKDOWN_ONLY| BuildDocs
BuildDocs -->|python3 deepwiki-scraper.py| Scraper
BuildDocs -->|mdbook init| MdBook
BuildDocs -->|mdbook build| MdBook
BuildDocs -->|generates| ConfigFile
BuildDocs -->|generates| SummaryFile
Scraper -->|requests.get| DeepWikiAPI
Scraper -->|writes| RawMarkdownDir
Scraper -->|writes enhanced| MarkdownDir
MdBook -->|preprocessor chain| MermaidPlugin
MdBook -->|generates| BookDir
BookDir -.->|edit links point to| GitHubEdit
Executable Components
| Component | Type | Entry Point | Key Operations |
|---|---|---|---|
build-docs.sh | Shell script | CMD in Dockerfile | Parse $REPO, $BOOK_TITLE, generate book.toml, invoke Python and Rust tools |
deepwiki-scraper.py | Python 3.12 module | main() function | scrape_wiki(), extract_mermaid_from_nextjs_data(), inject_mermaid_diagrams_into_markdown() |
mdbook | Rust binary | CLI invocation | mdbook init, mdbook build with book.toml configuration |
mdbook-mermaid | Rust preprocessor | mdBook plugin chain | Asset injection for Mermaid.js runtime |
Sources: README.md:1-27 README.md:84-88 Diagram 1, Diagram 3
Processing Pipeline
The system executes a three-phase pipeline with conditional execution based on the MARKDOWN_ONLY environment variable. Each phase invokes specific executables and functions:
Three-Phase Execution Flow with Code Entities
stateDiagram-v2
[*] --> ParseEnv : build-docs.sh reads env
state ParseEnv {
[*] --> ReadREPO : $REPO
ReadREPO --> ReadBOOKTITLE : $BOOK_TITLE
ReadBOOKTITLE --> ReadMARKDOWNONLY : $MARKDOWN_ONLY
ReadMARKDOWNONLY --> [*]
}
ParseEnv --> Phase1 : python3 deepwiki-scraper.py
state Phase1 {
[*] --> scrape_wiki
scrape_wiki --> BeautifulSoup4 : parse HTML
BeautifulSoup4 --> html2text : convert to .md
html2text --> extract_mermaid : extract_mermaid_from_nextjs_data()
extract_mermaid --> normalize_mermaid : 7-step normalization
normalize_mermaid --> inject_diagrams : inject_mermaid_diagrams_into_markdown()
inject_diagrams --> write_files : /output/markdown/*.md
write_files --> [*]
}
Phase1 --> CheckMode
state CheckMode <<choice>>
CheckMode --> Phase2 : if MARKDOWN_ONLY != true
CheckMode --> Exit : if MARKDOWN_ONLY == true
Phase2 --> GenerateBookToml : build-docs.sh writes book.toml
GenerateBookToml --> GenerateSummary : build-docs.sh writes SUMMARY.md
GenerateSummary --> Phase3
state Phase3 {
[*] --> mdbook_init : mdbook init
mdbook_init --> mdbook_mermaid_install : mdbook-mermaid install
mdbook_mermaid_install --> mdbook_build : mdbook build
mdbook_build --> [*] : /output/book/
}
Phase3 --> Exit
Exit --> [*]
Phase Execution Details
| Phase | Primary Executable | Key Functions/Commands | Artifacts |
|---|---|---|---|
| 1: Extract | deepwiki-scraper.py | scrape_wiki(), extract_mermaid_from_nextjs_data(), normalize_mermaid_code(), inject_mermaid_diagrams_into_markdown() | /output/markdown/*.md, /output/raw_markdown/*.md |
| 2: Configure | build-docs.sh | Template string generation for book.toml and SUMMARY.md | /output/book.toml, /output/SUMMARY.md |
| 3: Build | mdbook, mdbook-mermaid | mdbook init, mdbook-mermaid install, mdbook build | /output/book/index.html, /output/book/searchindex.json |
Sources: README.md:72-77 Diagram 2, Diagram 4
Input and Output
Input Requirements
| Input | Format | Source | Example |
|---|---|---|---|
REPO | owner/repo | Environment variable | facebook/react |
BOOK_TITLE | String | Environment variable (optional) | React Documentation |
BOOK_AUTHORS | String | Environment variable (optional) | Meta Open Source |
MARKDOWN_ONLY | true/false | Environment variable (optional) | false |
Sources: README.md:42-51
Output Artifacts
Full Build Mode (MARKDOWN_ONLY=false or unset):
output/
├── markdown/
│ ├── 1-overview.md
│ ├── 2-quick-start.md
│ ├── section-3/
│ │ ├── 3-1-workspace.md
│ │ └── 3-2-parser.md
│ └── ...
├── book/
│ ├── index.html
│ ├── searchindex.json
│ ├── mermaid.min.js
│ └── ...
└── book.toml
Markdown-Only Mode (MARKDOWN_ONLY=true):
output/
└── markdown/
├── 1-overview.md
├── 2-quick-start.md
└── ...
Sources: README.md:89-119
Technical Stack
The system combines multiple technology stacks in a single container using Docker multi-stage builds:
Runtime Dependencies
| Component | Version | Purpose | Installation Method |
|---|---|---|---|
| Python | 3.12-slim | Scraping runtime | Base image |
requests | Latest | HTTP client | uv pip install |
beautifulsoup4 | Latest | HTML parser | uv pip install |
html2text | Latest | HTML to Markdown | uv pip install |
mdbook | Latest | Documentation builder | Compiled from source (Rust) |
mdbook-mermaid | Latest | Diagram preprocessor | Compiled from source (Rust) |
Build Architecture
The Dockerfile uses a two-stage build:
- Stage 1 (
rust:latest): Compilesmdbookandmdbook-mermaidbinaries (~1.5 GB, discarded) - Stage 2 (
python:3.12-slim): Copies binaries into Python runtime (~300-400 MB final)
Sources: README.md:146-157 Diagram 3
graph TB
subgraph HostFS["Host Filesystem"]
HostOutput["./output/\n(bind mount)"]
end
subgraph ContainerFS["Container Filesystem"]
BuildScript["/usr/local/bin/build-docs.sh"]
ScraperScript["/usr/local/bin/deepwiki-scraper.py"]
TmpWiki["/tmp/wiki_temp/\n(write buffer)"]
OutputMount["/output/\n(volume mount)"]
WorkspaceTemplates["/workspace/templates/\nheader.html, footer.html"]
end
subgraph WriteOperations["File Write Operations"]
WriteRaw["write_markdown_file()\nraw .md to /tmp"]
WriteEnhanced["write enhanced .md\nafter diagram injection"]
AtomicMove["shutil.move()\nor mv command"]
CopyBook["cp -r mdbook_output/"]
end
subgraph OutputStructure["/output/ final structure"]
OutMarkdown["/output/markdown/"]
OutRaw["/output/raw_markdown/"]
OutBook["/output/book/"]
OutConfig["/output/book.toml"]
end
HostOutput -.->|-v bind mount| OutputMount
ScraperScript -->|Phase 1| WriteRaw
WriteRaw --> TmpWiki
ScraperScript --> WriteEnhanced
WriteEnhanced --> TmpWiki
TmpWiki -->|atomic| AtomicMove
AtomicMove --> OutMarkdown
AtomicMove --> OutRaw
BuildScript -->|Phase 2| OutConfig
BuildScript -->|Phase 3| CopyBook
CopyBook --> OutBook
WorkspaceTemplates -.->|process-template.py reads| BuildScript
OutMarkdown --> OutputMount
OutRaw --> OutputMount
OutBook --> OutputMount
OutConfig --> OutputMount
File System Interaction
The system uses a temporary directory pattern to ensure atomic writes to the output volume:
Filesystem Write Pattern
Write Sequence
deepwiki-scraper.pywrites raw markdown to/tmp/wiki_temp/usingwrite_markdown_file()function- After diagram injection via
inject_mermaid_diagrams_into_markdown(), enhanced markdown moves to/output/markdown/ build-docs.shgenerates/output/book.tomlfrom environment variablesmdbook buildwrites HTML to internal directory, whichbuild-docs.shcopies to/output/book/
This pattern ensures atomicity: partial writes never appear in /output/.
Sources: README.md:19-26 README.md:54-58 Diagram 3
Configuration Philosophy
The system operates on three configuration principles:
- Environment-Driven : All customization via environment variables, no file editing required
- Auto-Detection : Intelligent defaults from Git remotes (repository URL, author name)
- Zero-Configuration : Minimal required inputs (
REPOor auto-detect from current directory)
Minimal Example :
This single command triggers the complete extraction, transformation, and build pipeline.
For complete configuration options, see Configuration Reference. For deployment patterns, see Quick Start.
Sources: README.md:22-51 README.md:220-227
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Quick Start
Loading…
Quick Start
Relevant source files
This page provides step-by-step instructions for basic usage of the DeepWiki-to-mdBook converter. It covers Docker image building, container execution with environment variables, output inspection, and local serving. For comprehensive configuration options, see Configuration Reference. For understanding the internal pipeline, see System Architecture.
Prerequisites
Before starting, ensure you have the following installed:
- Docker (version 20.10 or later)
- Git (for repository cloning)
- Python 3 (for local serving)
The system requires no additional dependencies on the host machine; all build tools and Python packages are contained within the Docker image.
Sources: README.md:1-95
Basic Workflow
The typical workflow involves three steps: building the Docker image, running the container to generate documentation, and serving the output locally for preview.
flowchart TD
Start["User initiates workflow"]
Clone["Clone repository\ngit clone"]
Build["Build Docker image\ndocker build -t deepwiki-to-mdbook ."]
Run["Run container with config\ndocker run --rm -e REPO=... -v ..."]
subgraph Container["Docker Container Execution"]
BuildScript["build-docs.sh"]
Scraper["deepwiki-scraper.py"]
Process["process-template.py"]
MdBook["mdbook build"]
end
Output["Output directory\n/output mounted volume"]
Serve["Serve locally\npython3 -m http.server"]
View["View in browser\nhttp://localhost:8000"]
Start --> Clone
Clone --> Build
Build --> Run
Run --> Container
Container --> BuildScript
BuildScript --> Scraper
BuildScript --> Process
BuildScript --> MdBook
MdBook --> Output
Output --> Serve
Serve --> View
Workflow Diagram
Sources: README.md:12-29
Step 1: Build the Docker Image
Navigate to the repository root and build the Docker image:
This command:
- Reads the multi-stage
Dockerfileat the repository root - Compiles
mdbookandmdbook-mermaidfrom source in the Rust builder stage - Installs Python 3.12 dependencies in the final stage
- Tags the resulting image as
deepwiki-to-mdbook
The build process typically takes 5-10 minutes on the first run due to Rust compilation. Subsequent builds use Docker layer caching.
Sources: README.md:14-16
Step 2: Run the Container
Execute the container with required environment variables and a volume mount for output:
Environment Variable Configuration
The container accepts configuration exclusively through environment variables:
| Variable | Required | Description | Default |
|---|---|---|---|
REPO | No* | GitHub repository (owner/repo format) | Auto-detected from git remote |
BOOK_TITLE | No | Documentation title | “Documentation” |
BOOK_AUTHORS | No | Author names | Repository owner |
MARKDOWN_ONLY | No | Set to “true” to skip HTML build | false |
*The REPO variable is auto-detected if the container is run in a Git repository context. For manual execution, it should be explicitly provided.
Volume Mount
The -v "$(pwd)/output:/output" mount maps the host’s ./output directory to the container’s /output directory. All generated artifacts are written here.
Sources: README.md:18-27 README.md:31-51
Step 3: Serve and View Output
After the container completes execution, serve the generated HTML documentation locally:
This command:
- Changes to the
outputdirectory - Starts Python’s built-in HTTP server
- Serves files from the
booksubdirectory - Listens on port 8000
Open http://localhost:8000 in a web browser to view the searchable documentation.
Output Directory Structure
The container generates four output artifacts:
Directory Descriptions:
book/: The final HTML output generated bymdbook build. This is the directory to serve for viewing documentation.markdown/: Enhanced markdown files after diagram injection and template processing. Contains the source files used by mdBook.raw_markdown/: Markdown files immediately after HTML-to-markdown conversion, before diagram enhancement. Useful for debugging the extraction phase.book.toml: The mdBook configuration file generated from environment variables.
Sources: README.md:26-27 README.md:53-58
flowchart LR
CMD["Container CMD\nbuild-docs.sh"]
subgraph Phase1["Phase 1: Extraction"]
Fetch["fetch_and_convert_to_markdown()\ndeepwiki-scraper.py"]
RawOut["raw_markdown/\ndirectory"]
end
subgraph Phase2["Phase 2: Enhancement"]
Extract["extract_mermaid_from_nextjs_data()\ndeepwiki-scraper.py"]
Normalize["normalize_mermaid_diagram()\ndeepwiki-scraper.py"]
Inject["inject_mermaid_diagrams()\ndeepwiki-scraper.py"]
Templates["process-template.py\n--input templates/header.html"]
MdOut["markdown/\ndirectory"]
end
subgraph Phase3["Phase 3: Build"]
Summary["generate_summary()\nbuild-docs.sh"]
BookInit["mdbook init\nbinary"]
BookBuild["mdbook build\nbinary"]
BookOut["book/\ndirectory"]
end
CMD --> Fetch
Fetch --> RawOut
RawOut --> Extract
Extract --> Normalize
Normalize --> Inject
Inject --> Templates
Templates --> MdOut
MdOut --> Summary
Summary --> BookInit
BookInit --> BookBuild
BookBuild --> BookOut
Container Execution Flow
The following diagram maps the container’s internal execution to specific code entities:
This diagram shows the execution path through the three phases of the pipeline, with references to actual functions and binaries. The build-docs.sh script orchestrates all three phases sequentially. For detailed information on each phase, see Three-Phase Pipeline.
Sources: README.md:72-77
Common Usage Patterns
Pattern 1: Custom Book Title and Authors
Pattern 2: Markdown-Only Mode
Skip the HTML build to inspect or further process the markdown files:
The markdown/ directory will contain all enhanced markdown files, but the book/ directory will not be created. See Markdown-Only Mode for more details.
Pattern 3: Custom Templates
Provide custom header and footer templates by mounting a templates directory:
Your my-templates/ directory should contain header.html and/or footer.html. See Template System for template variable syntax and Custom Templates for a comprehensive guide.
Sources: README.md:39-51
Minimal Example
For a minimal working example with auto-detected repository:
The REPO variable is auto-detected from git remote get-url origin, and BOOK_AUTHORS defaults to the repository owner.
Sources: README.md:12-29 README.md34
Verification Steps
After running the container, verify successful execution:
- Check container logs : The container should print progress messages for each phase.
- Inspect output directory : Ensure all four artifacts exist (
book/,markdown/,raw_markdown/,book.toml). - Verify HTML structure :
book/index.htmlshould exist and contain the search interface. - Test local serving : The HTTP server should start without errors.
- Browse documentation : Navigate to
http://localhost:8000and verify page rendering and search functionality.
Next Steps
After completing the Quick Start:
- Customize configuration : See Configuration Reference for all environment variables and options.
- Understand the pipeline : See System Architecture for how the system transforms DeepWiki content.
- Use in CI/CD : See GitHub Action to integrate into automated workflows.
- Customize templates : See Template System Details to brand your documentation.
Sources: README.md:1-95
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Configuration Reference
Loading…
Configuration Reference
Relevant source files
This document provides a comprehensive reference for all configuration options available in the DeepWiki-to-mdBook Converter system. It covers environment variables, their default values, validation logic, auto-detection features, and how configuration flows through the system components.
For information about running the system with these configurations, see Quick Start. For details on how auto-detection works internally, see Auto-Detection Features.
Configuration System Overview
The DeepWiki-to-mdBook Converter uses environment variables as its sole configuration mechanism. All configuration is processed by the build-docs.sh orchestrator script at runtime, with no configuration files required. The system provides intelligent defaults and auto-detection capabilities to minimize required configuration.
Configuration Flow Diagram
flowchart TD
User["User/CI System"]
Docker["docker run -e VAR=value"]
subgraph "build-docs.sh Configuration Processing"
AutoDetect["Git Auto-Detection\n[build-docs.sh:8-19]"]
ParseEnv["Environment Variable Parsing\n[build-docs.sh:21-26]"]
Defaults["Default Value Assignment\n[build-docs.sh:43-45]"]
Validate["Validation\n[build-docs.sh:32-37]"]
end
subgraph "Configuration Consumers"
Scraper["deepwiki-scraper.py\nREPO parameter"]
BookToml["book.toml Generation\n[build-docs.sh:85-103]"]
SummaryGen["SUMMARY.md Generation\n[build-docs.sh:113-159]"]
end
User -->|Set environment variables| Docker
Docker -->|Container startup| AutoDetect
AutoDetect -->|REPO detection| ParseEnv
ParseEnv -->|Parse all vars| Defaults
Defaults -->|Apply defaults| Validate
Validate -->|REPO validated| Scraper
Validate -->|BOOK_TITLE, BOOK_AUTHORS, GIT_REPO_URL| BookToml
Validate -->|No direct config needed| SummaryGen
Sources: build-docs.sh:1-206 README.md:41-51
Environment Variables Reference
The following table lists all environment variables supported by the system:
| Variable | Type | Required | Default | Description |
|---|---|---|---|---|
REPO | String | Conditional | Auto-detected from Git remote | GitHub repository in owner/repo format. Required if not running in a Git repository with a GitHub remote. |
BOOK_TITLE | String | No | "Documentation" | Title displayed in the generated mdBook documentation. Used in book.toml title field. |
BOOK_AUTHORS | String | No | Repository owner (from REPO) | Author name(s) displayed in the documentation. Used in book.toml authors array. |
GIT_REPO_URL | String | No | https://github.com/{REPO} | Full GitHub repository URL. Used for “Edit this page” links in mdBook output. |
MARKDOWN_ONLY | Boolean | No | "false" | When "true", skips Phase 3 (mdBook build) and outputs only extracted Markdown files. Useful for debugging. |
Sources: build-docs.sh:21-26 README.md:44-51
Variable Details and Usage
REPO
Format: owner/repo (e.g., "facebook/react" or "microsoft/vscode")
Purpose: Identifies the GitHub repository to scrape from DeepWiki.com. This is the primary configuration variable that drives the entire system.
flowchart TD
Start["build-docs.sh Startup"]
CheckEnv{"REPO environment\nvariable set?"}
UseEnv["Use provided REPO value\n[build-docs.sh:22]"]
CheckGit{"Git repository\ndetected?"}
GetRemote["Execute: git config --get\nremote.origin.url\n[build-docs.sh:12]"]
ParseURL["Extract owner/repo using regex:\n.*github\\.com[:/]([^/]+/[^/\\.]+)\n[build-docs.sh:16]"]
SetRepo["Set REPO variable\n[build-docs.sh:16]"]
ValidateRepo{"REPO is set?"}
Error["Exit with error\n[build-docs.sh:33-37]"]
Continue["Continue with\nREPO=$REPO_OWNER/$REPO_NAME"]
Start --> CheckEnv
CheckEnv -->|Yes| UseEnv
CheckEnv -->|No| CheckGit
CheckGit -->|Yes| GetRemote
CheckGit -->|No| ValidateRepo
GetRemote --> ParseURL
ParseURL --> SetRepo
UseEnv --> ValidateRepo
SetRepo --> ValidateRepo
ValidateRepo -->|No| Error
ValidateRepo -->|Yes| Continue
Auto-Detection Logic:
Sources: build-docs.sh:8-37
Validation: The system exits with an error if REPO is not set and cannot be auto-detected:
ERROR: REPO must be set or run from within a Git repository with a GitHub remote
Usage: REPO=owner/repo $0
Usage in System:
- Passed as first argument to
deepwiki-scraper.pybuild-docs.sh58 - Used to derive
REPO_OWNERandREPO_NAMEbuild-docs.sh:40-41 - Used to construct
GIT_REPO_URLdefault build-docs.sh45
BOOK_TITLE
Default: "Documentation"
Purpose: Sets the title of the generated mdBook documentation. This appears in the browser tab, navigation header, and book metadata.
Usage: Injected into book.toml configuration file build-docs.sh87:
Examples:
BOOK_TITLE="React Documentation"BOOK_TITLE="VS Code Internals"BOOK_TITLE="Apache Arrow DataFusion Developer Guide"
Sources: build-docs.sh23 build-docs.sh87
BOOK_AUTHORS
Default: Repository owner extracted from REPO
Purpose: Sets the author name(s) in the mdBook documentation metadata.
Default Assignment Logic: build-docs.sh44
This uses shell parameter expansion to set BOOK_AUTHORS to REPO_OWNER only if BOOK_AUTHORS is unset or empty.
Usage: Injected into book.toml as an array build-docs.sh88:
Examples:
- If
REPO="facebook/react"andBOOK_AUTHORSnot set →BOOK_AUTHORS="facebook" - Explicitly set:
BOOK_AUTHORS="Meta Open Source" - Multiple authors:
BOOK_AUTHORS="John Doe, Jane Smith"(rendered as single string in array)
Sources: build-docs.sh24 build-docs.sh44 build-docs.sh88
GIT_REPO_URL
Default: https://github.com/{REPO}
Purpose: Provides the full GitHub repository URL used for “Edit this page” links in the generated mdBook documentation. Each page includes a link back to the source repository.
Default Assignment Logic: build-docs.sh45
Usage: Injected into book.toml configuration build-docs.sh95:
Notes:
- mdBook automatically appends
/edit/main/or similar paths based on its heuristics - The URL must be a valid Git repository URL for the edit links to work correctly
- Can be overridden for non-standard Git hosting scenarios
Sources: build-docs.sh25 build-docs.sh45 build-docs.sh95
MARKDOWN_ONLY
Default: "false"
Type: Boolean string ("true" or "false")
Purpose: Controls whether the system executes the full three-phase pipeline or stops after Phase 2 (Markdown extraction with diagram enhancement). When set to "true", Phase 3 (mdBook build) is skipped.
flowchart TD
Start["build-docs.sh Execution"]
Phase1["Phase 1: Scrape & Extract\n[build-docs.sh:56-58]"]
Phase2["Phase 2: Enhance Diagrams\n(within deepwiki-scraper.py)"]
CheckMode{"MARKDOWN_ONLY\n== 'true'?\n[build-docs.sh:61]"}
CopyMD["Copy markdown to /output/markdown\n[build-docs.sh:64-65]"]
ExitEarly["Exit (skipping mdBook build)\n[build-docs.sh:75]"]
Phase3Init["Phase 3: Initialize mdBook\n[build-docs.sh:79-106]"]
BuildBook["Build HTML documentation\n[build-docs.sh:176]"]
CopyAll["Copy all outputs\n[build-docs.sh:179-191]"]
Start --> Phase1
Phase1 --> Phase2
Phase2 --> CheckMode
CheckMode -->|Yes| CopyMD
CopyMD --> ExitEarly
CheckMode -->|No| Phase3Init
Phase3Init --> BuildBook
BuildBook --> CopyAll
style ExitEarly fill:#ffebee
style CopyAll fill:#e8f5e9
Execution Flow with MARKDOWN_ONLY:
Sources: build-docs.sh26 build-docs.sh:61-76
Use Cases:
- Debugging diagram placement: Quickly iterate on diagram matching without waiting for mdBook build
- Markdown-only extraction: When you only need the Markdown source files
- Faster feedback loops: mdBook build adds significant time; skipping it speeds up testing
- Custom processing: Extract Markdown for processing with different documentation tools
Output Differences:
| Mode | Output Directory Structure |
|---|---|
MARKDOWN_ONLY="false" (default) | /output/book/ (HTML site) |
/output/markdown/ (source) | |
/output/book.toml (config) | |
MARKDOWN_ONLY="true" | /output/markdown/ (source only) |
Performance Impact: Markdown-only mode is approximately 3-5x faster, as it skips:
- mdBook initialization build-docs.sh:79-106
- SUMMARY.md generation build-docs.sh:109-159
- File copying to book/src build-docs.sh:164-166
- mdbook-mermaid asset installation build-docs.sh:169-171
- mdBook HTML build build-docs.sh:174-176
Sources: build-docs.sh:61-76 README.md:55-76
Internal Configuration Variables
These variables are derived or used internally and are not meant to be configured by users:
| Variable | Source | Purpose |
|---|---|---|
WORK_DIR | Hard-coded: /workspace build-docs.sh27 | Temporary working directory inside container |
WIKI_DIR | Derived: $WORK_DIR/wiki build-docs.sh28 | Directory where deepwiki-scraper.py outputs Markdown |
OUTPUT_DIR | Hard-coded: /output build-docs.sh29 | Container output directory (mounted as volume) |
BOOK_DIR | Derived: $WORK_DIR/book build-docs.sh30 | mdBook project directory |
REPO_OWNER | Extracted from REPO build-docs.sh40 | First component of owner/repo |
REPO_NAME | Extracted from REPO build-docs.sh41 | Second component of owner/repo |
Sources: build-docs.sh:27-30 build-docs.sh:40-41
Configuration Precedence and Inheritance
The system follows this precedence order for configuration values:
Sources: build-docs.sh:8-45
Example Scenarios:
- User provides all values:
All explicit values used; no auto-detection occurs.
-
User provides only REPO:
REPO:"facebook/react"(explicit)BOOK_TITLE:"Documentation"(default)BOOK_AUTHORS:"facebook"(derived from REPO)GIT_REPO_URL:"https://github.com/facebook/react"(derived)MARKDOWN_ONLY:"false"(default)
-
User provides no values in Git repo:
REPO: Auto-detected fromgit config --get remote.origin.url- All other values derived or defaulted as above
Generated Configuration Files
The system generates configuration files dynamically based on environment variables:
book.toml
Location: Created at $BOOK_DIR/book.toml build-docs.sh85 copied to /output/book.toml build-docs.sh191
Template Structure:
Sources: build-docs.sh:85-103
Variable Substitution Mapping:
| Template Variable | Environment Variable | Section |
|---|---|---|
${BOOK_TITLE} | $BOOK_TITLE | [book] |
${BOOK_AUTHORS} | $BOOK_AUTHORS | [book] |
${GIT_REPO_URL} | $GIT_REPO_URL | [output.html] |
Hard-Coded Values:
language = "en"build-docs.sh89default-theme = "rust"build-docs.sh94[preprocessor.mermaid]configuration build-docs.sh:97-98- Sidebar folding enabled at level 1 build-docs.sh:100-102
SUMMARY.md
Location: Created at $BOOK_DIR/src/SUMMARY.md build-docs.sh159
Generation: Automatically generated from file structure in $WIKI_DIR, no direct environment variable input. See SUMMARY.md Generation for details.
Sources: build-docs.sh:109-159
Configuration Examples
Minimal Configuration
Results:
REPO:"owner/repo"BOOK_TITLE:"Documentation"BOOK_AUTHORS:"owner"GIT_REPO_URL:"https://github.com/owner/repo"MARKDOWN_ONLY:"false"
Full Custom Configuration
Auto-Detected Configuration
Note: This only works if the current directory is a Git repository with a GitHub remote URL configured.
Debugging Configuration
Outputs only Markdown files to /output/markdown/, skipping the mdBook build phase.
Sources: README.md:28-88
Configuration Validation
The system performs validation on the REPO variable build-docs.sh:32-37:
Validation Rules:
REPOmust be non-empty after auto-detection- No format validation is performed on
REPOvalue (e.g.,owner/repopattern) - Invalid
REPOvalues will cause failures during scraping phase, not during validation
Other Variables:
- No validation performed on
BOOK_TITLE,BOOK_AUTHORS, orGIT_REPO_URL MARKDOWN_ONLYis not validated; any value other than"true"is treated asfalse
Sources: build-docs.sh:32-37
Configuration Debugging
To debug configuration values, check the console output at startup build-docs.sh:47-53:
Configuration:
Repository: facebook/react
Book Title: React Documentation
Authors: Meta Open Source
Git Repo URL: https://github.com/facebook/react
Markdown Only: false
This output shows the final resolved configuration values after auto-detection, derivation, and defaults are applied.
Sources: build-docs.sh:47-53
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
System Architecture
Loading…
System Architecture
Relevant source files
This page provides a comprehensive overview of the DeepWiki-to-mdBook converter’s architecture, including its component organization, execution model, and data flow patterns. The system is designed as a containerized pipeline that transforms DeepWiki content into searchable mdBook documentation through three distinct processing phases.
For detailed information about the three-phase transformation pipeline, see Three-Phase Pipeline. For Docker-specific implementation details, see Docker Multi-Stage Build.
Architectural Overview
The system follows a pipeline architecture with three sequential phases, orchestrated by a shell script and executed within a Docker container. All components are stateless and communicate through the filesystem, with no external dependencies required at runtime.
graph TB
subgraph Docker["Docker Container (python:3.12-slim)"]
subgraph Executables["/usr/local/bin/"]
BuildScript["build-docs.sh"]
Scraper["deepwiki-scraper.py"]
TemplateProc["process-template.py"]
mdBook["mdbook"]
mdBookMermaid["mdbook-mermaid"]
end
subgraph Workspace["/workspace/"]
Templates["templates/\nheader.html\nfooter.html"]
WorkingDirs["wiki/\nraw_markdown/\nbook/"]
end
subgraph Output["/output/ (Volume Mount)"]
BookHTML["book/"]
MarkdownSrc["markdown/"]
RawMarkdown["raw_markdown/"]
BookConfig["book.toml"]
end
end
subgraph External["External Dependencies"]
DeepWiki["deepwiki.com"]
GitRemote["git remote"]
end
BuildScript -->|executes| Scraper
BuildScript -->|executes| TemplateProc
BuildScript -->|executes| mdBook
BuildScript -->|executes| mdBookMermaid
Scraper -->|writes| WorkingDirs
TemplateProc -->|reads| Templates
TemplateProc -->|outputs HTML| BuildScript
mdBook -->|reads| WorkingDirs
mdBook -->|writes| BookHTML
BuildScript -->|copies to| Output
Scraper -->|HTTP GET| DeepWiki
BuildScript -->|auto-detect| GitRemote
style Executables fill:#f0f0f0
style Workspace fill:#f0f0f0
style Output fill:#e8f5e9
System Composition
Diagram: Container Internal Structure and Component Relationships
The container is structured into three main areas: executables in /usr/local/bin/, working files in /workspace/, and outputs in /output/. The build-docs.sh orchestrator coordinates all components, with persistent results written to the mounted volume.
Sources: Dockerfile:1-34 scripts/build-docs.sh:1-310
Core Components
The system consists of five primary components, each with a specific responsibility in the documentation generation pipeline.
| Component | Type | Location | Primary Responsibility |
|---|---|---|---|
build-docs.sh | Shell Script | /usr/local/bin/ | Pipeline orchestration and configuration management |
deepwiki-scraper.py | Python Script | /usr/local/bin/ | Wiki content extraction and markdown conversion |
process-template.py | Python Script | /usr/local/bin/ | Template variable substitution |
mdbook | Rust Binary | /usr/local/bin/ | HTML documentation generation |
mdbook-mermaid | Rust Binary | /usr/local/bin/ | Mermaid diagram rendering |
Component Interaction Map
Diagram: Component Interaction and Data Flow
This diagram shows how build-docs.sh coordinates the three processing components sequentially, with data flowing through working directories before final output to the mounted volume.
Sources: scripts/build-docs.sh:1-310 Dockerfile:20-33
Execution Flow
The system follows a strictly sequential execution model, with each step depending on the output of the previous step. This design simplifies error handling and allows for debugging at intermediate stages.
Build Script Orchestration
The build-docs.sh script orchestrates the entire pipeline through seven distinct steps:
-
Configuration & Validation scripts/build-docs.sh:8-59
- Auto-detects
REPOfrom git remote if not provided - Sets defaults for
BOOK_TITLE,BOOK_AUTHORS,GIT_REPO_URL - Validates required configuration
- Computes derived URLs (
DEEPWIKI_URL, badge URLs)
- Auto-detects
-
Wiki Scraping scripts/build-docs.sh:61-65
- Executes
deepwiki-scraper.pywith repository identifier - Writes to
/workspace/wiki/and/workspace/raw_markdown/
- Executes
-
Early Exit (Markdown-Only Mode) scripts/build-docs.sh:67-93
- Optional: skip HTML build if
MARKDOWN_ONLY=true - Copies markdown directly to output volume
- Optional: skip HTML build if
-
mdBook Initialization scripts/build-docs.sh:95-122
- Creates
/workspace/book/structure - Generates
book.tomlconfiguration - Initializes
src/directory
- Creates
-
SUMMARY.md Generation scripts/build-docs.sh:124-188
- Scans wiki directory for
.mdfiles - Sorts numerically by page number prefix
- Builds hierarchical table of contents
- Handles subsections in
section-N/directories
- Scans wiki directory for
-
Template Processing & Injection scripts/build-docs.sh:190-261
- Processes
header.htmlandfooter.htmlwith variable substitution - Injects processed HTML into every markdown file
- Copies enhanced markdown to
book/src/
- Processes
-
mdBook Build scripts/build-docs.sh:263-271
- Installs mermaid assets via
mdbook-mermaid install - Executes
mdbook build - Generates searchable HTML in
book/
- Installs mermaid assets via
-
Output Copying scripts/build-docs.sh:273-309
- Copies all artifacts to
/output/volume mount - Preserves intermediate outputs for debugging
- Copies all artifacts to
Diagram: build-docs.sh Sequential Execution Flow
Sources: scripts/build-docs.sh:1-310
Three-Phase Pipeline Architecture
The core transformation happens in three distinct phases, each with specific inputs, processing logic, and outputs. This separation allows for independent testing and debugging of each phase.
Phase Overview
| Phase | Primary Component | Input | Output | Key Operations |
|---|---|---|---|---|
| Phase 1: Extraction | deepwiki-scraper.py | DeepWiki HTML | Markdown files | Structure discovery, HTML→Markdown conversion, raw diagram extraction |
| Phase 2: Enhancement | deepwiki-scraper.py | Raw markdown + diagrams | Enhanced markdown | Diagram normalization, fuzzy matching, template injection |
| Phase 3: Build | mdbook + mdbook-mermaid | Enhanced markdown | Searchable HTML | SUMMARY generation, mermaid rendering, search index |
Diagram: Three-Phase Pipeline with Key Functions
For detailed documentation of each phase, see Three-Phase Pipeline.
Sources: scripts/build-docs.sh:61-271 README.md:72-77
graph TB
subgraph Stage1["Stage 1: Builder (rust:latest)"]
RustToolchain["Rust Toolchain\ncargo, rustc"]
CargoInstall["cargo install mdbook\ncargo install mdbook-mermaid"]
Binaries["/usr/local/cargo/bin/\nmdbook\nmdbook-mermaid"]
RustToolchain --> CargoInstall
CargoInstall --> Binaries
end
subgraph Stage2["Stage 2: Runtime (python:3.12-slim)"]
PythonBase["Python 3.12 Runtime"]
PipInstall["pip install requirements"]
CopyBinaries["COPY --from=builder\nmdbook binaries"]
CopyScripts["COPY Python scripts\nCOPY Shell scripts\nCOPY Templates"]
FinalImage["Final Image\n~500MB"]
PythonBase --> PipInstall
PipInstall --> CopyBinaries
CopyBinaries --> CopyScripts
CopyScripts --> FinalImage
end
Binaries -.->|copy only| CopyBinaries
style Stage1 fill:#ffebee
style Stage2 fill:#e8f5e9
style FinalImage fill:#c8e6c9
Docker Multi-Stage Build
The Docker architecture uses a multi-stage build pattern to minimize final image size while compiling Rust-based tools from source. This approach separates build-time dependencies from runtime dependencies.
Stage Architecture
Diagram: Multi-Stage Build Process
The builder stage (approximately 2GB) is discarded after compilation, and only the compiled binaries (approximately 50MB) are copied to the final image, resulting in a significantly smaller runtime image.
Container Filesystem Layout
/usr/local/bin/
├── mdbook # Rust binary from builder stage
├── mdbook-mermaid # Rust binary from builder stage
├── deepwiki-scraper.py # Python script (executable)
├── process-template.py # Python script (executable)
└── build-docs.sh # Shell script (executable)
/workspace/
├── templates/
│ ├── header.html # Default header template
│ └── footer.html # Default footer template
├── wiki/ # Created at runtime
├── raw_markdown/ # Created at runtime
└── book/ # Created at runtime
/output/ # Volume mount point
└── (user-provided volume)
For detailed Docker implementation information, see Docker Multi-Stage Build.
Sources: Dockerfile:1-34
Configuration Architecture
The system is configured entirely through environment variables and volume mounts , following the Twelve-Factor App methodology. No configuration files are required; all settings have sensible defaults.
Configuration Layers
-
Auto-Detection scripts/build-docs.sh:8-19
- Extracts
REPOfromgit remote get-url origin - Supports GitHub URLs in multiple formats (HTTPS, SSH)
- Extracts
-
Environment Variables scripts/build-docs.sh:21-26
- User-provided overrides
- Takes precedence over auto-detection
-
Computed Defaults scripts/build-docs.sh:40-51
- Derives
BOOK_AUTHORSfromREPOowner - Constructs
GIT_REPO_URLfromREPO - Generates badge URLs
- Derives
Diagram: Configuration Resolution Order
Key Configuration Variables
| Variable | Default | Source | Description |
|---|---|---|---|
REPO | (auto-detected) | scripts/build-docs.sh:9-19 | GitHub repository (owner/repo) |
BOOK_TITLE | “Documentation” | scripts/build-docs.sh23 | Title in book.toml |
BOOK_AUTHORS | (derived from REPO) | scripts/build-docs.sh45 | Author metadata |
GIT_REPO_URL | (derived from REPO) | scripts/build-docs.sh46 | Link in generated docs |
MARKDOWN_ONLY | “false” | scripts/build-docs.sh26 | Skip HTML build |
GENERATION_DATE | (current UTC time) | scripts/build-docs.sh200 | Timestamp in templates |
For complete configuration documentation, see Configuration Reference.
Sources: scripts/build-docs.sh:8-59 README.md:31-51
Output Artifacts
The system produces four distinct output artifacts, each serving a specific purpose in the documentation workflow:
| Artifact | Location | Purpose | Generated By |
|---|---|---|---|
book/ | /output/book/ | Searchable HTML documentation | mdbook build |
markdown/ | /output/markdown/ | Enhanced markdown source | deepwiki-scraper.py + templates |
raw_markdown/ | /output/raw_markdown/ | Pre-enhancement markdown (debug) | deepwiki-scraper.py (raw output) |
book.toml | /output/book.toml | mdBook configuration | build-docs.sh generation |
The multi-artifact design allows users to inspect intermediate stages, debug transformation issues, or use the markdown files for alternative processing workflows.
For detailed output structure documentation, see Output Structure.
Sources: scripts/build-docs.sh:273-309 README.md:53-58
Extensibility Points
The architecture provides three primary extension mechanisms:
1. Custom Templates
Users can override default header/footer templates by mounting a custom directory:
The process-template.py script performs variable substitution on any mounted templates, supporting custom branding and layout.
Sources: scripts/build-docs.sh:195-234 Dockerfile26
2. Environment Variable Configuration
All behavior can be modified through environment variables without rebuilding the Docker image. This includes metadata, URLs, and operational modes.
Sources: scripts/build-docs.sh:8-59
3. Markdown-Only Mode
Setting MARKDOWN_ONLY=true allows users to skip the HTML build entirely, enabling alternative processing pipelines or custom mdBook configurations.
Sources: scripts/build-docs.sh:67-93
For advanced customization patterns, see Advanced Topics.
Summary
The DeepWiki-to-mdBook converter implements a pipeline architecture with three distinct phases (extraction, enhancement, build), orchestrated by a shell script within a multi-stage Docker container. The system is stateless, configuration-driven, and produces multiple output artifacts for different use cases. All components communicate through the filesystem, with no runtime dependencies beyond the container image.
Key architectural principles:
- Sequential processing : Each phase depends on the previous phase’s output
- Stateless execution : No persistent state between runs
- Configuration through environment : No config files required
- Multi-stage build : Minimized runtime image size
- Multiple outputs : Debugging and alternative workflows supported
Sources: Dockerfile:1-34 scripts/build-docs.sh:1-310 README.md:1-95
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Three-Phase Pipeline
Loading…
Three-Phase Pipeline
Relevant source files
Purpose and Scope
This document describes the three-phase processing pipeline that transforms DeepWiki HTML pages into searchable mdBook documentation. The pipeline consists of Phase 1: Clean Markdown Extraction , Phase 2: Diagram Enhancement , and Phase 3: mdBook Build. Each phase has distinct responsibilities and uses different technology stacks.
For overall system architecture, see System Architecture. For detailed implementation of individual phases, see Phase 1: Markdown Extraction, Phase 2: Diagram Enhancement, and Phase 3: mdBook Build. For configuration that affects pipeline behavior, see Configuration Reference.
Pipeline Overview
The system processes content through three sequential phases, with an optional bypass mechanism for Phase 3. Each phase is implemented by different components and operates on files in specific directories.
Pipeline Execution Flow
stateDiagram-v2
[*] --> Init["build-docs.sh\nParse env vars"]
Init --> Phase1["Phase 1 : deepwiki-scraper.py"]
state "Phase 1 : Markdown Extraction" as Phase1 {
[*] --> ExtractStruct["extract_wiki_structure()"]
ExtractStruct --> LoopPages["for page in pages"]
LoopPages --> ExtractPage["extract_page_content(url)"]
ExtractPage --> ConvertHTML["convert_html_to_markdown()"]
ConvertHTML --> CleanFooter["clean_deepwiki_footer()"]
CleanFooter --> WriteTemp["/workspace/wiki/*.md"]
WriteTemp --> LoopPages
LoopPages --> RawSnapshot["/workspace/raw_markdown/\n(debug snapshot)"]
}
Phase1 --> Phase2["Phase 2 : deepwiki-scraper.py"]
state "Phase 2 : Diagram Enhancement" as Phase2 {
[*] --> ExtractDiagrams["extract_and_enhance_diagrams()"]
ExtractDiagrams --> FetchJS["Fetch JS payload\nextract_mermaid_from_nextjs_data()"]
FetchJS --> NormalizeDiagrams["normalize_mermaid_diagram()\n7 normalization passes"]
NormalizeDiagrams --> FuzzyMatch["Fuzzy match loop\n300/200/150/100/80 char chunks"]
FuzzyMatch --> InjectFiles["Modify /workspace/wiki/*.md\nInsert ```mermaid blocks"]
}
Phase2 --> CheckMode{"MARKDOWN_ONLY\nenv var?"}
CheckMode --> CopyMarkdown["build-docs.sh\ncp -r /workspace/wiki /output/markdown"] : true
CheckMode --> Phase3["Phase 3 : build-docs.sh"] - false
state "Phase 3 : mdBook Build" as Phase3 {
[*] --> GenToml["Generate book.toml\n[book], [output.html]"]
GenToml --> GenSummary["Generate src/SUMMARY.md\nScan .md files"]
GenSummary --> CopyToSrc["cp -r /workspace/wiki/* src/"]
CopyToSrc --> MermaidInstall["mdbook-mermaid install"]
MermaidInstall --> MdBookBuild["mdbook build"]
MdBookBuild --> OutputBook["/workspace/book/book/"]
}
Phase3 --> CopyAll["cp -r book /output/\ncp -r markdown /output/"]
CopyMarkdown --> Done["/output directory\nready"]
CopyAll --> Done
Done --> [*]
Sources: scripts/build-docs.sh:61-93 python/deepwiki-scraper.py:1277-1408 python/deepwiki-scraper.py:880-1276
Phase Coordination
The build-docs.sh orchestrator coordinates all three phases and handles the decision point for markdown-only mode.
Orchestrator Control Flow
flowchart TD
Start["Container entrypoint\nCMD: /usr/local/bin/build-docs.sh"]
Start --> ParseEnv["Parse environment\n$REPO, $BOOK_TITLE, $BOOK_AUTHORS\n$MARKDOWN_ONLY, $GIT_REPO_URL"]
ParseEnv --> CheckRepo{"$REPO\nset?"}
CheckRepo -->|No| GitDetect["git config --get remote.origin.url\nsed -E 's#.*github.com[:/]([^/]+/[^/.]+).*#\1#'"]
CheckRepo -->|Yes| SetVars["Set defaults:\nBOOK_AUTHORS=$REPO_OWNER\nGIT_REPO_URL=https://github.com/$REPO"]
GitDetect --> SetVars
SetVars --> SetPaths["WORK_DIR=/workspace\nWIKI_DIR=/workspace/wiki\nRAW_DIR=/workspace/raw_markdown\nOUTPUT_DIR=/output\nBOOK_DIR=/workspace/book"]
SetPaths --> CallScraper["python3 /usr/local/bin/deepwiki-scraper.py $REPO $WIKI_DIR"]
CallScraper --> ScraperRuns["deepwiki-scraper.py executes:\nPhase 1: extract_wiki_structure()\nPhase 2: extract_and_enhance_diagrams()"]
ScraperRuns --> CheckMode{"$MARKDOWN_ONLY\n== 'true'?"}
CheckMode -->|Yes| QuickCopy["rm -rf $OUTPUT_DIR/markdown\nmkdir -p $OUTPUT_DIR/markdown\ncp -r $WIKI_DIR/. $OUTPUT_DIR/markdown/\nexit 0"]
CheckMode -->|No| MkdirBook["mkdir -p $BOOK_DIR\ncd $BOOK_DIR"]
MkdirBook --> GenToml["cat > book.toml <<EOF\n[book] title=$BOOK_TITLE\n[output.html] git-repository-url=$GIT_REPO_URL\n[preprocessor.mermaid]"]
GenToml --> MkdirSrc["mkdir -p src"]
MkdirSrc --> GenSummary["Generate src/SUMMARY.md\nls $WIKI_DIR/*.md /sort -t- -k1 -n for file: head -1 $file/ sed 's/^# //'"]
GenSummary --> CopyToSrc["cp -r $WIKI_DIR/* src/"]
CopyToSrc --> ProcessTemplates["python3 process-template.py header.html\npython3 process-template.py footer.html\nInject into src/*.md"]
ProcessTemplates --> InstallMermaid["mdbook-mermaid install $BOOK_DIR"]
InstallMermaid --> BuildBook["mdbook build"]
BuildBook --> CopyOutputs["mkdir -p $OUTPUT_DIR\ncp -r book $OUTPUT_DIR/\ncp -r $WIKI_DIR/. $OUTPUT_DIR/markdown/\ncp book.toml $OUTPUT_DIR/"]
QuickCopy --> Done["Exit 0"]
CopyOutputs --> Done
Sources: scripts/build-docs.sh:8-47 scripts/build-docs.sh:61-93 scripts/build-docs.sh:95-309
Phase 1: Clean Markdown Extraction
Phase 1 discovers the wiki structure and converts HTML pages to clean Markdown, writing files to a temporary directory (/workspace/wiki). This phase is implemented entirely in Python within deepwiki-scraper.py.
Phase 1 Data Flow
flowchart LR
DeepWiki["https://deepwiki.com/$REPO"]
DeepWiki -->|session.get base_url| ExtractStruct["extract_wiki_structure(repo, session)"]
ExtractStruct -->|soup.find_all 'a', href=re.compile ...| ParseLinks["Parse sidebar links\nPattern: /$REPO/\d+"]
ParseLinks --> PageList["pages = [\n {'number': '1', 'title': 'Overview',\n 'url': '...', 'href': '...', 'level': 0},\n {'number': '2.1', 'title': 'Sub',\n 'url': '...', 'href': '...', 'level': 1},\n ...\n]\nsorted by page number"]
PageList --> Loop["for page in pages:]
Loop --> FetchPage[fetch_page(url, session)\nUser-Agent header\n3 retries with timeout=30"]
FetchPage --> ParseHTML["BeautifulSoup(response.text)\nRemove: nav, header, footer, aside\nFind: article or main or body"]
ParseHTML --> ConvertMD["h = html2text.HTML2Text()\nh.body_width = 0\nmarkdown = h.handle(html_content)"]
ConvertMD --> CleanFooter["clean_deepwiki_footer(markdown)\nRegex patterns:\n'Dismiss', 'Refresh this wiki',\n'On this page', 'Edit Wiki'"]
CleanFooter --> FixLinks["fix_wiki_link(match)\nRegex: /owner/repo/(\d+(?:\.\d+)*)-(.+)\nConvert to: section-N/N-M-slug.md"]
FixLinks --> ResolvePath["resolve_output_path(page_number, title)\nnormalized_number_parts()\nsanitize_filename()"]
ResolvePath --> WriteFile["filepath.write_text(markdown)\nMain: /workspace/wiki/N-slug.md\nSub: /workspace/wiki/section-N/N-M-slug.md"]
WriteFile --> Loop
Sources: python/deepwiki-scraper.py:116-163 python/deepwiki-scraper.py:751-877 python/deepwiki-scraper.py:213-228 python/deepwiki-scraper.py:165-211 python/deepwiki-scraper.py:22-26 python/deepwiki-scraper.py:28-53
Key Functions and Their Roles
| Function | File Location | Responsibility |
|---|---|---|
extract_wiki_structure() | python/deepwiki-scraper.py:116-163 | Parse main wiki page, discover all pages via sidebar links matching /repo/\d+, return sorted list of page metadata |
extract_page_content() | python/deepwiki-scraper.py:751-877 | Fetch individual page HTML, parse with BeautifulSoup, remove nav/footer elements, convert to Markdown |
convert_html_to_markdown() | python/deepwiki-scraper.py:213-228 | Convert HTML string to Markdown using html2text.HTML2Text() with body_width=0 (no line wrapping) |
clean_deepwiki_footer() | python/deepwiki-scraper.py:165-211 | Scan last 50 lines for DeepWiki UI patterns (Dismiss, Refresh this wiki, etc.) and truncate |
sanitize_filename() | python/deepwiki-scraper.py:22-26 | Strip special chars, replace spaces/hyphens, convert to lowercase: re.sub(r'[^\w\s-]', '', text) |
normalized_number_parts() | python/deepwiki-scraper.py:28-43 | Shift DeepWiki page numbers down by 1 (page 1 becomes unnumbered), split on . into parts |
resolve_output_path() | python/deepwiki-scraper.py:45-53 | Determine filename (N-slug.md) and optional subdirectory (section-N/) based on page numbering |
fix_wiki_link() | python/deepwiki-scraper.py:854-876 | Rewrite internal links from /owner/repo/N-title to relative paths like ../section-N/N-M-title.md |
File Organization Logic
flowchart TD
PageNum["page['number']\n(from DeepWiki)"]
PageNum --> Normalize["normalized_number_parts(page_number)\nSplit on '.', shift main number down by 1\nDeepWiki '1' → []\nDeepWiki '2' → ['1']\nDeepWiki '2.1' → ['1', '1']"]
Normalize --> CheckParts{"len(parts)?"}
CheckParts -->|0 was page 1| RootOverview["Filename: overview.md\nPath: $WIKI_DIR/overview.md\nNo section dir"]
CheckParts -->|1 main page| RootMain["Filename: N-slug.md\nExample: 1-quick-start.md\nPath: $WIKI_DIR/1-quick-start.md\nNo section dir"]
CheckParts -->|2+ subsection| ExtractSection["main_section = parts[0]\nsection_dir = f'section-{main_section}'"]
ExtractSection --> CreateDir["section_path = Path($WIKI_DIR) / section_dir\nsection_path.mkdir(exist_ok=True)"]
CreateDir --> SubFile["Filename: N-M-slug.md\nExample: 1-1-installation.md\nPath: $WIKI_DIR/section-1/1-1-installation.md"]
The system organizes files hierarchically based on page numbering. DeepWiki pages are numbered starting from 1, but the system shifts them down by 1 so that the first page becomes unnumbered (the overview).
Sources: python/deepwiki-scraper.py:28-43 python/deepwiki-scraper.py:45-63 python/deepwiki-scraper.py:1332-1338
Phase 2: Diagram Enhancement
Phase 2 extracts Mermaid diagrams from the JavaScript payload and uses fuzzy matching to intelligently place them in the appropriate Markdown files. This phase operates on files in the temporary directory (/workspace/wiki).
Phase 2 Algorithm Flow
flowchart TD
Start["extract_and_enhance_diagrams(repo, temp_dir, session, diagram_source_url)"]
Start --> FetchJS["response = session.get(diagram_source_url)\nhtml_text = response.text"]
FetchJS --> ExtractRegex["pattern = r'```mermaid(?:\\r\\\n/\\/\r?\)(.*?)(?:\\r\\\n/\\/\r?\)```'\ndiagram_matches = re.finditer(pattern, html_text, re.DOTALL)"]
ExtractRegex --> CountTotal["print(f'Found {len(diagram_matches)}
total diagrams')"]
CountTotal --> ExtractContext["for match in diagram_matches:\ncontext_start = max(0, match.start() - 2000)\ncontext_before = html_text[context_start:match.start()]"]
ExtractContext --> Unescape["Unescape escape sequences:\nreplace('\\\n', '\\n')\nreplace('\\ ', '\ ')\nreplace('\\\"', '\"')\nreplace('\<', '<')\nreplace('\>', '>')"]
Unescape --> ParseContext["context_lines = [l for l in context.split('\\n')
if l.strip()]\nFind last_heading (line starting with #)\nExtract anchor_text (last 2-3 non-heading lines, max 300 chars)"]
ParseContext --> Normalize["normalize_mermaid_diagram(diagram)\n7 normalization passes:\nnormalize_mermaid_edge_labels()\nnormalize_mermaid_state_descriptions()\nnormalize_flowchart_nodes()\nnormalize_statement_separators()\nnormalize_empty_node_labels()\nnormalize_gantt_diagram()"]
Normalize --> BuildContexts["diagram_contexts.append({\n 'last_heading': last_heading,\n 'anchor_text': anchor_text[-300:],\n 'diagram': normalized_diagram\n})"]
BuildContexts --> ScanFiles["md_files = list(temp_dir.glob('**/*.md'))\nfor md_file in md_files:]
ScanFiles --> SkipExisting{re.search(r'^\s*`{3,}\s*mermaid\b',\ncontent)?"}
SkipExisting -->|Yes| ScanFiles
SkipExisting -->|No| NormalizeContent["content_normalized = content.lower()\ncontent_normalized = ' '.join(content_normalized.split())"]
NormalizeContent --> MatchLoop["for idx, item in enumerate(diagram_contexts):]
MatchLoop --> TryChunks[for chunk_size in [300, 200, 150, 100, 80]:\ntest_chunk = anchor_normalized[-chunk_size:]\npos = content_normalized.find(test_chunk)"]
TryChunks --> FoundMatch{"pos != -1?"}
FoundMatch -->|Yes| ConvertToLine["char_count = 0\nfor line_num, line in enumerate(lines):\n char_count += len(' '.join(line.split())) + 1\n if char_count >= pos: best_match_line = line_num"]
FoundMatch -->|No| TryHeading["for line_num, line in enumerate(lines):\nif heading_normalized in line_normalized:\n best_match_line = line_num"]
TryHeading --> FoundMatch2{"best_match_line != -1?"}
FoundMatch2 -->|Yes| ConvertToLine
FoundMatch2 -->|No| MatchLoop
ConvertToLine --> CheckScore{"best_match_score >= 80?"}
CheckScore -->|Yes| FindInsert["insert_line = best_match_line + 1\nSkip blank lines, skip paragraph/list"]
CheckScore -->|No| MatchLoop
FindInsert --> QueueInsert["pending_insertions.append(\n (insert_line, diagram, score, idx)\n)\ndiagrams_used.add(idx)"]
QueueInsert --> MatchLoop
MatchLoop --> SortInsert["pending_insertions.sort(key=lambda x: x[0], reverse=True)"]
SortInsert --> InsertLoop["for insert_line, diagram, score, idx in pending_insertions:\nlines.insert(insert_line, '')\nlines.insert(insert_line, f'{fence}mermaid')\nlines.insert(insert_line, diagram)\nlines.insert(insert_line, fence)\nlines.insert(insert_line, '')"]
InsertLoop --> WriteFile["with open(md_file, 'w')
as f:\n f.write('\\n'.join(lines))"]
WriteFile --> ScanFiles
Sources: python/deepwiki-scraper.py:880-1276 python/deepwiki-scraper.py:899-1088 python/deepwiki-scraper.py:1149-1273 python/deepwiki-scraper.py:230-393
Fuzzy Matching Algorithm
The algorithm uses progressively shorter anchor text chunks to find the best match location for each diagram. The score threshold of 80 ensures only high-confidence matches are inserted.
Sources: python/deepwiki-scraper.py:1184-1218
flowchart LR
AnchorText["anchor_text\n(last 300 chars from context)"]
AnchorText --> NormalizeA["anchor_normalized = anchor.lower()\nanchor_normalized = ' '.join(anchor_normalized.split())"]
MDFile["markdown file content"]
MDFile --> NormalizeC["content_normalized = content.lower()\ncontent_normalized = ' '.join(content_normalized.split())"]
NormalizeA --> Loop["for chunk_size in [300, 200, 150, 100, 80]:]
NormalizeC --> Loop
Loop --> Extract[if len(anchor_normalized) >= chunk_size:\n test_chunk = anchor_normalized[-chunk_size:]"]
Extract --> Find["pos = content_normalized.find(test_chunk)"]
Find --> FoundPos{"pos != -1?"}
FoundPos -->|Yes| CharToLine["char_count = 0\nfor line_num, line in enumerate(lines):\n char_count += len(' '.join(line.split())) + 1\n if char_count >= pos:\n best_match_line = line_num\n best_match_score = chunk_size"]
FoundPos -->|No| Loop
CharToLine --> CheckThresh{"best_match_score >= 80?"}
CheckThresh -->|Yes| Accept["Accept match\nQueue for insertion"]
CheckThresh -->|No| HeadingFallback["Try heading_normalized in line_normalized\nbest_match_score = 50"]
HeadingFallback --> CheckThresh2{"best_match_score >= 80?"}
CheckThresh2 -->|Yes| Accept
CheckThresh2 -->|No| Reject["Reject match\nSkip diagram"]
Diagram Extraction from JavaScript
Diagrams are extracted from the Next.js JavaScript payload embedded in the HTML response. DeepWiki stores all diagrams for all pages in a single JavaScript bundle, which requires fuzzy matching to place each diagram in the correct file.
Extraction Method
The primary extraction pattern captures fenced Mermaid blocks with various newline representations:
Unescape Sequence
Each diagram block undergoes unescaping to convert JavaScript string literals to actual text:
| Escape Sequence | Replacement | Purpose |
|---|---|---|
\\n | \n | Newline characters in diagram syntax |
\\t | \t | Tab characters for indentation |
\\" | " | Quoted strings in node labels |
\\\\ | \ | Literal backslashes |
\\u003c | < | HTML less-than entity |
\\u003e | > | HTML greater-than entity |
\\u0026 | & | HTML ampersand entity |
<br/>, <br> | (space) | HTML line breaks in labels |
Sources: python/deepwiki-scraper.py:899-901 python/deepwiki-scraper.py:1039-1047
Phase 3: mdBook Build
Phase 3 generates mdBook configuration, creates the table of contents, and builds the final HTML documentation. This phase is orchestrated by build-docs.sh and invokes Rust tools (mdbook, mdbook-mermaid).
Phase 3 Component Interactions
flowchart TD
Entry["build-docs.sh line 95\nPhase 3 starts"]
Entry --> MkdirBook["mkdir -p $BOOK_DIR\ncd $BOOK_DIR\n$BOOK_DIR=/workspace/book"]
MkdirBook --> GenToml["cat > book.toml <<EOF\n[book]\ntitle = \"$BOOK_TITLE\"\nauthors = [\"$BOOK_AUTHORS\"]\n[output.html]\ndefault-theme = \"rust\"\ngit-repository-url = \"$GIT_REPO_URL\"\n[preprocessor.mermaid]\ncommand = \"mdbook-mermaid\"\n[output.html.fold]\nenable = true"]
GenToml --> MkdirSrc["mkdir -p src"]
MkdirSrc --> InitSummary["{ echo '# Summary'; echo ''; } > src/SUMMARY.md"]
InitSummary --> FindOverview["main_pages_list=$(ls $WIKI_DIR/*.md)\noverview_file=$(printf '%s\ ' $main_pages_list / grep -Ev '^[0-9]' / head -1)\ntitle=$(head -1 $WIKI_DIR/$overview_file /sed 's/^# //'"]
FindOverview --> WriteOverview["echo \"[$title] $overview_file \" >> src/SUMMARY.md echo '' >> src/SUMMARY.md"]
WriteOverview --> SortPages["main_pages=$ printf '%s\ ' $main_pages_list/ awk -F/ '{print $NF}'\n /grep -E '^[0-9]'/ sort -t- -k1 -n)"]
SortPages --> LoopPages["echo \"$main_pages\" |while read -r file; do"]
LoopPages --> ExtractTitle["filename=$ basename \"$file\" title=$ head -1 \"$file\"| sed 's/^# //')"]
ExtractTitle --> ExtractNum["section_num=$(echo \"$filename\" |grep -oE '^[0-9]+' "]
ExtractNum --> CheckSubdir{"[ -d \"$WIKI_DIR/section-$section_num\" ]?"}
CheckSubdir -->|Yes|WriteMain["echo \"- [$title] $filename \" >> src/SUMMARY.md"]
WriteMain --> LoopSubs["ls $WIKI_DIR/section-$section_num/*.md/ awk -F/ '{print $NF}'\n /sort -t- -k1 -n/ while read subname; do"]
LoopSubs --> WriteSub["subfile=\"section-$section_num/$subname\"\nsubtitle=$(head -1 \"$subfile\" |sed 's/^# //' echo \" - [$subtitle] $subfile \" >> src/SUMMARY.md"]
WriteSub --> LoopSubs
CheckSubdir -->|No| WriteStandalone["echo \"- [$title]($filename)\" >> src/SUMMARY.md"]
LoopSubs --> LoopPages
WriteStandalone --> LoopPages
LoopPages --> CopySrc["cp -r $WIKI_DIR/* src/"]
CopySrc --> ProcessTemplates["python3 /usr/local/bin/process-template.py $HEADER_TEMPLATE\npython3 /usr/local/bin/process-template.py $FOOTER_TEMPLATE\nInject into src/*.md and src/*/*.md"]
ProcessTemplates --> MermaidInstall["mdbook-mermaid install $BOOK_DIR"]
MermaidInstall --> MdBookBuild["mdbook build\nReads book.toml and src/SUMMARY.md\nProcesses src/*.md files\nGenerates book/index.html"]
MdBookBuild --> CopyOut["mkdir -p $OUTPUT_DIR\ncp -r book $OUTPUT_DIR/\ncp -r $WIKI_DIR/. $OUTPUT_DIR/markdown/\ncp book.toml $OUTPUT_DIR/"]
Sources: scripts/build-docs.sh:95-309 scripts/build-docs.sh:124-188 scripts/build-docs.sh:190-261 scripts/build-docs.sh:263-271
book.toml Generation
The orchestrator dynamically generates book.toml with runtime configuration from environment variables:
Sources: scripts/build-docs.sh:102-119
flowchart TD
Start["Generate src/SUMMARY.md\n{ echo '# Summary'; echo ''; } > src/SUMMARY.md"]
Start --> ListFiles["main_pages_list=$(ls $WIKI_DIR/*.md 2>/dev/null // true)"]
ListFiles --> FindOverview["overview_file=$(printf '%s\ ' $main_pages_list\n /awk -F/ '{print $NF}'/ grep -Ev '^[0-9]'\n /head -1"]
FindOverview --> WriteOverview["if [ -n \"$overview_file\" ]; then title=$ head -1 $WIKI_DIR/$overview_file| sed 's/^# //')\n echo \"[${title:-Overview}]($overview_file)\" >> src/SUMMARY.md\n echo '' >> src/SUMMARY.md\nfi"]
WriteOverview --> FilterMain["main_pages=$(printf '%s\ ' $main_pages_list\n /awk -F/ '{print $NF}'/ grep -E '^[0-9]'\n /sort -t- -k1 -n"]
FilterMain --> LoopMain["echo \"$main_pages\"| while read -r file; do"]
LoopMain --> CheckFile{"[ -f \"$file\" ]?"}
CheckFile -->|Yes| GetFilename["filename=$(basename \"$file\")\ntitle=$(head -1 \"$file\" |sed 's/^# //' "]
CheckFile -->|No|LoopMain
GetFilename --> ExtractNum["section_num=$ echo \"$filename\"| grep -oE '^[0-9]+' || true)\nsection_dir=\"$WIKI_DIR/section-$section_num\""]
ExtractNum --> CheckSubdir{"[ -n \"$section_num\" ] &&\n[ -d \"$section_dir\" ]?"}
CheckSubdir -->|Yes| WriteMainWithSubs["echo \"- [$title]($filename)\" >> src/SUMMARY.md"]
WriteMainWithSubs --> ListSubs["ls $section_dir/*.md 2>/dev/null\n /awk -F/ '{print $NF}'/ sort -t- -k1 -n"]
ListSubs --> LoopSubs["while read subname; do"]
LoopSubs --> CheckSubFile{"[ -f \"$subfile\" ]?"}
CheckSubFile -->|Yes| WriteSub["subfile=\"$section_dir/$subname\"\nsubfilename=$(basename \"$subfile\")\nsubtitle=$(head -1 \"$subfile\" |sed 's/^# //' echo \" - [$subtitle] section-$section_num/$subfilename \" >> src/SUMMARY.md"]
CheckSubFile -->|No|LoopSubs
WriteSub --> LoopSubs
CheckSubdir -->|No| WriteStandalone["echo \"- [$title]($filename)\" >> src/SUMMARY.md"]
LoopSubs --> LoopMain
WriteStandalone --> LoopMain
LoopMain --> CountEntries["echo \"Generated SUMMARY.md with $(grep -c '\\[' src/SUMMARY.md)
entries\""]
SUMMARY.md Generation Algorithm
The table of contents is generated by scanning the actual file structure in /workspace/wiki and extracting titles from the first line of each file:
Sources: scripts/build-docs.sh:124-188
mdBook and mdbook-mermaid Execution
The build process invokes two Rust binaries installed via the Docker multi-stage build:
| Command | Location | Purpose | Output |
|---|---|---|---|
mdbook-mermaid install $BOOK_DIR | scripts/build-docs.sh265 | Install Mermaid.js assets into book directory and update book.toml | mermaid.min.js, mermaid-init.js in $BOOK_DIR/ |
mdbook build | scripts/build-docs.sh270 | Parse SUMMARY.md, process all Markdown files, generate static HTML site | HTML files in $BOOK_DIR/book/ (subdirectory, not root) |
mdbook Build Process:
The mdbook build command performs the following operations:
- Parse Structure : Read
src/SUMMARY.mdto determine page hierarchy and navigation order - Process Files : For each
.mdfile referenced in SUMMARY.md:- Parse Markdown with CommonMark parser
- Process Mermaid fenced code blocks via
mdbook-mermaidpreprocessor - Apply
rusttheme styles (configurable viadefault-themein book.toml) - Generate sidebar navigation
- Generate HTML : Create HTML files with:
- Responsive navigation sidebar
- Client-side search functionality (elasticlunr.js)
- “Edit this page” links using
git-repository-urlfrom book.toml - Syntax highlighting for code blocks
- Copy Assets : Bundle theme assets, fonts, and JavaScript libraries
Sources: scripts/build-docs.sh:263-271 scripts/build-docs.sh:102-119
Data Transformation Summary
Each phase transforms data in specific ways, with temporary directories used for intermediate work:
| Phase | Input | Processing Components | Output |
|---|---|---|---|
| Phase 1 | HTML from https://deepwiki.com/$REPO | extract_wiki_structure(), extract_page_content(), BeautifulSoup, html2text.HTML2Text(), clean_deepwiki_footer() | Clean Markdown in /workspace/wiki/ |
| Phase 2 | Markdown files + JavaScript payload from DeepWiki | extract_and_enhance_diagrams(), normalize_mermaid_diagram(), fuzzy matching with 300/200/150/100/80 char chunks | Enhanced Markdown in /workspace/wiki/ (modified in place) |
| Phase 3 | Markdown files + environment variables ($BOOK_TITLE, $BOOK_AUTHORS, etc.) | Shell script generates book.toml and src/SUMMARY.md, process-template.py, mdbook-mermaid install, mdbook build | HTML site in /workspace/book/book/ |
Working Directories:
| Directory | Purpose | Contents | Lifecycle |
|---|---|---|---|
/workspace/wiki/ | Primary working directory | Markdown files organized by numbering scheme | Created in Phase 1, modified in Phase 2, read in Phase 3 |
/workspace/raw_markdown/ | Debug snapshot | Copy of /workspace/wiki/ before Phase 2 enhancement | Created between Phase 1 and Phase 2, copied to /output/raw_markdown/ |
/workspace/book/ | mdBook project directory | book.toml, src/ subdirectory, final book/ subdirectory | Created in Phase 3 |
/workspace/book/src/ | mdBook source | Copy of /workspace/wiki/ with injected headers/footers, SUMMARY.md | Created in Phase 3 |
/workspace/book/book/ | Final HTML output | Complete static HTML site | Generated by mdbook build |
/output/ | Final container output | book/, markdown/, raw_markdown/, book.toml | Populated at end of Phase 3 (or end of Phase 2 if MARKDOWN_ONLY=true) |
Final Output Structure:
/output/
├── book/ # Static HTML site (from /workspace/book/book/)
│ ├── index.html
│ ├── overview.html # First page (unnumbered)
│ ├── 1-quick-start.html # Main pages
│ ├── section-1/
│ │ ├── 1-1-installation.html # Subsections
│ │ └── ...
│ ├── mermaid.min.js # Installed by mdbook-mermaid
│ ├── mermaid-init.js # Installed by mdbook-mermaid
│ └── ...
├── markdown/ # Enhanced Markdown source (from /workspace/wiki/)
│ ├── overview.md
│ ├── 1-quick-start.md
│ ├── section-1/
│ │ ├── 1-1-installation.md
│ │ └── ...
│ └── ...
├── raw_markdown/ # Pre-enhancement snapshot (from /workspace/raw_markdown/)
│ ├── overview.md # Same structure as markdown/ but without diagrams
│ └── ...
└── book.toml # mdBook configuration (from /workspace/book/book.toml)
Sources: scripts/build-docs.sh:273-294 python/deepwiki-scraper.py:1358-1366
flowchart TD
Phase1["Phase 1: Extraction\n(deepwiki-scraper.py)"]
Phase2["Phase 2: Enhancement\n(deepwiki-scraper.py)"]
Phase1 --> Phase2
Phase2 --> Check{"MARKDOWN_ONLY\n== true?"}
Check -->|Yes| FastPath["cp -r /workspace/wiki/* /output/markdown/\nExit (fast path)"]
Check -->|No| Phase3["Phase 3: mdBook Build\n(build-docs.sh)"]
Phase3 --> FullOutput["Copy book/ and markdown/ to /output/\nExit (full build)"]
FastPath --> End[/"Build complete"/]
FullOutput --> End
Conditional Execution: MARKDOWN_ONLY Mode
The MARKDOWN_ONLY environment variable allows bypassing Phase 3 for faster iteration during development:
When MARKDOWN_ONLY=true:
- Execution time: ~30-60 seconds (scraping + diagram matching only)
- Output:
/output/markdown/only - Use case: Debugging diagram placement, testing content extraction
When MARKDOWN_ONLY=false (default):
- Execution time: ~60-120 seconds (full pipeline)
- Output:
/output/book/,/output/markdown/,/output/book.toml - Use case: Production documentation builds
Sources: build-docs.sh:60-76 README.md:55-76
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Docker Multi-Stage Build
Loading…
Docker Multi-Stage Build
Relevant source files
Purpose and Scope
This document explains the Docker multi-stage build strategy used to create the deepwiki-scraper container image. It details how the system combines a Rust toolchain for compiling documentation tools with a Python runtime for web scraping, while optimizing the final image size.
For information about how the container orchestrates the build process, see build-docs.sh Orchestrator. For details on the Python scraper implementation, see deepwiki-scraper.py.
Multi-Stage Build Strategy
The Dockerfile implements a two-stage build pattern that separates compilation from runtime. Stage 1 uses a full Rust development environment to compile mdBook binaries from source. Stage 2 creates a minimal Python runtime and extracts only the compiled binaries, discarding the build toolchain.
Build Stages Flow
graph TD
subgraph Stage1["Stage 1: Rust Builder (rust:latest)"]
RustBase["rust:latest base image\n~1.5 GB with toolchain"]
CargoInstall["cargo install mdbook\ncargo install mdbook-mermaid"]
Binaries["/usr/local/cargo/bin/\nmdbook\nmdbook-mermaid"]
RustBase --> CargoInstall
CargoInstall --> Binaries
end
subgraph Stage2["Stage 2: Python Runtime (python:3.12-slim)"]
PyBase["python:3.12-slim base\n~150 MB"]
UVInstall["Copy uv from\nghcr.io/astral-sh/uv:latest"]
PipInstall["uv pip install --system\nrequirements.txt"]
CopyBinaries["COPY --from=builder\n/usr/local/cargo/bin/"]
CopyScripts["COPY tools/ and\nbuild-docs.sh"]
PyBase --> UVInstall
UVInstall --> PipInstall
PipInstall --> CopyBinaries
CopyBinaries --> CopyScripts
end
Binaries -.->|Extract only binaries| CopyBinaries
CopyScripts --> FinalImage["Final Image\n~300-400 MB"]
Stage1 -.->|Discarded after build| Discard["Discard"]
style RustBase fill:#f5f5f5
style PyBase fill:#f5f5f5
style FinalImage fill:#e8e8e8
style Discard fill:#fff,stroke-dasharray: 5 5
Sources: Dockerfile:1-33
Stage 1: Rust Builder
Stage 1 uses rust:latest as the base image, providing the complete Rust toolchain including cargo, the Rust package manager and build tool.
Rust Builder Configuration
| Aspect | Details |
|---|---|
| Base Image | rust:latest |
| Size | ~1.5 GB (includes rustc, cargo, stdlib) |
| Build Commands | cargo install mdbook, cargo install mdbook-mermaid |
| Output Location | /usr/local/cargo/bin/ |
| Stage Identifier | builder |
The cargo install commands fetch mdBook and mdbook-mermaid source from crates.io, compile them from source, and install the resulting binaries to /usr/local/cargo/bin/.
flowchart LR
subgraph BuilderStage["builder stage"]
CratesIO["crates.io\n(package registry)"]
CargoFetch["cargo fetch\n(download sources)"]
CargoCompile["cargo build --release\n(compile to binary)"]
CargoInstallBin["Install to\n/usr/local/cargo/bin/"]
CratesIO --> CargoFetch
CargoFetch --> CargoCompile
CargoCompile --> CargoInstallBin
end
CargoInstallBin --> MdBookBin["mdbook binary"]
CargoInstallBin --> MermaidBin["mdbook-mermaid binary"]
MdBookBin -.->|Copied to Stage 2| NextStage["NextStage"]
MermaidBin -.->|Copied to Stage 2| NextStage
Sources: Dockerfile:1-5
Stage 2: Python Runtime Assembly
Stage 2 builds the final runtime image starting from python:3.12-slim, a minimal Python base image that omits development headers and unnecessary packages.
Python Runtime Components
graph TB
subgraph PythonBase["python:3.12-slim"]
PyInterpreter["Python 3.12 interpreter"]
PyStdlib["Python standard library"]
BaseUtils["Essential utilities\n(bash, sh, coreutils)"]
end
subgraph InstalledTools["Installed via COPY"]
UV["uv package manager\n/bin/uv, /bin/uvx"]
PyDeps["Python packages\n(requests, beautifulsoup4, html2text)"]
RustBins["Rust binaries\n(mdbook, mdbook-mermaid)"]
Scripts["Application scripts\n(deepwiki-scraper.py, build-docs.sh)"]
end
PythonBase --> UV
UV --> PyDeps
PythonBase --> RustBins
PythonBase --> Scripts
PyDeps --> Runtime["Runtime Environment"]
RustBins --> Runtime
Scripts --> Runtime
The installation sequence follows a specific order:
- Copy uv Dockerfile13 - Multi-stage copy from
ghcr.io/astral-sh/uv:latest - Install Python dependencies Dockerfile:16-17 - Uses
uv pip install --system --no-cache - Copy Rust binaries Dockerfile:20-21 - Extracts from builder stage
- Copy application scripts Dockerfile:24-29 - Adds Python scraper and orchestrator
Sources: Dockerfile:8-29
Binary Extraction and Integration
The critical optimization occurs at Dockerfile:20-21 where the COPY --from=builder directive extracts only the compiled binaries without any build dependencies.
Binary Extraction Pattern
| Source (Stage 1) | Destination (Stage 2) | Purpose |
|---|---|---|
/usr/local/cargo/bin/mdbook | /usr/local/bin/mdbook | Documentation builder executable |
/usr/local/cargo/bin/mdbook-mermaid | /usr/local/bin/mdbook-mermaid | Mermaid preprocessor executable |
flowchart LR
subgraph BuilderFS["Builder Filesystem"]
CargoDir["/usr/local/cargo/bin/"]
MdBookSrc["mdbook\n(compiled binary)"]
MermaidSrc["mdbook-mermaid\n(compiled binary)"]
CargoDir --> MdBookSrc
CargoDir --> MermaidSrc
end
subgraph RuntimeFS["Runtime Filesystem"]
BinDir["/usr/local/bin/"]
MdBookDst["mdbook\n(extracted)"]
MermaidDst["mdbook-mermaid\n(extracted)"]
BinDir --> MdBookDst
BinDir --> MermaidDst
end
MdBookSrc -.->|COPY --from=builder| MdBookDst
MermaidSrc -.->|COPY --from=builder| MermaidDst
subgraph Discarded["Discarded (not copied)"]
RustToolchain["rustc compiler"]
CargoTool["cargo build tool"]
SourceFiles["mdBook source files"]
BuildCache["cargo build cache"]
end
Both binaries are statically linked or contain all necessary Rust runtime dependencies, allowing them to execute in the Python base image without the Rust toolchain.
Sources: Dockerfile:19-21
Python Dependency Installation
Python dependencies are installed using uv, a fast Python package installer written in Rust. The dependencies are defined in tools/requirements.txt:1-4
Python Dependencies
| Package | Version | Purpose |
|---|---|---|
requests | ≥2.31.0 | HTTP client for scraping DeepWiki |
beautifulsoup4 | ≥4.12.0 | HTML parsing and navigation |
html2text | ≥2020.1.16 | HTML to Markdown conversion |
The installation command Dockerfile17 uses these flags:
--system: Install to system Python (not virtualenv)--no-cache: Don’t cache downloaded packages (reduces image size)-r /tmp/requirements.txt: Read dependencies from file
Sources: Dockerfile:16-17 tools/requirements.txt:1-4
graph LR
subgraph Approach1["Single-Stage Approach (Hypothetical)"]
Single["rust:latest + Python\n~2+ GB"]
end
subgraph Approach2["Multi-Stage Approach (Actual)"]
Builder["Stage 1: rust:latest\n~1.5 GB\n(discarded)"]
Runtime["Stage 2: python:3.12-slim\n+ binaries + dependencies\n~300-400 MB"]
Builder -.->|Extract binaries only| Runtime
end
Single -->|Contains unnecessary build toolchain| Waste["Wasted Space"]
Runtime -->|Contains only runtime essentials| Efficient["Efficient"]
style Single fill:#f5f5f5
style Builder fill:#f5f5f5
style Runtime fill:#e8e8e8
style Waste fill:#fff,stroke-dasharray: 5 5
style Efficient fill:#fff,stroke-dasharray: 5 5
Image Size Optimization
The multi-stage strategy achieves significant size reduction by discarding the build environment.
Size Comparison
Size Breakdown of Final Image
| Component | Approximate Size |
|---|---|
| Python 3.12 slim base | ~150 MB |
| Python packages (requests, BeautifulSoup4, html2text) | ~20 MB |
| mdBook binary | ~8 MB |
| mdbook-mermaid binary | ~6 MB |
| uv package manager | ~10 MB |
| Application scripts | <1 MB |
| Total | ~300-400 MB |
Sources: Dockerfile:1-33 README.md156
graph TB
subgraph Filesystem["/usr/local/bin/"]
BuildScript["build-docs.sh\n(orchestrator)"]
ScraperScript["deepwiki-scraper.py\n(Python scraper)"]
MdBookBin["mdbook\n(Rust binary)"]
MermaidBin["mdbook-mermaid\n(Rust binary)"]
UVBin["uv\n(Python installer)"]
end
subgraph SystemPython["/usr/local/lib/python3.12/"]
Requests["requests package"]
BS4["beautifulsoup4 package"]
Html2Text["html2text package"]
end
subgraph Execution["Execution Flow"]
Docker["docker run\n(CMD)"]
Docker --> BuildScript
BuildScript -->|python| ScraperScript
BuildScript -->|subprocess| MdBookBin
MdBookBin -->|preprocessor| MermaidBin
ScraperScript --> Requests
ScraperScript --> BS4
ScraperScript --> Html2Text
end
Runtime Environment Structure
The final image contains a hybrid Python-Rust runtime where Python scripts can execute Rust binaries as subprocesses.
Runtime Component Locations
The entrypoint Dockerfile32 executes /usr/local/bin/build-docs.sh, which orchestrates calls to both Python and Rust components. The script can execute:
python /usr/local/bin/deepwiki-scraper.pyfor web scrapingmdbook initfor initializationmdbook buildfor HTML generationmdbook-mermaid installfor asset installation
Sources: Dockerfile:28-32 build-docs.sh
Container Execution Model
When the container runs, Docker executes the CMD Dockerfile32 which invokes build-docs.sh. This shell script has access to all binaries in /usr/local/bin/ (automatically on $PATH).
Process Tree During Execution
graph TD
Docker["docker run\n(container init)"]
Docker --> CMD["CMD: build-docs.sh"]
CMD --> Phase1["Phase 1:\npython deepwiki-scraper.py"]
CMD --> Phase2["Phase 2: mdbook init"]
CMD --> Phase3["Phase 3: mdbook-mermaid install"]
CMD --> Phase4["Phase 4: mdbook build"]
Phase1 --> PyProc["Python 3.12 process"]
PyProc --> ReqLib["requests.get()"]
PyProc --> BS4Lib["BeautifulSoup()"]
PyProc --> H2TLib["html2text.HTML2Text()"]
Phase2 --> MdBookProc1["mdbook binary process"]
Phase3 --> MermaidProc["mdbook-mermaid binary process"]
Phase4 --> MdBookProc2["mdbook binary process"]
MdBookProc2 --> MermaidPreproc["mdbook-mermaid\n(as preprocessor)"]
Sources: Dockerfile32 README.md:122-145
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Component Reference
Loading…
Component Reference
Relevant source files
Purpose and Scope
This page provides a high-level overview of the major components in the DeepWiki-to-mdBook converter system and their responsibilities. Each component is introduced with its primary function, key files, and relationships to other components.
For detailed information about specific components:
- Shell orchestration logic: see build-docs.sh Orchestrator
- Content extraction and diagram processing: see deepwiki-scraper.py
- Header and footer customization: see Template System
- Final HTML generation: see mdBook Integration
System Component Map
The following diagram shows all major components and their organizational relationships:
Sources: scripts/build-docs.sh:1-310 README.md:84-88
graph TB
subgraph "Entry Points"
Dockerfile["Dockerfile\n(multi-stage build)"]
ActionYAML["action.yml\n(GitHub Action)"]
end
subgraph "Orchestration Layer"
BuildScript["build-docs.sh\n(main orchestrator)"]
end
subgraph "Python Components"
Scraper["deepwiki-scraper.py\n(content extraction)"]
TemplateProc["process-template.py\n(template rendering)"]
end
subgraph "Build Tools"
mdBook["mdbook\n(HTML generator)"]
mdBookMermaid["mdbook-mermaid\n(diagram renderer)"]
end
subgraph "Configuration Assets"
HeaderTemplate["templates/header.html"]
FooterTemplate["templates/footer.html"]
BookToml["book.toml\n(generated)"]
SummaryMd["SUMMARY.md\n(generated)"]
end
subgraph "Data Directories"
WikiDir["/workspace/wiki\n(enhanced markdown)"]
RawDir["/workspace/raw_markdown\n(pre-enhancement)"]
BookSrc["/workspace/book/src\n(mdBook input)"]
OutputDir["/output\n(final artifacts)"]
end
Dockerfile -->|builds| BuildScript
Dockerfile -->|installs| Scraper
Dockerfile -->|installs| TemplateProc
Dockerfile -->|compiles| mdBook
Dockerfile -->|compiles| mdBookMermaid
ActionYAML -->|invokes| Dockerfile
BuildScript -->|executes| Scraper
BuildScript -->|executes| TemplateProc
BuildScript -->|executes| mdBook
BuildScript -->|executes| mdBookMermaid
BuildScript -->|generates| BookToml
BuildScript -->|generates| SummaryMd
Scraper -->|writes to| WikiDir
Scraper -->|writes to| RawDir
TemplateProc -->|reads| HeaderTemplate
TemplateProc -->|reads| FooterTemplate
TemplateProc -->|outputs HTML| BuildScript
BuildScript -->|copies| WikiDir
WikiDir -->|to| BookSrc
BuildScript -->|injects templates into| BookSrc
mdBook -->|reads| BookToml
mdBook -->|reads| SummaryMd
mdBook -->|reads| BookSrc
mdBook -->|builds to| OutputDir
mdBookMermaid -->|preprocesses| BookSrc
Core Components
build-docs.sh
Type: Shell script orchestrator
Location: scripts/build-docs.sh:1-310
Entry Point: Docker container CMD instruction
The main orchestration script that coordinates the entire build process. It performs seven sequential steps:
| Step | Line Range | Description |
|---|---|---|
| Configuration | scripts/build-docs.sh:8-60 | Auto-detect repository, set environment defaults |
| Scraping | scripts/build-docs.sh:61-65 | Invoke deepwiki-scraper.py to fetch content |
| Optional Exit | scripts/build-docs.sh:67-93 | If MARKDOWN_ONLY=true, copy outputs and exit |
| mdBook Init | scripts/build-docs.sh:95-122 | Create book.toml and directory structure |
| SUMMARY Generation | scripts/build-docs.sh:124-188 | Discover files and build table of contents |
| Template Processing | scripts/build-docs.sh:190-261 | Process header/footer and inject into markdown |
| Build & Copy | scripts/build-docs.sh:263-309 | Run mdBook build and copy artifacts to /output |
Key Responsibilities:
- Environment variable validation and default assignment
- Git repository auto-detection from remote URLs
- Orchestrating execution order of Python scripts
- Dynamic SUMMARY.md generation with numeric sorting
- Template injection into all markdown files
- Output directory management
Sources: scripts/build-docs.sh:1-310
deepwiki-scraper.py
Type: Python script
Location: python/deepwiki-scraper.py
Invocation: scripts/build-docs.sh65
The content extraction component that scrapes DeepWiki wiki pages and converts them to markdown with embedded diagrams.
Key Responsibilities:
- Fetch wiki HTML from
https://deepwiki.com/{REPO} - Parse Next.js data payload to discover wiki structure
- Convert HTML to markdown using
html2textlibrary - Extract Mermaid diagrams from JavaScript payload
- Normalize diagrams for Mermaid 11 compatibility (7-step pipeline)
- Match diagrams to pages using fuzzy text matching
- Write enhanced markdown to
/workspace/wiki - Write pre-enhancement snapshot to
/workspace/raw_markdown
The scraper is covered in detail in deepwiki-scraper.py.
Sources: scripts/build-docs.sh65 README.md74
process-template.py
Type: Python script
Location: python/process-template.py
Invocation: scripts/build-docs.sh:205-213 scripts/build-docs.sh:222-230
A template rendering utility that processes HTML template files with variable substitution.
Key Responsibilities:
- Read template file from path argument
- Parse variable assignments from command-line arguments (format:
KEY=value) - Substitute
{{VARIABLE}}placeholders with values - Handle conditional rendering with
{{#if VARIABLE}}...{{/if}}blocks - Output processed HTML to stdout
Template Variables Supported:
REPO- Repository identifier (e.g., “owner/repo”)BOOK_TITLE- Documentation titleBOOK_AUTHORS- Author namesGIT_REPO_URL- Full GitHub repository URLDEEPWIKI_URL- DeepWiki page URLDEEPWIKI_BADGE_URL- Badge image URLGITHUB_BADGE_URL- GitHub badge URLGENERATION_DATE- Build timestamp
See Template System for comprehensive documentation.
Sources: scripts/build-docs.sh:195-234 README.md51
Template Files
Type: HTML configuration files
Location: templates/header.html, templates/footer.html
Default Path: /workspace/templates/
Custom Mount: -v "$(pwd)/my-templates:/workspace/templates"
Static HTML template files that are processed by process-template.py and injected into every markdown file.
File Responsibilities:
| File | Purpose | Injection Point |
|---|---|---|
header.html | Top-of-page content (badges, navigation) | Before markdown content |
footer.html | Bottom-of-page content (metadata, links) | After markdown content |
Injection Logic:
[Header HTML]
<blank line>
[Original Markdown Content]
<blank line>
[Footer HTML]
Templates are injected at scripts/build-docs.sh:240-261 after all markdown files are copied to the book source directory.
Sources: scripts/build-docs.sh:195-234 scripts/build-docs.sh:240-261 README.md:39-51
mdBook and mdbook-mermaid
Type: External build tools (Rust binaries)
Location: /usr/local/bin/mdbook, /usr/local/bin/mdbook-mermaid
Compilation: Dockerfile multi-stage build
Pre-compiled tools that generate the final HTML output.
mdBook Responsibilities:
- Read configuration from
book.tomlscripts/build-docs.sh:102-119 - Parse
SUMMARY.mdto build navigation structure - Convert markdown files to HTML with search index
- Apply theme (rust theme by default)
- Generate table of contents sidebar
- Create chapter navigation links
mdbook-mermaid Responsibilities:
- Act as mdBook preprocessor scripts/build-docs.sh:113-114
- Detect mermaid code blocks in markdown
- Install JavaScript rendering libraries scripts/build-docs.sh266
- Configure client-side diagram rendering
See mdBook Integration for detailed integration documentation.
Sources: scripts/build-docs.sh:113-114 scripts/build-docs.sh266 scripts/build-docs.sh271
sequenceDiagram
participant Docker
participant BuildScript as "build-docs.sh"
participant Scraper as "deepwiki-scraper.py"
participant TemplateProc as "process-template.py"
participant mdBook as "mdbook"
participant FileSystem as "/output"
Docker->>BuildScript: Execute CMD
BuildScript->>BuildScript: Validate REPO env var
BuildScript->>BuildScript: Auto-detect from git remote
BuildScript->>BuildScript: Set defaults (BOOK_AUTHORS, etc)
BuildScript->>Scraper: Execute with REPO arg
Scraper->>Scraper: Fetch DeepWiki HTML
Scraper->>Scraper: Extract wiki structure
Scraper->>Scraper: Convert HTML to markdown
Scraper->>Scraper: Process Mermaid diagrams
Scraper->>FileSystem: Write /workspace/wiki/*.md
Scraper->>FileSystem: Write /workspace/raw_markdown/*.md
Scraper-->>BuildScript: Exit 0
alt MARKDOWN_ONLY=true
BuildScript->>FileSystem: Copy markdown to /output
BuildScript->>Docker: Exit 0
end
BuildScript->>BuildScript: Create /workspace/book/
BuildScript->>BuildScript: Generate book.toml
BuildScript->>BuildScript: Scan wiki files
BuildScript->>BuildScript: Generate SUMMARY.md
BuildScript->>TemplateProc: Process header.html
TemplateProc-->>BuildScript: Return HTML string
BuildScript->>TemplateProc: Process footer.html
TemplateProc-->>BuildScript: Return HTML string
BuildScript->>BuildScript: Copy wiki/*.md to book/src/
BuildScript->>BuildScript: Inject header into each .md
BuildScript->>BuildScript: Inject footer into each .md
BuildScript->>mdBook: mdbook-mermaid install
BuildScript->>mdBook: mdbook build
mdBook->>mdBook: Parse SUMMARY.md
mdBook->>mdBook: Convert markdown to HTML
mdBook->>mdBook: Build search index
mdBook->>FileSystem: Write /workspace/book/book/
mdBook-->>BuildScript: Exit 0
BuildScript->>FileSystem: Copy book/ to /output/book/
BuildScript->>FileSystem: Copy wiki/ to /output/markdown/
BuildScript->>FileSystem: Copy raw_markdown/ to /output/raw_markdown/
BuildScript->>FileSystem: Copy book.toml to /output/
BuildScript-->>Docker: Exit 0
Component Execution Flow
This diagram shows the runtime execution sequence and data flow between components:
Sources: scripts/build-docs.sh:1-310
File System Organization
The following table maps logical component names to their physical locations in the Docker container and output directory:
| Component | Container Path | Output Path | Description |
|---|---|---|---|
| Main orchestrator | /usr/local/bin/build-docs.sh | - | Shell script entry point |
| Scraper | /usr/local/bin/deepwiki-scraper.py | - | Python extraction script |
| Template processor | /usr/local/bin/process-template.py | - | Python template engine |
| mdBook binary | /usr/local/bin/mdbook | - | Rust-compiled tool |
| mdbook-mermaid binary | /usr/local/bin/mdbook-mermaid | - | Rust-compiled preprocessor |
| Default templates | /workspace/templates/*.html | - | Header/footer HTML files |
| Working wiki dir | /workspace/wiki/ | /output/markdown/ | Enhanced markdown files |
| Raw markdown dir | /workspace/raw_markdown/ | /output/raw_markdown/ | Pre-enhancement snapshot |
| Book workspace | /workspace/book/ | - | Temporary build directory |
| Book source files | /workspace/book/src/ | - | mdBook input directory |
| Generated config | /workspace/book/book.toml | /output/book.toml | mdBook configuration |
| Generated TOC | /workspace/book/src/SUMMARY.md | - | Navigation structure |
| Built HTML | /workspace/book/book/ | /output/book/ | Final documentation site |
Sources: scripts/build-docs.sh:27-31 scripts/build-docs.sh:274-294
graph TD
subgraph "Docker Build Stage 1"
RustBase["rust:latest base image"]
CargoInstall["cargo install"]
RustBase --> CargoInstall
CargoInstall --> mdBookBin["mdbook binary"]
CargoInstall --> mdBookMermaidBin["mdbook-mermaid binary"]
end
subgraph "Docker Build Stage 2"
PythonBase["python:3.12-slim base image"]
PipInstall["pip install"]
PythonBase --> PipInstall
PipInstall --> RequestsLib["requests library"]
PipInstall --> Html2TextLib["html2text library"]
PipInstall --> RapidFuzzLib["rapidfuzz library"]
end
subgraph "Runtime Dependencies"
BuildScript["build-docs.sh"]
ScraperPy["deepwiki-scraper.py"]
TemplatePy["process-template.py"]
BuildScript --> ScraperPy
BuildScript --> TemplatePy
BuildScript --> mdBookBin
BuildScript --> mdBookMermaidBin
ScraperPy --> RequestsLib
ScraperPy --> Html2TextLib
ScraperPy --> RapidFuzzLib
end
subgraph "Environment Inputs"
EnvREPO["REPO env var"]
EnvBOOK_TITLE["BOOK_TITLE env var"]
EnvMARKDOWN_ONLY["MARKDOWN_ONLY env var"]
GitRemote["git remote origin"]
GitRemote -.fallback.-> EnvREPO
EnvREPO --> BuildScript
EnvBOOK_TITLE --> BuildScript
EnvMARKDOWN_ONLY --> BuildScript
end
mdBookBin -.copied from.-> RustBase
mdBookMermaidBin -.copied from.-> RustBase
Component Dependencies
This diagram maps the dependency relationships between components, showing which components require which other components:
Sources: scripts/build-docs.sh:8-19 README.md:14-27
Component Communication Patterns
Inter-Process Communication
All component communication uses standard Unix patterns:
| Pattern | Components | Mechanism |
|---|---|---|
| Parent-child execution | build-docs.sh → Python scripts | python3 /usr/local/bin/script.py args |
| Parent-child execution | build-docs.sh → mdBook tools | mdbook build, mdbook-mermaid install |
| Output capture | build-docs.sh ← process-template.py | Command substitution: VAR=$(python3 ...) |
| Exit status | All → build-docs.sh | Standard exit codes (0 = success) |
| Error propagation | All | set -e in bash (exit on any error) |
Sources: scripts/build-docs.sh2 scripts/build-docs.sh65 scripts/build-docs.sh:205-213
File System Communication
Components communicate via shared file system locations:
Sources: scripts/build-docs.sh:27-31 scripts/build-docs.sh237 scripts/build-docs.sh:274-294
Configuration Communication
Environment variables flow unidirectionally from the container entry point to all components:
| Variable | Set By | Read By | Usage |
|---|---|---|---|
REPO | Docker -e flag | build-docs.sh | DeepWiki URL construction |
BOOK_TITLE | Docker -e flag | build-docs.sh | book.toml generation |
BOOK_AUTHORS | Docker -e flag | build-docs.sh | book.toml generation |
MARKDOWN_ONLY | Docker -e flag | build-docs.sh | Build mode selection |
GENERATION_DATE | build-docs.sh | process-template.py | Template variable |
GIT_REPO_URL | build-docs.sh (derived) | process-template.py | Template variable |
DEEPWIKI_URL | build-docs.sh (derived) | process-template.py | Template variable |
Sources: scripts/build-docs.sh:8-60 scripts/build-docs.sh:200-230
Component Responsibilities Matrix
The following table summarizes what each component is and is not responsible for:
| Component | Responsible For | Not Responsible For |
|---|---|---|
build-docs.sh | Orchestration, environment validation, SUMMARY generation, template injection, output copying | Content extraction, HTML rendering, diagram normalization |
deepwiki-scraper.py | HTTP requests, HTML parsing, markdown conversion, diagram extraction/normalization/matching | File system orchestration, mdBook integration, template processing |
process-template.py | Variable substitution, conditional rendering in templates | File discovery, output management, HTML generation |
mdbook | Markdown to HTML conversion, search index, navigation, theming | Content extraction, diagram processing, template injection |
mdbook-mermaid | Mermaid library installation, diagram rendering configuration | Diagram extraction, diagram normalization, markdown conversion |
Templates (*.html) | Define header/footer structure and variables | Variable substitution, file injection, content generation |
Sources: scripts/build-docs.sh:1-310 README.md:72-77
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
build-docs.sh Orchestrator
Loading…
build-docs.sh Orchestrator
Relevant source files
Purpose and Scope
The build-docs.sh script is the main orchestration layer for the DeepWiki-to-mdBook conversion system. It coordinates all components of the three-phase pipeline, manages configuration, handles environment variable processing, and produces the final output artifacts. This document covers the script’s responsibilities, execution flow, configuration management, and integration with other system components.
For details on the components orchestrated by this script, see deepwiki-scraper.py, Template System, and mdBook Integration. For information on the three-phase architecture, see Three-Phase Pipeline.
Role and Responsibilities
The orchestrator serves as the single entry point for the documentation build process. It is invoked as the Docker container’s default command and coordinates all system components in a sequential, deterministic manner.
Key Responsibilities:
| Responsibility | Implementation |
|---|---|
| Configuration Management | Validates and sets defaults for all environment variables |
| Auto-detection | Discovers repository information from Git remotes |
| Component Coordination | Invokes deepwiki-scraper.py, process-template.py, mdbook, and mdbook-mermaid |
| Error Handling | Uses set -e for fail-fast behavior on any component failure |
| Output Management | Organizes all artifacts into /output directory structure |
| Mode Selection | Supports standard and markdown-only execution modes |
| Template Processing | Coordinates header/footer injection into all markdown files |
Sources: scripts/build-docs.sh:1-310
Architecture Overview
The following diagram maps the orchestrator’s workflow to actual code entities and directory paths used in the script:
Diagram: Orchestrator Component Integration
graph TB
Entry["Entry Point\nbuild-docs.sh"]
subgraph "Configuration Phase"
AutoDetect["Git Auto-detection\nlines 8-19"]
EnvVars["Environment Variables\nREPO, BOOK_TITLE, etc.\nlines 21-26"]
Defaults["Default Generation\nBOOK_AUTHORS, GIT_REPO_URL\nlines 44-51"]
Validate["Validation\nREPO check\nlines 33-38"]
end
subgraph "Execution Phase"
Step1["Step 1: deepwiki-scraper.py\n$REPO → $WIKI_DIR\nlines 61-65"]
Decision{{"MARKDOWN_ONLY\n?\nlines 68"}}
MarkdownExit["Copy to /output/markdown\nlines 69-93"]
Step2["Step 2: mkdir $BOOK_DIR\nCreate book.toml\nlines 95-119"]
Step3["Step 3: Generate SUMMARY.md\nDiscover structure\nlines 124-188"]
Step4["Step 4: process-template.py\nInject headers/footers\nlines 190-261"]
Step5["Step 5: mdbook-mermaid install\nlines 263-266"]
Step6["Step 6: mdbook build\nlines 268-271"]
Step7["Step 7: Copy to /output\nlines 273-295"]
end
Entry --> AutoDetect
AutoDetect --> EnvVars
EnvVars --> Defaults
Defaults --> Validate
Validate --> Step1
Step1 --> Decision
Decision -->|true| MarkdownExit
Decision -->|false| Step2
Step2 --> Step3
Step3 --> Step4
Step4 --> Step5
Step5 --> Step6
Step6 --> Step7
MarkdownExit --> OutputMarkdown["/output/markdown/"]
MarkdownExit --> OutputRaw["/output/raw_markdown/"]
Step7 --> OutputBook["/output/book/"]
Step7 --> OutputMarkdown
Step7 --> OutputRaw
Step7 --> OutputConfig["/output/book.toml"]
Sources: scripts/build-docs.sh:1-310
Configuration Management
The script implements a sophisticated configuration system with automatic detection, environment variable overrides, and sensible defaults.
Auto-Detection Logic
The script attempts to automatically detect the repository from Git metadata if REPO is not explicitly set:
Diagram: Repository Auto-Detection Flow
flowchart TD
Start["Check if REPO\nenvironment variable set"]
Start -->|Not set| CheckGit["Check if .git directory exists\ngit rev-parse --git-dir"]
Start -->|Set| UseProvided["Use provided REPO value"]
CheckGit -->|Yes| GetRemote["Get remote.origin.url\ngit config --get remote.origin.url"]
CheckGit -->|No| RequireManual["Require manual REPO setting"]
GetRemote -->|Found| ExtractOwnerRepo["Extract owner/repo using sed\nPattern: github.com[:/]owner/repo"]
GetRemote -->|Not found| RequireManual
ExtractOwnerRepo --> SetRepo["Set REPO variable"]
UseProvided --> SetRepo
SetRepo --> Validate["Validate REPO is not empty"]
RequireManual --> Validate
Validate -->|Empty| Error["Exit with error\nlines 34-37"]
Validate -->|Valid| Continue["Continue execution"]
The regex pattern at scripts/build-docs.sh16 handles multiple GitHub URL formats:
https://github.com/owner/repo.gitgit@github.com:owner/repo.githttps://github.com/owner/repo
Sources: scripts/build-docs.sh:8-19 scripts/build-docs.sh:33-38
Configuration Variables
The following table documents all configuration variables managed by the orchestrator:
| Variable | Default | Derivation | Line Reference |
|---|---|---|---|
REPO | Auto-detected | Extracted from git remote.origin.url | scripts/build-docs.sh:9-19 |
BOOK_TITLE | "Documentation" | None | scripts/build-docs.sh23 |
BOOK_AUTHORS | Repository owner | Extracted from $REPO (first segment) | scripts/build-docs.sh45 |
GIT_REPO_URL | GitHub URL | Constructed from $REPO | scripts/build-docs.sh46 |
MARKDOWN_ONLY | "false" | None | scripts/build-docs.sh26 |
WORK_DIR | "/workspace" | Fixed | scripts/build-docs.sh27 |
WIKI_DIR | "/workspace/wiki" | Fixed | scripts/build-docs.sh28 |
RAW_DIR | "/workspace/raw_markdown" | Fixed | scripts/build-docs.sh29 |
OUTPUT_DIR | "/output" | Fixed | scripts/build-docs.sh30 |
BOOK_DIR | "/workspace/book" | Fixed | scripts/build-docs.sh31 |
Computed variables derived from $REPO:
| Variable | Computation | Line Reference |
|---|---|---|
REPO_OWNER | `echo “$REPO” | cut -d’/’ -f1` |
REPO_NAME | `echo “$REPO” | cut -d’/’ -f2` |
DEEPWIKI_URL | "https://deepwiki.com/$REPO" | scripts/build-docs.sh48 |
DEEPWIKI_BADGE_URL | "https://deepwiki.com/badge.svg" | scripts/build-docs.sh49 |
REPO_BADGE_LABEL | URL-encoded with dash escaping | scripts/build-docs.sh50 |
GITHUB_BADGE_URL | Shields.io badge URL | scripts/build-docs.sh51 |
Sources: scripts/build-docs.sh:21-51
sequenceDiagram
participant Script as build-docs.sh
participant Scraper as deepwiki-scraper.py
participant FileSystem as File System
participant Templates as process-template.py
participant MDBook as mdbook
participant Mermaid as mdbook-mermaid
Note over Script: Configuration Phase
Script->>Script: Auto-detect REPO
Script->>Script: Set defaults
Script->>Script: Validate configuration
Note over Script: Step 1: Scraping
Script->>FileSystem: rm -rf $RAW_DIR
Script->>Scraper: python3 deepwiki-scraper.py $REPO $WIKI_DIR
Scraper-->>FileSystem: Write markdown to $WIKI_DIR
Scraper-->>FileSystem: Write raw snapshots to $RAW_DIR
alt MARKDOWN_ONLY == true
Note over Script: Markdown-Only Exit Path
Script->>FileSystem: cp $WIKI_DIR to /output/markdown
Script->>FileSystem: cp $RAW_DIR to /output/raw_markdown
Script->>Script: Exit (skip HTML build)
else MARKDOWN_ONLY == false
Note over Script: Step 2: mdBook Initialization
Script->>FileSystem: mkdir -p $BOOK_DIR/src
Script->>FileSystem: Create book.toml
Note over Script: Step 3: SUMMARY.md Generation
Script->>FileSystem: Scan $WIKI_DIR for .md files
Script->>FileSystem: Generate src/SUMMARY.md
Note over Script: Step 4: Template Processing
Script->>Templates: process-template.py $HEADER_TEMPLATE
Templates-->>Script: Processed HEADER_HTML
Script->>Templates: process-template.py $FOOTER_TEMPLATE
Templates-->>Script: Processed FOOTER_HTML
Script->>FileSystem: cp $WIKI_DIR/* to src/
Script->>FileSystem: Inject header/footer into all .md files
Note over Script: Step 5: Mermaid Installation
Script->>Mermaid: mdbook-mermaid install $BOOK_DIR
Mermaid-->>FileSystem: Install mermaid.js assets
Note over Script: Step 6: Build
Script->>MDBook: mdbook build
MDBook-->>FileSystem: Generate book/ directory
Note over Script: Step 7: Output Collection
Script->>FileSystem: cp book to /output/book
Script->>FileSystem: cp $WIKI_DIR to /output/markdown
Script->>FileSystem: cp $RAW_DIR to /output/raw_markdown
Script->>FileSystem: cp book.toml to /output/book.toml
end
Execution Flow
The orchestrator follows a seven-step execution sequence, with conditional branching for markdown-only mode:
Diagram: Step-by-Step Execution Sequence
Sources: scripts/build-docs.sh:61-310
Step Details
Step 1: Wiki Scraping
Lines: scripts/build-docs.sh:61-65
Invokes the Python scraper to fetch and convert DeepWiki content:
The scraper writes output to two locations:
$WIKI_DIR(/workspace/wiki): Enhanced markdown with injected diagrams$RAW_DIR(/workspace/raw_markdown): Pre-enhancement markdown snapshots for debugging
For details on the scraper’s operation, see deepwiki-scraper.py.
Sources: scripts/build-docs.sh:61-65
Step 2: mdBook Structure Initialization
Lines: scripts/build-docs.sh:95-119
Skipped if: MARKDOWN_ONLY=true
Creates the mdBook directory structure and generates book.toml configuration:
$BOOK_DIR/
├── book.toml
└── src/
The book.toml file is generated using a heredoc with variable substitution:
| Configuration Section | Variables Used | Purpose |
|---|---|---|
[book] | $BOOK_TITLE, $BOOK_AUTHORS | Book metadata |
[output.html] | $GIT_REPO_URL | Repository link in UI |
[preprocessor.mermaid] | N/A | Enable mermaid diagrams |
[output.html.fold] | N/A | Enable section folding |
Sources: scripts/build-docs.sh:95-119
Step 3: SUMMARY.md Generation
Lines: scripts/build-docs.sh:124-188
flowchart TD
Start["Start SUMMARY.md Generation"]
Start --> WriteHeader["Write '# Summary' header"]
WriteHeader --> FindOverview["Find overview file\ngrep -Ev '^[0-9]'"]
FindOverview -->|Found| WriteOverview["Write overview entry\nExtract title from first line"]
FindOverview -->|Not found| ListMain
WriteOverview --> ListMain["List all main pages\nls $WIKI_DIR/*.md"]
ListMain --> FilterOverview["Filter out overview file"]
FilterOverview --> NumericSort["Sort numerically\nsort -t- -k1 -n"]
NumericSort --> ProcessLoop["For each file"]
ProcessLoop --> ExtractTitle["Extract title\nhead -1 /sed 's/^# //'"]
ExtractTitle --> GetSectionNum["Extract section number grep -oE '^[0-9]+'"]
GetSectionNum --> CheckSubdir{"Subsection directory section-N exists?"}
CheckSubdir -->|Yes|WriteSection["Write section entry - [title] filename"]
WriteSection --> ListSubs["List subsection files ls section-N/*.md"]
ListSubs --> SortSubs["Sort numerically sort -t- -k1 -n"]
SortSubs --> WriteSubLoop["For each subsection: - [subtitle] section-N/file"]
WriteSubLoop --> NextFile
CheckSubdir -->|No|WriteStandalone["Write standalone entry - [title] filename"]
WriteStandalone --> NextFile{"More files?"}
NextFile -->|Yes|ProcessLoop
NextFile -->|No| Complete["Complete src/SUMMARY.md"]
Dynamically generates the table of contents by discovering the file structure in $WIKI_DIR. This step implements numeric sorting and hierarchical organization.
Diagram: SUMMARY.md Generation Algorithm
Key implementation details:
Overview Page Detection: scripts/build-docs.sh:136-144
- Searches for files without numeric prefix
- Typically matches
Overview.mdor similar
Numeric Sorting: scripts/build-docs.sh:147-155
- Uses
sort -t- -k1 -nto sort by numeric prefix - Handles formats like
1-Title.md,2.1-Subtopic.md
Hierarchy Detection: scripts/build-docs.sh:165-180
- Checks for
section-N/directories for each numeric section - Creates indented entries for subsections
Sources: scripts/build-docs.sh:124-188
Step 4: Template Processing and File Copying
Lines: scripts/build-docs.sh:190-261
flowchart LR
subgraph "Template Loading"
HeaderPath["$HEADER_TEMPLATE\n/workspace/templates/header.html"]
FooterPath["$FOOTER_TEMPLATE\n/workspace/templates/footer.html"]
GenDate["GENERATION_DATE\ndate -u command"]
end
subgraph "Variable Substitution"
ProcessH["process-template.py\n$HEADER_TEMPLATE"]
ProcessF["process-template.py\n$FOOTER_TEMPLATE"]
Vars["Variables passed:\nDEEPWIKI_URL\nDEEPWIKI_BADGE_URL\nGIT_REPO_URL\nGITHUB_BADGE_URL\nREPO\nBOOK_TITLE\nBOOK_AUTHORS\nGENERATION_DATE"]
end
subgraph "Injection"
CopyFiles["cp $WIKI_DIR/* to src/"]
InjectLoop["For each .md file:\nsrc/*.md src/*/*.md"]
CreateTemp["Create temp file:\nHEADER + content + FOOTER"]
Replace["mv temp to original"]
end
HeaderPath --> ProcessH
FooterPath --> ProcessF
GenDate --> Vars
Vars --> ProcessH
Vars --> ProcessF
ProcessH --> HeaderHTML["HEADER_HTML variable"]
ProcessF --> FooterHTML["FOOTER_HTML variable"]
CopyFiles --> InjectLoop
HeaderHTML --> InjectLoop
FooterHTML --> InjectLoop
InjectLoop --> CreateTemp
CreateTemp --> Replace
Processes header and footer templates and injects them into all markdown files.
Template Processing Flow:
Diagram: Template Processing Pipeline
The template processor is invoked with all configuration variables as arguments: scripts/build-docs.sh:205-213 scripts/build-docs.sh:222-230
File injection pattern: scripts/build-docs.sh:243-257
- Processes all
.mdfiles insrc/andsrc/*/ - Creates temporary file with header + original content + footer
- Replaces original with modified version
For details on the template system and variable substitution, see Template System.
Sources: scripts/build-docs.sh:190-261
Step 5: Mermaid Installation
Lines: scripts/build-docs.sh:263-266
Installs mdbook-mermaid preprocessor assets into the book directory:
This command installs the mermaid.js library and initialization code required for client-side diagram rendering in the final HTML output.
Sources: scripts/build-docs.sh:263-266
Step 6: Book Build
Lines: scripts/build-docs.sh:268-271
Executes the mdBook build process:
Build Process:
- Reads
book.tomlconfiguration - Processes
src/SUMMARY.mdto determine structure - Applies mermaid preprocessor to all markdown files
- Converts markdown to HTML with search indexing
- Outputs to
$BOOK_DIR/book/directory
For more information on mdBook integration, see mdBook Integration.
Sources: scripts/build-docs.sh:268-271
Step 7: Output Collection
Lines: scripts/build-docs.sh:273-295
Copies all build artifacts to the /output volume mount for persistence:
| Source | Destination | Description |
|---|---|---|
$BOOK_DIR/book/ | /output/book/ | Built HTML documentation |
$WIKI_DIR/ | /output/markdown/ | Enhanced markdown files |
$RAW_DIR/ | /output/raw_markdown/ | Pre-enhancement markdown (if exists) |
$BOOK_DIR/book.toml | /output/book.toml | Book configuration reference |
The script ensures clean output by removing existing directories before copying: scripts/build-docs.sh:282-290
Sources: scripts/build-docs.sh:273-295
Markdown-Only Mode
When MARKDOWN_ONLY=true, the orchestrator follows a shortened execution path that skips HTML generation:
Execution Path:
- Step 1: Scrape wiki (normal)
- Copy
$WIKI_DIRto/output/markdown/ - Copy
$RAW_DIRto/output/raw_markdown/(if exists) - Exit with success
Use Cases:
- Debugging the scraper output without full build overhead
- Extracting markdown for alternative processing pipelines
- CI/CD test workflows that only validate markdown generation
- Custom post-processing before HTML generation
Implementation: scripts/build-docs.sh:68-93
Sources: scripts/build-docs.sh:68-93
Error Handling
The orchestrator implements fail-fast error handling:
Error Handling Mechanisms:
| Mechanism | Implementation | Line Reference |
|---|---|---|
| Exit on any error | set -e | scripts/build-docs.sh2 |
| Configuration validation | Explicit REPO check with error message | scripts/build-docs.sh:33-38 |
| Component failures | Automatic propagation due to set -e | All component invocations |
| Template warnings | Non-fatal warnings if templates not found | scripts/build-docs.sh:215-216 scripts/build-docs.sh:232-233 |
The script does not use explicit error trapping; instead, it relies on Bash’s set -e behavior to immediately exit if any command returns a non-zero status. This ensures that failures in any component (scraper, template processor, mdBook) halt execution and propagate to the Docker container exit code.
Sources: scripts/build-docs.sh2 scripts/build-docs.sh:33-38 scripts/build-docs.sh:215-216 scripts/build-docs.sh:232-233
graph TB
Orchestrator["build-docs.sh"]
subgraph "Python Components"
Scraper["deepwiki-scraper.py\nArgs: REPO, WIKI_DIR"]
Templates["process-template.py\nArgs: template_path, var1=val1, ..."]
end
subgraph "Build Tools"
MDBook["mdbook build\nWorking dir: $BOOK_DIR"]
Mermaid["mdbook-mermaid install\nArgs: $BOOK_DIR"]
end
subgraph "File System"
Input["Input:\n/workspace/templates/"]
Working["Working:\n$WIKI_DIR\n$RAW_DIR\n$BOOK_DIR"]
Output["Output:\n/output/"]
end
subgraph "Environment"
EnvVars["Environment Variables:\nREPO, BOOK_TITLE,\nBOOK_AUTHORS, etc."]
end
EnvVars --> Orchestrator
Input --> Templates
Orchestrator -->|python3| Scraper
Orchestrator -->|python3| Templates
Orchestrator -->|mdbook| MDBook
Orchestrator -->|mdbook-mermaid| Mermaid
Scraper --> Working
Templates --> Orchestrator
Orchestrator --> Working
MDBook --> Working
Mermaid --> Working
Orchestrator --> Output
Integration Points
The orchestrator integrates with multiple system components through well-defined interfaces:
Diagram: Component Integration Interfaces
Interface Specifications:
deepwiki-scraper.py:
- Invocation:
python3 /usr/local/bin/deepwiki-scraper.py $REPO $WIKI_DIR - Input: Repository identifier (e.g.,
"jzombie/deepwiki-to-mdbook") - Output: Markdown files in
$WIKI_DIR, raw snapshots in$RAW_DIR - Documentation: deepwiki-scraper.py
process-template.py:
- Invocation:
python3 /usr/local/bin/process-template.py $TEMPLATE_PATH var1=val1 var2=val2 ... - Input: Template file path and variable assignments
- Output: Processed HTML string to stdout
- Documentation: Template System
mdbook:
- Invocation:
mdbook build(in$BOOK_DIR) - Input:
book.tomlandsrc/directory structure - Output: HTML in
book/subdirectory - Documentation: mdBook Integration
mdbook-mermaid:
- Invocation:
mdbook-mermaid install $BOOK_DIR - Input: Book directory path
- Output: Mermaid assets installed in book directory
- Documentation: mdBook Integration
Sources: scripts/build-docs.sh65 scripts/build-docs.sh:205-213 scripts/build-docs.sh:222-230 scripts/build-docs.sh266 scripts/build-docs.sh271
Output Artifacts
The orchestrator produces a structured output directory with multiple artifact types:
/output/
├── book/ # Searchable HTML documentation (Step 7)
│ ├── index.html
│ ├── searchindex.js
│ ├── mermaid.min.js
│ └── ...
├── markdown/ # Enhanced markdown files (Step 7 or markdown-only)
│ ├── Overview.md
│ ├── 1-First-Section.md
│ ├── section-1/
│ │ └── 1.1-Subsection.md
│ └── ...
├── raw_markdown/ # Pre-enhancement snapshots (if available)
│ ├── Overview.md
│ ├── 1-First-Section.md
│ └── ...
└── book.toml # Book configuration reference (Step 7)
Artifact Generation Timeline:
| Artifact | Generated By | When | Purpose |
|---|---|---|---|
raw_markdown/ | deepwiki-scraper.py | Step 1 | Debug: pre-enhancement state |
markdown/ | deepwiki-scraper.py | Step 1 | Final markdown with diagrams |
book.toml | build-docs.sh | Step 2 | Book configuration reference |
book/ | mdbook | Step 6 | Final HTML documentation |
Sources: scripts/build-docs.sh:273-295
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
deepwiki-scraper.py
Loading…
deepwiki-scraper.py
Relevant source files
Purpose and Scope
The deepwiki-scraper.py script is the primary data extraction and transformation component that converts DeepWiki wiki content into enhanced markdown files. It orchestrates a three-phase pipeline: (1) extracting clean markdown from DeepWiki HTML, (2) enhancing files with normalized Mermaid diagrams using fuzzy matching, and (3) moving completed files to the output directory.
This page documents the script’s architecture, execution model, and key algorithms. For information about how this script is invoked by the build system, see build-docs.sh Orchestrator. For detailed explanations of the extraction and enhancement phases, see Phase 1: Markdown Extraction and Phase 2: Diagram Enhancement.
Sources: python/deepwiki-scraper.py:1-11
Command-Line Interface
The script requires two positional arguments:
- Repository identifier : Format
owner/repo(e.g.,jzombie/deepwiki-to-mdbook) - Output directory : Destination path for generated markdown files
The repository identifier is validated using the regex pattern ^[\w-]+/[\w-]+$ at python/deepwiki-scraper.py:1287-1289 The script exits with status code 1 if validation fails or if the wiki structure cannot be extracted.
Sources: python/deepwiki-scraper.py:1-10 python/deepwiki-scraper.py:1277-1289
Three-Phase Execution Model
Figure 1: Three-Phase Execution Pipeline
The main() function at python/deepwiki-scraper.py:1277-1410 implements a three-phase workflow:
| Phase | Function | Primary Responsibility | Output Location |
|---|---|---|---|
| 1 | extract_wiki_structure + extract_page_content | Scrape HTML and convert to markdown | Temporary directory |
| 2 | extract_and_enhance_diagrams | Match and inject Mermaid diagrams | In-place modification of temp directory |
| 3 | File system operations | Move validated files to output | Final output directory |
A temporary directory is created at python/deepwiki-scraper.py:1295-1296 using Python’s tempfile.TemporaryDirectory context manager. This ensures automatic cleanup even if the script fails. A raw markdown snapshot is saved to raw_markdown/ at python/deepwiki-scraper.py:1358-1366 before diagram enhancement for debugging purposes.
Sources: python/deepwiki-scraper.py:1277-1410 python/deepwiki-scraper.py:1298-1371
Wiki Structure Discovery
Figure 2: Structure Discovery Algorithm Using extract_wiki_structure
The extract_wiki_structure function at python/deepwiki-scraper.py:116-163 discovers all wiki pages by parsing the main wiki index page. It uses a compiled regex pattern to find all links matching ^/{repo_pattern}/\d+ at python/deepwiki-scraper.py:128-129
The page numbering scheme distinguishes main pages from subsections using dot notation:
- Level 0 : Main pages (e.g.,
1,2,3) - pages with no dots - Level 1 : Subsections (e.g.,
2.1,2.2) - pages with one dot - Level N : Deeper subsections (e.g.,
2.1.3) - pages with N dots
The level is calculated at python/deepwiki-scraper.py145 as page_num.count('.'). Pages are sorted using a custom key function at python/deepwiki-scraper.py:157-159 that splits the page number by dots and converts each component to an integer, ensuring proper numerical ordering (e.g., 2.10 comes after 2.9).
Each page dictionary contains:
number: Page number string (e.g.,"2.1")title: Extracted link texturl: Full URL to the pagehref: Relative path (used for link rewriting)level: Nesting depth based on dot count
Sources: python/deepwiki-scraper.py:116-163 python/deepwiki-scraper.py145 python/deepwiki-scraper.py:157-161
Path Resolution and Numbering Normalization
Figure 3: Path Resolution Using normalized_number_parts and resolve_output_path
The path resolution system normalizes DeepWiki’s numbering scheme to match mdBook’s conventions. The normalized_number_parts function at python/deepwiki-scraper.py:28-43 shifts page numbers down by one so that DeepWiki’s page 1 becomes unnumbered (the index page), and subsequent pages start at 1.
| DeepWiki Number | normalized_number_parts Output | Final Filename |
|---|---|---|
"1" | [] (empty list) | overview.md (unnumbered) |
"2" | ["1"] | 1-introduction.md |
"3.1" | ["2", "1"] | 2-1-subsection.md |
"3.2" | ["2", "2"] | 2-2-another.md |
The resolve_output_path function at python/deepwiki-scraper.py:45-53 combines normalized numbers with sanitized titles. Subsections (with len(parts) > 1) are placed in directories named section-{main_number} at python/deepwiki-scraper.py52 The sanitize_filename function at python/deepwiki-scraper.py:22-26 strips special characters and normalizes whitespace using regex patterns r'[^\w\s-]' and r'[-\s]+'.
The build_target_path function at python/deepwiki-scraper.py:55-63 constructs full relative paths for link rewriting, used by the link fixing logic at python/deepwiki-scraper.py:854-875
Sources: python/deepwiki-scraper.py:28-63 python/deepwiki-scraper.py:22-26 python/deepwiki-scraper.py:854-875
Content Extraction and HTML-to-Markdown Conversion
Figure 4: Content Extraction Pipeline Using extract_page_content
The extract_page_content function at python/deepwiki-scraper.py:751-877 implements a multi-stage HTML cleaning and conversion pipeline. BeautifulSoup selectors at python/deepwiki-scraper.py:761-762 remove navigation elements before content extraction.
The content finder at python/deepwiki-scraper.py:765-779 tries a prioritized list of selectors: article, main, .wiki-content, .content, #content, .markdown-body, and finally falls back to body. DeepWiki-specific UI elements are removed at python/deepwiki-scraper.py:786-795 by searching for text patterns like “Index your code with Devin” and “Edit Wiki”.
Navigation list removal at python/deepwiki-scraper.py:799-806 detects and removes <ul> elements containing more than 5 links where 80%+ are internal wiki links.
The convert_html_to_markdown function at python/deepwiki-scraper.py:213-228 uses the html2text library with configuration:
ignore_links = False- preserve all linksbody_width = 0- disable line wrapping to prevent formatting issues
Note at python/deepwiki-scraper.py:221-223 explicitly documents that Mermaid diagram processing is disabled during HTML conversion because diagrams from ALL pages are mixed together in the JavaScript payload.
The clean_deepwiki_footer function at python/deepwiki-scraper.py:165-211 removes DeepWiki UI elements using compiled regex patterns for text like “Dismiss”, “Refresh this wiki”, and “On this page”. It scans backwards from the end of the file up to 50 lines to find footer markers at python/deepwiki-scraper.py:187-191
Link rewriting at python/deepwiki-scraper.py:854-875 converts DeepWiki URLs to relative markdown paths, handling both same-section and cross-section references by calculating relative paths based on the source file’s section directory.
Sources: python/deepwiki-scraper.py:751-877 python/deepwiki-scraper.py:213-228 python/deepwiki-scraper.py:165-211 python/deepwiki-scraper.py:395-406
Diagram Extraction from JavaScript Payload
Figure 5: Diagram Extraction Using extract_and_enhance_diagrams
The extract_and_enhance_diagrams function at python/deepwiki-scraper.py:880-1275 extracts all Mermaid diagrams from DeepWiki’s Next.js JavaScript payload. The regex pattern at python/deepwiki-scraper.py899 matches fenced code blocks with various newline formats: \\r\\n, \\n, or actual newline characters.
Context extraction at python/deepwiki-scraper.py:903-1087 captures up to 2000 characters before each diagram to enable fuzzy matching. For each diagram, the context is parsed to extract:
- Last heading : The most recent line starting with
#(searched backwards from diagram position) - Anchor text : The last 2-3 non-heading lines exceeding 20 characters in length, concatenated and truncated to 300 characters
The context extraction logic at python/deepwiki-scraper.py:1066-1081 searches backwards through context lines to find the last heading, then collects up to 3 substantial non-heading lines as anchor text.
The unescaping phase at python/deepwiki-scraper.py:1039-1046 handles JavaScript string escapes:
| Escaped Sequence | Unescaped Result |
|---|---|
\\n | Newline character |
\\t | Tab character |
\\" | Double quote |
\\\\ | Single backslash |
\\u003c | < character |
\\u003e | > character |
\\u0026 | & character |
The merge_multiline_labels function at python/deepwiki-scraper.py:907-1009 collapses wrapped Mermaid labels into literal \n sequences. This is crucial because DeepWiki sometimes wraps long labels across multiple lines in the HTML, but Mermaid 11 expects these to be explicitly marked with \n tokens.
Sources: python/deepwiki-scraper.py:880-1087 python/deepwiki-scraper.py899 python/deepwiki-scraper.py:1039-1046 python/deepwiki-scraper.py:907-1009
Seven-Step Mermaid Normalization Pipeline
Figure 6: Seven-Step Normalization Pipeline Using normalize_mermaid_diagram
The normalize_mermaid_diagram function at python/deepwiki-scraper.py:385-393 applies seven normalization passes to ensure Mermaid 11 compatibility:
Step 1: normalize_mermaid_edge_labels
Function at python/deepwiki-scraper.py:230-251 Applies only to graphs and flowcharts (detected by checking if first line starts with graph or flowchart). Uses regex r'\|([^|]*)\|' to find edge labels and flattens any containing \n, \\n, (, or ) by:
- Replacing
\\nand\nwith spaces - Removing parentheses
- Collapsing whitespace with
re.sub(r'\s+', ' ', cleaned).strip()
Step 2: normalize_mermaid_state_descriptions
Function at python/deepwiki-scraper.py:253-277 Applies only to state diagrams. Ensures state descriptions use the syntax State : Description by:
- Skipping lines with
::(already valid) - Splitting on single
:and cleaning suffix - Replacing colons in description with
- - Rebuilding as
{prefix.rstrip()} : {cleaned_suffix}
Step 3: normalize_flowchart_nodes
Function at python/deepwiki-scraper.py:279-301 Applies to graphs and flowcharts. Uses regex r'\["([^"]*)"\]' to find node labels and:
- Replaces pipe characters with forward slashes
- Collapses whitespace
- Inserts newlines between consecutive statements using regex at python/deepwiki-scraper.py:298-299
Step 4: normalize_statement_separators
Function at python/deepwiki-scraper.py:313-328 Applies to graphs and flowcharts. The STATEMENT_BREAK_PATTERN at python/deepwiki-scraper.py:309-311 detects consecutive statements on one line and inserts newlines between them while preserving indentation.
Step 5: normalize_empty_node_labels
Function at python/deepwiki-scraper.py:330-341 Uses regex r'(\b[A-Za-z0-9_]+)\[""\]' to find nodes with empty labels. Generates fallback label from node ID by replacing underscores/hyphens with spaces.
Step 6: normalize_gantt_diagram
Function at python/deepwiki-scraper.py:343-383 Applies only to Gantt diagrams. Detects task lines missing IDs using pattern r'^(\s*"[^"]+"\s*):\s*(.+)$' and inserts synthetic IDs (task1, task2, etc.) when the first token after the colon is not an ID or after reference.
Sources: python/deepwiki-scraper.py:385-393 python/deepwiki-scraper.py:230-251 python/deepwiki-scraper.py:253-277 python/deepwiki-scraper.py:279-301 python/deepwiki-scraper.py:313-328 python/deepwiki-scraper.py:330-341 python/deepwiki-scraper.py:343-383
Fuzzy Matching and Diagram Injection
Figure 7: Fuzzy Matching Algorithm for Diagram Injection
The fuzzy matching algorithm at python/deepwiki-scraper.py:1150-1275 pairs each diagram with its correct markdown file by matching context against file contents. The algorithm uses a progressive chunk size strategy at python/deepwiki-scraper.py1188 to find matches:
| Chunk Size | Use Case |
|---|---|
| 300 chars | Highest precision - exact context match |
| 200 chars | Medium precision - paragraph-level match |
| 150 chars | Lower precision - sentence-level match |
| 100 chars | Low precision - phrase-level match |
| 80 chars | Minimum threshold - short phrase match |
The matching loop at python/deepwiki-scraper.py:1170-1238 attempts anchor text matching first. The anchor text (last 2-3 lines of context before the diagram) is normalized to lowercase with whitespace collapsed at python/deepwiki-scraper.py:1185-1186 For each chunk size, the algorithm searches for the test chunk at the end of the anchor text (anchor_normalized[-chunk_size:]) in the normalized file content.
If anchor matching fails (score < 80), the algorithm falls back to heading matching at python/deepwiki-scraper.py:1204-1216 This compares the last_heading from diagram context against all headings in the file after normalizing both by removing # symbols and collapsing whitespace.
Only matches with best_match_score >= 80 are accepted at python/deepwiki-scraper.py1218 This threshold balances precision (avoiding false matches) with recall (ensuring most diagrams are placed).
Insertion Point Logic
The insertion point finder at python/deepwiki-scraper.py:1220-1236 behaves differently based on match type:
After heading match :
- Skip blank lines after heading
- Skip through the following paragraph
- Insert after the paragraph ends (blank line or next heading)
After paragraph match :
- Find end of current paragraph
- Insert when encountering blank line or heading
Content Guards
The enforce_content_start function at python/deepwiki-scraper.py:1138-1147 and advance_past_lists function at python/deepwiki-scraper.py:1125-1137 implement content guards to prevent diagram insertion in protected areas:
Protected prefix (detected by protected_prefix_end at python/deepwiki-scraper.py:1101-1115):
- Title line (first line starting with
#) - “Relevant source files” section and its list items
- Blank lines in these sections
List blocks (detected by is_list_line at python/deepwiki-scraper.py:1117-1123):
- Lines starting with
-,*,+ - Lines matching
\d+[.)]\s(numbered lists)
Diagrams are never inserted inside list blocks. If the insertion point lands in a list, advance_past_lists moves the insertion point to after the list ends.
Dynamic Fence Length
The insertion logic at python/deepwiki-scraper.py:1249-1266 calculates a dynamic fence length to handle diagrams containing backticks. It scans the diagram text for the longest run of consecutive backticks and sets fence_len = max(3, max_backticks + 1). This ensures the fence markers (````mermaid`) always properly delimit the diagram content.
Sources: python/deepwiki-scraper.py:1150-1275 python/deepwiki-scraper.py:1170-1238 python/deepwiki-scraper.py:1101-1147 python/deepwiki-scraper.py:1249-1266
Error Handling and Retry Logic
Figure 8: Retry Logic in fetch_page Function
The fetch_page function at python/deepwiki-scraper.py:65-80 implements a 3-attempt retry strategy with exponential backoff. The retry loop at python/deepwiki-scraper.py:71-80 catches all exceptions using a bare except Exception as e clause and retries with a 2-second delay using time.sleep(2).
Browser-like headers are set at python/deepwiki-scraper.py:67-69 to avoid bot detection:
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
The timeout is set to 30 seconds at python/deepwiki-scraper.py73 After a successful fetch, response.raise_for_status() validates the HTTP status code.
The main extraction loop at python/deepwiki-scraper.py:1328-1353 catches exceptions per-page and continues processing remaining pages:
This ensures that a single page failure doesn’t abort the entire scraping process. The success count is reported at python/deepwiki-scraper.py1355
The top-level try-except block at python/deepwiki-scraper.py:1310-1407 catches any unhandled exceptions and exits with status code 1, signaling failure to the calling build script.
Sources: python/deepwiki-scraper.py:65-80 python/deepwiki-scraper.py:1328-1353 python/deepwiki-scraper.py:1310-1407
Session Management and Rate Limiting
The script uses a requests.Session object created at python/deepwiki-scraper.py:1305-1308 with persistent headers:
Session reuse provides connection pooling and persistent cookies across requests. The session is passed to all HTTP functions: extract_wiki_structure, extract_page_content, and extract_and_enhance_diagrams.
Rate limiting is implemented at python/deepwiki-scraper.py1350 with a 1-second sleep between page extractions:
This prevents overwhelming the DeepWiki server and reduces the risk of rate limiting or IP blocking. The comment at python/deepwiki-scraper.py1349 explicitly states “Be nice to the server”.
Sources: python/deepwiki-scraper.py:1305-1308 python/deepwiki-scraper.py:1349-1350
Key Function Reference
| Function | Lines | Purpose |
|---|---|---|
main() | 1277-1410 | Entry point - orchestrates three-phase pipeline |
extract_wiki_structure(repo, session) | 116-163 | Discover all wiki pages from index |
extract_page_content(url, session, page_info) | 751-877 | Extract and clean single page content |
extract_and_enhance_diagrams(repo, temp_dir, session, url) | 880-1275 | Extract diagrams and inject into files |
convert_html_to_markdown(html_content) | 213-228 | Convert HTML to markdown using html2text |
clean_deepwiki_footer(markdown) | 165-211 | Remove DeepWiki UI elements from footer |
normalize_mermaid_diagram(diagram_text) | 385-393 | Apply seven-step normalization pipeline |
normalize_mermaid_edge_labels(diagram_text) | 230-251 | Flatten multiline edge labels |
normalize_mermaid_state_descriptions(diagram_text) | 253-277 | Fix state diagram syntax |
normalize_flowchart_nodes(diagram_text) | 279-301 | Clean flowchart node labels |
normalize_statement_separators(diagram_text) | 313-328 | Insert newlines between statements |
normalize_empty_node_labels(diagram_text) | 330-341 | Provide fallback labels |
normalize_gantt_diagram(diagram_text) | 343-383 | Add synthetic task IDs |
merge_multiline_labels(diagram_text) | 907-1009 | Collapse wrapped labels |
strip_wrapping_quotes(diagram_text) | 1011-1022 | Remove extra quotes |
fetch_page(url, session) | 65-80 | HTTP fetch with retry logic |
sanitize_filename(text) | 22-26 | Convert text to safe filename |
normalized_number_parts(page_number) | 28-43 | Shift DeepWiki numbering down by 1 |
resolve_output_path(page_number, title) | 45-53 | Determine filename and section dir |
build_target_path(page_number, slug) | 55-63 | Build relative path for links |
format_source_references(markdown) | 397-406 | Insert colons in source links |
Sources: python/deepwiki-scraper.py:1-1411
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Template System
Loading…
Template System
Relevant source files
The template system provides customizable header and footer content that is injected into every markdown file during the build process. This system uses a simple variable substitution syntax with conditional rendering support, allowing users to customize the appearance and metadata of generated documentation without modifying the core build scripts.
For information about how templates are injected during the build process, see Template Injection. For comprehensive documentation on template variables, see Template Variables.
Sources: templates/README.md:1-77
Template Files
The system uses two HTML template files that define content to be injected into each markdown file:
| Template File | Location | Purpose | Injection Point |
|---|---|---|---|
header.html | /workspace/templates/header.html | Injected at the beginning of each markdown file | Immediately after frontmatter |
footer.html | /workspace/templates/footer.html | Injected at the end of each markdown file | After all content |
Both templates are processed through the same variable substitution engine before injection.
Sources: templates/README.md:6-8 templates/header.html:1-9 templates/footer.html:1-11
Default Header Template
The default header template displays project badges and attribution information:
The template uses inline conditionals to prevent mdBook from wrapping links in separate paragraph tags, which would break the styling.
Sources: templates/header.html:1-9
Default Footer Template
The default footer template displays generation metadata and repository information:
Sources: templates/footer.html:1-11
Template Syntax
The template system supports three syntactic features: variable substitution, conditional rendering, and HTML comments.
Variable Substitution
Variables use double-brace syntax: {{VARIABLE_NAME}}. The processor replaces these with the corresponding variable value, or an empty string if the variable is not defined.
Variable names must match the pattern \w+ (alphanumeric and underscore characters).
Sources: python/process-template.py:38-45 templates/README.md:12-15
Conditional Rendering
Conditional blocks use {{#if VARIABLE}}...{{/if}} syntax. The content between the tags is included only if the variable exists and is non-empty.
The conditional pattern matches \{\{#if\s+(\w+)\}\}(.*?)\{\{/if\}\} and evaluates whether the variable is truthy.
Sources: python/process-template.py:24-36 templates/README.md:17-22
HTML Comments
HTML comments are automatically stripped from the output during processing:
Sources: python/process-template.py:47-48 templates/README.md:24-28
Template Processing Engine
Diagram: Template Processing Flow
Sources: python/process-template.py:1-82
The process-template.py script implements a two-pass processing algorithm:
- Conditional Processing (first pass): Evaluates
{{#if}}blocks using regular expression matching and removes or includes content based on variable truthiness python/process-template.py:24-36 - Variable Substitution (second pass): Replaces
{{VAR}}placeholders with actual values python/process-template.py:38-45 - Comment Removal (cleanup): Strips HTML comments from the final output python/process-template.py:47-48
Command-Line Interface
The script accepts a template file path and variable assignments in KEY=value format:
Arguments are parsed at python/process-template.py:66-70 where each KEY=value pair is split and stored in a dictionary for substitution.
Sources: python/process-template.py:53-82
Available Template Variables
The following variables are provided by the build system and available in all templates:
| Variable | Description | Example Value | Source |
|---|---|---|---|
REPO | Repository in owner/repo format | jzombie/deepwiki-to-mdbook | Environment or Git detection |
BOOK_TITLE | Documentation title | DeepWiki Documentation | Environment or auto-generated |
BOOK_AUTHORS | Author names | zenOSmosis | Environment or Git config |
GENERATION_DATE | ISO 8601 timestamp (UTC) | 2024-01-15T10:30:00Z | Build time |
DEEPWIKI_URL | DeepWiki documentation URL | https://deepwiki.com/wiki/... | DeepWiki scraper |
DEEPWIKI_BADGE_URL | DeepWiki badge image URL | https://deepwiki.com/badge/... | Constructed from DEEPWIKI_URL |
GIT_REPO_URL | Full Git repository URL | https://github.com/... | Constructed from REPO |
GITHUB_BADGE_URL | GitHub badge image URL | https://img.shields.io/github/... | Constructed from REPO |
All variables are optional. If a variable is undefined, it is replaced with an empty string during substitution.
Sources: templates/README.md:30-39
Customization
Diagram: Template Customization Architecture
Sources: templates/README.md:42-56
Volume Mount Customization
Custom templates can be provided by mounting a local directory or individual files into the Docker container:
Mount entire template directory:
Mount individual template files:
Sources: templates/README.md:45-51
Environment Variable Customization
Template locations can be overridden via environment variables:
| Environment Variable | Default Value | Description |
|---|---|---|
TEMPLATE_DIR | /workspace/templates | Base directory for templates |
HEADER_TEMPLATE | $TEMPLATE_DIR/header.html | Full path to header template |
FOOTER_TEMPLATE | $TEMPLATE_DIR/footer.html | Full path to footer template |
Example:
Sources: templates/README.md:53-56
Integration with Build Process
Diagram: Template System in Build Pipeline
Sources: templates/README.md:1-77 Diagram 2 from high-level system architecture
The template system is invoked during Phase 3 of the build pipeline, after markdown files have been generated and enhanced with diagrams. The build-docs.sh orchestrator script processes each template file through process-template.py, then injects the processed content into every markdown file before generating SUMMARY.md and running mdbook build.
Template processing occurs in the following sequence:
- Template Resolution : Determine paths to header and footer templates using environment variables or defaults
- Variable Collection : Gather all template variables from environment and runtime context
- Template Processing : Invoke
process-template.pyfor each template file with collected variables - Content Injection : Prepend processed header and append processed footer to each markdown file
- Build Continuation : Proceed with SUMMARY.md generation and mdBook build
This design ensures that all documentation pages share consistent branding, navigation, and metadata without requiring manual edits to individual markdown files.
Sources: templates/README.md:1-77
Example Custom Templates
Minimal Header Example
Sources: templates/README.md:60-68
Custom Footer Example
Sources: templates/README.md:70-76
Conditional Badge Example
This example demonstrates conditional rendering of badges based on available repository and documentation URLs. If neither GIT_REPO_URL nor DEEPWIKI_URL is defined, the template produces no output.
Sources: templates/header.html:2-4 templates/README.md:17-22
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
mdBook Integration
Loading…
mdBook Integration
Relevant source files
Purpose and Scope
This page documents how the system integrates with mdBook and mdbook-mermaid to generate the final HTML documentation. It covers the configuration generation process, automatic table of contents creation, mermaid diagram support installation, and the build execution. For information about the overall three-phase pipeline, see Three-Phase Pipeline. For template injection specifics, see Template Injection.
mdBook and mdbook-mermaid Tools
The system uses two Rust-based tools compiled during the Docker build:
| Tool | Purpose | Installation Location |
|---|---|---|
mdbook | Static site generator for documentation from Markdown | /usr/local/bin/mdbook |
mdbook-mermaid | mdBook preprocessor for rendering Mermaid diagrams | /usr/local/bin/mdbook-mermaid |
Both tools are compiled from source in the Rust builder stage of the Docker image and copied to the final Python-based runtime container for size optimization.
mdBook Build Pipeline
graph TB
subgraph "Input_Preparation"
WikiDir["/workspace/wiki/\nEnhanced markdown files"]
Templates["Processed templates\nHEADER_HTML, FOOTER_HTML"]
end
subgraph "Configuration_Generation"
BookToml["book.toml generation\n[scripts/build-docs.sh:102-119]"]
SummaryGen["SUMMARY.md generation\n[scripts/build-docs.sh:124-186]"]
EnvVars["Environment variables:\nBOOK_TITLE, BOOK_AUTHORS\nGIT_REPO_URL"]
end
subgraph "Structure_Creation"
BookDir["/workspace/book/\nmdBook project root"]
SrcDir["/workspace/book/src/\nMarkdown source"]
CopyFiles["Copy wiki files to src/\n[scripts/build-docs.sh:237]"]
InjectTemplates["Inject header/footer\n[scripts/build-docs.sh:239-261]"]
end
subgraph "mdBook_Processing"
MermaidInstall["mdbook-mermaid install\n[scripts/build-docs.sh:266]"]
MdBookBuild["mdbook build\n[scripts/build-docs.sh:271]"]
Preprocessor["mdbook-mermaid preprocessor\nConverts ```mermaid blocks"]
end
subgraph "Output"
BookHTML["/workspace/book/book/\nBuilt HTML documentation"]
OutputCopy["Copy to /output/book/\n[scripts/build-docs.sh:279]"]
end
EnvVars --> BookToml
EnvVars --> SummaryGen
WikiDir --> CopyFiles
Templates --> InjectTemplates
BookToml --> BookDir
SummaryGen --> SrcDir
CopyFiles --> SrcDir
InjectTemplates --> SrcDir
BookDir --> MermaidInstall
MermaidInstall --> MdBookBuild
SrcDir --> MdBookBuild
MdBookBuild --> Preprocessor
Preprocessor --> BookHTML
BookHTML --> OutputCopy
Sources: scripts/build-docs.sh:95-295
Configuration Generation (book.toml)
The book.toml configuration file is dynamically generated at scripts/build-docs.sh:102-119 using environment variables.
book.toml Structure
Sources: scripts/build-docs.sh:102-119
Configuration Sections
| Section | Key | Value Source | Purpose |
|---|---|---|---|
[book] | title | $BOOK_TITLE | Documentation title displayed in UI |
[book] | authors | $BOOK_AUTHORS | Author names shown in metadata |
[book] | language | "en" (hardcoded) | Content language for HTML lang attribute |
[book] | src | "src" (hardcoded) | Markdown source directory relative to book root |
[output.html] | default-theme | "rust" (hardcoded) | Visual theme (rust, light, navy, ayu, coal) |
[output.html] | git-repository-url | $GIT_REPO_URL | Enables “Edit on GitHub” link in top bar |
[preprocessor.mermaid] | command | "mdbook-mermaid" | Specifies mermaid preprocessor executable |
[output.html.fold] | enable | true | Enables collapsible sidebar sections |
[output.html.fold] | level | 1 | Sidebar sections collapsed by default at depth 1 |
The git-repository-url setting automatically adds an “Edit on GitHub” button to the top-right of each page, pointing to the repository specified in $GIT_REPO_URL (defaults to https://github.com/$REPO).
Sources: scripts/build-docs.sh:102-119
SUMMARY.md Generation
The table of contents is automatically generated by analyzing the file structure in /workspace/wiki/. The algorithm at scripts/build-docs.sh:124-186 creates a hierarchical navigation structure.
SUMMARY.md Generation Algorithm
graph TB
subgraph "File_Discovery"
WikiDir["/workspace/wiki/"]
ListFiles["ls *.md\n[scripts/build-docs.sh:135]"]
FilterNumeric["grep -E '^[0-9]'\n[scripts/build-docs.sh:150]"]
NumericSort["sort -t- -k1 -n\n[scripts/build-docs.sh:151]"]
end
subgraph "Overview_Detection"
OverviewFile["Find non-numeric file\n[scripts/build-docs.sh:138]"]
ExtractTitle["head -1 /sed 's/^# //' [scripts/build-docs.sh:140]"]
WriteOverview["Write [Title] filename [scripts/build-docs.sh:141]"]
end subgraph "Main_Page_Processing" IteratePages["Iterate main pages [scripts/build-docs.sh:158-185]"]
ExtractPageTitle["Extract '# Title' from file [scripts/build-docs.sh:163]"]
CheckSubsections["Check section-N directory [scripts/build-docs.sh:166-169]"]
end subgraph "Subsection_Handling" ListSubsections["ls section-N/*.md [scripts/build-docs.sh:174]"]
SortSubsections["sort -t- -k1 -n [scripts/build-docs.sh:174]"]
ExtractSubTitle["Extract '# Title' from subfile [scripts/build-docs.sh:178]"]
WriteSubsection["Write ' - [Title] section-N/file ' [scripts/build-docs.sh:179]"]
end subgraph "Output" SummaryMd["/workspace/book/src/SUMMARY.md"]
end
WikiDir --> ListFiles
ListFiles --> OverviewFile
OverviewFile --> ExtractTitle
ExtractTitle --> WriteOverview
ListFiles --> FilterNumeric
FilterNumeric --> NumericSort
NumericSort --> IteratePages
IteratePages --> ExtractPageTitle
ExtractPageTitle --> CheckSubsections
CheckSubsections -->|Has subsections|ListSubsections
ListSubsections --> SortSubsections
SortSubsections --> ExtractSubTitle
ExtractSubTitle --> WriteSubsection
CheckSubsections -->|No subsections| WriteStandalone["Write '- [Title](file)'"]
WriteOverview --> SummaryMd
WriteSubsection --> SummaryMd
WriteStandalone --> SummaryMd
Sources: scripts/build-docs.sh:124-186
SUMMARY.md Structure
The generated SUMMARY.md follows this format:
Generation Logic
-
Overview Detection scripts/build-docs.sh:136-144: Finds the first non-numeric markdown file and writes it as the top-level overview link.
-
Main Page Sorting scripts/build-docs.sh:147-155: Extracts numeric prefixes (e.g.,
1,2,10) and sorts numerically usingsort -t- -k1 -n, ensuring10-file.mdcomes after2-file.md. -
Subsection Detection scripts/build-docs.sh:166-180: For each main page with numeric prefix
N, checks if directorysection-N/exists. If found, lists and sorts subsection files, writing them indented with two spaces. -
Title Extraction scripts/build-docs.sh:163-178: Reads the first line of each markdown file and removes the
#prefix usingsed 's/^# //'.
Sources: scripts/build-docs.sh:124-186
mdbook-mermaid Installation
Before building the book, the mermaid preprocessor assets must be installed. This is performed by mdbook-mermaid install at scripts/build-docs.sh266
mdbook-mermaid Installation Process
Sources: scripts/build-docs.sh:263-266
The mdbook-mermaid install command:
- Downloads the mermaid.js library (version compatible with Mermaid 11)
- Creates initialization scripts to configure mermaid rendering
- Adds CSS stylesheets for diagram theming
- Modifies the HTML templates to include the necessary
<script>tags
During the subsequent mdbook build, the mermaid preprocessor:
- Detects code blocks with ````mermaid` fence
- Wraps them in
<pre class="mermaid">tags - Leaves the mermaid syntax intact for client-side rendering
Sources: scripts/build-docs.sh:263-271
Build Execution
The actual mdBook build occurs at scripts/build-docs.sh271 with a simple mdbook build command.
mdBook Build Process
Sources: scripts/build-docs.sh:268-271
Build Steps
-
Configuration Parsing : mdBook reads
book.tomlto load title, authors, theme, and preprocessor configuration. -
Chapter Loading : mdBook parses
SUMMARY.mdto determine navigation structure and file order. -
Markdown Processing : Each markdown file is read from the
src/directory. -
Preprocessor Execution : The
mdbook-mermaidpreprocessor transforms mermaid code blocks into render-ready HTML. -
Theme Application : The
rusttheme provides CSS styling, JavaScript functionality, and HTML templates. -
Search Index : mdBook generates a searchable index from all content, enabling the built-in search feature.
-
Output Generation : HTML files are written to the
book/subdirectory within the project root.
Sources: scripts/build-docs.sh:268-271
Output Structure
After the build completes, the system copies outputs to /output/ at scripts/build-docs.sh:274-295
Output Directory Structure
Sources: scripts/build-docs.sh:274-295
Output Artifacts
| Path | Content | Purpose |
|---|---|---|
/output/book/ | Complete HTML documentation | Servable website with search, navigation, and rendered diagrams |
/output/markdown/ | Enhanced markdown files | Source files with injected headers/footers and diagrams |
/output/raw_markdown/ | Pre-enhancement markdown | Debug artifact showing initial conversion from HTML |
/output/book.toml | mdBook configuration | Reference for reproduction or customization |
The book/ directory contains:
index.html- Main entry point with navigation sidebarprint.html- Single-page version for printing*.html- Individual chapter pagessearchindex.json- Search index datasearchindex.js- Search functionality- CSS and JavaScript assets for theming and interactivity
Sources: scripts/build-docs.sh:274-295 README.md:53-58
Local Serving
The generated HTML can be served locally using Python’s built-in HTTP server, as documented in README.md26:
This serves the documentation at http://localhost:8000, providing:
- Full-text search functionality
- Interactive navigation sidebar
- Rendered Mermaid diagrams (client-side rendering)
- Theme switching (light/dark/rust/coal/navy/ayu)
- Print-friendly single-page view
Sources: README.md:26-29 scripts/build-docs.sh:307-308
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Phase 1: Markdown Extraction
Loading…
Phase 1: Markdown Extraction
Relevant source files
This page documents Phase 1 of the three-phase processing pipeline, which handles the extraction and initial conversion of wiki content from DeepWiki.com into clean Markdown files. Phase 1 produces raw Markdown files in a temporary directory before diagram enhancement (Phase 2, see #7) and mdBook HTML generation (Phase 3, see #8).
For detailed information about specific sub-processes within Phase 1, see:
Scope and Objectives
Phase 1 accomplishes the following:
- Discover all wiki pages and their hierarchical structure from DeepWiki
- Fetch HTML content for each page via HTTP requests
- Parse HTML to extract main content and remove UI elements
- Convert cleaned HTML to Markdown using
html2text - Organize output files into a hierarchical directory structure
- Save to a temporary directory for subsequent processing
This phase is implemented entirely in Python within deepwiki-scraper.py and operates independently of Phases 2 and 3.
Sources: README.md:121-128 tools/deepwiki-scraper.py:790-876
Phase 1 Execution Flow
The following diagram shows the complete execution flow of Phase 1, mapping high-level steps to specific functions in the codebase:
Sources: tools/deepwiki-scraper.py:790-876 tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594
flowchart TD
Start["main()
Entry Point"]
CreateTemp["Create tempfile.TemporaryDirectory()"]
CreateSession["requests.Session()
with User-Agent"]
DiscoverPhase["Structure Discovery Phase"]
ExtractWiki["extract_wiki_structure(repo, session)"]
ParseLinks["BeautifulSoup: find_all('a', href=pattern)"]
SortPages["sort by page number (handle dots)"]
ExtractionPhase["Content Extraction Phase"]
LoopPages["For each page in pages list"]
FetchContent["extract_page_content(url, session, page_info)"]
FetchHTML["fetch_page(url, session)
with retries"]
ParseHTML["BeautifulSoup(response.text)"]
RemoveNav["Remove nav/header/footer/aside elements"]
FindContent["Find main content: article/main/[role='main']"]
ConvertPhase["Conversion Phase"]
ConvertMD["convert_html_to_markdown(html_content)"]
HTML2Text["html2text.HTML2Text with body_width=0"]
CleanFooter["clean_deepwiki_footer(markdown)"]
FixLinks["Regex replace: wiki links → .md paths"]
SavePhase["File Organization Phase"]
DetermineLevel{"page['level'] == 0?"}
SaveRoot["Save to temp_dir/NUM-title.md"]
CreateSubdir["Create temp_dir/section-N/"]
SaveSubdir["Save to section-N/NUM-title.md"]
NextPage{"More pages?"}
Complete["Phase 1 Complete: temp_dir contains all .md files"]
Start --> CreateTemp
CreateTemp --> CreateSession
CreateSession --> DiscoverPhase
DiscoverPhase --> ExtractWiki
ExtractWiki --> ParseLinks
ParseLinks --> SortPages
SortPages --> ExtractionPhase
ExtractionPhase --> LoopPages
LoopPages --> FetchContent
FetchContent --> FetchHTML
FetchHTML --> ParseHTML
ParseHTML --> RemoveNav
RemoveNav --> FindContent
FindContent --> ConvertPhase
ConvertPhase --> ConvertMD
ConvertMD --> HTML2Text
HTML2Text --> CleanFooter
CleanFooter --> FixLinks
FixLinks --> SavePhase
SavePhase --> DetermineLevel
DetermineLevel -->|Yes: Main Page| SaveRoot
DetermineLevel -->|No: Subsection| CreateSubdir
CreateSubdir --> SaveSubdir
SaveRoot --> NextPage
SaveSubdir --> NextPage
NextPage -->|Yes| LoopPages
NextPage -->|No| Complete
Core Components and Data Flow
Structure Discovery Pipeline
The structure discovery process identifies all wiki pages and builds a hierarchical page list:
Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:90-116 tools/deepwiki-scraper.py:118-123
flowchart LR
subgraph Input
BaseURL["Base URL\ndeepwiki.com/owner/repo"]
end
subgraph extract_wiki_structure
FetchMain["fetch_page(base_url)"]
ParseSoup["BeautifulSoup(response.text)"]
FindLinks["soup.find_all('a', href=regex)"]
ExtractInfo["Extract page number & title\nRegex: /(\d+(?:\.\d+)*)-(.+)$"]
CalcLevel["Calculate level from dots\nlevel = page_num.count('.')"]
BuildPages["Build pages list with metadata"]
SortFunc["Sort by sort_key(page)\nparts = [int(x)
for x in num.split('.')]"]
end
subgraph Output
PagesList["List[Dict]\n{'number': '2.1',\n'title': 'Section',\n'url': full_url,\n'href': path,\n'level': 1}"]
end
BaseURL --> FetchMain
FetchMain --> ParseSoup
ParseSoup --> FindLinks
FindLinks --> ExtractInfo
ExtractInfo --> CalcLevel
CalcLevel --> BuildPages
BuildPages --> SortFunc
SortFunc --> PagesList
Content Extraction and Cleaning
Each page undergoes a multi-step cleaning process to remove DeepWiki UI elements:
Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:175-190 tools/deepwiki-scraper.py:127-173
flowchart TD
subgraph fetch_page
MakeRequest["requests.get(url, headers)\nUser-Agent: Mozilla/5.0..."]
RetryLogic["Retry up to 3 times\n2 second delay between attempts"]
CheckStatus["response.raise_for_status()"]
end
subgraph extract_page_content
ParsePage["BeautifulSoup(response.text)"]
RemoveUnwanted["Decompose: nav, header, footer,\naside, .sidebar, script, style"]
FindMain["Try selectors in order:\narticle → main → .wiki-content\n→ [role='main'] → body"]
RemoveUI["Remove DeepWiki UI elements:\n'Edit Wiki', 'Last indexed:',\n'Index your code with Devin'"]
RemoveNavLists["Remove navigation <ul> lists\n(80%+ internal wiki links)"]
end
subgraph convert_html_to_markdown
HTML2TextInit["h = html2text.HTML2Text()\nh.ignore_links = False\nh.body_width = 0"]
HandleContent["markdown = h.handle(html_content)"]
CleanFooterCall["clean_deepwiki_footer(markdown)"]
end
subgraph clean_deepwiki_footer
SplitLines["lines = markdown.split('\\n')"]
ScanBackward["Scan last 50 lines backward\nfor footer patterns"]
MatchPatterns["Regex patterns:\n'Dismiss', 'Refresh this wiki',\n'On this page', 'Edit Wiki'"]
TruncateLines["lines = lines[:footer_start]"]
RemoveEmpty["Remove trailing empty lines"]
end
MakeRequest --> RetryLogic
RetryLogic --> CheckStatus
CheckStatus --> ParsePage
ParsePage --> RemoveUnwanted
RemoveUnwanted --> FindMain
FindMain --> RemoveUI
RemoveUI --> RemoveNavLists
RemoveNavLists --> HTML2TextInit
HTML2TextInit --> HandleContent
HandleContent --> CleanFooterCall
CleanFooterCall --> SplitLines
SplitLines --> ScanBackward
ScanBackward --> MatchPatterns
MatchPatterns --> TruncateLines
TruncateLines --> RemoveEmpty
Link Rewriting Logic
Phase 1 transforms internal DeepWiki links into relative Markdown file paths. The rewriting logic accounts for hierarchical directory structure:
Sources: tools/deepwiki-scraper.py:549-592
flowchart TD
subgraph Input
WikiLink["DeepWiki Link\n[text](/owner/repo/2-1-section)"]
SourcePage["Current Page Info\n{level: 1, number: '2.1'}"]
end
subgraph fix_wiki_link
ExtractPath["Regex: /(\d+(?:\.\d+)*)-(.+)$"]
ParseNumbers["Extract: page_num='2.1', slug='section'"]
ConvertNum["file_num = page_num.replace('.', '-')\nResult: '2-1'"]
CheckTarget{"Is target\nsubsection?\n(has '.')"}
CheckSource{"Is source\nsubsection?\n(level > 0)"}
CheckSame{"Same main\nsection?"}
PathSameSection["Relative path:\nfile_num-slug.md"]
PathDiffSection["Full path:\nsection-N/file_num-slug.md"]
PathToMain["Up one level:\n../file_num-slug.md"]
PathMainToMain["Same level:\nfile_num-slug.md"]
end
subgraph Output
MDLink["Markdown Link\n[text](2-1-section.md)\nor [text](section-2/2-1-section.md)\nor [text](../2-1-section.md)"]
end
WikiLink --> ExtractPath
ExtractPath --> ParseNumbers
ParseNumbers --> ConvertNum
ConvertNum --> CheckTarget
CheckTarget -->|Yes| CheckSource
CheckTarget -->|No: Main Page| CheckSource
CheckSource -->|Target: Sub, Source: Sub| CheckSame
CheckSource -->|Target: Sub, Source: Main| PathDiffSection
CheckSource -->|Target: Main, Source: Sub| PathToMain
CheckSource -->|Target: Main, Source: Main| PathMainToMain
CheckSame -->|Yes| PathSameSection
CheckSame -->|No| PathDiffSection
PathSameSection --> MDLink
PathDiffSection --> MDLink
PathToMain --> MDLink
PathMainToMain --> MDLink
File Organization Strategy
Phase 1 organizes output files into a hierarchical directory structure based on page levels:
Directory Structure Rules
| Page Level | Page Number Format | Directory Location | Filename Pattern | Example |
|---|---|---|---|---|
| 0 (Main) | 1, 2, 3 | temp_dir/ (root) | {num}-{slug}.md | 1-overview.md |
| 1 (Subsection) | 2.1, 3.4 | temp_dir/section-{N}/ | {num}-{slug}.md | section-2/2-1-workspace.md |
File Organization Implementation
Sources: tools/deepwiki-scraper.py:21-25 tools/deepwiki-scraper.py:845-868
HTTP Session Configuration
Phase 1 uses a persistent requests.Session with browser-like headers and retry logic:
Session Setup
Retry Strategy
Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:817-821
Data Structures
Page Metadata Dictionary
Each page discovered by extract_wiki_structure() is represented as a dictionary:
Sources: tools/deepwiki-scraper.py:109-115
BeautifulSoup Content Selectors
Phase 1 attempts multiple selector strategies to find main content, in priority order:
| Priority | Selector Type | Selector Value | Rationale |
|---|---|---|---|
| 1 | CSS Selector | article | Semantic HTML5 element for main content |
| 2 | CSS Selector | main | HTML5 main landmark element |
| 3 | CSS Selector | .wiki-content | Common class name for wiki content |
| 4 | CSS Selector | .content | Generic content class |
| 5 | CSS Selector | #content | Generic content ID |
| 6 | CSS Selector | .markdown-body | GitHub-style markdown container |
| 7 | Attribute | role="main" | ARIA landmark role |
| 8 | Fallback | body | Last resort: entire body |
Sources: tools/deepwiki-scraper.py:472-484
Error Handling and Robustness
Page Extraction Error Handling
Phase 1 implements graceful degradation for individual page failures:
Sources: tools/deepwiki-scraper.py:841-876
Content Extraction Fallbacks
If primary content selectors fail, Phase 1 applies fallback strategies:
- Content Selector Fallback Chain : Try 8 different selectors (see table above)
- Empty Content Check : Raises exception if no content element found tools/deepwiki-scraper.py:486-487
- HTTP Retry Logic : 3 attempts with exponential backoff
- Session Persistence : Reuses TCP connections for efficiency
Sources: tools/deepwiki-scraper.py:472-487 tools/deepwiki-scraper.py:27-42
Output Format
Temporary Directory Structure
At the end of Phase 1, the temporary directory contains the following structure:
temp_dir/
├── 1-overview.md # Main page (level 0)
├── 2-architecture.md # Main page (level 0)
├── 3-components.md # Main page (level 0)
├── section-2/ # Subsections of page 2
│ ├── 2-1-workspace-and-crates.md # Subsection (level 1)
│ └── 2-2-dependency-graph.md # Subsection (level 1)
└── section-4/ # Subsections of page 4
├── 4-1-logical-planning.md
└── 4-2-physical-planning.md
Markdown File Format
Each generated Markdown file has the following characteristics:
- Title : Always starts with
# {Page Title}heading - Content : Cleaned HTML converted to Markdown via
html2text - Links : Internal wiki links rewritten to relative
.mdpaths - No Diagrams : Diagrams are added in Phase 2 (see #7)
- No Footer : DeepWiki UI elements removed via
clean_deepwiki_footer() - Encoding : UTF-8
Sources: tools/deepwiki-scraper.py:862-868 tools/deepwiki-scraper.py:127-173
Phase 1 Completion Criteria
Phase 1 is considered complete when:
- All pages discovered by
extract_wiki_structure()have been processed - Each page’s Markdown file has been written to the temporary directory
- Directory structure (main pages +
section-N/subdirectories) has been created - Success count is reported:
"✓ Successfully extracted N/M pages to temp directory"
The temporary directory is then passed to Phase 2 for diagram enhancement.
Sources: tools/deepwiki-scraper.py877 tools/deepwiki-scraper.py:596-788
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Wiki Structure Discovery
Loading…
Wiki Structure Discovery
Relevant source files
Purpose and Scope
This document describes the wiki structure discovery mechanism in Phase 1 of the processing pipeline. The system analyzes the main DeepWiki repository page to identify all available wiki pages and their hierarchical relationships. This discovery phase produces a structured page list that drives subsequent content extraction.
For the HTML-to-Markdown conversion that follows discovery, see HTML to Markdown Conversion. For the overall Phase 1 process, see Phase 1: Markdown Extraction.
Overview
The discovery process fetches the main wiki page for a repository and parses its HTML to extract all wiki page references. The system identifies both main pages (e.g., 1, 2, 3) and subsections (e.g., 2.1, 2.2, 3.1) by analyzing link patterns. The output is a sorted list of page metadata that includes page numbers, titles, URLs, and hierarchical levels.
flowchart TD
Start["main()
entry point"] --> ValidateRepo["Validate repo format\n(owner/repo)"]
ValidateRepo --> CreateSession["Create requests.Session\nwith User-Agent headers"]
CreateSession --> CallExtract["extract_wiki_structure(repo, session)"]
CallExtract --> FetchMain["Fetch https://deepwiki.com/{repo}"]
FetchMain --> ParseHTML["BeautifulSoup(response.text)"]
ParseHTML --> FindLinks["soup.find_all('a', href=regex)"]
FindLinks --> IterateLinks["Iterate over all links"]
IterateLinks --> ExtractPattern["Regex: /(\d+(?:\.\d+)*)-(.+)$"]
ExtractPattern --> BuildPageDict["Build page dict:\n{number, title, url, href, level}"]
BuildPageDict --> CheckDupe{"href in seen_urls?"}
CheckDupe -->|Yes| IterateLinks
CheckDupe -->|No| AddToList["pages.append(page_dict)"]
AddToList --> IterateLinks
IterateLinks -->|Done| SortPages["Sort by numeric parts:\nsort_key([int(x)
for x in num.split('.')])"]
SortPages --> ReturnPages["Return pages list"]
ReturnPages --> ProcessPages["Process each page\nin main loop"]
style CallExtract fill:#f9f,stroke:#333,stroke-width:2px
style ExtractPattern fill:#f9f,stroke:#333,stroke-width:2px
style SortPages fill:#f9f,stroke:#333,stroke-width:2px
Discovery Flow Diagram
Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:790-831
Main Discovery Function
The extract_wiki_structure function performs the core discovery logic. It accepts a repository identifier (e.g., "jzombie/deepwiki-to-mdbook") and an HTTP session object, then returns a list of page dictionaries.
Function Signature and Entry Point
Sources: tools/deepwiki-scraper.py:78-79
HTTP Request and HTML Parsing
The function constructs the base URL and fetches the main wiki page:
The fetch_page helper includes retry logic (3 attempts) and browser-like headers to handle transient failures.
Sources: tools/deepwiki-scraper.py:80-84 tools/deepwiki-scraper.py:27-42
Link Pattern Matching
Regex-Based Link Discovery
The system uses a compiled regex pattern to find all wiki page links:
This pattern matches URLs like:
/jzombie/deepwiki-to-mdbook/1-overview/jzombie/deepwiki-to-mdbook/2-quick-start/jzombie/deepwiki-to-mdbook/2-1-basic-usage
Sources: tools/deepwiki-scraper.py:88-90
Page Information Extraction
For each matched link, the system extracts page metadata using a detailed regex pattern:
The regex r'/(\d+(?:\.\d+)*)-(.+)$' captures:
- Group 1: Page number with optional dots (e.g.,
1,2.1,3.2.1) - Group 2: URL slug (e.g.,
overview,basic-usage)
Sources: tools/deepwiki-scraper.py:98-107
Link Extraction Data Flow
Sources: tools/deepwiki-scraper.py:98-115
Deduplication and Sorting
Deduplication Strategy
The system maintains a seen_urls set to prevent duplicate page entries:
Sources: tools/deepwiki-scraper.py:92-116
Hierarchical Sorting
Pages are sorted by their numeric components to maintain proper ordering:
This ensures ordering like: 1 → 2 → 2.1 → 2.2 → 3 → 3.1
Sources: tools/deepwiki-scraper.py:118-123
Sorting Example
| Before Sorting (Link Order) | Page Number | After Sorting (Numeric Order) |
|---|---|---|
/3-phase-3 | 3 | /1-overview |
/2-1-subsection-one | 2.1 | /2-quick-start |
/1-overview | 1 | /2-1-subsection-one |
/2-quick-start | 2 | /2-2-subsection-two |
/2-2-subsection-two | 2.2 | /3-phase-3 |
Page Data Structure
Page Dictionary Schema
Each discovered page is represented as a dictionary:
Sources: tools/deepwiki-scraper.py:109-115
Level Calculation
The level field indicates hierarchical depth:
| Page Number | Level | Type |
|---|---|---|
1 | 0 | Main page |
2 | 0 | Main page |
2.1 | 1 | Subsection |
2.2 | 1 | Subsection |
3.1.1 | 2 | Sub-subsection |
Sources: tools/deepwiki-scraper.py:106-114
Discovery Result Processing
Output Statistics
After discovery, the system categorizes pages and reports statistics:
Sources: tools/deepwiki-scraper.py:824-837
Integration with Content Extraction
The discovered page list drives the extraction loop in main():
Sources: tools/deepwiki-scraper.py:841-860
Alternative Discovery Method (Unused)
Subsection Probing Function
The codebase includes a discover_subsections function that uses HTTP HEAD requests to probe for subsections, but this function is not invoked in the current implementation:
This function attempts to discover subsections by making HEAD requests to potential URLs (e.g., /repo/2-1-, /repo/2-2-). However, the actual implementation relies entirely on parsing links from the main wiki page.
Sources: tools/deepwiki-scraper.py:44-76
Discovery Method Comparison
Sources: tools/deepwiki-scraper.py:44-76 tools/deepwiki-scraper.py:78-125
Error Handling
No Pages Found
The system validates that at least one page was discovered:
Sources: tools/deepwiki-scraper.py:828-830
Network Failures
The fetch_page function includes retry logic:
Sources: tools/deepwiki-scraper.py:33-42
Summary
The wiki structure discovery process provides a robust mechanism for identifying all pages in a DeepWiki repository through a single HTML parse operation. The resulting page list is hierarchically organized and drives all subsequent extraction operations in Phase 1.
Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:790-831
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
HTML to Markdown Conversion
Loading…
HTML to Markdown Conversion
Relevant source files
This page documents the HTML to Markdown conversion process in Phase 1 of the pipeline. After the wiki structure is discovered (see Wiki Structure Discovery), each page’s HTML content is fetched, cleaned, and converted to Markdown format. This conversion prepares the content for diagram enhancement in Phase 2 (see Phase 2: Diagram Enhancement).
Conversion Pipeline
The HTML to Markdown conversion follows a multi-step pipeline that progressively cleans and transforms the content. The process is orchestrated by the extract_page_content function and involves HTML parsing, element removal, conversion, and post-processing.
Conversion Pipeline Flow
graph TB
fetch["fetch_page()\n[65-80]"]
parse["BeautifulSoup()\nHTML Parser"]
remove1["Remove Navigation\nElements [761-762]"]
remove2["Find Main Content\nArea [765-782]"]
remove3["Remove DeepWiki UI\nElements [786-806]"]
convert["convert_html_to_markdown()\n[213-228]"]
clean["clean_deepwiki_footer()\n[165-211]"]
format["format_source_references()\n[397-406]"]
links["Fix Internal Links\n[854-875]"]
output["Cleaned Markdown\nOutput"]
fetch --> parse
parse --> remove1
remove1 --> remove2
remove2 --> remove3
remove3 --> convert
convert --> clean
clean --> format
format --> links
links --> output
Sources: python/deepwiki-scraper.py:751-877
HTML Parsing and Content Extraction
The conversion begins by fetching the HTML page using a requests.Session with browser-like headers to avoid bot detection. BeautifulSoup parses the HTML into a navigable tree structure.
Content Area Detection
The system uses a cascading selector strategy to locate the main content area, trying multiple selectors in order of preference:
| Priority | Selector Type | Example |
|---|---|---|
| 1 | Semantic HTML5 tags | article, main |
| 2 | Class-based selectors | .wiki-content, .content, .markdown-body |
| 3 | ID-based selectors | #content |
| 4 | ARIA role attributes | role="main" |
| 5 | Fallback | body tag |
Content Detection Logic
Sources: python/deepwiki-scraper.py:765-782
DeepWiki UI Element Removal
Before conversion, the system removes DeepWiki-specific navigation and UI elements that would pollute the final documentation. This occurs in two stages: pre-processing element removal and footer cleanup.
Pre-Processing Element Removal
The first stage removes structural elements and UI components using CSS selectors:
The second stage removes DeepWiki-specific text-based UI elements by scanning for characteristic strings:
| UI Element | Detection String | Max Length |
|---|---|---|
| Code indexing prompt | “Index your code with Devin” | 200 chars |
| Edit controls | “Edit Wiki” | 200 chars |
| Indexing status | “Last indexed:” | 200 chars |
| Search links | “View this search on DeepWiki” | 200 chars |
Sources: python/deepwiki-scraper.py:761-762 python/deepwiki-scraper.py:786-795
Navigation List Removal
DeepWiki pages include navigation lists that link to all wiki pages. The system detects and removes these by identifying unordered lists (<ul>) with the following characteristics:
- Contains more than 5 links
- At least 80% of links are internal (start with
/)
graph LR
ul["Find all ul\nElements"]
count["Count Links\nin List"]
check1{"More than\n5 links?"}
check2{"80%+ are\ninternal?"}
remove["Remove ul\nElement"]
keep["Keep ul\nElement"]
ul --> count
count --> check1
check1 -->|Yes| check2
check1 -->|No| keep
check2 -->|Yes| remove
check2 -->|No| keep
Navigation List Detection
Sources: python/deepwiki-scraper.py:799-806
html2text Conversion
After cleaning the HTML, the system uses the html2text library to convert HTML to Markdown. The conversion is configured with specific settings to preserve link structure and prevent line wrapping.
html2text Configuration
The body_width = 0 setting is critical because it prevents the converter from introducing artificial line breaks that would break code blocks and formatted content.
Important: Mermaid diagram extraction is explicitly disabled at this stage. DeepWiki’s Next.js payload contains diagrams from ALL pages mixed together, making per-page extraction unreliable. Diagrams are handled separately in Phase 2 using fuzzy matching (see Fuzzy Matching Algorithm).
Sources: python/deepwiki-scraper.py:213-228
graph TB
scan["Scan Last 50 Lines\nBackwards"]
patterns["Check Against\nFooter Patterns"]
found{"Pattern\nMatch?"}
backward["Scan Backward\n20 More Lines"]
content{"Hit Real\nContent?"}
cut["Cut Lines from\nFooter Start"]
trim["Trim Trailing\nEmpty Lines"]
scan --> patterns
patterns --> found
found -->|Yes| backward
found -->|No| trim
backward --> content
content -->|Yes| cut
content -->|No| backward
cut --> trim
Footer Cleanup
The clean_deepwiki_footer function removes DeepWiki’s footer UI elements that appear at the end of each page. It uses regex patterns to detect footer markers and removes everything from that point onward.
Footer Detection Patterns
The footer patterns are compiled regex expressions:
| Pattern | Purpose | Example Match |
|---|---|---|
^\s*Dismiss\s*$ | Close button | “Dismiss” |
Refresh this wiki | Refresh controls | “Refresh this wiki” |
This wiki was recently refreshed | Status message | Various timestamps |
###\s*On this page | Page navigation | “### On this page” |
Please wait \d+ days? | Rate limiting | “Please wait 7 days” |
View this search on DeepWiki | Search link | Exact match |
^\s*Edit Wiki\s*$ | Edit button | “Edit Wiki” |
Sources: python/deepwiki-scraper.py:165-211
Post-Processing Steps
After initial conversion, two post-processing steps refine the Markdown output: source reference formatting and internal link rewriting.
Source Reference Formatting
The format_source_references function inserts colons between filenames and line numbers in source code references. This transforms patterns like [path/to/file:10-20] into [path/to/file:10-20].
Pattern Matching:
- Regex:
\[([A-Za-z0-9._/-]+?)(\d+-\d+)\] - Capture Group 1: Filename path
- Capture Group 2: Line number range
- Output:
[filename:linerange]
Sources: python/deepwiki-scraper.py:395-406
graph TB
find["Find Link Pattern:\n/owner/repo/page"]
extract["Extract page_num\nand slug"]
normalize["normalized_number_parts()\n[28-43]"]
build["build_target_path()\n[55-63]"]
relative{"Source in\nSubsection?"}
same{"Target in Same\nSection?"}
diff["Prefix: ../\nDifferent section"]
none["Prefix: ../\nTop level"]
local["No prefix\nSame section"]
find --> extract
extract --> normalize
normalize --> build
build --> relative
relative -->|Yes| same
relative -->|No| done["Return\nRelative Path"]
same -->|Yes| local
same -->|No| diff
diff --> done
local --> done
none --> done
Internal Link Rewriting
DeepWiki uses absolute URLs for internal wiki links (e.g., /owner/repo/4-query-planning). The system rewrites these to relative Markdown file paths using the build_target_path function.
Link Rewriting Process
Path Resolution Examples:
| Source File | Target Link | Resolved Path |
|---|---|---|
1-overview.md | /repo/2-architecture | 2-architecture.md |
section-2/2-1-pipeline.md | /repo/2-2-build | 2-2-build.md |
section-2/2-1-pipeline.md | /repo/3-config | ../3-config.md |
1-overview.md | /repo/2-1-subsection | section-2/2-1-subsection.md |
Sources: python/deepwiki-scraper.py:854-875 python/deepwiki-scraper.py:55-63 python/deepwiki-scraper.py:28-43
Duplicate Content Removal
The final cleanup step removes duplicate titles and stray “Menu” text that may appear in the converted Markdown. The system tracks whether a title has been seen and skips subsequent occurrences if they match the first title exactly.
Cleanup Rules:
- Skip standalone “Menu” lines
- Keep first
# Titleoccurrence - Skip duplicate titles that match the first title
- Preserve all other content
Sources: python/deepwiki-scraper.py:820-841
Output Format
The final output is clean Markdown with the following characteristics:
- Title guaranteed to be present (added if missing)
- No DeepWiki UI elements
- No artificial line wrapping
- Relative internal links
- Formatted source references
- Stripped trailing whitespace
The output is written to temporary storage before diagram enhancement in Phase 2. A snapshot of this raw Markdown (without diagrams) is saved to raw_markdown/ for debugging purposes.
Sources: python/deepwiki-scraper.py:1357-1366
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Phase 2: Diagram Enhancement
Loading…
Phase 2: Diagram Enhancement
Relevant source files
Purpose and Scope
Phase 2 performs intelligent diagram extraction and placement after Phase 1 has generated clean markdown files. This phase extracts Mermaid diagrams from DeepWiki’s JavaScript payload, matches them to appropriate locations in the markdown content using fuzzy text matching, and inserts them contextually after relevant paragraphs.
For information about the initial markdown extraction that precedes this phase, see Phase 1: Markdown Extraction. For details on the specific fuzzy matching algorithm implementation, see Fuzzy Diagram Matching Algorithm. For information about the extraction patterns used, see Diagram Extraction from Next.js.
Sources: README.md:130-136 tools/deepwiki-scraper.py:596-789
The Client-Side Rendering Problem
DeepWiki renders diagrams client-side using JavaScript, making them invisible to traditional HTML scraping. All Mermaid diagrams are embedded in a JavaScript payload (self.__next_f.push) that contains diagram code for all pages in the wiki , not just the current page. This creates a matching problem: given ~461 diagrams in a single payload and individual markdown files, how do we determine which diagrams belong in which files?
Key challenges:
- Diagrams are escaped JavaScript strings (
\n,\t,\") - No metadata associates diagrams with specific pages
- html2text conversion changes text formatting from the original JavaScript context
- Must avoid false positives (placing diagrams in wrong locations)
Sources: tools/deepwiki-scraper.py:458-461 README.md:131-136
Architecture Overview
Diagram: Phase 2 Processing Pipeline
Sources: tools/deepwiki-scraper.py:596-789
Diagram Extraction Process
The extraction process reads the JavaScript payload from any DeepWiki page and locates all Mermaid diagram blocks using regex pattern matching.
flowchart TD
Start["extract_and_enhance_diagrams()"]
FetchURL["Fetch https://deepwiki.com/repo/1-overview"]
subgraph "Pattern Matching"
Pattern1["Pattern: r'```mermaid\\\\\n(.*?)```'\n(re.DOTALL)"]
Pattern2["Pattern: r'([^`]{500,}?)```mermaid\\\\ (.*?)```'\n(with context)"]
FindAll["re.findall() → all_diagrams list"]
FindIter["re.finditer() → diagram_contexts with context"]
end
subgraph "Unescaping"
ReplaceNewline["Replace '\\\\\n' → newline"]
ReplaceTab["Replace '\\\\ ' → tab"]
ReplaceQuote["Replace '\\\\\"' → double-quote"]
ReplaceUnicode["Replace Unicode escapes:\n\\\< → '<'\n\\\> → '>'\n\\\& → '&'"]
end
subgraph "Context Processing"
Last500["Extract last 500 chars of context"]
FindHeading["Scan for last heading starting with #"]
ExtractAnchor["Extract last 2-3 non-heading lines\n(min 20 chars each)"]
BuildDict["Build dict: {last_heading, anchor_text, diagram}"]
end
Start --> FetchURL
FetchURL --> Pattern1
FetchURL --> Pattern2
Pattern1 --> FindAll
Pattern2 --> FindIter
FindAll --> ReplaceNewline
FindIter --> ReplaceNewline
ReplaceNewline --> ReplaceTab
ReplaceTab --> ReplaceQuote
ReplaceQuote --> ReplaceUnicode
ReplaceUnicode --> Last500
Last500 --> FindHeading
FindHeading --> ExtractAnchor
ExtractAnchor --> BuildDict
BuildDict --> Output["Returns:\n- all_diagrams count\n- diagram_contexts list"]
Extraction Function Flow
Diagram: Diagram Extraction and Context Building
Sources: tools/deepwiki-scraper.py:604-674
Key Implementation Details
| Component | Implementation | Location |
|---|---|---|
| Regex Pattern | r'```mermaid\\n(.*?)```' with re.DOTALL flag | tools/deepwiki-scraper.py615 |
| Context Pattern | r'([^]{500,}?)mermaid\\n(.*?)’` captures 500+ chars | tools/deepwiki-scraper.py621 |
| Unescape Operations | replace('\\n', '\n'), replace('\\t', '\t'), etc. | tools/deepwiki-scraper.py:628-635 tools/deepwiki-scraper.py:639-645 |
| Heading Detection | line.startswith('#') on reversed context lines | tools/deepwiki-scraper.py:652-656 |
| Anchor Extraction | Last 2-3 lines with len(line) > 20, max 300 chars | tools/deepwiki-scraper.py:658-666 |
| Context Storage | Dict with keys: last_heading, anchor_text, diagram | tools/deepwiki-scraper.py:668-672 |
Sources: tools/deepwiki-scraper.py:614-674
Fuzzy Matching Algorithm
The fuzzy matching algorithm determines where each diagram should be inserted by finding the best match between the diagram’s context and the markdown file’s content.
flowchart TD
Start["For each diagram_contexts[idx]"]
CheckUsed["idx in diagrams_used?"]
Skip["Skip to next diagram"]
subgraph "Text Normalization"
NormFile["Normalize file content:\ncontent.lower()\n' '.join(content.split())"]
NormAnchor["Normalize anchor_text:\nanchor.lower()\n' '.join(anchor.split())"]
NormHeading["Normalize heading:\nheading.lower().replace('#', '').strip()"]
end
subgraph "Progressive Chunk Matching"
Try300["Try chunk_size=300"]
Try200["Try chunk_size=200"]
Try150["Try chunk_size=150"]
Try100["Try chunk_size=100"]
Try80["Try chunk_size=80"]
ExtractChunk["test_chunk = anchor_normalized[-chunk_size:]"]
FindPos["pos = content_normalized.find(test_chunk)"]
CheckPos["pos != -1?"]
ConvertLine["Convert char position to line number"]
RecordMatch["Record best_match_line, best_match_score"]
end
subgraph "Heading Fallback"
IterLines["For each line in markdown"]
CheckHeadingLine["line.strip().startswith('#')?"]
NormalizeLinе["Normalize line heading"]
CheckContains["heading_normalized in line_normalized?"]
RecordHeadingMatch["best_match_line = line_num\nbest_match_score = 50"]
end
Start --> CheckUsed
CheckUsed -->|Yes| Skip
CheckUsed -->|No| NormFile
NormFile --> NormAnchor
NormAnchor --> Try300
Try300 --> ExtractChunk
ExtractChunk --> FindPos
FindPos --> CheckPos
CheckPos -->|Found| ConvertLine
CheckPos -->|Not found| Try200
ConvertLine --> RecordMatch
Try200 --> Try150
Try150 --> Try100
Try100 --> Try80
Try80 -->|All failed| IterLines
RecordMatch --> Success["Return match with score"]
IterLines --> CheckHeadingLine
CheckHeadingLine -->|Yes| NormalizeLinе
NormalizeLinе --> CheckContains
CheckContains -->|Yes| RecordHeadingMatch
RecordHeadingMatch --> Success
Matching Strategy
Diagram: Progressive Chunk Matching with Fallback
Sources: tools/deepwiki-scraper.py:708-746
Chunk Size Progression
The algorithm tries progressively smaller chunk sizes to accommodate variations in text formatting between the JavaScript context and the html2text-converted markdown:
| Chunk Size | Use Case | Success Rate |
|---|---|---|
| 300 chars | Perfect or near-perfect matches | Highest precision |
| 200 chars | Minor formatting differences | Good precision |
| 150 chars | Moderate text variations | Acceptable precision |
| 100 chars | Significant reformatting | Lower precision |
| 80 chars | Minimal context available | Lowest precision |
| Heading match | Fallback when text matching fails | Score: 50 |
The algorithm stops at the first successful match, prioritizing larger chunks for higher confidence.
Sources: tools/deepwiki-scraper.py:716-730 README.md134
flowchart TD
Start["Found best_match_line"]
CheckType["lines[best_match_line].strip().startswith('#')?"]
subgraph "Heading Case"
H1["insert_line = best_match_line + 1"]
H2["Skip blank lines after heading"]
H3["Skip through paragraph content"]
H4["Stop at next blank line or heading"]
end
subgraph "Paragraph Case"
P1["insert_line = best_match_line + 1"]
P2["Find end of current paragraph"]
P3["Stop at next blank line or heading"]
end
subgraph "Insertion Format"
I1["Insert: empty line"]
I2["Insert: ```mermaid"]
I3["Insert: diagram code"]
I4["Insert: ```"]
I5["Insert: empty line"]
end
Start --> CheckType
CheckType -->|Heading| H1
CheckType -->|Paragraph| P1
H1 --> H2
H2 --> H3
H3 --> H4
P1 --> P2
P2 --> P3
H4 --> I1
P3 --> I1
I1 --> I2
I2 --> I3
I3 --> I4
I4 --> I5
I5 --> Append["Append to pending_insertions list:\n(insert_line, diagram, score, idx)"]
Insertion Point Logic
After finding a match, the system determines the precise line number where the diagram should be inserted.
Insertion Algorithm
Diagram: Insertion Point Calculation
Sources: tools/deepwiki-scraper.py:747-768
graph LR
Collect["Collect all\npending_insertions"]
Sort["Sort by insert_line\n(descending)"]
Insert["Insert from bottom to top\npreserves line numbers"]
Write["Write enhanced file\nto temp_dir"]
Collect --> Sort
Sort --> Insert
Insert --> Write
Batch Insertion Strategy
Diagrams are inserted in descending line order to avoid invalidating insertion points:
Diagram: Batch Insertion Order
Implementation:
Sources: tools/deepwiki-scraper.py:771-783
sequenceDiagram
participant Main as extract_and_enhance_diagrams()
participant Glob as temp_dir.glob('**/*.md')
participant File as Individual .md file
participant Matcher as Fuzzy Matcher
participant Writer as File Writer
Main->>Main: Extract all diagram_contexts
Main->>Glob: Find all markdown files
loop For each md_file
Glob->>File: Open and read content
File->>File: Check if '```mermaid' already present
alt Already has diagrams
File->>Glob: Skip (continue)
else No diagrams
File->>Matcher: Normalize content
loop For each diagram_context
Matcher->>Matcher: Try progressive chunk matching
Matcher->>Matcher: Try heading fallback
Matcher->>Matcher: Record best match
end
Matcher->>File: Return pending_insertions list
File->>File: Sort insertions (descending)
File->>File: Insert diagrams bottom-up
File->>Writer: Write enhanced content
Writer->>Main: Increment enhanced_count
end
end
Main->>Main: Print summary
File Processing Workflow
Phase 2 operates on files in the temporary directory created by Phase 1, enhancing them in-place before they are moved to the final output directory.
Processing Loop
Diagram: File Processing Sequence
Sources: tools/deepwiki-scraper.py:676-788
Performance Characteristics
Extraction Statistics
From a typical wiki with ~10 pages:
| Metric | Value | Location |
|---|---|---|
| Total diagrams in JS payload | ~461 | README.md132 |
| Diagrams with context (500+ chars) | ~48 | README.md133 |
| Context window size | 500 characters | tools/deepwiki-scraper.py621 |
| Anchor text max length | 300 characters | tools/deepwiki-scraper.py666 |
| Typical enhanced files | Varies by content | Printed in output |
Sources: README.md:132-133 tools/deepwiki-scraper.py674 tools/deepwiki-scraper.py788
Matching Performance
The progressive chunk size strategy balances precision and recall:
- High precision matches (300-200 chars) : Strong contextual alignment
- Medium precision matches (150-100 chars) : Acceptable with some risk
- Low precision matches (80 chars) : Risk of false positives
- Heading-only matches (score: 50) : Last resort fallback
The algorithm prefers to skip a diagram rather than place it incorrectly, prioritizing documentation quality over diagram count.
Sources: tools/deepwiki-scraper.py:716-745
Integration with Phases 1 and 3
Input Requirements (from Phase 1)
- Clean markdown files in
temp_dir - Files must not already contain
\```mermaidblocks - Proper heading structure for fallback matching
- Normalized link structure
Sources: tools/deepwiki-scraper.py:810-877
Output Guarantees (for Phase 3)
- Enhanced markdown files in
temp_dir - Diagrams inserted with proper fencing:
\```mermaid…````` - Blank lines before and after diagrams for proper rendering
- Original file structure preserved (section-N directories maintained)
- Atomic file operations (write complete file or skip)
Sources: tools/deepwiki-scraper.py:781-786 tools/deepwiki-scraper.py:883-908
Workflow Integration
Diagram: Three-Phase Integration
Sources: README.md:123-145 tools/deepwiki-scraper.py:810-916
Error Handling and Edge Cases
Skipped Files
Files are skipped if they already contain Mermaid diagrams to avoid duplicate insertion:
Sources: tools/deepwiki-scraper.py:686-687
Failed Matches
When a diagram cannot be matched:
- The diagram is not inserted (conservative approach)
- No error is raised (continues processing other diagrams)
- File is left unmodified if no diagrams match
Sources: tools/deepwiki-scraper.py:699-746
Network Errors
If diagram extraction fails (network error, changed HTML structure):
- Warning is printed but Phase 2 continues
- Phase 1 files remain valid
- System can still proceed to Phase 3 without diagrams
Sources: tools/deepwiki-scraper.py:610-612
Diagram Quality Thresholds
| Threshold | Purpose |
|---|---|
len(diagram) > 10 | Filter out trivial/invalid diagram code |
len(anchor) > 50 | Ensure sufficient context for matching |
len(line) > 20 | Filter out short lines from anchor text |
chunk_size >= 80 | Minimum viable match size |
Sources: tools/deepwiki-scraper.py648 tools/deepwiki-scraper.py712 tools/deepwiki-scraper.py661
Summary
Phase 2 implements a sophisticated fuzzy matching system that:
- Extracts all Mermaid diagrams from DeepWiki’s JavaScript payload using regex patterns
- Processes diagram context to extract heading and anchor text metadata
- Matches diagrams to markdown files using progressive chunk size comparison (300→80 chars)
- Inserts diagrams after relevant paragraphs with proper formatting
- Validates through conservative matching to avoid false positives
The phase operates entirely on files in the temporary directory, leaving Phase 1’s output intact while preparing enhanced files for Phase 3’s mdBook build process.
Sources: README.md:130-136 tools/deepwiki-scraper.py:596-789
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Mermaid Normalization
Loading…
Mermaid Normalization
Relevant source files
The Mermaid normalization pipeline transforms diagrams extracted from DeepWiki’s JavaScript payload into syntax that is compatible with Mermaid 11. DeepWiki’s diagrams often contain formatting issues, legacy syntax, and multiline constructs that newer Mermaid parsers reject. This seven-step normalization process ensures that all diagrams render correctly in mdBook’s Mermaid renderer.
For information about how diagrams are extracted from the JavaScript payload, see Phase 2: Diagram Enhancement. For information about the fuzzy matching algorithm that places diagrams in the correct markdown files, see Fuzzy Matching Algorithm.
Purpose and Scope
This page documents the seven normalization functions that transform raw Mermaid diagram code into Mermaid 11-compatible syntax. Each normalization step addresses a specific category of syntax errors or incompatibilities. The pipeline is applied to every diagram before it is injected into markdown files.
The normalization pipeline handles:
- Multiline edge labels that span multiple lines
- State diagram description syntax variations
- Flowchart node labels containing reserved characters
- Missing statement separators between consecutive nodes
- Empty node labels that lack fallback text
- Gantt chart tasks missing required task identifiers
- Additional edge case transformations (quote stripping, label merging)
Normalization Pipeline Architecture
The normalization pipeline is orchestrated by the normalize_mermaid_diagram function, which applies seven normalization passes in sequence. Each pass is idempotent and focuses on a specific syntax issue.
Pipeline Flow Diagram
graph TD
Input["Raw Diagram Text\nfrom Next.js Payload"]
Step1["normalize_mermaid_edge_labels()\nFlatten multiline edge labels"]
Step2["normalize_mermaid_state_descriptions()\nFix state syntax"]
Step3["normalize_flowchart_nodes()\nClean node labels"]
Step4["normalize_statement_separators()\nInsert newlines"]
Step5["normalize_empty_node_labels()\nAdd fallback labels"]
Step6["normalize_gantt_diagram()\nAdd synthetic task IDs"]
Output["Normalized Diagram\nMermaid 11 Compatible"]
Input --> Step1
Step1 --> Step2
Step2 --> Step3
Step3 --> Step4
Step4 --> Step5
Step5 --> Step6
Step6 --> Output
Sources: python/deepwiki-scraper.py:385-393 python/deepwiki-scraper.py:230-383
Function Name to Normalization Step Mapping
Sources: python/deepwiki-scraper.py:385-393
Step 1: Edge Label Normalization
The normalize_mermaid_edge_labels function collapses multiline edge labels into single-line labels with escaped newline sequences. Mermaid 11 rejects edge labels that span multiple physical lines.
Function : normalize_mermaid_edge_labels(diagram_text: str) -> str
Pattern Matched : Edge labels enclosed in pipes: |....|
Transformations Applied :
- Replace literal newline characters with spaces
- Replace escaped
\nsequences with spaces - Remove parentheses from labels (invalid syntax)
- Collapse multiple spaces into single spaces
| Before | After |
|---|---|
| `A –> | “Label\nLine 2” |
| `C –> | Text (note) |
| `E –> | First\nSecond\nThird |
Implementation Details :
- Only processes diagrams starting with
graphorflowchartkeywords - Uses regex pattern
\|([^|]*)\|to match edge labels - Checks for presence of
\n,(, or)before applying cleanup - Preserves labels that are already properly formatted
Sources: python/deepwiki-scraper.py:230-251
Step 2: State Description Normalization
The normalize_mermaid_state_descriptions function ensures state diagram descriptions follow the strict State : Description syntax required by Mermaid 11.
Function : normalize_mermaid_state_descriptions(diagram_text: str) -> str
Pattern Matched : State declarations with colons in state diagrams
Transformations Applied :
- Ensure single space after state name before colon
- Replace newlines in descriptions with spaces
- Replace additional colons in description with
- - Collapse multiple spaces to single space
| Before | After |
|---|---|
Idle:Waiting\nfor input | Idle : Waiting for input |
Active:Processing:data | Active : Processing - data |
Error : Multiple spaces | Error : Multiple spaces |
Implementation Details :
- Only processes diagrams starting with
statediagramkeyword - Skips lines containing
::(double colon, used for class names) - Splits each line on first colon occurrence
- Requires both prefix and suffix to be non-empty after stripping
Sources: python/deepwiki-scraper.py:253-277
Step 3: Flowchart Node Normalization
The normalize_flowchart_nodes function removes reserved characters (especially pipe |) from flowchart node labels and adds statement separators.
Function : normalize_flowchart_nodes(diagram_text: str) -> str
Pattern Matched : Node labels in brackets: ["..."]
Transformations Applied :
- Replace pipe characters
|with forward slash/ - Collapse multiple spaces to single space
- Insert newlines between consecutive statements on same line
| Before | After |
|---|---|
| `Node[“Label | With Pipes“]` |
A["Text"] B["More"] | A["Text"] |
B["More"] | |
C["Many Spaces"] | C["Many Spaces"] |
Implementation Details :
- Only processes diagrams starting with
graphorflowchartkeywords - Uses regex
\["([^"]*)"\]to match quoted node labels - Inserts newlines after closing brackets/braces/parens using regex:
(\"]|\}|\))\s+(?=[A-Za-z0-9_]) - Preserves indentation when splitting statements
Sources: python/deepwiki-scraper.py:279-301
Step 4: Statement Separator Normalization
The normalize_statement_separators function inserts newlines between consecutive Mermaid statements that have been flattened onto a single line.
Function : normalize_statement_separators(diagram_text: str) -> str
Connector Tokens Recognized :
--> ==> -.-> --x x-- o--> o-> x-> *--> <--> <-.-> <-- --o
Pattern Matched : Whitespace before a node identifier that precedes a connector
Regex Pattern : STATEMENT_BREAK_PATTERN
| Before | After |
|---|---|
A-->B B-->C C-->D | A-->B |
B-->C | |
C-->D | |
Node1-->Node2 Node3-->Node4 | Node1-->Node2 |
Node3-->Node4 |
Implementation Details :
- Only processes diagrams starting with
graphorflowchartkeywords - Defines
FLOW_CONNECTORSlist of all Mermaid connector tokens - Builds regex pattern by escaping and joining connector tokens
- Pattern:
(?<!\n)([ \t]+)(?=[A-Za-z0-9_][\w\-]*(?:\s*\[[^\]]*\])?\s*(?:CONNECTORS)(?:\|[^|]*\|)?\s*) - Preserves indentation length when inserting newlines
- Converts tabs to 4 spaces for consistent indentation
Sources: python/deepwiki-scraper.py:303-328 python/deepwiki-scraper.py:309-311
Step 5: Empty Node Label Normalization
The normalize_empty_node_labels function provides fallback text for nodes with empty labels, which Mermaid 11 rejects.
Function : normalize_empty_node_labels(diagram_text: str) -> str
Pattern Matched : Empty quoted labels: NodeId[""]
Transformation Applied :
- Use node ID as fallback label text
- Replace underscores and hyphens with spaces
- Preserve original node ID for connections
| Before | After |
|---|---|
Dead[""] | Dead["Dead"] |
User_Profile[""] | User_Profile["User Profile"] |
API-Gateway[""] | API-Gateway["API Gateway"] |
Implementation Details :
- Regex pattern:
(\b[A-Za-z0-9_]+)\[""\] - Converts underscores/hyphens to spaces for readable label:
re.sub(r'[_\-]+', ' ', node_id) - Falls back to raw node_id if cleaned version is empty
- Applied to all diagram types (not limited to flowcharts)
Sources: python/deepwiki-scraper.py:330-341 python/tests/test_mermaid_normalization.py:19-23
Step 6: Gantt Diagram Normalization
The normalize_gantt_diagram function assigns synthetic task identifiers to gantt chart tasks that are missing them, which is required by Mermaid 11.
Function : normalize_gantt_diagram(diagram_text: str) -> str
Pattern Matched : Task lines in format "Task Name" : start, end[, duration]
Transformation Applied :
- Insert synthetic task ID (
task1,task2, etc.) after colon - Only apply to tasks lacking valid identifiers
- Preserve tasks that already have IDs or use
afterdependencies
| Before | After |
|---|---|
"Design" : 2024-01-01, 2024-01-10 | "Design" : task1, 2024-01-01, 2024-01-10 |
"Code" : myTask, 2024-01-11, 5d | "Code" : myTask, 2024-01-11, 5d (unchanged) |
"Test" : after task1, 3d | "Test" : after task1, 3d (unchanged) |
Implementation Details :
- Only processes diagrams starting with
ganttkeyword - Task line regex:
^(\s*"[^"]+"\s*):\s*(.+)$ - Splits remainder on commas (max 3 parts)
- Checks if first token matches
^[A-Za-z_][\w-]*$or starts withafter - Maintains counter (
task_counter) for generating unique IDs - Reconstructs line:
"{task_name}" : {task_id}, {start}, {end}[, {duration}]
Sources: python/deepwiki-scraper.py:343-383
Step 7: Additional Preprocessing
Before the seven main normalization steps, diagrams undergo additional preprocessing in the extraction phase:
Quote Stripping : strip_wrapping_quotes(diagram_text: str) -> str
- Removes unnecessary quotation marks around edge labels:
|"text"| → |text| - Removes quotes in state transitions:
: "label" → : label
Label Merging : merge_multiline_labels(diagram_text: str) -> str
- Collapses wrapped labels inside node shapes into
\nsequences - Handles multiple shape types:
(),[],{},(()),[[]],{{}} - Skips lines containing structural tokens (arrows, keywords)
- Applied before unescaping, so works with both real and escaped newlines
Sources: python/deepwiki-scraper.py:907-1023
Main Orchestrator Function
The normalize_mermaid_diagram function orchestrates all normalization passes in the correct order.
Function Signature : normalize_mermaid_diagram(diagram_text: str) -> str
Implementation :
Key Characteristics :
- Each pass is idempotent and can be safely applied multiple times
- Passes are independent and order-dependent
- Edge label normalization must precede statement separator insertion
- Flowchart node normalization includes its own statement separator logic
- Empty label normalization should occur after other node transformations
Sources: python/deepwiki-scraper.py:385-393
graph TD
Extract["extract_and_enhance_diagrams()"]
Loop["For each diagram match"]
Unescape["Unescape sequences\n(\\\n, \\ , \<, etc)"]
Preprocess["merge_multiline_labels()\nstrip_wrapping_quotes()"]
Normalize["normalize_mermaid_diagram()"]
Context["Extract context\n(heading, anchor text)"]
Pool["Add to diagram_contexts list"]
Extract --> Loop
Loop --> Unescape
Unescape --> Preprocess
Preprocess --> Normalize
Normalize --> Context
Context --> Pool
Normalization Invocation Points
The normalization pipeline is invoked at a single location during diagram processing:
Invocation Context Diagram
Sources: python/deepwiki-scraper.py:880-1089 python/deepwiki-scraper.py:1058-1060
Testing and Validation
The normalization pipeline has dedicated unit tests covering each normalization function:
Test Coverage :
| Function | Test File | Test Cases |
|---|---|---|
normalize_statement_separators | test_mermaid_normalization.py | Newline insertion, indentation preservation |
normalize_empty_node_labels | test_mermaid_normalization.py | Empty label replacement |
normalize_flowchart_nodes | test_mermaid_normalization.py | Pipe character stripping |
normalize_mermaid_diagram | test_mermaid_normalization.py | End-to-end pipeline test |
Example Test Case (Statement Separator Normalization):
End-to-End Test : The end-to-end test validates that multiple normalization steps work together correctly:
- Input:
graph TD\n Stage1[""] --> Stage2["Stage 2"]\n Stage2 --> Stage3 Stage3 --> Stage4 - Validates empty label replacement:
Stage1["Stage1"] - Validates statement separation:
Stage2 --> Stage3andStage3 --> Stage4on separate lines
Sources: python/tests/test_mermaid_normalization.py:1-42
Common Edge Cases
The normalization pipeline handles several edge cases that commonly occur in DeepWiki diagrams:
Empty Diagram Handling :
- All normalizers check for empty/whitespace-only input
- Return original text unchanged if stripped content is empty
Diagram Type Detection :
- Each normalizer checks diagram type via first line keyword
normalize_mermaid_edge_labels: only processesgraphorflowchartnormalize_mermaid_state_descriptions: only processesstatediagramnormalize_gantt_diagram: only processesgantt- Other normalizers apply to all diagram types
Indentation Preservation :
- Statement separator normalization preserves original indentation level
- Converts tabs to 4 spaces for consistent formatting
- Inserts newlines with matching indentation
Backtick Escaping in Fence Blocks : When injecting normalized diagrams into markdown, the injection logic dynamically calculates fence length to avoid conflicts with backticks inside diagram code:
- Scans diagram for longest backtick run
- Uses
max(3, max_backticks + 1)as fence length
Sources: python/deepwiki-scraper.py:1249-1255
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Fuzzy Matching Algorithm
Loading…
Fuzzy Matching Algorithm
Relevant source files
Purpose and Scope
The fuzzy matching algorithm places Mermaid diagrams extracted from DeepWiki’s JavaScript payload into the correct locations within Markdown files. It matches diagram context (as it appears in the JavaScript) to content locations in html2text-converted Markdown, accounting for formatting differences between the two representations.
This algorithm is implemented in extract_and_enhance_diagrams() function and processes all files after the initial markdown extraction phase completes.
Sources: python/deepwiki-scraper.py:880-1275
The Matching Problem
The fuzzy matching algorithm addresses a fundamental mismatch: diagrams are embedded in DeepWiki’s JavaScript payload alongside their surrounding context text, but this context text differs significantly from the final Markdown output produced by html2text. The algorithm must find where each diagram belongs despite these differences.
Format Differences Between Sources
| Aspect | JavaScript Payload | html2text Output |
|---|---|---|
| Whitespace | Escaped \n sequences | Actual newlines |
| Line wrapping | No wrapping (continuous text) | Wrapped at natural boundaries |
| HTML entities | Escaped (\u003c, \u0026) | Decoded (<, &) |
| Formatting | Inline with escaped quotes | Clean Markdown syntax |
| Structure | Linear text stream | Hierarchical headings/paragraphs |
Sources: python/deepwiki-scraper.py:898-903
Context Extraction Strategy
The algorithm extracts two types of context for each diagram to enable matching:
1. Last Heading Before Diagram
Extractinglast_heading from Context
The algorithm scans backwards through context_lines to find the most recent line starting with #, which provides a coarse-grained location hint.
Sources: python/deepwiki-scraper.py:1066-1071
2. Anchor Text (Last 2-3 Paragraphs)
Extractinganchor_text from Context
The anchor_text consists of the last 2-3 substantial non-heading lines before the diagram, truncated to 300 characters. This provides fine-grained matching capability.
Sources: python/deepwiki-scraper.py:1073-1081
Progressive Chunk Size Matching
The core of the fuzzy matching algorithm uses progressively smaller chunk sizes to find matches, prioritizing longer (more specific) matches over shorter ones.
Chunk Size Progression
The algorithm tests chunks in this order:
| Chunk Size | Purpose | Match Quality |
|---|---|---|
| 300 chars | Full anchor text | Highest confidence |
| 200 chars | Most of anchor | High confidence |
| 150 chars | Significant portion | Medium-high confidence |
| 100 chars | Key phrases | Medium confidence |
| 80 chars | Minimum viable match | Low confidence |
Matching Algorithm Flow
Progressive Chunk Matching in Code
Sources: python/deepwiki-scraper.py:1169-1239
Text Normalization
Both the diagram context and the target Markdown content undergo identical normalization to maximize matching success:
This process:
- Converts all text to lowercase
- Collapses all consecutive whitespace (spaces, tabs, newlines) into single spaces
- Removes leading/trailing whitespace
Sources: python/deepwiki-scraper.py:1166-1167 python/deepwiki-scraper.py:1185-1186
Fallback: Heading-Based Matching
If progressive chunk matching fails (best_match_line == -1 and heading exists), the algorithm falls back to heading-based matching:
Heading Fallback Implementation
Heading-based matches receive a fixed best_match_score of 50, lower than any chunk-based match (minimum 80), indicating lower confidence.
Sources: python/deepwiki-scraper.py:1203-1216
Insertion Point Calculation
Once a match is found with best_match_score >= 80, the algorithm calculates the precise insert_line for the diagram:
Insertion After Headings
Calculatinginsert_line After Heading
Sources: python/deepwiki-scraper.py:1222-1230
Insertion After Paragraphs
Calculatinginsert_line After Paragraph
Sources: python/deepwiki-scraper.py:1232-1236
Scoring and Deduplication
The algorithm tracks which diagrams have been used to prevent duplicates:
For each file, the algorithm:
- Attempts to match all diagrams in
diagram_contextswith file content - Stores successful matches with their scores in
pending_insertionsas tuples:(insert_line, diagram, best_match_score, idx) - Marks diagrams as used by adding their index to
diagrams_usedset - Sorts
pending_insertionsby line number (descending) to avoid index shifting - Inserts diagrams from bottom to top
Sources: python/deepwiki-scraper.py:1162-1163 python/deepwiki-scraper.py1238 python/deepwiki-scraper.py:1242-1243
Diagram Insertion Format
Diagrams are inserted with proper Markdown fencing and spacing, accounting for backticks in the diagram content:
Buildinglines_to_insert
This results in the following structure in the Markdown file:
Next paragraph text.
If the diagram contains triple backticks, the fence length is increased (e.g., to 4 or 5 backticks) to avoid conflicts.
**Sources:** <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/python/deepwiki-scraper.py#L1249-L1266" min=1249 max=1266 file-path="python/deepwiki-scraper.py">Hii</FileRef>
## Complete Matching Pipeline
**`extract_and_enhance_diagrams()` Function Flow**
```mermaid
flowchart TD
Start["extract_and_enhance_diagrams(\nrepo, temp_dir,\nsession, diagram_source_url)"] --> Fetch["response = session.get(\ndiagram_source_url)\nhtml_text = response.text"]
Fetch --> ExtractPattern["diagram_pattern =\nr'```mermaid(?:\\r\\n|\\n|\r?\n)\n(.*?)(?:\\r\\n|\\n|\r?\n)```'\ndiagram_matches =\nlist(re.finditer(\npattern, html_text, re.DOTALL))"]
ExtractPattern --> ContextLoop["diagram_contexts = []\nfor match in diagram_matches:\ncontext_start =\nmax(0, match.start() - 2000)"]
ContextLoop --> ParseContext["extract:\n- last_heading\n- anchor_text[-300:]\n- diagram (unescaped)"]
ParseContext --> NormalizeDiag["diagram =\nmerge_multiline_labels(diagram)\ndiagram =\nstrip_wrapping_quotes(diagram)\ndiagram =\nnormalize_mermaid_diagram(diagram)"]
NormalizeDiag --> AppendContext["diagram_contexts.append({\n'last_heading': last_heading,\n'anchor_text': anchor_text,\n'diagram': diagram\n})"]
AppendContext --> FindFiles["md_files =\nlist(temp_dir.glob('**/*.md'))"]
FindFiles --> FileLoop["for md_file in md_files:"]
FileLoop --> ReadFile["content = f.read()"]
ReadFile --> CheckExists["re.search(\nr'^\\s*`{3,}\\s*mermaid\\b',\ncontent, re.IGNORECASE\n| re.MULTILINE)?"]
CheckExists -->|Yes| Skip["continue"]
CheckExists -->|No| SplitLines["lines = content.split('\\n')"]
SplitLines --> InitVars["diagrams_used = set()\npending_insertions = []\ncontent_normalized =\ncontent.lower()"]
InitVars --> DiagLoop["for idx, item in\nenumerate(diagram_contexts):"]
DiagLoop --> TryChunks["for chunk_size in\n[300, 200, 150, 100, 80]:\ntest_chunk =\nanchor_normalized[-chunk_size:]\npos = content_normalized\n.find(test_chunk)"]
TryChunks -->|Found| CalcLine["convert pos to line_num\nbest_match_line = line_num\nbest_match_score = chunk_size"]
TryChunks -->|Not found| TryHeading["heading fallback matching"]
TryHeading -->|Found| SetScore["best_match_score = 50"]
CalcLine --> CheckScore["best_match_score >= 80?"]
SetScore --> CheckScore
CheckScore -->|Yes| CalcInsert["calculate insert_line:\nenforce_content_start()\nadvance_past_lists()"]
CalcInsert --> AppendPending["pending_insertions.append(\ninsert_line, diagram,\nbest_match_score, idx)\ndiagrams_used.add(idx)"]
CheckScore -->|No| NextDiag["next diagram"]
AppendPending --> NextDiag
NextDiag --> DiagLoop
DiagLoop -->|All done| SortPending["pending_insertions.sort(\nkey=lambda x: x[0],\nreverse=True)"]
SortPending --> InsertLoop["for insert_line, diagram,\nscore, idx in\npending_insertions:\ncalculate fence_len\nlines.insert(insert_line,\nlines_to_insert)"]
InsertLoop --> SaveFile["with open(md_file, 'w')\nas f:\nf.write('\\n'.join(lines))"]
SaveFile --> FileLoop
FileLoop -->|All files| PrintStats["print(f'Enhanced\n{enhanced_count} files')"]
Sources: python/deepwiki-scraper.py:880-1275
Key Functions and Variables
| Function/Variable | Location | Purpose |
|---|---|---|
extract_and_enhance_diagrams() | python/deepwiki-scraper.py:880-1275 | Main orchestrator for diagram enhancement phase |
diagram_contexts | python/deepwiki-scraper.py903 | List of dicts with last_heading, anchor_text, diagram |
first_body_heading_index() | python/deepwiki-scraper.py:1095-1099 | Finds first ## heading in file |
protected_prefix_end() | python/deepwiki-scraper.py:1101-1115 | Determines where title and source list end |
advance_past_lists() | python/deepwiki-scraper.py:1125-1137 | Skips over list blocks to avoid insertion in lists |
enforce_content_start() | python/deepwiki-scraper.py:1138-1147 | Ensures insertion is after protected sections |
diagrams_used | python/deepwiki-scraper.py1162 | Set tracking which diagram indices are already placed |
pending_insertions | python/deepwiki-scraper.py1163 | List of tuples: (insert_line, diagram, score, idx) |
best_match_line | python/deepwiki-scraper.py1175 | Line number where best match was found |
best_match_score | python/deepwiki-scraper.py1176 | Score of best match (chunk_size or 50 for heading) |
| Progressive chunk loop | python/deepwiki-scraper.py:1187-1201 | Tries chunk sizes [300, 200, 150, 100, 80] |
| Heading fallback | python/deepwiki-scraper.py:1203-1216 | Matches based on heading text when chunks fail |
| Insertion calculation | python/deepwiki-scraper.py:1219-1236 | Determines where to insert diagram after match |
| Diagram insertion | python/deepwiki-scraper.py:1249-1266 | Inserts diagram with dynamic fence length |
Performance Characteristics
The algorithm processes diagrams in a single pass per file with the following complexity:
| Operation | Complexity | Notes |
|---|---|---|
| Content normalization | O(n) | Where n = file size in characters |
| Chunk search | O(n × c × d) | c = 5 chunk sizes, d = diagram count |
| Line number conversion | O(L) | Where L = number of lines in file |
| Insertion sorting | O(k log k) | Where k = matched diagrams |
| Bottom-up insertion | O(k × L) | Avoids index recalculation due to reverse order |
For a typical file with 1000 lines and 48 diagram candidates with context, the algorithm completes in under 100ms per file.
Sources: python/deepwiki-scraper.py:880-1275
Match Quality Statistics
As reported in the console output, the algorithm typically achieves:
- Total diagrams in JavaScript : ~461 diagrams across all pages
- Diagrams with sufficient context : ~48 diagrams (500+ char context)
- Average match rate : 60-80% of diagrams with context are successfully placed
- Typical score distribution :
- 300-char matches: 20-30% (highest confidence)
- 200-char matches: 15-25%
- 150-char matches: 15-20%
- 100-char matches: 10-15%
- 80-char matches: 5-10%
- Heading fallback: 5-10% (lowest confidence)
Sources: README.md:132-136 tools/deepwiki-scraper.py674
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Phase 3: mdBook Build
Loading…
Phase 3: mdBook Build
Relevant source files
Purpose and Scope
Phase 3 is the final transformation stage that converts enhanced markdown files into a searchable HTML documentation website using mdBook. This phase creates the book structure, generates the table of contents, injects templates, installs mermaid rendering support, and executes the mdBook build process.
This page covers the overall Phase 3 workflow and its core components. For detailed information about specific sub-processes, see:
- SUMMARY.md generation algorithm: 8.1
- Template injection mechanics: 8.2
- Configuration system: 3
- Template system details: 11
Phase 3 begins after Phase 2 completes diagram enhancement (see 7) and ends with the production of deployable HTML artifacts.
Sources: scripts/build-docs.sh:95-310
Phase 3 Process Flow
Phase 3 executes six distinct steps, each transforming the workspace toward the final HTML output. The process is orchestrated by build-docs.sh and coordinates multiple tools.
graph TB
Input["Enhanced Markdown Files\n/workspace/wiki/"]
Step2["Step 2: Initialize mdBook Structure\nCreate /workspace/book/\nGenerate book.toml"]
Step3["Step 3: Generate SUMMARY.md\nDiscover file structure\nSort pages numerically"]
Step4["Step 4: Process Markdown Files\nInject header/footer templates\nCopy to book/src/"]
Step5["Step 5: Install Mermaid Assets\nmdbook-mermaid install"]
Step6["Step 6: Build Book\nmdbook build"]
Step7["Step 7: Copy Outputs\nbook/ → /output/book/\nbook.toml → /output/"]
Output["Final Outputs\n/output/book/ (HTML)\n/output/markdown/\n/output/book.toml"]
Input --> Step2
Step2 --> Step3
Step3 --> Step4
Step4 --> Step5
Step5 --> Step6
Step6 --> Step7
Step7 --> Output
High-Level Phase 3 Pipeline
Sources: scripts/build-docs.sh:95-310
mdBook Structure Initialization
The first step creates the mdBook workspace at /workspace/book/ and generates the configuration file book.toml. This establishes the foundation for all subsequent operations.
Directory Structure Creation
The script creates the following directory hierarchy:
/workspace/book/
├── book.toml (generated configuration)
└── src/ (will contain markdown files)
├── SUMMARY.md (generated in Step 3)
├── *.md (copied in Step 4)
└── section-*/ (copied in Step 4)
Sources: scripts/build-docs.sh:96-122
book.toml Configuration
The book.toml file is generated dynamically from environment variables. The configuration structure follows the mdBook specification:
| Configuration Section | Purpose | Source Variables |
|---|---|---|
[book] | Book metadata | BOOK_TITLE, BOOK_AUTHORS |
[output.html] | HTML output settings | GIT_REPO_URL |
[preprocessor.mermaid] | Mermaid diagram support | Static configuration |
[output.html.fold] | Section folding behavior | Static configuration |
The generated configuration:
The git-repository-url setting enables the GitHub icon in the rendered book’s header, providing navigation back to the source repository.
Sources: scripts/build-docs.sh:101-119
SUMMARY.md Generation Process
Step 3 generates src/SUMMARY.md, which defines the book’s table of contents and navigation structure. This is a critical file that mdBook requires to determine page ordering and hierarchy.
graph TB
WikiDir["/workspace/wiki/\nAll markdown files"]
FindOverview["Find Overview File\ngrep -Ev '^[0-9]'\nFirst non-numbered file"]
FindMain["Find Main Pages\ngrep -E '^[0-9]'\nFiles matching ^[0-9]*.md"]
Sort["Sort Numerically\nsort -t- -k1 -n\nBy leading number"]
CheckSections{"Has Subsections?\nsection-N directory exists?"}
FindSubs["Find Subsections\nls section-N/*.md\nSort numerically"]
ExtractTitle["Extract Title\nhead -1 file.md\nsed 's/^# //'"]
BuildEntry["Build TOC Entry\n- [Title](filename)"]
BuildNested["Build Nested Entry\n- [Title](filename)\n - [Subtitle](section-N/file)"]
Summary["src/SUMMARY.md\nGenerated TOC"]
WikiDir --> FindOverview
WikiDir --> FindMain
FindOverview --> Summary
FindMain --> Sort
Sort --> CheckSections
CheckSections -->|No| ExtractTitle
CheckSections -->|Yes| FindSubs
ExtractTitle --> BuildEntry
FindSubs --> ExtractTitle
ExtractTitle --> BuildNested
BuildEntry --> Summary
BuildNested --> Summary
File Discovery and Sorting
Numeric Sorting Algorithm
The sorting mechanism uses shell built-ins to extract numeric prefixes and sort appropriately:
- List all
.mdfiles in/workspace/wiki/:ls "$WIKI_DIR"/*.md - Filter by numeric prefix:
grep -E '^[0-9]' - Sort using field delimiter
-and numeric comparison:sort -t- -k1 -n - For each main page, check for subsection directory:
section-$section_num - If subsections exist, repeat sort for subsection files
This ensures pages appear in the correct order: 1-overview.md, 2-architecture.md, 2.1-subsection.md, etc.
Sources: scripts/build-docs.sh:124-188
For detailed documentation of the SUMMARY.md generation algorithm, see 8.1.
Markdown File Processing
Step 4 copies markdown files from /workspace/wiki/ to /workspace/book/src/ and injects HTML templates into each file. This step bridges the gap between raw markdown and mdBook-ready content.
graph TB
HeaderTemplate["templates/header.html\nRaw template with variables"]
FooterTemplate["templates/footer.html\nRaw template with variables"]
ProcessTemplate["process-template.py\nVariable substitution"]
EnvVars["Environment Variables\nREPO, BOOK_TITLE, DEEPWIKI_URL\nGITHUB_BADGE_URL, etc."]
HeaderHTML["HEADER_HTML\nProcessed HTML string"]
FooterHTML["FOOTER_HTML\nProcessed HTML string"]
MarkdownFiles["Markdown Files\nsrc/*.md, src/*/*.md"]
Inject["Injection Loop\nfor mdfile in src/*.md"]
TempFile["$mdfile.tmp\nHeader + Content + Footer"]
Replace["mv $mdfile.tmp $mdfile\nReplace original"]
FinalFiles["Final Book Source\nsrc/*.md with templates"]
HeaderTemplate --> ProcessTemplate
FooterTemplate --> ProcessTemplate
EnvVars --> ProcessTemplate
ProcessTemplate --> HeaderHTML
ProcessTemplate --> FooterHTML
HeaderHTML --> Inject
FooterHTML --> Inject
MarkdownFiles --> Inject
Inject --> TempFile
TempFile --> Replace
Replace --> FinalFiles
Template Processing Workflow
Template Variable Substitution
The process-template.py script is invoked twice, once for the header and once for the footer:
The processed HTML strings are then injected into every markdown file through shell redirection.
Sources: scripts/build-docs.sh:190-261
For detailed documentation of template mechanics and customization, see 8.2 and 11.
Mermaid Asset Installation
Step 5 installs the mdbook-mermaid preprocessor assets into the book directory. This step configures the JavaScript libraries and stylesheets required to render Mermaid diagrams in the final HTML output.
Installation Command
The installation is performed by the mdbook-mermaid binary:
This command:
- Creates theme assets in
book/theme/ - Installs
mermaid-init.jsfor diagram initialization - Configures mermaid.js library version
- Sets up diagram rendering hooks for mdBook’s preprocessor chain
The preprocessor was configured in book.toml during Step 2:
This configuration tells mdBook to run the mermaid preprocessor before generating HTML, which converts mermaid code blocks into rendered diagrams.
Sources: scripts/build-docs.sh:263-266
graph LR
BookToml["book.toml\nConfiguration"]
SrcDir["src/\nSUMMARY.md\n*.md files\nsection-*/ dirs"]
MdBookBuild["mdbook build\nMain build command"]
Preprocessor["mermaid preprocessor\nConvert mermaid blocks"]
Renderer["HTML renderer\nGenerate pages"]
Assets["Copy static assets\ntheme/, images/"]
Search["Build search index\nsearchindex.js"]
BookOutput["book/\nCompiled HTML"]
BookToml --> MdBookBuild
SrcDir --> MdBookBuild
MdBookBuild --> Preprocessor
Preprocessor --> Renderer
Renderer --> Assets
Renderer --> Search
Assets --> BookOutput
Search --> BookOutput
Book Build Execution
Step 6 executes the core mdBook build process. This step transforms the prepared markdown files and configuration into a complete HTML documentation website.
mdBook Build Pipeline
Build Command
The build is invoked with no arguments, using the current directory’s configuration:
mdBook automatically:
- Reads
book.tomlfor configuration - Processes
src/SUMMARY.mdto determine page structure - Runs configured preprocessors (mermaid)
- Generates HTML with search index
- Applies the configured theme (rust)
- Creates navigation elements with git repository link
The resulting output is written to book/, relative to the current working directory (/workspace/book/).
Sources: scripts/build-docs.sh:268-271
graph TB
BookBuild["book/\nBuilt HTML website"]
WikiDir["wiki/\nEnhanced markdown"]
RawDir["raw_markdown/\nPre-enhancement snapshots"]
BookConfig["book.toml\nBuild configuration"]
OutputBook["/output/book/\nDeployable HTML"]
OutputMD["/output/markdown/\nFinal markdown source"]
OutputRaw["/output/raw_markdown/\nDebug snapshots"]
OutputConfig["/output/book.toml\nReference config"]
BookBuild -->|cp -r| OutputBook
WikiDir -->|cp -r| OutputMD
RawDir -->|cp -r if exists| OutputRaw
BookConfig -->|cp| OutputConfig
Output Collection
Step 7 consolidates all build artifacts into the /output/ directory, which is typically mounted as a volume for access from the host system.
Output Artifacts and Layout
Copy Operations
The script performs four copy operations:
| Source | Destination | Purpose |
|---|---|---|
book/ | /output/book/ | Deployable HTML site |
/workspace/wiki/ | /output/markdown/ | Enhanced markdown (with diagrams) |
/workspace/raw_markdown/ | /output/raw_markdown/ | Pre-enhancement markdown (debugging) |
book.toml | /output/book.toml | Configuration reference |
The /output/book/ directory is immediately servable as a static website:
Sources: scripts/build-docs.sh:273-309
graph TB
Phase1["Phase 1:\nMarkdown Extraction"]
Phase2["Phase 2:\nDiagram Enhancement"]
CheckMode{"MARKDOWN_ONLY\n== 'true'?"}
CopyMD["Copy wiki/ to\n/output/markdown/"]
CopyRaw["Copy raw_markdown/ to\n/output/raw_markdown/"]
Exit["Exit with success\nSkip Phase 3"]
Phase3["Phase 3:\nmdBook Build"]
Phase1 --> Phase2
Phase2 --> CheckMode
CheckMode -->|Yes| CopyMD
CopyMD --> CopyRaw
CopyRaw --> Exit
CheckMode -->|No| Phase3
Markdown-Only Mode
The MARKDOWN_ONLY environment variable provides an escape hatch that bypasses Phase 3 entirely. When set to "true", the script exits after Phase 2 (diagram enhancement) without building the HTML book.
Markdown-Only Workflow
This mode is useful for:
- Debugging diagram placement without HTML build overhead
- Consuming markdown in alternative build systems
- Inspecting intermediate transformation results
- CI/CD pipelines that only need markdown output
Sources: scripts/build-docs.sh:67-93
For more details on markdown-only mode, see 12.1.
graph TB
subgraph "Shell Script Orchestration"
BuildScript["build-docs.sh\nMain orchestrator"]
Step2Func["Lines 95-122\nmkdir, cat > book.toml"]
Step3Func["Lines 124-188\nSUMMARY.md generation"]
Step4Func["Lines 190-261\ncp, template injection"]
Step5Func["Lines 263-266\nmdbook-mermaid install"]
Step6Func["Lines 268-271\nmdbook build"]
Step7Func["Lines 273-295\ncp outputs"]
end
subgraph "External Tools"
MdBook["mdbook\nRust binary\n/usr/local/bin/mdbook"]
MdBookMermaid["mdbook-mermaid\nPreprocessor binary\n/usr/local/bin/mdbook-mermaid"]
ProcessTemplate["process-template.py\nPython script\n/usr/local/bin/process-template.py"]
end
subgraph "Key Files"
BookToml["book.toml\nGenerated at line 102"]
SummaryMd["src/SUMMARY.md\nGenerated at line 130-186"]
HeaderHtml["templates/header.html\nInput template"]
FooterHtml["templates/footer.html\nInput template"]
end
subgraph "Directory Structures"
BookDir["/workspace/book/\nBuild workspace"]
SrcDir["/workspace/book/src/\nMarkdown sources"]
OutputDir["/output/\nVolume mount point"]
end
BuildScript --> Step2Func
Step2Func --> Step3Func
Step3Func --> Step4Func
Step4Func --> Step5Func
Step5Func --> Step6Func
Step6Func --> Step7Func
Step2Func --> BookToml
Step2Func --> BookDir
Step3Func --> SummaryMd
Step4Func --> ProcessTemplate
Step4Func --> SrcDir
Step5Func --> MdBookMermaid
Step6Func --> MdBook
Step7Func --> OutputDir
HeaderHtml --> ProcessTemplate
FooterHtml --> ProcessTemplate
Phase 3 Component Map
This diagram maps Phase 3 concepts to their concrete implementations in the codebase:
Sources: scripts/build-docs.sh:95-310
Phase 3 Error Handling
Phase 3 operates under set -e mode, causing immediate script termination on any command failure. Common failure scenarios:
| Failure Point | Cause | Impact |
|---|---|---|
Step 2 mkdir | Permissions issue, disk full | Script exits before book.toml creation |
| Step 3 file discovery | No markdown files in wiki/ | SUMMARY.md empty, mdBook build fails |
| Step 4 template processing | Invalid template syntax | HEADER_HTML or FOOTER_HTML empty, build continues without templates |
| Step 5 mermaid install | mdbook-mermaid binary missing | Script exits, no mermaid support in output |
| Step 6 mdbook build | Invalid markdown syntax, broken SUMMARY.md | Script exits, no HTML output |
| Step 7 copy | Permissions issue on /output | Script exits, no artifacts persisted |
All errors are reported to stderr and result in a non-zero exit code due to set -e.
Sources: scripts/build-docs.sh2 scripts/build-docs.sh:95-310
Summary
Phase 3 transforms enhanced markdown into a searchable HTML documentation website through six orchestrated steps:
- Structure Initialization : Creates mdBook workspace and
book.tomlconfiguration - SUMMARY.md Generation : Builds navigation structure with numeric sorting
- Markdown Processing : Injects templates and copies files to book source
- Mermaid Installation : Configures diagram rendering support
- Book Build : Executes mdBook to generate HTML
- Output Collection : Consolidates artifacts to
/output/volume
Phase 3 is entirely orchestrated by build-docs.sh and coordinates shell utilities, Python scripts, and Rust binaries to produce the final documentation website. The process is deterministic and idempotent, generating consistent output from the same input markdown.
Sources: scripts/build-docs.sh:95-310
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
SUMMARY.md Generation
Loading…
SUMMARY.md Generation
Relevant source files
Purpose and Scope
This document explains how the SUMMARY.md file is dynamically generated from the scraped markdown content structure. The SUMMARY.md file serves as mdBook’s table of contents, defining the navigation structure and page hierarchy for the generated HTML documentation.
For information about how the markdown files are initially organized during scraping, see Wiki Structure Discovery. For details about the overall mdBook build configuration, see Configuration Generation.
SUMMARY.md in mdBook
The SUMMARY.md file is mdBook’s primary navigation document. It defines:
- The order of pages in the documentation
- The hierarchical structure (chapters and sub-chapters)
- The titles displayed in the navigation sidebar
- Which markdown files map to which sections
mdBook parses SUMMARY.md to construct the entire book structure. Pages not listed in SUMMARY.md will not be included in the generated documentation.
Sources: build-docs.sh:108-161
Generation Process Overview
The SUMMARY.md generation occurs in Step 3 of the build pipeline build-docs.sh:124-188 after markdown extraction is complete but before the mdBook build begins. The generation algorithm automatically discovers the file structure, applies numeric sorting to section files, and constructs a hierarchical table of contents.
SUMMARY.md Generation Pipeline
flowchart TD
Start["Start Step 3:\nbuild-docs.sh:126"]
Init["Write Header:\necho '# Summary'\nLines 129-131"]
ListFiles["List all files:\nmain_pages_list=$(ls $WIKI_DIR/*.md)"]
FindOverview["Find overview_file:\ngrep -Ev '^[0-9]' /head -1 Lines 136-138"]
HasOverview{"overview_file exists?"}
ExtractOvTitle["Extract title: head -1/ sed 's/^# //'\nLine 140"]
WriteOv["Write: [{title}]($overview_file)\nLines 141-143"]
RemoveOv["Filter overview_file from list\nLine 143"]
NumericSort["Numeric Sort:\ngrep -E '^[0-9]' /sort -t- -k1 -n Lines 147-155"]
IterateMain["for file in main_pages Line 158"]
ExtractTitle["title=$ head -1/ sed 's/^# //')"]
ExtractSecNum["section_num=$(grep -oE '^[0-9]+')"]
CheckSecDir{"[ -d section-$section_num ]"}
WriteSec["echo '- [$title]($filename)'\nLine 171"]
IterateSub["ls section-$section_num/*.md /sort -t- -k1 -n Lines 174-180"]
WriteSub["echo ' - [$subtitle] section-$section_num/$subfilename '"]
WriteStandalone["echo '- [$title] $filename ' Line 183"]
Complete["Redirect to: src/SUMMARY.md Line 186"]
LogCount["Log entry count: grep -c '\\[' src/SUMMARY.md Line 188"]
End["End Step 3"]
Start --> Init
Init --> ListFiles
ListFiles --> FindOverview
FindOverview --> HasOverview
HasOverview -->|Yes|ExtractOvTitle
ExtractOvTitle --> WriteOv
WriteOv --> RemoveOv
RemoveOv --> NumericSort
HasOverview -->|No|NumericSort
NumericSort --> IterateMain
IterateMain --> ExtractTitle
ExtractTitle --> ExtractSecNum
ExtractSecNum --> CheckSecDir
CheckSecDir -->|Yes: Has subsections|WriteSec
WriteSec --> IterateSub
IterateSub --> WriteSub
WriteSub --> IterateMain
CheckSecDir -->|No: Standalone|WriteStandalone
WriteStandalone --> IterateMain
IterateMain -->|Done| Complete
Complete --> LogCount
LogCount --> End
Sources: build-docs.sh:124-188
The algorithm executes three key phases:
| Phase | Lines | Description |
|---|---|---|
| Overview Extraction | 133-145 | Identifies and writes non-numbered introduction page |
| Numeric Sorting | 147-155 | Sorts numbered pages by numeric prefix using sort -t- -k1 -n |
| Hierarchical Writing | 158-185 | Iterates sorted pages, detecting and nesting subsections |
Sources: build-docs.sh:124-188
Algorithm Components
Step 1: Overview File Selection
The algorithm identifies a special overview file by searching for files without numeric prefixes. This file becomes the introduction page, written before the numbered sections.
Overview File Detection Algorithm
flowchart TD
Start["main_pages_list =\nls $WIKI_DIR/*.md"]
Filter["overview_file =\nawk -F/ '{print $NF}' /grep -Ev '^[0-9]'/\nhead -1"]
Check{"overview_file\nnot empty?"}
Verify{"File exists?\n[ -f $WIKI_DIR/$overview_file ]"}
Extract["title=$(head -1 $WIKI_DIR/$overview_file /sed 's/^# //'"]
Write["echo '[{title:-Overview}] $overview_file ' echo ''"]
Remove["main_pages_list = grep -v $overview_file"]
Continue["Proceed to numeric sorting"]
Start --> Filter
Filter --> Check
Check -->|Yes|Verify
Check -->|No|Continue
Verify -->|Yes|Extract
Verify -->|No| Continue
Extract --> Write
Write --> Remove
Remove --> Continue
Sources: build-docs.sh:133-145
Detection Logic:
| Step | Command | Purpose | Example |
|---|---|---|---|
| List files | ls "$WIKI_DIR"/*.md | Get all root markdown files | Overview.md 1-intro.md 2-start.md |
| Extract basename | awk -F/ '{print $NF}' | Get filename only | Overview.md |
| Filter non-numeric | grep -Ev '^[0-9]' | Exclude numbered files | Overview.md (matches) |
| Take first | head -1 | Select single overview | Overview.md |
| Extract title | `head -1 | sed ‘s/^# //’` | Get page title |
The overview file is then excluded from main_pages_list before numeric sorting build-docs.sh143
Sources: build-docs.sh:133-145
Step 2: Numeric Sorting Pipeline
After overview extraction, remaining files are sorted numerically by their leading number prefix. This ensures pages appear in logical order (e.g., 2-start.md before 10-advanced.md).
Numeric Sorting Implementation
flowchart LR
Input["main_pages_list\n(filtered from overview)"]
Basename["awk -F/ '{print $NF}'\nExtract filename"]
GrepNum["grep -E '^[0-9]'\nKeep only numbered files"]
NumSort["sort -t- -k1 -n\nNumeric sort on first field"]
Reconstruct["while read fname; do\n echo $WIKI_DIR/$fname\ndone"]
Output["main_pages\n(sorted file paths)"]
Input --> Basename
Basename --> GrepNum
GrepNum --> NumSort
NumSort --> Reconstruct
Reconstruct --> Output
Sources: build-docs.sh:147-155
Sort Command Breakdown:
| Flag | Purpose | Example Effect |
|---|---|---|
-t- | Set delimiter to - | Split 10-advanced.md into 10 and advanced.md |
-k1 | Sort by field 1 | Use 10 as sort key |
-n | Numeric comparison | 2 sorts before 10 (not lexicographic) |
Example Sorting:
| Unsorted Filenames | Numeric Sort | Output Order |
|---|---|---|
10-advanced.md | Extract 10 | 1-overview.md |
2-start.md | Extract 2 | 2-start.md |
1-overview.md | Extract 1 | 5-components.md |
5-components.md | Extract 5 | 10-advanced.md |
Sources: build-docs.sh:147-155
Step 3: Subsection Detection and Iteration
For each main page, the algorithm extracts the numeric prefix (section_num) and checks for a corresponding section-N/ directory. If found, all subsection files are written with 2-space indentation.
Subsection Detection and Writing Flow
flowchart TD
LoopStart["for file in main_pages\nLine 158"]
FileCheck["[ -f $file ] // continue"]
GetBasename["filename=$(basename $file)"]
ExtractTitle["title=$(head -1 $file /sed 's/^# //' Line 163"]
ExtractNum["section_num=$ echo $filename/\ngrep -oE '^[0-9]+' // true)\nLine 166"]
BuildPath["section_dir=$WIKI_DIR/section-$section_num\nLine 167"]
CheckBoth{"[ -n $section_num ] &&\n[ -d $section_dir ]"}
WriteMain["echo '- [$title]($filename)'\nLine 171"]
ListSubs["ls $section_dir/*.md 2>/dev/null /awk -F/ '{print $NF}'/\nsort -t- -k1 -n\nLines 174-175"]
SubLoop["while read subname\nLine 174"]
SubCheck["[ -f $section_dir/$subname ] // continue"]
SubBasename["subfilename=$(basename $subfile)"]
SubTitle["subtitle=$(head -1 $subfile /sed 's/^# //' Line 178"]
SubWrite["echo ' - [$subtitle] section-$section_num/$subfilename ' Line 179"]
WriteStandalone["echo '- [$title] $filename ' Line 183"]
NextFile["Continue loop"]
LoopStart --> FileCheck
FileCheck --> GetBasename
GetBasename --> ExtractTitle
ExtractTitle --> ExtractNum
ExtractNum --> BuildPath
BuildPath --> CheckBoth
CheckBoth -->|Yes: Has subsections|WriteMain
WriteMain --> ListSubs
ListSubs --> SubLoop
SubLoop --> SubCheck
SubCheck --> SubBasename
SubBasename --> SubTitle
SubTitle --> SubWrite
SubWrite --> SubLoop
SubLoop -->|Done|NextFile
CheckBoth -->|No: Standalone| WriteStandalone
WriteStandalone --> NextFile
NextFile --> LoopStart
Sources: build-docs.sh:158-185
Variable Mapping:
| Variable | Type | Purpose | Example Value |
|---|---|---|---|
filename | String | Main page filename | 5-component-reference.md |
title | String | Main page title | Component Reference |
section_num | String | Numeric prefix | 5 |
section_dir | Path | Subsection directory | /workspace/wiki/section-5 |
subfilename | String | Subsection filename | 5.1-build-docs.md |
subtitle | String | Subsection title | build-docs.sh Orchestrator |
Sources: build-docs.sh:158-185
Step 4: Subsection Numeric Sorting
Subsection files within section-N/ directories undergo the same numeric sorting as main pages, ensuring proper ordering (e.g., 5.2 before 5.10).
Subsection Sorting Pipeline
flowchart LR
Dir["section_dir/\nsection-5/"]
List["ls $section_dir/*.md 2>/dev/null"]
Awk["awk -F/ '{print $NF}'"]
Sort["sort -t- -k1 -n"]
Loop["while read subname"]
Verify["[ -f $subfile ] // continue"]
Extract["subtitle=$(head -1 / sed 's/^# //')"]
Write["echo ' - [$subtitle](section-N/$subfilename)'"]
Dir --> List
List --> Awk
Awk --> Sort
Sort --> Loop
Loop --> Verify
Verify --> Extract
Extract --> Write
Write --> Loop
Sources: build-docs.sh:174-180
Key Implementation Details:
| Aspect | Implementation | Purpose |
|---|---|---|
| Indentation | echo " - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/$subtitle" undefined file-path="$subtitle">Hii</FileRef>" | Two spaces for mdBook nesting |
| Path prefix | section-$section_num/$subfilename | Correct relative path for mdBook |
| Numeric sort | sort -t- -k1 -n | Same algorithm as main pages |
| Title extraction | `head -1 | sed ‘s/^# //’` |
Sorting Example:
Input files: After sort: SUMMARY.md output:
section-5/ section-5/ - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Component Reference" undefined file-path="Component Reference">Hii</FileRef>
├── 5.10-tools.md ├── 5.1-build.md - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/build-docs.sh" undefined file-path="build-docs.sh">Hii</FileRef>
├── 5.2-scraper.md ├── 5.2-scraper.md - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Scraper" undefined file-path="Scraper">Hii</FileRef>
└── 5.1-build.md └── 5.10-tools.md - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Tools" undefined file-path="Tools">Hii</FileRef>
Sources: build-docs.sh:174-180
File Structure Conventions
The generation algorithm depends on the file structure created during markdown extraction (see Wiki Structure Discovery):
Diagram: File Structure Conventions for SUMMARY.md Generation
| Pattern | Location | SUMMARY.md Output |
|---|---|---|
*.md | Root directory | Main pages |
N-*.md | Root directory | Main section (if section-N/ exists) |
*.md | section-N/ directory | Subsections (indented under section N) |
Sources: build-docs.sh:126-158
Title Extraction Method
All page titles are extracted using a consistent pattern:
This assumes that every markdown file begins with a level-1 heading (# Title). The sed command removes the # prefix, leaving only the title text.
Extraction Pipeline:
| Command | Purpose | Example Input | Example Output |
|---|---|---|---|
head -1 "$file" | Get first line | # Component Reference | # Component Reference |
sed 's/^# //' | Remove heading syntax | # Component Reference | Component Reference |
Sources: build-docs.sh120 build-docs.sh134 build-docs.sh150
Output Format
The generated SUMMARY.md follows mdBook’s syntax:
Format Rules:
| Element | Syntax | Purpose |
|---|---|---|
| Header | # Summary | Required mdBook header |
| Introduction | <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Title" undefined file-path="Title">Hii</FileRef> | First page (no bullet) |
| Main Page | - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Title" undefined file-path="Title">Hii</FileRef> | Top-level navigation item |
| Section Header | # Section Name | Visual grouping in sidebar |
| Subsection | - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Title" undefined file-path="Title">Hii</FileRef> | Nested under main section (2-space indent) |
Sources: build-docs.sh:113-159
flowchart TD
Step3["Step 3 Comment\nLine 124-126"]
Header["Write Header Block\nLines 129-131\n{\n echo '# Summary'\n echo ''\n}"]
Overview["Overview Detection\nLines 133-145\nmain_pages_list=$(ls)\noverview_file=$(grep -Ev '^[0-9]')\nif [ -n $overview_file ]; then\n title=$(head -1)\n echo '[$title]($overview_file)'\n main_pages_list=$(grep -v $overview_file)\nfi"]
NumSort["Numeric Sort Block\nLines 147-155\nmain_pages=$(\n printf '%s' $main_pages_list\n /awk -F/ '{print $NF}'/ grep -E '^[0-9]'\n /sort -t- -k1 -n/ while read fname; do\n echo $WIKI_DIR/$fname\n done\n)"]
MainLoop["Main Page Loop\nLines 158-185\necho $main_pages /while read file do filename=$ basename $file title=$ head -1/ sed 's/^# //')\n section_num=$(grep -oE '^[0-9]+')\n section_dir=$WIKI_DIR/section-$section_num\n if [ -n $section_num ] && [ -d $section_dir ]; then\n ...\n fi\ndone"]
Subsection["Subsection Block\nLines 174-180\nls $section_dir/*.md\n /awk -F/ '{print $NF}'/ sort -t- -k1 -n\n / while read subname; do\n subtitle=$(head -1)\n echo ' - [$subtitle](...)'\n done"]
Redirect["Output Redirection\nLine 186\n} > src/SUMMARY.md"]
Log["Entry Count Log\nLine 188\necho 'Generated SUMMARY.md with\n$(grep -c '\\[' src/SUMMARY.md)
entries'"]
Step3 --> Header
Header --> Overview
Overview --> NumSort
NumSort --> MainLoop
MainLoop --> Subsection
Subsection --> MainLoop
MainLoop --> Redirect
Redirect --> Log
Implementation Code Mapping
The SUMMARY.md generation is implemented in a single contiguous block within build-docs.sh. The following diagram maps algorithm phases to specific line ranges:
Code Structure and Execution Flow
Sources: build-docs.sh:124-188
Shell Variable Reference:
| Variable | Scope | Type | Initialization | Example Value |
|---|---|---|---|---|
WIKI_DIR | Global | Path | Line 28 | /workspace/wiki |
main_pages_list | Local | String | Line 135 | Multi-line list of paths |
overview_file | Local | String | Line 138 | Overview.md |
main_pages | Local | String | Line 147 | Sorted, newline-separated paths |
filename | Loop | String | Line 160 | 5-component-reference.md |
title | Loop | String | Line 163 | Component Reference |
section_num | Loop | String | Line 166 | 5 |
section_dir | Loop | Path | Line 167 | /workspace/wiki/section-5 |
subfilename | Nested Loop | String | Line 177 | 5.1-build-docs.md |
subtitle | Nested Loop | String | Line 178 | build-docs.sh Orchestrator |
Sources: build-docs.sh28 build-docs.sh:124-188
Generation Statistics and Output
After generation completes, the script logs statistical information about the generated SUMMARY.md file:
Entry Counting Logic
Sources: build-docs.sh188
The grep -c command counts lines containing the <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/ character, which appears in every markdown link. This count includes#LNaN-LNaN“ NaN file-path=“ character, which appears in every markdown link. This count includes">Hii</FileRef> | +1 | | Main pages | - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Title" undefined file-path="Title">Hii</FileRef> | +1 per main page | | Subsections | - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Title" undefined file-path="Title">Hii</FileRef> | +1 per subsection |
Example Output:
Generated SUMMARY.md with 23 entries
This indicates the book contains 23 total navigation links (overview + main pages + all subsections combined).
Sources: build-docs.sh188
Integration with mdBook Build
The generated src/SUMMARY.md file is used directly by mdBook during the build process (Step 6):
- mdBook reads
src/SUMMARY.mdto determine book structure - For each entry, mdBook looks up the corresponding markdown file in
src/ - Files are processed in the order they appear in
SUMMARY.md - The navigation sidebar reflects the hierarchy defined in
SUMMARY.md
The generation happens in build-docs.sh:108-161 markdown files are copied to src/ in build-docs.sh166 and the mdBook build executes in build-docs.sh176
Sources: build-docs.sh:108-176
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Template Injection
Loading…
Template Injection
Relevant source files
Purpose and Scope
Template Injection is the process of inserting processed HTML header and footer content into each markdown file during Phase 3 of the build pipeline. This occurs after SUMMARY.md Generation and before the final mdBook build. The system reads HTML template files, performs variable substitution and conditional rendering, and prepends/appends the resulting HTML to every markdown file in the book structure.
For information about the template system architecture and customization options, see Template System and Template System Details. For the complete Phase 3 pipeline, see Phase 3: mdBook Build.
Template Processing Architecture
The template injection system consists of two stages: template processing (variable substitution and conditional evaluation) and content injection (inserting processed HTML into markdown files).
graph TB
subgraph "Input Sources"
HeaderTemplate["header.html\n/workspace/templates/"]
FooterTemplate["footer.html\n/workspace/templates/"]
EnvVars["Environment Variables\nREPO, BOOK_TITLE,\nGIT_REPO_URL, etc."]
end
subgraph "Template Processing"
ProcessScript["process-template.py"]
ParseVars["Parse Variables\nVAR=value args"]
ReadTemplate["Read Template File"]
ProcessConditionals["Process Conditionals\n{{#if VAR}}...{{/if}}"]
SubstituteVars["Substitute Variables\n{{VAR}}"]
StripComments["Strip HTML Comments"]
end
subgraph "Processed Output"
HeaderHTML["HEADER_HTML\nShell Variable"]
FooterHTML["FOOTER_HTML\nShell Variable"]
end
HeaderTemplate --> ProcessScript
FooterTemplate --> ProcessScript
EnvVars --> ParseVars
ParseVars --> ProcessScript
ProcessScript --> ReadTemplate
ReadTemplate --> ProcessConditionals
ProcessConditionals --> SubstituteVars
SubstituteVars --> StripComments
StripComments --> HeaderHTML
StripComments --> FooterHTML
Template Processing Flow
Sources: scripts/build-docs.sh:195-234 python/process-template.py:11-50
Template File Discovery
The system locates template files using configurable paths with sensible defaults. Template discovery follows a priority order that allows for custom template overrides.
| Configuration Variable | Default Value | Purpose |
|---|---|---|
TEMPLATE_DIR | /workspace/templates | Base directory for template files |
HEADER_TEMPLATE | $TEMPLATE_DIR/header.html | Path to header template |
FOOTER_TEMPLATE | $TEMPLATE_DIR/footer.html | Path to footer template |
If a template file is not found, the system emits a warning and continues without that template component, allowing for header-only or footer-only configurations.
Sources: scripts/build-docs.sh:195-197
Variable Substitution Mechanism
The process_template function in process-template.py performs two types of text replacement: conditional blocks and simple variable substitution.
Simple Variable Substitution
Variables use the {{VARIABLE_NAME}} syntax and are replaced with their corresponding values passed as command-line arguments.
graph LR
Template["Template: {{REPO}}"]
Variables["Variables:\nREPO=owner/repo"]
ProcessTemplate["process_template()\npython/process-template.py:11-50"]
Result["Result: owner/repo"]
Template --> ProcessTemplate
Variables --> ProcessTemplate
ProcessTemplate --> Result
The substitution pattern r'\{\{(\w+)\}\}' matches variable placeholders, and the replace_variable function looks up values in the variables dictionary. If a variable is not found, an empty string is substituted.
Sources: python/process-template.py:38-45
Conditional Rendering
Conditional blocks use {{#if VARIABLE}}...{{/if}} syntax to conditionally include content based on variable presence and non-empty values.
graph TB
ConditionalPattern["Pattern:\n{{#if VAR}}...{{/if}}"]
CheckVar{"Variable exists\nAND non-empty?"}
IncludeContent["Include Content"]
RemoveBlock["Remove Entire Block"]
ConditionalPattern --> CheckVar
CheckVar -->|Yes| IncludeContent
CheckVar -->|No| RemoveBlock
The regex pattern r'\{\{#if\s+(\w+)\}\}(.*?)\{\{/if\}\}' captures both the variable name and the content block. The replace_conditional function evaluates the condition using if var_name in variables and variables[var_name].
Sources: python/process-template.py:24-36
Available Template Variables
The following variables are passed to template processing and can be used in both header and footer templates:
| Variable | Source | Example Value |
|---|---|---|
DEEPWIKI_URL | Constructed from REPO | https://deepwiki.com/owner/repo |
DEEPWIKI_BADGE_URL | Static URL | https://deepwiki.com/badge.svg |
GIT_REPO_URL | Environment or derived | https://github.com/owner/repo |
GITHUB_BADGE_URL | Constructed from REPO | https://img.shields.io/badge/... |
REPO | Environment or Git detection | owner/repo |
BOOK_TITLE | Environment or default | Documentation |
BOOK_AUTHORS | Environment or REPO_OWNER | owner |
GENERATION_DATE | date -u command | January 15, 2024 at 14:30 UTC |
Sources: scripts/build-docs.sh:199-213 scripts/build-docs.sh:221-230
Template Processing Invocation
The shell script invokes process-template.py twice: once for the header and once for the footer. Each invocation passes all variables as command-line arguments in KEY=value format.
Header Processing
The processed HTML is captured in the HEADER_HTML shell variable for later injection.
Sources: scripts/build-docs.sh:202-213
Footer Processing
Footer processing follows the same pattern but stores the result in FOOTER_HTML.
Sources: scripts/build-docs.sh:219-230
graph TB
Start["Start Injection"]
CheckTemplates{"HEADER_HTML or\nFOOTER_HTML set?"}
SkipInjection["Skip injection\nCopy files as-is"]
FindFiles["Find all .md files\nsrc/*.md src/*/*.md"]
ProcessFile["For each file"]
ReadOriginal["Read original content"]
CreateTemp["Create temp file:\nheader + content + footer"]
ReplaceOriginal["mv temp to original"]
CountFiles["Increment file_count"]
ReportCount["Report processed count"]
Start --> CheckTemplates
CheckTemplates -->|No| SkipInjection
CheckTemplates -->|Yes| FindFiles
FindFiles --> ProcessFile
ProcessFile --> ReadOriginal
ReadOriginal --> CreateTemp
CreateTemp --> ReplaceOriginal
ReplaceOriginal --> CountFiles
CountFiles --> ProcessFile
ProcessFile --> ReportCount
Markdown File Injection
After templates are processed, the system injects the resulting HTML into all markdown files in the book structure. Injection occurs by creating a temporary file with concatenated content and then replacing the original.
Injection Algorithm
Sources: scripts/build-docs.sh:236-261
File Processing Pattern
The injection loop processes files matching the glob patterns src/*.md and src/*/*.md, covering both root-level pages and subsection pages.
The temporary file approach ensures atomic writes and prevents partial content if processing is interrupted.
Sources: scripts/build-docs.sh:243-257
graph LR
CopyFiles["Copy markdown files\nto src/"]
ProcessHeader["Process header.html\n→ HEADER_HTML"]
ProcessFooter["Process footer.html\n→ FOOTER_HTML"]
InjectLoop["Inject into each\n.md file"]
InstallMermaid["Install mdbook-mermaid"]
BuildBook["mdbook build"]
CopyFiles --> ProcessHeader
ProcessHeader --> ProcessFooter
ProcessFooter --> InjectLoop
InjectLoop --> InstallMermaid
InstallMermaid --> BuildBook
Template Injection Sequence
Template injection occurs at a specific point in the Phase 3 pipeline, after markdown files are copied but before the mdBook build.
This ordering ensures that:
- SUMMARY.md is generated with original titles before injection
- All markdown files exist in their final locations
- Templates are processed once and reused for all files
- mdBook receives fully-decorated markdown files
Sources: scripts/build-docs.sh:190-271
Example Template Structure
The default header template demonstrates the use of variables and conditionals:
Key observations:
- Conditional wrapping prevents broken links when
GIT_REPO_URLis unset - Inline conditionals prevent mdBook from wrapping content in
<p>tags - Style attributes provide layout control within markdown constraints
Sources: templates/header.html:1-8
Skipping Template Injection
Template injection can be effectively disabled by:
- Not providing template files at the expected paths
- Setting
HEADER_TEMPLATEorFOOTER_TEMPLATEto non-existent paths - Mounting empty template files via volume mounts
When templates are not found, the system emits warnings but continues:
Warning: Header template not found at /workspace/templates/header.html, skipping...
Warning: Footer template not found at /workspace/templates/footer.html, skipping...
Files are then copied without modification using the fallback path.
Sources: scripts/build-docs.sh:214-217 scripts/build-docs.sh:232-234
Template Processing Error Handling
The process-template.py script performs validation at startup:
Error conditions result in non-zero exit codes, which would cause the shell script to abort due to set -e on line 2.
Sources: python/process-template.py:53-78 scripts/build-docs.sh2
Performance Characteristics
Template processing and injection have the following performance characteristics:
| Operation | Complexity | Typical Duration |
|---|---|---|
| Template file read | O(1) | < 1ms per file |
| Variable substitution | O(n × m) where n=template size, m=variables | < 5ms per template |
| Conditional evaluation | O(n × c) where c=conditional count | < 5ms per template |
| Per-file injection | O(f) where f=file size | < 10ms per file |
| Total injection time | O(files × file_size) | ~100-500ms for typical wikis |
The system reports processing time via the file count message: "Processed N files with templates".
Sources: scripts/build-docs.sh258
Integration with mdBook
The injected HTML content becomes part of the markdown source that mdBook processes. mdBook’s HTML renderer:
- Preserves raw HTML blocks in markdown
- Applies syntax highlighting to code blocks
- Processes Mermaid diagrams via
mdbook-mermaidpreprocessor - Wraps content in the theme’s page structure
This allows the templates to provide page-level customization that appears consistently across all pages while leveraging mdBook’s built-in features for navigation, search, and responsive layout.
Sources: scripts/build-docs.sh:264-271
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
CI/CD Integration
Loading…
CI/CD Integration
Relevant source files
Purpose and Scope
This page provides an overview of the continuous integration and continuous deployment (CI/CD) infrastructure for the DeepWiki-to-mdBook system. The system includes three distinct integration patterns:
- Automated Build and Deploy Workflow - Weekly scheduled builds with manual trigger support (#9.1)
- Continuous Testing Workflow - Automated Python tests on every push and pull request (#9.2)
- Reusable GitHub Action - Packaged action for use in other repositories (#9.3)
All three patterns share the same underlying Docker infrastructure documented in Docker Multi-Stage Build, ensuring consistency between local development, testing, and production deployments.
For information about the Docker container architecture itself, see Docker Multi-Stage Build. For configuration options, see Configuration Reference.
CI/CD Architecture Overview
The CI/CD system consists of three independently triggered workflows that share a common Docker image but serve different purposes in the development lifecycle.
Diagram: CI/CD Trigger and Execution Patterns
graph TB
subgraph "Trigger Sources"
Schedule["Weekly Schedule\nSundays 00:00 UTC"]
ManualDispatch["Manual Dispatch\nworkflow_dispatch"]
Push["Push to main"]
PullRequest["Pull Request"]
ExternalRepo["External Repository\nuses action"]
end
subgraph "Build and Deploy Workflow"
BuildJob["build job\n.github/workflows/build-and-deploy.yml"]
ResolveStep["Resolve repository\nand book title"]
BuildxSetup["Setup Docker Buildx"]
DockerBuild1["Build deepwiki-to-mdbook\nDocker image"]
RunContainer1["Run docker container\nwith env vars"]
UploadArtifact["Upload Pages artifact\n./output/book"]
DeployJob["deploy job\nneeds: build"]
DeployPages["Deploy to GitHub Pages"]
end
subgraph "Test Workflow"
PytestJob["pytest job\n.github/workflows/tests.yml"]
SetupPython["Setup Python 3.12"]
InstallDeps["Install requirements.txt\nand pytest"]
RunTests["Run pytest python/tests/"]
end
subgraph "Reusable Action"
ActionDef["action.yml\ncomposite action"]
BuildImage["Build Docker image\nwith GITHUB_RUN_ID tag"]
RunContainer2["Run docker container\nwith input parameters"]
end
Schedule --> BuildJob
ManualDispatch --> BuildJob
Push --> PytestJob
PullRequest --> PytestJob
ExternalRepo --> ActionDef
BuildJob --> ResolveStep
ResolveStep --> BuildxSetup
BuildxSetup --> DockerBuild1
DockerBuild1 --> RunContainer1
RunContainer1 --> UploadArtifact
UploadArtifact --> DeployJob
DeployJob --> DeployPages
PytestJob --> SetupPython
SetupPython --> InstallDeps
InstallDeps --> RunTests
ActionDef --> BuildImage
BuildImage --> RunContainer2
Sources: .github/workflows/build-and-deploy.yml:1-90 .github/workflows/tests.yml:1-26 action.yml:1-53
Workflow Triggers and Scheduling
The system employs multiple trigger mechanisms to balance automation, manual control, and external integration:
| Trigger Type | Workflow | Configuration | Purpose |
|---|---|---|---|
schedule | Build and Deploy | cron: "0 0 * * 0" | Weekly documentation refresh on Sundays at midnight UTC |
workflow_dispatch | Build and Deploy | Manual inputs supported | On-demand builds with custom parameters |
push | Tests | Branch: main | Validate changes merged to main branch |
pull_request | Tests | All PRs | Pre-merge validation of proposed changes |
uses | GitHub Action | N/A | External repositories invoke the action |
Sources: .github/workflows/build-and-deploy.yml:3-9 .github/workflows/tests.yml:3-8
Shared Docker Infrastructure
All three CI/CD patterns build and execute the same Docker image, ensuring consistent behavior across environments. The image build process follows this pattern:
Diagram: Docker Image Build and Execution Flow
graph LR
subgraph "Source Files"
Dockerfile["Dockerfile\nMulti-stage build"]
PythonScripts["Python scripts\ndeepwiki-scraper.py\nprocess-template.py"]
ShellScripts["Shell scripts\nbuild-docs.sh"]
Templates["templates/\nheader.html, footer.html"]
end
subgraph "Build Stage"
BuildCommand["docker build -t\ndeepwiki-to-mdbook"]
RustStage["Stage 1: rust:latest\nBuild mdBook binaries"]
PythonStage["Stage 2: python:3.12-slim\nInstall Python deps\nCopy executables"]
end
subgraph "Execution"
RunCommand["docker run --rm"]
EnvVars["Environment variables\nREPO, BOOK_TITLE, etc."]
VolumeMount["-v output:/output"]
Entrypoint["CMD: build-docs.sh"]
end
subgraph "Output"
OutputDir["./output/book/\nHTML documentation"]
MarkdownDir["./output/markdown/\nEnhanced markdown"]
end
Dockerfile --> BuildCommand
PythonScripts --> BuildCommand
ShellScripts --> BuildCommand
Templates --> BuildCommand
BuildCommand --> RustStage
RustStage --> PythonStage
PythonStage --> RunCommand
EnvVars --> RunCommand
VolumeMount --> RunCommand
RunCommand --> Entrypoint
Entrypoint --> OutputDir
Entrypoint --> MarkdownDir
Sources: .github/workflows/build-and-deploy.yml:60-72 action.yml:30-52
Permissions and Concurrency Control
The Build and Deploy workflow requires elevated permissions for GitHub Pages deployment, while the Test workflow operates with default read permissions.
Build and Deploy Permissions
Concurrency Management
The Build and Deploy workflow enforces single-instance execution to prevent deployment conflicts:
This configuration ensures that if a build is already running, subsequent triggers will wait rather than canceling the in-progress deployment.
Sources: .github/workflows/build-and-deploy.yml:11-20
Environment Variable Passing
All CI/CD patterns pass configuration to the Docker container via environment variables. The Build and Deploy workflow includes a resolution step that provides defaults when manual inputs are not specified:
Diagram: Environment Variable Resolution Flow
graph TD
subgraph "Input Resolution"
ManualInput["workflow_dispatch inputs\nrepo, book_title"]
CheckRepo{"repo input\nprovided?"}
CheckTitle{"book_title input\nprovided?"}
UseInputRepo["Use manual input"]
UseGitHubRepo["Use github.repository"]
UseInputTitle["Use manual input"]
GenTitle["Generate title from\nrepository name"]
end
subgraph "Docker Execution"
RepoVar["-e REPO"]
TitleVar["-e BOOK_TITLE"]
DockerRun["docker run deepwiki-to-mdbook"]
end
subgraph "build-docs.sh"
ReadEnv["Read environment variables"]
FetchWiki["Fetch from DeepWiki"]
BuildBook["Build mdBook"]
end
ManualInput --> CheckRepo
CheckRepo -->|Yes| UseInputRepo
CheckRepo -->|No| UseGitHubRepo
ManualInput --> CheckTitle
CheckTitle -->|Yes| UseInputTitle
CheckTitle -->|No| GenTitle
UseInputRepo --> RepoVar
UseGitHubRepo --> RepoVar
UseInputTitle --> TitleVar
GenTitle --> TitleVar
RepoVar --> DockerRun
TitleVar --> DockerRun
DockerRun --> ReadEnv
ReadEnv --> FetchWiki
FetchWiki --> BuildBook
Sources: .github/workflows/build-and-deploy.yml:30-58 .github/workflows/build-and-deploy.yml:66-72
Job Dependencies and Artifact Flow
The Build and Deploy workflow uses a two-job structure with explicit dependency ordering:
- Build Job - Executes Docker container, uploads Pages artifact
- Deploy Job - Depends on build completion, deploys to GitHub Pages
The jobs communicate via the GitHub Actions artifact system:
| Job | Step | Artifact Operation | Path |
|---|---|---|---|
build | Upload artifact | actions/upload-pages-artifact@v3 | ./output/book |
deploy | Deploy to Pages | actions/deploy-pages@v4 | Artifact from build |
Sources: .github/workflows/build-and-deploy.yml:74-89
Integration Points Summary
The following table summarizes how external systems interact with the CI/CD infrastructure:
| Integration Point | Method | Configuration | Output |
|---|---|---|---|
| DeepWiki API | HTTP scraping during build | REPO environment variable | Markdown content |
| GitHub Pages | Artifact deployment | actions/deploy-pages@v4 | Hosted HTML site |
| External Repositories | GitHub Action | uses: jzombie/deepwiki-to-mdbook@main | Local output directory |
| Git Metadata | Auto-detection in container | Falls back to git remote -v | Repository URL |
Sources: .github/workflows/build-and-deploy.yml:1-90 action.yml:1-53
Related Documentation
- Build and Deploy Workflow - Detailed coverage of the weekly build workflow
- Test Workflow - Python test execution and validation
- GitHub Action - Using the action in external repositories
- Configuration Reference - Complete environment variable documentation
- Docker Multi-Stage Build - Container architecture details
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Build and Deploy Workflow
Loading…
Build and Deploy Workflow
Relevant source files
Purpose and Scope
This page documents the Build and Deploy Workflow (.github/workflows/build-and-deploy.yml), which automates the process of building documentation from DeepWiki content and deploying it to GitHub Pages. The workflow runs on a weekly schedule and supports manual triggers with configurable parameters.
For information about testing the system’s Python components, see Test Workflow. For using this system in other repositories via the reusable action, see GitHub Action.
Workflow Overview
The Build and Deploy Workflow consists of two sequential jobs: build and deploy. The build job constructs the Docker image, generates the documentation artifacts, and uploads them. The deploy job then publishes these artifacts to GitHub Pages.
Sources: .github/workflows/build-and-deploy.yml:1-90
graph TB
subgraph "Trigger Mechanisms"
schedule["schedule\ncron: '0 0 * * 0'\n(Weekly Sunday 00:00 UTC)"]
manual["workflow_dispatch\n(Manual trigger)"]
end
subgraph "Build Job"
checkout["actions/checkout@v4\nClone repository"]
resolve["Resolve repository and book title\nCustom shell script"]
buildx["docker/setup-buildx-action@v3\nSetup Docker Buildx"]
dockerbuild["docker build -t deepwiki-to-mdbook ."]
dockerrun["docker run\nGenerate documentation"]
upload["actions/upload-pages-artifact@v3\nUpload ./output/book"]
end
subgraph "Deploy Job"
deploy["actions/deploy-pages@v4\nPublish to GitHub Pages"]
end
schedule --> checkout
manual --> checkout
checkout --> resolve
resolve --> buildx
buildx --> dockerbuild
dockerbuild --> dockerrun
dockerrun --> upload
upload --> deploy
deploy --> pages["GitHub Pages\ngithub.io site"]
Trigger Configuration
Schedule-Based Execution
The workflow executes automatically on a weekly basis using a cron schedule:
This runs every Sunday at midnight UTC, ensuring the documentation stays synchronized with the latest DeepWiki content.
Manual Dispatch
The workflow_dispatch trigger allows on-demand execution through the GitHub Actions UI. Unlike the scheduled run, manual triggers can accept custom inputs for repository and book title configuration.
Sources: .github/workflows/build-and-deploy.yml:3-9
Permissions and Concurrency
GitHub Token Permissions
The workflow requires specific permissions for GitHub Pages deployment:
| Permission | Level | Purpose |
|---|---|---|
contents | read | Access repository code for building |
pages | write | Deploy artifacts to GitHub Pages |
id-token | write | OIDC token for secure deployment |
Concurrency Control
The concurrency configuration ensures only one deployment runs at a time using the pages group. Setting cancel-in-progress: false means new deployments wait for running deployments to complete rather than canceling them.
Sources: .github/workflows/build-and-deploy.yml:11-20
Build Job Architecture
Sources: .github/workflows/build-and-deploy.yml:23-78
Step 1: Repository Checkout
The workflow begins by checking out the repository using actions/checkout@v4:
This provides access to the Dockerfile and all scripts needed for the build process.
Sources: .github/workflows/build-and-deploy.yml:27-28
Step 2: Repository and Title Resolution
The resolution step implements fallback logic for configuring the documentation build:
Repository Resolution:
- Check if
github.event.inputs.repowas provided (manual trigger) - If not provided, use
github.repository(current repository) - Store result in
repo_valueoutput
Title Resolution:
- Check if
github.event.inputs.book_titlewas provided - If not provided, extract repository name and append “ Documentation“
- If extraction fails, use “Documentation” as fallback
- Store result in
title_valueoutput
The resolved values are written to GITHUB_OUTPUT for use in subsequent steps:
Sources: .github/workflows/build-and-deploy.yml:30-58
Step 3: Docker Buildx Setup
The workflow uses Docker Buildx for enhanced build capabilities:
This provides BuildKit features, including improved caching and multi-platform builds (though this workflow builds for a single platform).
Sources: .github/workflows/build-and-deploy.yml:60-61
Step 4: Docker Image Build
The Docker image is built using the Dockerfile in the repository root:
This creates the deepwiki-to-mdbook image containing:
- mdBook and mdbook-mermaid binaries
- Python 3.12 runtime
deepwiki-scraper.pyandprocess-template.pyscriptsbuild-docs.shorchestrator- Default templates
For details on the Docker image structure, see Docker Multi-Stage Build.
Sources: .github/workflows/build-and-deploy.yml:63-64
Step 5: Documentation Builder Execution
The Docker container is executed with environment variables and a volume mount:
Container Configuration:
| Component | Value | Purpose |
|---|---|---|
--rm | Flag | Remove container after execution |
-e REPO | Resolved repository | Configure documentation source |
-e BOOK_TITLE | Resolved title | Set book metadata |
-v $(pwd)/output:/output | Volume mount | Persist generated artifacts |
| Image | deepwiki-to-mdbook | Previously built image |
The container runs build-docs.sh (default CMD), which orchestrates the three-phase pipeline:
- Phase 1: Extract markdown from DeepWiki (see Markdown Extraction)
- Phase 2: Enhance with diagrams (see Diagram Enhancement)
- Phase 3: Build mdBook artifacts (see mdBook Build)
Output is written to $(pwd)/output, which maps to /output inside the container.
Sources: .github/workflows/build-and-deploy.yml:66-72
Step 6: GitHub Pages Artifact Upload
The generated book directory is uploaded as a GitHub Pages artifact:
The actions/upload-pages-artifact action packages ./output/book into a tarball optimized for Pages deployment. This artifact is then available to the deploy job.
Sources: .github/workflows/build-and-deploy.yml:74-77
Deploy Job Architecture
Sources: .github/workflows/build-and-deploy.yml:79-90
Job Dependencies and Environment
The deploy job has a strict dependency on the build job:
This ensures the deploy job only runs after the build job successfully completes.
The environment configuration:
name: github-pages- Associates deployment with the GitHub Pages environmenturl- Sets the environment URL to the deployed site URL from the deployment step
Deployment Step
The deployment uses the official GitHub Pages deployment action:
The deploy-pages action:
- Downloads the artifact uploaded by the build job
- Unpacks the tarball
- Publishes content to GitHub Pages
- Returns the deployed site URL in
page_urloutput
Sources: .github/workflows/build-and-deploy.yml:79-89
Workflow Execution Flow
Sources: .github/workflows/build-and-deploy.yml:1-90
Environment Variables and Outputs
Step Outputs
The resolve step produces two outputs accessible to subsequent steps:
| Output | ID Path | Description |
|---|---|---|
repo_value | steps.resolve.outputs.repo_value | Resolved repository (input or default) |
title_value | steps.resolve.outputs.title_value | Resolved book title (input or generated) |
These outputs are consumed by the Docker run command via environment variables.
Docker Container Environment
The container receives configuration through environment variables:
| Variable | Source | Example Value |
|---|---|---|
REPO | steps.resolve.outputs.repo_value | jzombie/deepwiki-to-mdbook |
BOOK_TITLE | steps.resolve.outputs.title_value | deepwiki-to-mdbook Documentation |
Additional environment variables supported by the container (not set by this workflow) include:
BOOK_AUTHORS- Author metadataBOOK_LANGUAGE- Language code (default:en)BOOK_SRC- Source directory (default:src)MARKDOWN_ONLY- Skip HTML build if set
For a complete list, see Configuration Reference.
Sources: .github/workflows/build-and-deploy.yml:30-72
Workflow Status and Monitoring
Build Job Success Criteria
The build job succeeds when:
- Repository checkout completes
- Resolution logic produces valid outputs
- Docker image builds without errors
- Container execution exits with code 0
- Artifact upload completes
If any step fails, the build job terminates and the deploy job does not run.
Deploy Job Success Criteria
The deploy job succeeds when:
- Build job completed successfully
- Artifact download succeeds
- GitHub Pages deployment completes
- Deployment URL is accessible
Monitoring and Debugging
Workflow Execution:
- View workflow runs in the Actions tab:
https://github.com/{owner}/{repo}/actions - Each run shows both job statuses and step-by-step logs
Common Failure Points:
| Failure Location | Possible Cause | Resolution |
|---|---|---|
| Docker build | Dockerfile syntax error | Check Dockerfile changes in failed commit |
| Docker run | Missing dependencies | Review Python requirements and mdBook installation |
| Artifact upload | Output directory empty | Check build-docs.sh execution logs |
| Deploy | Permissions issue | Verify Pages write permission in workflow |
Sources: .github/workflows/build-and-deploy.yml:1-90
Integration with Other Components
Relationship to Test Workflow
The Build and Deploy Workflow does not include test execution. Testing is handled by a separate workflow that runs on push and PR events. See Test Workflow for details.
Relationship to GitHub Action
This workflow uses the same Docker image and build process as the reusable GitHub Action. The key differences:
| Aspect | Build and Deploy Workflow | GitHub Action |
|---|---|---|
| Trigger | Schedule + manual | Called by other workflows |
| Output | Deployed to Pages | Artifact or caller-specified location |
| Configuration | Hardcoded to this repo | Parameterized for any repo |
For using this system in other repositories, see GitHub Action.
Docker Image Reuse
The workflow builds the Docker image fresh on each run. This ensures:
- Latest code changes are included
- Dependencies are up to date
- Reproducible builds from source
The image is not pushed to a registry; it exists only during workflow execution.
Sources: .github/workflows/build-and-deploy.yml:1-90
Deployment Target Configuration
GitHub Pages Environment
GitHub Pages deployment requires repository configuration:
- Pages Source: Set to “GitHub Actions” in repository settings
- Branch: Not applicable (artifact-based deployment)
- Custom Domain: Optional; configure in repository settings
The deployment uses the github-pages environment, which provides:
- Protection rules (if configured)
- Deployment history
- Environment-specific secrets (if needed)
URL Structure
The deployed documentation is accessible at:
https://{owner}.github.io/{repo}/
For custom domains, the URL follows the custom domain configuration.
Sources: .github/workflows/build-and-deploy.yml:80-82
Workflow Customization
Modifying Schedule
To change the execution frequency, edit the cron expression:
Examples:
- Daily:
"0 0 * * *" - Twice weekly: Add another cron entry
- Monthly:
"0 0 1 * *"
Adding Manual Inputs
While the current workflow supports manual trigger, it does not expose inputs in the workflow_dispatch section. To add configurable inputs, modify the trigger:
The resolution step already handles these inputs via github.event.inputs.*.
Sources: .github/workflows/build-and-deploy.yml:8-9
Performance Characteristics
Typical Execution Time
| Phase | Duration | Notes |
|---|---|---|
| Docker build | 2-4 minutes | Includes Rust compilation of mdBook |
| Documentation generation | 1-3 minutes | Depends on wiki size |
| Artifact upload | 10-30 seconds | Depends on output size |
| Deployment | 30-60 seconds | GitHub Pages processing |
| Total | 4-8 minutes | End-to-end workflow |
Resource Usage
The workflow runs on ubuntu-latest runners:
- CPU: 2 cores
- RAM: 7 GB
- Disk: 14 GB SSD
Docker build and Python scraping are the most resource-intensive operations.
Optimization Opportunities
- Docker Layer Caching: The workflow could use
actions/cacheto cache Docker layers between runs - Build Artifact Reuse: Skip documentation rebuild if wiki content hasn’t changed
- Parallel Jobs: Split build and test into parallel jobs (though this workflow has no tests)
Sources: .github/workflows/build-and-deploy.yml:1-90
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Test Workflow
Loading…
Test Workflow
Relevant source files
Purpose and Scope
This document describes the continuous integration testing workflow that validates code quality on every push to the main branch and on all pull requests. The workflow executes Python unit tests to ensure that core components function correctly before code is merged.
For information about the production build and deployment workflow, see Build and Deploy Workflow. For details on using the system as a reusable GitHub Action, see GitHub Action. For instructions on running tests locally during development, see Running Tests.
Workflow Configuration
The test workflow is defined in .github/workflows/tests.yml:1-26 It uses GitHub Actions to automatically execute the test suite in a clean Ubuntu environment.
Trigger Events
The workflow activates on two event types:
| Event Type | Trigger Condition | Purpose |
|---|---|---|
push | Commits to main branch | Validate merged changes |
pull_request | Any PR created or updated | Block merge of failing code |
Sources: .github/workflows/tests.yml:3-7
Workflow Trigger and Outcome Logic
Sources: .github/workflows/tests.yml:3-7
Job Structure
The workflow contains a single job named pytest that runs on ubuntu-latest. The job executes five sequential steps to set up the environment and run tests.
pytest Job Execution Flow
graph TB
subgraph "pytest Job Steps"
S1["Step 1: Checkout\nactions/checkout@v4"]
S2["Step 2: Setup Python\nactions/setup-python@v5\npython-version: 3.12"]
S3["Step 3: Install Dependencies\npip install -r python/requirements.txt\npip install pytest"]
S4["Step 4: Run pytest\npytest python/tests/ -s"]
end
S1 --> S2
S2 --> S3
S3 --> S4
Sources: .github/workflows/tests.yml:10-25
Step-by-Step Breakdown
Step 1: Repository Checkout
Uses actions/checkout@v4 to clone the repository at the commit being tested. For pull requests, this checks out the merge commit to test the exact code that would be merged.
Sources: .github/workflows/tests.yml:13-14
Step 2: Python Environment Setup
Configures Python 3.12 using actions/setup-python@v5. This matches the Python version used in the Docker container Dockerfile26 ensuring consistency between local development, CI testing, and production execution.
Sources: .github/workflows/tests.yml:15-18
Step 3: Dependency Installation
Installs required packages in three phases:
- Upgrades
pipto the latest version - Installs project dependencies from
python/requirements.txt - Installs
pytesttesting framework
Sources: .github/workflows/tests.yml:19-23
Step 4: Test Execution
Runs pytest against the python/tests/ directory with the -s flag to show print statements. This executes all test modules discovered by pytest.
Sources: .github/workflows/tests.yml:24-25
Test Coverage
The workflow executes three test suites that validate core system components:
| Test Module | Component Tested | Key Functions Validated |
|---|---|---|
test_template_processor.py | Template System | process_template, variable substitution, conditional rendering |
test_mermaid_normalization.py | Diagram Processing | Seven-step normalization pipeline, Mermaid 11 compatibility |
test_numbering.py | Path Resolution | Page numbering scheme, filename generation, path calculations |
Test Module Coverage and Component Mapping
Sources: .github/workflows/tests.yml:24-25 scripts/run-tests.sh:11-30
Template Processor Tests
Validates the process_template function in python/process-template.py Tests verify:
- Variable substitution (
{{REPO}},{{BOOK_TITLE}}, etc.) - Conditional rendering based on variable presence
- Special character handling in template values
- Fallback behavior when variables are undefined
Sources: scripts/run-tests.sh11
Mermaid Normalization Tests
Tests the seven-step normalization pipeline in python/deepwiki-scraper.py Validates:
- Edge label flattening for multiline labels
- State description syntax fixes
- Flowchart node cleanup
- Statement separator insertion
- Empty label fallback generation
- Gantt chart task ID synthesis
For detailed information on the normalization pipeline, see Mermaid Normalization.
Sources: scripts/run-tests.sh22
Numbering Tests
Verifies the page numbering and path resolution logic in python/deepwiki-scraper.py Tests confirm:
- Correct filename generation from hierarchical numbers (e.g.,
9.2→9.2_test-workflow.md) - Number parsing and validation
- Path calculations relative to markdown directory
- Numeric sorting behavior for SUMMARY.md generation
For detailed information on the numbering system, see Numbering and Path Resolution.
Sources: scripts/run-tests.sh30
graph LR
subgraph "Development Flow"
Dev["Developer\nCreates PR"]
Review["Code Review\nProcess"]
Merge["Merge to main"]
end
subgraph "Test Workflow"
TestTrigger["tests.yml\non: pull_request"]
PytestJob["pytest job"]
TestResult{{"Test Result"}}
end
subgraph "Build Workflow"
BuildTrigger["build-and-deploy.yml\non: push to main"]
BuildJob["Build & Deploy"]
end
Dev --> TestTrigger
TestTrigger --> PytestJob
PytestJob --> TestResult
TestResult -->|Pass| Review
TestResult -->|Fail| Dev
Review --> Merge
Merge --> BuildTrigger
BuildTrigger --> BuildJob
Integration with CI/CD Pipeline
The test workflow serves as a quality gate in the development process. Its relationship to other CI/CD components is shown below:
CI/CD Pipeline Integration and Quality Gate
Sources: .github/workflows/tests.yml:3-7
The workflow prevents code with failing tests from being merged, ensuring that the weekly build-and-deploy workflow (Build and Deploy Workflow) only processes validated code.
graph TB
subgraph "GitHub Actions Execution"
GHA["tests.yml workflow"]
GHAEnv["ubuntu-latest\nPython 3.12\nClean environment"]
GHATest["pytest python/tests/ -s"]
end
subgraph "Local Execution"
Local["scripts/run-tests.sh"]
LocalEnv["Developer machine\nAny Python 3.x\nCurrent environment"]
LocalTest["python3 -m pytest\nOR\npython3 test_*.py"]
end
subgraph "Shared Test Suite"
Tests["python/tests/\ntest_template_processor.py\ntest_mermaid_normalization.py\ntest_numbering.py"]
end
GHA --> GHAEnv
GHAEnv --> GHATest
GHATest --> Tests
Local --> LocalEnv
LocalEnv --> LocalTest
LocalTest --> Tests
Local Test Execution
Developers can run the same tests locally using the scripts/run-tests.sh script. This script provides an alternative execution path that doesn’t require GitHub Actions:
Dual Execution Paths for Test Suite
Sources: .github/workflows/tests.yml:1-26 scripts/run-tests.sh:1-43
The local script scripts/run-tests.sh:1-43 includes fallback logic: if pytest is not available, it runs test_template_processor.py directly using Python’s unittest framework, but skips the pytest-dependent tests. This ensures developers can run at least basic tests without installing pytest.
Workflow Best Practices
Pull Request Requirements
Before merging, pull requests must:
- Pass all pytest tests
- Not introduce new test failures
- Update tests if behavior changes
Test Failure Investigation
When tests fail, the workflow output shows:
- Which test module failed
- The specific test function that failed
- Assertion details and error messages
- Full stdout/stderr due to the
-sflag
The -s flag in .github/workflows/tests.yml25 ensures that print statements from the code under test are visible in the workflow logs, aiding debugging.
Adding New Tests
To add new tests to the workflow:
- Create a new test module in
python/tests/ - Follow pytest conventions (
test_*.pyfilename,test_*function names) - The workflow automatically discovers and runs the new tests
- No workflow configuration changes required
Sources: .github/workflows/tests.yml:24-25
Comparison with Build Workflow
The following table highlights the differences between the test workflow and the build-and-deploy workflow:
| Aspect | Test Workflow | Build-and-Deploy Workflow |
|---|---|---|
| Trigger | Push to main, pull requests | Weekly schedule, manual dispatch |
| Purpose | Validate code quality | Generate documentation |
| Duration | ~1-2 minutes | ~5-10 minutes |
| Environment | Python only | Full Docker build |
| Artifacts | None | Documentation site |
| Deployment | None | GitHub Pages |
| Blocking | Yes (blocks PR merge) | No |
Sources: .github/workflows/tests.yml:1-26
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
GitHub Action
Loading…
GitHub Action
Relevant source files
This page documents the reusable GitHub Action that enables external repositories to generate mdBook documentation from DeepWiki content. The action packages the entire Docker-based build system into a single workflow step that can be invoked with YAML configuration. For information about the automated build-and-deploy workflow used in this repository itself, see Build and Deploy Workflow. For information about the test workflow, see Test Workflow.
Overview
The GitHub Action is defined in action.yml:1-53 and implements a composite action pattern. It builds the Docker image on-demand within the calling workflow’s runner environment, executes the documentation generation process, and outputs artifacts to a configurable directory. Unlike pre-built action images, this approach bundles the Dockerfile directly, ensuring that the action always uses the exact code version referenced in the workflow.
graph TB
subgraph "Calling Repository Workflow"
WorkflowYAML["workflow.yml\nuses: jzombie/deepwiki-to-mdbook@main"]
InputParams["Input Parameters\nrepo, book_title, output_dir, etc."]
end
subgraph "action.yml Composite Steps"
Step1["Step 1: Build Docker image\nworking-directory: github.action_path"]
Step2["Step 2: Run documentation builder\ndocker run with mounted volume"]
end
subgraph "Execution Environment"
Dockerfile["Dockerfile\nbundled with action"]
ImageTag["IMAGE_TAG=deepwiki-to-mdbook:GITHUB_RUN_ID"]
Container["Docker Container\nbuild-docs.sh entrypoint"]
end
subgraph "Output Artifacts"
OutputDir["inputs.output_dir\nmounted to /output"]
Book["book/\nHTML documentation"]
Markdown["markdown/\nenhanced markdown"]
Config["book.toml\nmdBook config"]
end
WorkflowYAML --> InputParams
InputParams --> Step1
Step1 --> Dockerfile
Dockerfile --> ImageTag
ImageTag --> Step2
Step2 --> Container
Container --> OutputDir
OutputDir --> Book
OutputDir --> Markdown
OutputDir --> Config
The action provides the same functionality as local Docker execution but wraps it in a GitHub Actions-native interface with declarative input parameters instead of raw environment variables and shell commands.
Diagram: Action Invocation and Execution Flow
The action uses the github.action_path context variable to locate the bundled Dockerfile within the checked-out action repository. The image tag includes GITHUB_RUN_ID to ensure uniqueness across concurrent workflow executions.
Sources: action.yml:1-53 README.md:60-70
Input Parameters
The action exposes six input parameters that map directly to the Docker container’s environment variables. All parameters except repo have sensible defaults.
| Input | Required | Default | Description | Environment Variable |
|---|---|---|---|---|
repo | Yes | N/A | DeepWiki repository in owner/repo format (e.g., jzombie/deepwiki-to-mdbook) | REPO |
book_title | No | "Documentation" | Title displayed in the generated mdBook | BOOK_TITLE |
book_authors | No | "" | Author metadata for the mdBook (empty defaults to repo owner) | BOOK_AUTHORS |
git_repo_url | No | "" | Repository URL for mdBook edit links (empty auto-detects from Git) | GIT_REPO_URL |
markdown_only | No | "false" | Set to "true" to skip HTML build and only extract markdown | MARKDOWN_ONLY |
output_dir | No | "./output" | Output directory on workflow host, mounted to /output in container | (Volume mount target) |
The output_dir parameter is workflow-relative and resolved to an absolute path before mounting action.yml44 This allows callers to specify paths like ./docs-output or ${{ github.workspace }}/generated-docs.
Sources: action.yml:3-26
Action Implementation
The action uses the composite run type action.yml28 meaning it executes shell commands directly in the workflow runner rather than using a pre-built container image. This design choice enables the action to always use the current repository code without requiring separate image publication.
Diagram: Action Step Implementation Details
graph TB
subgraph "Step 1: Build Docker image"
ActionPath["github.action_path\nPoints to checked-out action repo"]
SetWorkDir["working-directory: github.action_path"]
BuildCmd["docker build -t IMAGE_TAG ."]
SetEnv["Export IMAGE_TAG to GITHUB_ENV"]
end
subgraph "Step 2: Run documentation builder"
MkdirOutput["mkdir -p inputs.output_dir"]
ResolveDir["OUT_DIR=$(cd inputs.output_dir && pwd)"]
DockerRun["docker run --rm"]
EnvVars["Environment Variables\nREPO, BOOK_TITLE, BOOK_AUTHORS\nGIT_REPO_URL, MARKDOWN_ONLY"]
VolumeMount["Volume: OUT_DIR:/output"]
end
subgraph "Container Execution"
Entrypoint["CMD: build-docs.sh"]
OutputGeneration["Generate book/, markdown/, etc."]
end
ActionPath --> SetWorkDir
SetWorkDir --> BuildCmd
BuildCmd --> SetEnv
SetEnv --> MkdirOutput
MkdirOutput --> ResolveDir
ResolveDir --> DockerRun
DockerRun --> EnvVars
DockerRun --> VolumeMount
EnvVars --> Entrypoint
VolumeMount --> Entrypoint
Entrypoint --> OutputGeneration
The first step action.yml:30-37 uses working-directory: ${{ github.action_path }} to ensure docker build executes in the action’s repository root where the Dockerfile is located. The image tag incorporates GITHUB_RUN_ID action.yml35 to prevent conflicts when multiple workflows run concurrently. The tag is exported to $GITHUB_ENV action.yml36 to make it available in subsequent steps.
The second step action.yml:39-52 resolves the output_dir input to an absolute path action.yml44 using cd and pwd, which is necessary because Docker volume mounts require absolute paths. The docker run command action.yml:45-52 uses the same environment variable names as local Docker execution, mapping each input parameter to its corresponding environment variable.
Sources: action.yml:28-53
Usage Examples
Basic Usage
The most common usage pattern provides only the required repo parameter and accepts default values for all other inputs:
This generates documentation for the myorg/myproject DeepWiki repository with the default title “Documentation” and outputs to ./output.
Sources: README.md:62-70
Full Configuration
A fully configured invocation specifies all optional parameters:
This provides custom metadata and changes the output location to ./generated-docs.
Sources: action.yml:3-26
Markdown-Only Mode
To extract markdown without building HTML, useful for custom post-processing:
This skips the mdBook build phase and outputs only markdown/ and raw_markdown/ directories. For more information about markdown-only mode, see Markdown-Only Mode.
Sources: action.yml:19-22
GitHub Pages Deployment
The action integrates with GitHub Pages deployment workflows:
This workflow generates documentation weekly and deploys it to GitHub Pages. The ${{ github.repository }} context variable automatically uses the current repository as the target.
Sources: README.md:60-70
Comparison with Local Docker Usage
The GitHub Action provides the same functionality as direct Docker invocation but with different interfaces:
| Aspect | GitHub Action | Local Docker |
|---|---|---|
| Invocation | YAML with: parameters | Shell environment variables |
| Image Building | Automatic per workflow run | Manual docker build |
| Output Path | Workflow-relative (e.g., ./output) | Absolute host path (e.g., /home/user/output) |
| Path Resolution | Automatic via cd and pwd action.yml44 | Manual path specification |
| Use Case | CI/CD automation, scheduled builds | Local development, debugging |
Both methods use the identical Docker image and build-docs.sh entrypoint, ensuring consistent output regardless of invocation method. The action adds automation conveniences like automatic image building and path resolution, while local Docker provides more direct control and faster iteration during development.
Sources: action.yml:28-53 README.md:14-27
Implementation Details
Composite Action Structure
The action uses the composite type action.yml28 rather than docker or javascript types. This design choice has several implications:
- No Pre-built Images : The action does not publish Docker images to registries. Instead, it builds the image on-demand in each workflow run.
- Version Consistency : Using
@mainor a specific commit SHA ensures the action always uses the corresponding Dockerfile version. - Build Caching : GitHub Actions runners cache Docker layers between runs, reducing build time after the initial execution.
- Shell Portability : The action requires only
bashanddocker, making it compatible with all standard GitHub-hosted runners.
Sources: action.yml28
Environment Variable Mapping
The action translates workflow inputs to Docker environment variables using GitHub Actions expression syntax:
Each inputs.* reference is resolved by the Actions runtime before shell execution action.yml:46-51 Empty values are passed as empty strings, which triggers auto-detection behavior in build-docs.sh. For details about auto-detection features, see Auto-Detection Features.
Sources: action.yml:45-52
Working Directory Management
The action uses working-directory: ${{ github.action_path }} action.yml31 in the build step to ensure docker build executes in the action repository root. The github.action_path context variable points to the directory where the action was checked out, which is outside the calling workflow’s workspace.
In contrast, the second step does not specify a working directory, so it executes in ${{ github.workspace }} by default. This allows the output_dir input to use workflow-relative paths action.yml44
Sources: action.yml31 action.yml44
Common Integration Patterns
Multi-Repository Documentation
Organizations can use the action to generate documentation for multiple repositories in a single workflow:
This generates documentation for three repositories in parallel, each in its own output subdirectory.
Sources: action.yml:3-26
Conditional Builds
The action can be conditionally executed based on workflow triggers:
This only runs the documentation build when manually triggered, using a repository specified in the workflow dispatch inputs.
Sources: action.yml:1-53
Artifact Upload Integration
The action output integrates seamlessly with GitHub Actions artifact upload:
This uploads multiple output directories as a single artifact for later download or deployment.
Sources: README.md:54-58
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Output Structure
Loading…
Output Structure
Relevant source files
This page documents the structure and contents of the /output directory produced by the DeepWiki-to-mdBook converter. The output structure varies depending on whether the system runs in full build mode or markdown-only mode. For information about enabling markdown-only mode, see Markdown-Only Mode.
Output Directory Overview
The system writes all artifacts to the /output directory, which is typically mounted as a Docker volume. The contents of this directory depend on the MARKDOWN_ONLY environment variable:
Output Mode Decision Logic
graph TD
Start["build-docs.sh execution"]
CheckMode{MARKDOWN_ONLY\nenvironment variable}
FullBuild["Full Build Path"]
MarkdownOnly["Markdown-Only Path"]
OutputBook["/output/book/\nHTML documentation"]
OutputMarkdown["/output/markdown/\nSource .md files"]
OutputToml["/output/book.toml\nConfiguration"]
Start --> CheckMode
CheckMode -->|false default| FullBuild
CheckMode -->|true| MarkdownOnly
FullBuild --> OutputBook
FullBuild --> OutputMarkdown
FullBuild --> OutputToml
MarkdownOnly --> OutputMarkdown
Sources: build-docs.sh26 build-docs.sh:60-76
Full Build Mode Output
When MARKDOWN_ONLY is not set or is false, the system produces three distinct outputs:
graph TD
Output["/output/"]
Book["book/\nComplete HTML site"]
Markdown["markdown/\nSource files"]
BookToml["book.toml\nConfiguration"]
BookIndex["index.html"]
BookCSS["css/"]
BookJS["FontAwesome/"]
BookSearchJS["searchindex.js"]
BookMermaid["mermaid-init.js"]
BookPages["*.html pages"]
MarkdownRoot["*.md files\n(main pages)"]
MarkdownSections["section-N/\n(subsection dirs)"]
Output --> Book
Output --> Markdown
Output --> BookToml
Book --> BookIndex
Book --> BookCSS
Book --> BookJS
Book --> BookSearchJS
Book --> BookMermaid
Book --> BookPages
Markdown --> MarkdownRoot
Markdown --> MarkdownSections
Directory Structure
Full Build Output Structure
Sources: build-docs.sh:178-192 README.md:92-104
/output/book/ Directory
The book/ directory contains the complete HTML documentation site generated by mdBook. This is a self-contained static website that can be hosted on any web server or opened directly in a browser.
| Component | Description | Generated By |
|---|---|---|
index.html | Main entry point for the documentation | mdBook |
*.html | Individual page files corresponding to each .md source | mdBook |
css/ | Styling for the rust theme | mdBook |
FontAwesome/ | Icon font assets | mdBook |
searchindex.js | Search index for site-wide search functionality | mdBook |
mermaid.min.js | Mermaid diagram rendering library | mdbook-mermaid |
mermaid-init.js | Mermaid initialization script | mdbook-mermaid |
The HTML site includes:
- Responsive navigation sidebar with hierarchical structure
- Full-text search functionality
- Syntax highlighting for code blocks
- Working Mermaid diagram rendering
- “Edit this page” links pointing to
GIT_REPO_URL - Collapsible sections in the navigation
Sources: build-docs.sh:173-176 build-docs.sh:94-95 README.md:95-99
/output/markdown/ Directory
The markdown/ directory contains the source Markdown files extracted from DeepWiki and enhanced with Mermaid diagrams. These files follow a specific naming convention and organizational structure.
File Naming Convention:
<page-number>-<page-title-slug>.md
Examples from actual output:
1-overview.md2-1-workspace-and-crates.md3-2-sql-parser.md
Subsection Organization:
Pages with subsections have their children organized into directories:
section-N/
N-1-first-subsection.md
N-2-second-subsection.md
...
For example, if page 4-architecture.md has subsections, they appear in:
section-4/
4-1-overview.md
4-2-components.md
This organization is reflected in the mdBook SUMMARY.md generation logic at build-docs.sh:125-159
Sources: README.md:100-119 build-docs.sh:163-166 build-docs.sh:186-188
/output/book.toml File
The book.toml file is a copy of the mdBook configuration used to generate the HTML site. It contains:
This file can be used to:
- Understand the configuration used for the build
- Regenerate the book with different settings
- Debug mdBook configuration issues
Sources: build-docs.sh:84-103 build-docs.sh:190-191
Markdown-Only Mode Output
When MARKDOWN_ONLY=true, the system produces only the /output/markdown/ directory. This mode skips the mdBook build phase entirely.
Markdown-Only Mode Data Flow
graph LR
Scraper["deepwiki-scraper.py"]
TempDir["/workspace/wiki/\nTemporary directory"]
OutputMarkdown["/output/markdown/\nFinal output"]
Scraper -->|Writes enhanced .md files| TempDir
TempDir -->|cp -r| OutputMarkdown
The output structure is identical to the markdown/ directory in full build mode, but the book/ and book.toml artifacts are not created.
Sources: build-docs.sh:60-76 README.md:106-113
graph TD
subgraph "Phase 1: Scraping"
Scraper["deepwiki-scraper.py"]
WikiDir["/workspace/wiki/"]
end
subgraph "Phase 2: Decision Point"
CheckMode{MARKDOWN_ONLY\ncheck}
end
subgraph "Phase 3: mdBook Build (conditional)"
BookInit["Initialize /workspace/book/"]
GenToml["Generate book.toml"]
GenSummary["Generate SUMMARY.md"]
CopyToSrc["cp wiki/* book/src/"]
MermaidInstall["mdbook-mermaid install"]
MdBookBuild["mdbook build"]
BuildOutput["/workspace/book/book/"]
end
subgraph "Phase 4: Copy to Output"
CopyBook["cp -r book /output/"]
CopyMarkdown["cp -r wiki /output/markdown/"]
CopyToml["cp book.toml /output/"]
end
Scraper -->|Writes to| WikiDir
WikiDir --> CheckMode
CheckMode -->|false| BookInit
CheckMode -->|true| CopyMarkdown
BookInit --> GenToml
GenToml --> GenSummary
GenSummary --> CopyToSrc
CopyToSrc --> MermaidInstall
MermaidInstall --> MdBookBuild
MdBookBuild --> BuildOutput
BuildOutput --> CopyBook
WikiDir --> CopyMarkdown
GenToml --> CopyToml
Output Generation Process
The following diagram shows how each output artifact is generated during the build process:
Complete Output Generation Pipeline
Sources: build-docs.sh:55-205
File Naming Examples
The following table shows actual filename patterns produced by the system:
| Pattern | Example | Description |
|---|---|---|
N-title.md | 1-overview.md | Main page without subsections |
N-M-title.md | 2-1-workspace-and-crates.md | Subsection file in root (legacy format) |
section-N/N-M-title.md | section-4/4-1-logical-planning.md | Subsection file in section directory |
The system automatically detects which pages have subsections by examining the numeric prefix and checking for corresponding section-N/ directories during SUMMARY.md generation.
Sources: build-docs.sh:125-159 README.md:115-119
Volume Mounting
The /output directory is designed to be mounted as a Docker volume. The typical Docker run command specifies:
This mounts the host’s ./output directory to the container’s /output directory, making all generated artifacts accessible on the host filesystem after the container exits.
Sources: README.md:34-38 README.md:83-86
Output Size Characteristics
The output directory typically contains:
- Markdown files : 10-500 KB per page depending on content length and diagram count
- HTML book : 5-50 MB total depending on page count and assets
- book.toml : ~500 bytes
For a typical repository with 20-30 documentation pages, expect:
markdown/: 5-15 MBbook/: 10-30 MB (includes all HTML, CSS, JS, and search index)book.toml: < 1 KB
The HTML book is significantly larger than the markdown source because it includes:
- Complete mdBook framework (CSS, JavaScript)
- Search index (
searchindex.js) - Mermaid rendering library (
mermaid.min.js) - Font assets (FontAwesome)
- Generated HTML for each page with navigation
Sources: build-docs.sh:178-205
Serving the Output
The HTML documentation in /output/book/ can be served using any static web server:
The markdown files in /output/markdown/ can be:
- Committed to a Git repository
- Used as input for other documentation systems
- Edited and re-processed through mdBook manually
- Served directly by markdown-aware platforms like GitHub
Sources: README.md:83-86 build-docs.sh:203-204
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Template System Details
Loading…
Template System Details
Relevant source files
Purpose and Scope
This document provides comprehensive technical documentation of the template system used to customize headers and footers in generated markdown files. The template system supports variable substitution, conditional rendering, and custom template injection through volume mounts.
For information about specific template variables and their usage, see Template Variables. For guidance on providing custom templates, see Custom Templates. For the broader Phase 2 enhancement pipeline where templates are injected, see Phase 2: Diagram Enhancement.
Template Processing Architecture
The template system consists of two core components: the template processor script and the default template files. The processor applies variable substitution and conditional logic, while default templates provide the baseline content structure.
Template System Component Architecture
graph TB
subgraph "Template Processor"
ProcessScript["process-template.py"]
ProcessFunc["process_template()"]
ReplaceConditional["replace_conditional()"]
ReplaceVariable["replace_variable()"]
end
subgraph "Default Templates"
HeaderTemplate["templates/header.html"]
FooterTemplate["templates/footer.html"]
TemplateReadme["templates/README.md"]
end
subgraph "Runtime Configuration"
EnvVars["Environment Variables\nTEMPLATE_DIR\nHEADER_TEMPLATE\nFOOTER_TEMPLATE"]
VariablesDict["variables dict\nREPO, BOOK_TITLE\nGENERATION_DATE, etc."]
end
subgraph "Processing Steps"
ReadTemplate["Read template file"]
ParseVars["Parse CLI arguments\nVAR=value format"]
ProcessCond["Process conditionals\n{{#if}}...{{/if}}"]
ProcessVars["Process variables\n{{VARIABLE}}"]
RemoveComments["Remove HTML comments"]
end
EnvVars --> ProcessScript
HeaderTemplate --> ReadTemplate
FooterTemplate --> ReadTemplate
ReadTemplate --> ProcessFunc
ParseVars --> VariablesDict
VariablesDict --> ProcessFunc
ProcessFunc --> ProcessCond
ProcessCond --> ReplaceConditional
ReplaceConditional --> ProcessVars
ProcessVars --> ReplaceVariable
ReplaceVariable --> RemoveComments
RemoveComments --> Output["Processed HTML output"]
Sources: python/process-template.py:1-83 templates/header.html:1-9 templates/footer.html:1-11
Template Syntax
The template system supports three syntax constructs: variable substitution, conditional rendering, and HTML comments. All processing occurs through regular expression pattern matching in process_template().
Variable Substitution
Variables use double curly brace syntax: {{VARIABLE_NAME}}. The processor replaces these with values from the variables dictionary.
Variable Processing Flow
graph LR
Input["Template: {{BOOK_TITLE}}"]
Pattern["variable_pattern\nr'\\{\\{(\\w+)\\}\\}'"]
Match["re.sub()
match"]
Lookup["variables.get('BOOK_TITLE', '')"]
Output["Processed: My Documentation"]
Input --> Pattern
Pattern --> Match
Match --> Lookup
Lookup --> Output
The replace_variable() function at python/process-template.py:41-43 implements variable lookup. If a variable is not found in the dictionary, it returns an empty string rather than raising an error.
Sources: python/process-template.py:38-45 templates/README.md:12-15
Conditional Rendering
Conditionals control whether content blocks are included in the output based on variable presence and non-empty values. The syntax is {{#if VARIABLE}}...{{/if}}.
Conditional Evaluation Logic
| Condition | Variable Value | Result |
|---|---|---|
| Variable not in dictionary | N/A | Content excluded |
| Variable is empty string | "" | Content excluded |
Variable is None | None | Content excluded |
Variable is False | False | Content excluded |
| Variable has non-empty value | Any truthy value | Content included |
The conditional pattern at python/process-template.py26 uses re.DOTALL flag to match content across multiple lines. The replace_conditional() function evaluates the condition at python/process-template.py:32-34
Example from Default Header:
This conditional at templates/header.html3 ensures the GitHub badge only appears when GIT_REPO_URL is configured.
Sources: python/process-template.py:24-36 templates/README.md:17-22 templates/header.html3
HTML Comment Removal
HTML comments are automatically stripped from the processed output using the pattern <!--.*?--> with the re.DOTALL flag at python/process-template.py48 This allows template authors to include documentation within template files without affecting the output.
Sources: python/process-template.py:47-48 templates/README.md:24-28
Default Template Structure
The system provides two default templates located in the templates/ directory. These templates are embedded in the Docker image and used unless overridden by volume mounts.
Template File Locations and Purpose
graph TB
subgraph "Docker Image: /workspace/templates/"
HeaderFile["header.html\nProject badges\nInitiative description\nGitHub links"]
FooterFile["footer.html\nGeneration timestamp\nRepository link"]
ReadmeFile["README.md\nDocumentation\nNot processed"]
end
subgraph "Environment Variables"
TemplateDir["TEMPLATE_DIR\ndefault: /workspace/templates"]
HeaderPath["HEADER_TEMPLATE\ndefault: $TEMPLATE_DIR/header.html"]
FooterPath["FOOTER_TEMPLATE\ndefault: $TEMPLATE_DIR/footer.html"]
end
subgraph "Runtime Resolution"
BuildScript["build-docs.sh"]
CheckExists["Check file existence"]
UseDefault["Use default templates"]
UseCustom["Use custom templates\n(if mounted)"]
end
TemplateDir --> HeaderPath
TemplateDir --> FooterPath
HeaderPath --> BuildScript
FooterPath --> BuildScript
BuildScript --> CheckExists
CheckExists --> UseDefault
CheckExists --> UseCustom
UseDefault --> HeaderFile
UseDefault --> FooterFile
Sources: templates/header.html:1-9 templates/footer.html:1-11 templates/README.md:1-77
Header Template Structure
The default header at templates/header.html:1-9 contains three main sections:
- Project Badges Section (lines 2-4): Conditionally displays GitHub badge using flexbox layout with gap spacing and bottom border
- Initiative Description (line 7): Displays “Projects with Books” initiative text from zenOSmosis
- Conditional GitHub Link (line 7): Second conditional that provides GitHub source link
The inline conditional comment at templates/header.html1 explains why conditionals must be inline: to prevent mdBook from wrapping links in separate paragraph tags.
Footer Template Structure
The default footer at templates/footer.html:1-11 provides:
- Visual Separator (line 1): Horizontal rule with custom styling
- Generation Timestamp (line 4): Always displayed using
GENERATION_DATEvariable - Repository Link (lines 5-9): Conditionally displays repository link using
GIT_REPO_URLandREPOvariables
Sources: templates/header.html:1-9 templates/footer.html:1-11
sequenceDiagram
participant BS as build-docs.sh
participant PT as process-template.py
participant FS as Filesystem
participant MD as Markdown Files
Note over BS: Phase 2: Enhancement
BS->>BS: Set template variables\nREPO, BOOK_TITLE, etc.
BS->>FS: Check HEADER_TEMPLATE exists
BS->>FS: Check FOOTER_TEMPLATE exists
BS->>PT: process-template.py header.html\nVAR1=val1 VAR2=val2...
PT->>FS: Read header.html
PT->>PT: Apply conditionals
PT->>PT: Substitute variables
PT->>PT: Remove comments
PT-->>BS: Return processed header
BS->>PT: process-template.py footer.html\nVAR1=val1 VAR2=val2...
PT->>FS: Read footer.html
PT->>PT: Apply conditionals
PT->>PT: Substitute variables
PT->>PT: Remove comments
PT-->>BS: Return processed footer
BS->>MD: For each .md file in markdown/
loop Each File
BS->>FS: Read original content
BS->>FS: Write: header + content + footer
end
Note over BS,MD: Templates now injected\nReady for mdBook build
Template Processing Pipeline
Templates are processed and injected during Phase 2 of the build pipeline, after markdown extraction but before mdBook build. The build-docs.sh script orchestrates this process.
Template Injection Workflow
Sources: python/process-template.py:53-82
Available Template Variables
The following variables are available in all templates. These are set by build-docs.sh before template processing and passed as command-line arguments to process-template.py.
| Variable | Source | Example Value | Description |
|---|---|---|---|
REPO | REPO env var or Git remote | jzombie/deepwiki-to-mdbook | Repository in owner/repo format |
BOOK_TITLE | BOOK_TITLE env var or default | My Project Documentation | Title displayed in book |
BOOK_AUTHORS | BOOK_AUTHORS env var or default | Project Contributors | Author attribution |
GENERATION_DATE | date -u command | 2024-01-15 14:30:00 UTC | UTC timestamp when docs generated |
DEEPWIKI_URL | Hardcoded | https://deepwiki.com/... | Source wiki URL |
DEEPWIKI_BADGE_URL | Constructed from DEEPWIKI_URL | https://deepwiki.com/.../badge.svg | DeepWiki badge image |
GIT_REPO_URL | Constructed from REPO | https://github.com/owner/repo | Full GitHub repository URL |
GITHUB_BADGE_URL | Constructed from REPO | https://img.shields.io/badge/... | GitHub badge image |
Variable Construction Logic:
GIT_REPO_URL: Only set ifREPOis non-empty. Format:https://github.com/${REPO}GITHUB_BADGE_URL: Only set ifREPOis non-empty. Constructed using shields.io badge serviceDEEPWIKI_BADGE_URL: Derived fromDEEPWIKI_URLby appending/badge.svg
Sources: templates/README.md:30-40
Customization Mechanisms
The template system supports customization through three mechanisms: environment variables, volume mounts, and complete template directory replacement.
Environment Variable Configuration
| Environment Variable | Default Value | Purpose |
|---|---|---|
TEMPLATE_DIR | /workspace/templates | Base directory for template files |
HEADER_TEMPLATE | $TEMPLATE_DIR/header.html | Path to header template |
FOOTER_TEMPLATE | $TEMPLATE_DIR/footer.html | Path to footer template |
These variables allow changing template paths without modifying the Docker image.
Sources: templates/README.md:53-56
Volume Mount Strategies
Strategy 1: Replace Individual Templates
This overrides only the header template, keeping the default footer.
Strategy 2: Replace Entire Template Directory
This replaces all default templates. The custom directory must contain both header.html and footer.html.
Strategy 3: Custom Template Location
This uses environment variables to point to templates in a different location.
Sources: templates/README.md:44-51
graph TB
subgraph "Phase 1: Extraction"
Scraper["deepwiki-scraper.py\nGenerate raw markdown"]
RawMD["raw_markdown/\ndirectory"]
end
subgraph "Phase 2: Enhancement"
DiagramInject["Diagram injection\ninto markdown"]
TemplateProcess["Template processing\nprocess-template.py"]
TemplateInject["Template injection\nPrepend header\nAppend footer"]
EnhancedMD["markdown/\ndirectory"]
end
subgraph "Phase 3: Build"
SummaryGen["SUMMARY.md generation"]
MDBookBuild["mdbook build"]
HTMLOutput["book/\ndirectory"]
end
Scraper --> RawMD
RawMD --> DiagramInject
DiagramInject --> TemplateProcess
TemplateProcess --> TemplateInject
TemplateInject --> EnhancedMD
EnhancedMD --> SummaryGen
SummaryGen --> MDBookBuild
MDBookBuild --> HTMLOutput
Integration with Build Process
The template system integrates into the three-phase build pipeline at a specific injection point during Phase 2.
Build Phase Integration Points
The template injection occurs after diagram enhancement but before SUMMARY.md generation. This ensures:
- All content modifications (diagrams) are complete before templates are added
- Templates appear in every markdown file processed by mdBook
- The same header/footer styling applies consistently across all pages
File Processing Order:
build-docs.shprocesses header and footer templates once at the start of Phase 2- The processed templates are stored in temporary variables
- For each
.mdfile in themarkdown/directory:- Read original file content
- Concatenate:
processed_header + content + processed_footer - Write back to the same file
- Continue to Phase 3 with enhanced files
Sources: python/process-template.py:1-83
Template Processing Error Handling
The process-template.py script implements basic error handling for common failure scenarios:
Error Conditions and Responses:
| Condition | Detection | Response |
|---|---|---|
| Missing template file | os.path.isfile() check at line 62 | Print error to stderr, exit code 1 |
| Insufficient arguments | len(sys.argv) < 2 check at line 54 | Print usage message, exit code 1 |
| Missing variable | Dictionary lookup at line 43 | Return empty string (silent) |
| Invalid conditional | Regex fails to match | No replacement (silent) |
| File read error | Exception during file open | Unhandled exception propagates |
The error handling philosophy is “fail early for missing inputs, fail silently for missing variables.” This allows templates to gracefully degrade when optional variables (like GIT_REPO_URL) are not provided, while catching configuration errors before processing begins.
Sources: python/process-template.py:54-63
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Template Variables
Loading…
Template Variables
Relevant source files
Purpose and Scope
This page provides a complete reference for all template variables available in the deepwiki-to-mdbook template system. Template variables are placeholders that get substituted with actual values during the build process, allowing dynamic customization of headers and footers injected into each markdown file.
For information about the broader template system architecture and conditional logic, see Template System. For guidance on providing custom templates, see Custom Templates.
Variable Overview
Template variables are captured during the build process and passed to process-template.py for substitution into HTML template files. These variables provide context about the repository, documentation metadata, and generated links.
Sources: scripts/build-docs.sh:195-234
Complete Variable Reference
The following table documents all available template variables:
| Variable | Description | Source/Derivation | Default Value |
|---|---|---|---|
REPO | Repository identifier in owner/repo format | Environment variable or auto-detected from Git remote | Required (no default) |
BOOK_TITLE | Documentation book title | Environment variable | "Documentation" |
BOOK_AUTHORS | Authors of the documentation | Environment variable | Value of REPO_OWNER (first part of REPO) |
GENERATION_DATE | Timestamp when documentation was generated | Environment variable or auto-generated | Current UTC datetime in format "Month DD, YYYY at HH:MM UTC" |
DEEPWIKI_URL | URL to DeepWiki documentation page | Derived from REPO | https://deepwiki.com/{REPO} |
DEEPWIKI_BADGE_URL | URL to DeepWiki badge image | Static value | https://deepwiki.com/badge.svg |
GIT_REPO_URL | URL to Git repository | Environment variable | https://github.com/{REPO} |
GITHUB_BADGE_URL | URL to GitHub badge image | Generated from REPO with URL encoding | https://img.shields.io/badge/GitHub-{encoded_repo}-181717?logo=github |
Sources: scripts/build-docs.sh:8-51 scripts/build-docs.sh:195-213
Variable Resolution Flow
The following diagram illustrates how template variables are resolved from multiple sources:
Sources: scripts/build-docs.sh:8-51 scripts/build-docs.sh:195-234
graph TB
subgraph "Input Sources"
EnvVars["Environment Variables\n(REPO, BOOK_TITLE, etc.)"]
GitRemote["Git Remote\n(git config remote.origin.url)"]
SystemTime["System Time\n(date -u)"]
end
subgraph "Resolution Logic in build-docs.sh"
AutoDetect["Auto-Detection\n[build-docs.sh:8-19]"]
SetDefaults["Set Defaults\n[build-docs.sh:21-26]"]
DeriveValues["Derive URLs\n[build-docs.sh:40-51]"]
CaptureDate["Capture Timestamp\n[build-docs.sh:200]"]
end
subgraph "Variable Processing"
TemplateProcessor["process-template.py"]
HeaderTemplate["templates/header.html"]
FooterTemplate["templates/footer.html"]
end
subgraph "Output"
ProcessedHeader["Processed Header HTML"]
ProcessedFooter["Processed Footer HTML"]
InjectedMD["Markdown Files\nwith Headers/Footers"]
end
EnvVars -->|REPO not set| AutoDetect
GitRemote --> AutoDetect
AutoDetect -->|extract owner/repo| SetDefaults
EnvVars --> SetDefaults
SetDefaults --> DeriveValues
SystemTime --> CaptureDate
DeriveValues -->|8 variables| TemplateProcessor
CaptureDate --> TemplateProcessor
HeaderTemplate --> TemplateProcessor
FooterTemplate --> TemplateProcessor
TemplateProcessor --> ProcessedHeader
TemplateProcessor --> ProcessedFooter
ProcessedHeader --> InjectedMD
ProcessedFooter --> InjectedMD
Variable Processing Implementation
Capture and Derivation
The variable capture process occurs in several stages within build-docs.sh:
-
Auto-detection scripts/build-docs.sh:8-19: If
REPOis not set, the script attempts to extract it from the Git remote URL usinggit config --get remote.origin.url. -
Default assignment scripts/build-docs.sh:21-26: Environment variables are assigned default values. The pattern
${VAR:-default}provides fallback values. -
URL derivation scripts/build-docs.sh:40-51: Several URLs are constructed from the base variables:
DEEPWIKI_URLis built fromREPOGIT_REPO_URLdefaults to GitHub URL fromREPOGITHUB_BADGE_URLincludes URL-encodedREPOwith character escaping
-
Timestamp capture scripts/build-docs.sh200: The generation date is captured in UTC format using
date -u.
Sources: scripts/build-docs.sh:8-51 scripts/build-docs.sh200
Badge URL Generation
The GitHub badge URL generation includes special character handling:
REPO_BADGE_LABEL=$(printf '%s' "$REPO" | sed 's/-/--/g' | sed 's/\//%2F/g')
GITHUB_BADGE_URL="https://img.shields.io/badge/GitHub-${REPO_BADGE_LABEL}-181717?logo=github"
This escapes hyphens (doubled) and URL-encodes slashes for shields.io badge compatibility.
Sources: scripts/build-docs.sh:50-51
Template Invocation
Variables are passed to process-template.py as command-line arguments in KEY=VALUE format:
The same variables are passed for both header and footer template processing.
Sources: scripts/build-docs.sh:205-213 scripts/build-docs.sh:222-230
graph LR
subgraph "Stage 1: Variable Collection"
CollectEnv["Environment Variables"]
CollectGit["Git Remote Detection"]
CollectTime["Timestamp Generation"]
end
subgraph "Stage 2: build-docs.sh Processing"
ValidateREPO["Validate REPO\n[build-docs.sh:33-38]"]
ExtractParts["Extract REPO_OWNER\nand REPO_NAME\n[build-docs.sh:40-42]"]
ApplyDefaults["Apply Defaults\n[build-docs.sh:44-46]"]
ConstructURLs["Construct Badge URLs\n[build-docs.sh:48-51]"]
end
subgraph "Stage 3: Template Processing"
InvokeProcessor["Invoke process-template.py\n[build-docs.sh:205-230]"]
SubstituteVars["Variable Substitution"]
EvalConditionals["Evaluate Conditionals"]
StripComments["Strip HTML Comments"]
end
subgraph "Stage 4: Output"
HeaderHTML["HEADER_HTML"]
FooterHTML["FOOTER_HTML"]
InjectFiles["Inject into Markdown\n[build-docs.sh:240-261]"]
end
CollectEnv --> ValidateREPO
CollectGit --> ValidateREPO
CollectTime --> InvokeProcessor
ValidateREPO --> ExtractParts
ExtractParts --> ApplyDefaults
ApplyDefaults --> ConstructURLs
ConstructURLs --> InvokeProcessor
InvokeProcessor --> SubstituteVars
SubstituteVars --> EvalConditionals
EvalConditionals --> StripComments
StripComments --> HeaderHTML
StripComments --> FooterHTML
HeaderHTML --> InjectFiles
FooterHTML --> InjectFiles
Variable Processing Pipeline
The following diagram shows how variables flow through the processing pipeline:
Sources: scripts/build-docs.sh:8-51 scripts/build-docs.sh:195-261
Usage Syntax
Variable Substitution
Variables are referenced in templates using double curly braces:
At render time, this becomes:
Sources: templates/README.md:12-15
Conditional Blocks
Variables can be tested for existence using conditional syntax:
If GIT_REPO_URL is set, the link is rendered. If not set or empty, the entire block is omitted.
Sources: templates/README.md:17-22
Multiple Variables
Multiple variables can be combined in a single template:
Sources: templates/README.md:60-68
Implementation Examples
Badge Link Generation
The default templates use variables to generate badge links. Here’s a typical pattern:
This generates clickable badge images linking to both DeepWiki and GitHub, but only if the URLs are configured.
Sources: templates/README.md:30-39
Footer Timestamp
The footer typically includes the generation timestamp:
With GENERATION_DATE="December 25, 2024 at 14:30 UTC", this renders as:
Sources: templates/README.md:70-76
Environment Variable Override
All template variables can be overridden via environment variables when running the Docker container:
This allows full customization without modifying template files.
Sources: scripts/build-docs.sh:21-26
Variable Validation
Only REPO is strictly required. The build will fail if it cannot be determined:
All other variables have sensible defaults and can be omitted.
Sources: scripts/build-docs.sh:33-38
Variable Storage During Build
After processing, the rendered HTML is stored in shell variables:
HEADER_HTMLcontains the processed header templateFOOTER_HTMLcontains the processed footer template
These are then injected into each markdown file during the copy operation scripts/build-docs.sh:240-261
Sources: scripts/build-docs.sh:205-234 scripts/build-docs.sh:240-261
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Custom Templates
Loading…
Custom Templates
Relevant source files
This page explains how to provide custom header and footer templates for the DeepWiki-to-mdBook converter through Docker volume mounts. Custom templates allow you to control the HTML content injected at the beginning and end of every generated markdown file, enabling branding, navigation elements, and custom styling.
For information about the variables available for use within templates, see Template Variables. For comprehensive details about the template system architecture and processing logic, see Template System.
Purpose and Scope
The template system supports customization through two mechanisms:
- Volume Mounts : Replace default templates by mounting custom files into the container
- Environment Variables : Override default template file paths
This page documents both mechanisms and provides practical examples for common customization scenarios.
Sources: README.md:39-51 templates/README.md:42-56
Default Template Architecture
The system includes two default templates that are baked into the Docker image:
| Template File | Container Path | Purpose |
|---|---|---|
header.html | /workspace/templates/header.html | Injected at the start of each markdown file |
footer.html | /workspace/templates/footer.html | Injected at the end of each markdown file |
graph LR
subgraph "Container Filesystem"
DefaultTemplates["/workspace/templates/\nheader.html\nfooter.html"]
CustomMount["/workspace/templates/\n(volume mount)"]
end
subgraph "Resolution Logic"
CheckMount{"Custom\ntemplates\nmounted?"}
UseDefault["Use default\ntemplates"]
UseCustom["Use custom\ntemplates"]
end
subgraph "Processing Pipeline"
ProcessTemplate["process-template.py\nVariable substitution\nConditional rendering"]
MarkdownFiles["markdown/*.md"]
InjectedContent["Markdown with\ninjected headers/footers"]
end
DefaultTemplates --> CheckMount
CustomMount --> CheckMount
CheckMount -->|No| UseDefault
CheckMount -->|Yes| UseCustom
UseDefault --> ProcessTemplate
UseCustom --> ProcessTemplate
ProcessTemplate --> MarkdownFiles
MarkdownFiles --> InjectedContent
These defaults provide basic documentation metadata including repository links, DeepWiki badges, and generation timestamps. Custom templates completely replace these defaults when mounted.
Template Processing Flow:
Diagram: Template Resolution and Processing Pipeline
The system checks for mounted templates at container startup. If custom templates are found at the mount point, they replace the defaults entirely. The selected templates are then processed by process-template.py for variable substitution and conditional rendering before being injected into each markdown file.
Sources: templates/README.md:5-8 README.md:39-51
Volume Mount Strategies
Full Directory Mount
The recommended approach is to mount an entire directory containing both custom templates:
Directory Structure:
my-templates/
├── header.html
└── footer.html
This strategy provides clean separation between your custom templates and the system, and allows you to version control both templates together.
Individual File Mounts
For granular control, mount individual template files:
This approach is useful when:
- You only want to customize one template (header or footer)
- Your templates are stored in different locations
- You’re testing template changes without modifying a template directory
Partial Customization
You can mount only one template file, and the system will use the default for the other:
Sources: README.md:44-49 templates/README.md:46-51
Environment Variable Configuration
The template system exposes three environment variables for advanced path customization:
| Environment Variable | Default Value | Purpose |
|---|---|---|
TEMPLATE_DIR | /workspace/templates | Base directory for template files |
HEADER_TEMPLATE | $TEMPLATE_DIR/header.html | Path to header template |
FOOTER_TEMPLATE | $TEMPLATE_DIR/footer.html | Path to footer template |
Custom Template Directory
Override the entire template directory location:
Custom Template Paths
Override individual template file paths:
This advanced configuration is rarely needed but provides maximum flexibility for non-standard deployment scenarios.
Sources: templates/README.md:53-56
Template Processing Implementation
The template injection mechanism is implemented in process-template.py, which is invoked by the main orchestrator during Phase 2 of the build pipeline.
Template Processing Code Flow:
graph TB
subgraph "build-docs.sh Orchestration"
BuildScript["build-docs.sh"]
Phase2["Phase 2: Enhancement"]
end
subgraph "process-template.py"
LoadTemplate["load_template()\nRead header/footer files"]
ParseVars["Variable Substitution\n{{VAR}} → value"]
ParseCond["Conditional Rendering\n{{#if VAR}}...{{/if}}"]
StripComments["strip_html_comments()\nRemove <!-- ... -->"]
ProcessFile["process_file()\nInject into markdown"]
end
subgraph "File System"
TemplateFiles["/workspace/templates/\nheader.html\nfooter.html"]
MarkdownDir["markdown/*.md"]
EnhancedMD["Enhanced markdown\nwith headers/footers"]
end
BuildScript --> Phase2
Phase2 --> LoadTemplate
TemplateFiles --> LoadTemplate
LoadTemplate --> ParseVars
ParseVars --> ParseCond
ParseCond --> StripComments
StripComments --> ProcessFile
MarkdownDir --> ProcessFile
ProcessFile --> EnhancedMD
Diagram: Template Processing Implementation Flow
The process-template.py script is executed by build-docs.sh during Phase 2. It loads the template files from the configured paths, performs variable substitution and conditional rendering, strips HTML comments, and injects the processed content into each markdown file.
Key Functions in process-template.py:
| Function | Responsibility | Implementation Detail |
|---|---|---|
load_template() | Read template file content | Returns raw HTML string from file path |
| Variable substitution | Replace {{VAR}} with values | Regex-based pattern matching |
| Conditional rendering | Process {{#if VAR}}...{{/if}} | Evaluates variable truthiness |
strip_html_comments() | Remove HTML comments | Regex pattern: <!--.*?--> |
process_file() | Inject templates into markdown | Prepends header, appends footer |
Sources: templates/README.md:24-28 README.md51
Practical Examples
Minimal Custom Header
Replace the default header with a simple title banner:
File:my-templates/header.html
File:my-templates/footer.html
Usage:
Branded Documentation
Add company branding and navigation:
File:custom/header.html
File:custom/footer.html
Badge-Based Navigation
Create a navigation header using badges:
File:badges/header.html
Sources: templates/README.md:60-76
graph TB
subgraph "Phase 1: Extraction"
Scrape["deepwiki-scraper.py\nExtract wiki content"]
RawMD["raw_markdown/"]
end
subgraph "Phase 2: Enhancement"
DiagramInject["Inject Mermaid diagrams"]
CheckTemplates{"Custom\ntemplates\nmounted?"}
LoadDefaults["Load default templates\n/workspace/templates/"]
LoadCustom["Load custom templates\n(volume mount)"]
ProcessTemplates["process-template.py\nInject headers/footers"]
EnhancedMD["markdown/"]
end
subgraph "Phase 3: Build"
GenSummary["Generate SUMMARY.md"]
MDBookBuild["mdbook build"]
FinalHTML["book/"]
end
Scrape --> RawMD
RawMD --> DiagramInject
DiagramInject --> CheckTemplates
CheckTemplates -->|No| LoadDefaults
CheckTemplates -->|Yes| LoadCustom
LoadDefaults --> ProcessTemplates
LoadCustom --> ProcessTemplates
ProcessTemplates --> EnhancedMD
EnhancedMD --> GenSummary
GenSummary --> MDBookBuild
MDBookBuild --> FinalHTML
Integration with Build Pipeline
Custom templates are integrated at a specific point in the three-phase build pipeline:
Diagram: Custom Templates in Build Pipeline Context
Template customization occurs in Phase 2, after diagram injection but before mdBook structure generation. This ensures that custom headers and footers are present in the markdown files when SUMMARY.md is generated and mdBook performs its final build. The template selection happens once at the beginning of Phase 2, and the same templates are applied consistently to all markdown files.
Sources: README.md:72-76
Advanced Customization Patterns
Conditional Content Based on Environment
Use conditionals to show different content in different contexts:
Testing Template Changes Locally
Workflow for iterating on custom templates:
-
Create template directory:
mkdir -p ./test-templates -
Add custom templates:
vim ./test-templates/header.html -
Run container with mounted templates:
-
View results:
cd output && python3 -m http.server --directory book 8000 -
Iterate by editing templates and re-running step 3
Template Debugging
When templates don’t render as expected:
-
Check file paths : Verify templates are mounted correctly
-
Inspect raw markdown : Check
output/markdown/files to see injected content -
Verify variable availability : Ensure variables used in templates are set
- Check environment variables passed to container
- Review Template Variables for available variables
-
Test HTML syntax : Validate HTML before mounting
Sources: templates/README.md:1-77 README.md:39-51
Template Mount Configuration Matrix
| Configuration Type | Volume Mount | Environment Variables | Use Case |
|---|---|---|---|
| Default templates | None | None | Standard documentation |
| Full custom directory | -v ./templates:/workspace/templates | None | Complete customization |
| Individual files | -v ./header.html:/workspace/templates/header.html | None | Partial customization |
| Custom location | -v ./templates:/custom/path | TEMPLATE_DIR=/custom/path | Non-standard paths |
| Advanced paths | Multiple mounts | HEADER_TEMPLATE=... | |
FOOTER_TEMPLATE=... | Complex setups |
Sources: templates/README.md:42-56 README.md:44-49
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Advanced Topics
Loading…
Advanced Topics
Relevant source files
This page covers advanced usage scenarios, implementation details, and power-user features of the DeepWiki-to-mdBook Converter. It provides deeper insight into optional features, debugging techniques, and the internal mechanisms that enable the system’s flexibility and robustness.
For basic usage and configuration, see Quick Start and Configuration Reference. For architectural overview, see System Architecture. For component-level details, see Component Reference.
When to Use Advanced Features
The system provides several advanced features designed for specific scenarios:
Markdown-Only Mode : Extract markdown without building the HTML documentation. Useful for:
- Debugging diagram placement and content extraction
- Quick iteration during development
- Creating markdown archives for version control
- Feeding extracted content into other tools
Auto-Detection : Automatically determine repository metadata from Git remotes. Useful for:
- CI/CD pipeline integration with minimal configuration
- Running from within a repository checkout
- Reducing configuration boilerplate
Custom Configuration : Override default behaviors through environment variables. Useful for:
- Multi-repository documentation builds
- Custom branding and themes
- Specialized output requirements
Decision Flow for Build Modes
Sources: build-docs.sh:60-76 build-docs.sh:78-206 README.md:55-76
Debugging Strategies
Using Markdown-Only Mode for Fast Iteration
The MARKDOWN_ONLY environment variable bypasses the mdBook build phase, reducing build time from minutes to seconds. This is controlled by a simple conditional check in the orchestration script.
Workflow:
- Set
MARKDOWN_ONLY=truein Docker run command - Script executes build-docs.sh:60-76 which skips Steps 2-6
- Only Phase 1 (scraping) and Phase 2 (diagram enhancement) execute
- Output written directly to
/output/markdown/
Typical debugging session:
The check at build-docs.sh61 determines whether to exit early:
For detailed information about this mode, see Markdown-Only Mode.
Sources: build-docs.sh:60-76 build-docs.sh26 README.md:55-76
Inspecting Intermediate Outputs
The system uses a temporary directory workflow that can be examined for debugging:
| Stage | Location | Contents |
|---|---|---|
| During Phase 1 | /workspace/wiki/ (temp) | Raw markdown before diagram enhancement |
| During Phase 2 | /workspace/wiki/ (temp) | Markdown with injected diagrams |
| During Phase 3 | /workspace/book/src/ | Markdown copied for mdBook |
| Final Output | /output/markdown/ | Final enhanced markdown files |
The temporary directory pattern is implemented using Python’s tempfile.TemporaryDirectory at tools/deepwiki-scraper.py808:
This ensures atomic operations—if the script fails mid-process, partial outputs are automatically cleaned up.
Sources: tools/deepwiki-scraper.py:804-916 build-docs.sh:27-30
Diagram Placement Debugging
Diagram injection uses fuzzy matching with progressive chunk sizes. To debug placement:
- Check raw extraction count : Look for console output “Found N total diagrams”
- Check context extraction : Look for “Found N diagrams with context”
- Check matching : Look for “Enhanced X files with diagrams”
The matching algorithm tries progressively smaller chunks at tools/deepwiki-scraper.py:716-730:
Debugging poor matches:
- If too few diagrams placed: The context from JavaScript may not match converted markdown
- If diagrams in wrong locations: Context text may appear in multiple locations
- If no diagrams: Repository may not contain mermaid diagrams
Sources: tools/deepwiki-scraper.py:596-788 tools/deepwiki-scraper.py:216-331
Link Rewriting Implementation
The Link Rewriting Problem
DeepWiki uses absolute URLs like /owner/repo/2-1-subsection. The scraper must convert these to relative markdown paths that work in the mdBook file hierarchy:
output/markdown/
├── 1-overview.md
├── 2-main-section.md
├── section-2/
│ ├── 2-1-subsection.md
│ └── 2-2-another.md
└── 3-next-section.md
Links must account for:
- Source page location (main page vs. subsection)
- Target page location (main page vs. subsection)
- Same section vs. cross-section links
Link Rewriting Algorithm
Sources: tools/deepwiki-scraper.py:549-593
Link Rewriting Code Structure
The fix_wiki_link function at tools/deepwiki-scraper.py:549-589 implements this logic:
Input parsing:
Location detection:
Path generation rules:
| Source Location | Target Location | Generated Path | Example |
|---|---|---|---|
| Main page | Main page | file.md | 3-next.md |
| Main page | Subsection | section-N/file.md | section-2/2-1-sub.md |
| Subsection | Main page | ../file.md | ../3-next.md |
| Subsection (same section) | Subsection | file.md | 2-2-another.md |
| Subsection (diff section) | Subsection | section-N/file.md | section-3/3-1-sub.md |
The regex replacement at tools/deepwiki-scraper.py592 applies this transformation to all links:
For detailed explanation, see Link Rewriting Logic.
Sources: tools/deepwiki-scraper.py:549-593
Auto-Detection Mechanisms
flowchart TD
Start["build-docs.sh starts"] --> CheckRepo{"REPO env var\nprovided?"}
CheckRepo -->|Yes| UseEnv["Use provided REPO value"]
CheckRepo -->|No| CheckGit{"Is current directory\na Git repository?\n(git rev-parse --git-dir)"}
CheckGit -->|Yes| GetRemote["Get remote.origin.url:\ngit config --get\nremote.origin.url"]
CheckGit -->|No| SetEmpty["Set REPO=<empty>"]
GetRemote --> HasRemote{"Remote URL\nfound?"}
HasRemote -->|Yes| ParseURL["Parse GitHub URL using sed regex:\nExtract owner/repo"]
HasRemote -->|No| SetEmpty
ParseURL --> ValidateFormat{"Format is\nowner/repo?"}
ValidateFormat -->|Yes| SetRepo["Set REPO variable"]
ValidateFormat -->|No| SetEmpty
SetEmpty --> FinalCheck{"REPO is empty?"}
UseEnv --> Continue["Continue with REPO"]
SetRepo --> Continue
FinalCheck -->|Yes| Error["ERROR: REPO must be set\nExit with code 1"]
FinalCheck -->|No| Continue
Git Remote Auto-Detection
When REPO environment variable is not provided, the system attempts to auto-detect it from the Git repository in the current working directory.
Sources: build-docs.sh:8-37
Implementation Details
The auto-detection logic at build-docs.sh:8-19 handles multiple Git URL formats:
Supported URL formats:
- HTTPS:
https://github.com/owner/repo.git - HTTPS (no .git):
https://github.com/owner/repo - SSH:
git@github.com:owner/repo.git - SSH (no .git):
git@github.com:owner/repo
The regex pattern .*github\.com<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/#LNaN-LNaN" NaN file-path="">Hii</FileRef>(\.git)?.* captures:
[:/]- Matches either:(SSH) or/(HTTPS)([^/]+/[^/\.]+)- Capturesowner/repo(stops at/or.)(\.git)?- Optionally matches.gitsuffix
Derived defaults:
After determining REPO, the script derives other configuration at build-docs.sh:39-45:
This provides sensible defaults:
BOOK_AUTHORSdefaults to repository ownerGIT_REPO_URLdefaults to GitHub URL (for “Edit this page” links)
For detailed explanation, see Auto-Detection Features.
Sources: build-docs.sh:8-45 README.md:47-53
Performance Considerations
Build Time Breakdown
Typical build times for a medium-sized repository (50-100 pages):
| Phase | Time | Bottleneck |
|---|---|---|
| Phase 1: Scraping | 60-120s | Network requests + 1s delays |
| Phase 2: Diagrams | 5-10s | Regex matching + file I/O |
| Phase 3: mdBook | 10-20s | Rust compilation + mermaid assets |
| Total | 75-150s | Network + computation |
Optimization Strategies
Network optimization:
- The scraper includes
time.sleep(1)at tools/deepwiki-scraper.py872 between pages - Retry logic with exponential backoff at tools/deepwiki-scraper.py:33-42
- HTTP session reuse via
requests.Session()at tools/deepwiki-scraper.py:818-821
Markdown-only mode:
- Skips Phase 3 entirely, reducing build time by ~15-25%
- Useful for content-only iterations
Docker build optimization:
- Multi-stage build discards Rust toolchain (~1.5 GB)
- Final image only contains binaries (~300-400 MB)
- See Docker Multi-Stage Build for details
Caching considerations:
- No internal caching—each run fetches fresh content
- DeepWiki serves dynamic content (no cache headers)
- Docker layer caching helps with repeated image builds
Sources: tools/deepwiki-scraper.py:28-42 tools/deepwiki-scraper.py:817-821 tools/deepwiki-scraper.py872
Extending the System
Adding New Output Formats
The system’s three-phase architecture makes it easy to add new output formats:
Integration points:
- Before Phase 3: Add code after build-docs.sh188 to read from
$WIKI_DIR - Alternative Phase 3: Replace build-docs.sh:174-176 with custom builder
- Post-processing: Add steps after build-docs.sh192 to transform mdBook output
Example: Adding PDF export:
Sources: build-docs.sh:174-206
Customizing Diagram Matching
The fuzzy matching algorithm can be tuned by modifying the chunk sizes at tools/deepwiki-scraper.py716:
Matching strategy customization:
The scoring system at tools/deepwiki-scraper.py:709-745 prioritizes:
- Anchor text matching (weighted by chunk size)
- Heading matching (weight: 50)
You can add additional heuristics by modifying the scoring logic or adding new matching strategies.
Sources: tools/deepwiki-scraper.py:596-788 tools/deepwiki-scraper.py:716-745
Adding New Content Cleaners
The HTML-to-markdown conversion can be enhanced by adding custom cleaners at tools/deepwiki-scraper.py:489-511:
The footer cleaner at tools/deepwiki-scraper.py:127-173 can be extended with additional patterns:
Sources: tools/deepwiki-scraper.py:127-173 tools/deepwiki-scraper.py:466-511
Common Advanced Scenarios
CI/CD Integration
GitHub Actions example:
The auto-detection at build-docs.sh:8-19 determines REPO from Git context. The BOOK_TITLE overrides the default.
Sources: build-docs.sh:8-45 README.md:228-232
Multi-Repository Builds
Build documentation for multiple repositories in parallel:
Each build runs in an isolated container with separate output directories.
Sources: build-docs.sh:21-53 README.md:200-207
Custom Theming
Override mdBook theme by modifying the generated book.toml template at build-docs.sh:85-103:
Or inject custom CSS:
Sources: build-docs.sh:84-103
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Markdown-Only Mode
Loading…
Markdown-Only Mode
Relevant source files
This document describes the MARKDOWN_ONLY mode, a special execution mode that terminates the build pipeline after markdown extraction, skipping all mdBook-related processing. This mode is primarily used for debugging the markdown extraction process and for workflows that require raw markdown files without HTML output.
For information about the complete three-phase pipeline, see Three-Phase Pipeline. For details on the mdBook build process that gets skipped in this mode, see Phase 3: mdBook Build.
Purpose and Scope
Markdown-only mode provides a lightweight execution path that produces only markdown files, bypassing the mdBook initialization, template injection, and HTML generation stages. This mode is controlled by the MARKDOWN_ONLY environment variable and affects the execution flow in build-docs.sh.
Key characteristics:
- Executes only Phase 1 (markdown extraction) of the pipeline
- Produces
markdown/andraw_markdown/outputs - Skips
book/HTML generation andbook.tomlconfiguration - Reduces execution time by ~50-70% (no mdBook build overhead)
- Useful for debugging, alternative workflows, and markdown inspection
Sources: scripts/build-docs.sh:26-93 README.md37
Execution Path Comparison
The following diagram illustrates how the MARKDOWN_ONLY flag alters the execution flow in build-docs.sh:
Figure 1: Build Pipeline Execution Paths
graph TB
Start["build-docs.sh starts"]
Config["Configuration phase\n[lines 8-59]"]
Step1["Step 1: deepwiki-scraper.py\nScrape wiki & extract markdown\n[lines 61-65]"]
Check{"MARKDOWN_ONLY == true\n[line 68]"}
subgraph "Markdown-Only Path [lines 69-92]"
CopyMD["Step 2: Copy markdown/\nfrom WIKI_DIR to OUTPUT_DIR\n[lines 70-73]"]
CopyRaw["Step 3: Copy raw_markdown/\nfrom RAW_DIR to OUTPUT_DIR\n[lines 75-81]"]
Exit1["Exit with status 0\n[line 92]"]
end
subgraph "Standard Path [lines 95-309]"
Step2["Step 2: Initialize mdBook structure\nCreate book.toml, src/ directory\n[lines 96-122]"]
Step3["Step 3: Generate SUMMARY.md\nDiscover files, build TOC\n[lines 124-188]"]
Step4["Step 4: Process templates\nInject header/footer HTML\n[lines 190-261]"]
Step5["Step 5: Install mdbook-mermaid\n[lines 263-266]"]
Step6["Step 6: mdbook build\n[lines 268-271]"]
Step7["Step 7: Copy all outputs\n[lines 273-294]"]
Exit2["Exit with status 0"]
end
Start --> Config
Config --> Step1
Step1 --> Check
Check -->|true| CopyMD
CopyMD --> CopyRaw
CopyRaw --> Exit1
Check -->|false default| Step2
Step2 --> Step3
Step3 --> Step4
Step4 --> Step5
Step5 --> Step6
Step6 --> Step7
Step7 --> Exit2
Sources: scripts/build-docs.sh:1-310
Configuration
The MARKDOWN_ONLY mode is enabled by setting the environment variable to the string "true". Any other value (including unset) results in standard mode execution.
Environment variable:
- Name:
MARKDOWN_ONLY - Type: String boolean
- Default:
"false" - Valid values:
"true"(markdown-only) or any other value (standard mode)
Example usage:
Sources: scripts/build-docs.sh26 README.md37
Output Structure
The output directory structure differs significantly between standard and markdown-only modes:
Table 1: Output Comparison
| Output Path | Standard Mode | Markdown-Only Mode | Description |
|---|---|---|---|
output/book/ | ✓ Generated | ✗ Not created | Searchable HTML documentation |
output/markdown/ | ✓ Generated | ✓ Generated | Enhanced markdown with diagrams |
output/raw_markdown/ | ✓ Generated | ✓ Generated | Pre-enhancement markdown snapshots |
output/book.toml | ✓ Generated | ✗ Not created | mdBook configuration file |
In markdown-only mode, the output/markdown/ directory contains:
- All scraped wiki pages as
.mdfiles with numeric prefixes (e.g.,1-Overview.md) - Section subdirectories (e.g.,
section-4/,section-5/) - Enhanced content with Mermaid diagrams inserted via fuzzy matching
- No template injection (no header/footer HTML)
The output/raw_markdown/ directory contains:
- Pre-enhancement markdown files (before diagram injection)
- Useful for comparing before/after states
- Only present if
RAW_DIRexists (created bydeepwiki-scraper.py)
Sources: scripts/build-docs.sh:70-92 scripts/build-docs.sh:287-294
Technical Implementation
The markdown-only mode implementation is located in the orchestration script and uses a simple conditional to bypass most processing steps.
Figure 2: Implementation Logic in build-docs.sh
graph TD
ConfigVar["MARKDOWN_ONLY variable\nline 26: default='false'"]
Step1Complete["deepwiki-scraper.py complete\nWIKI_DIR and RAW_DIR populated"]
Conditional["if [ MARKDOWN_ONLY = true ]\nline 68"]
subgraph "Early Exit Block [lines 69-92]"
EchoStep2["echo 'Step 2: Copying markdown...'\nline 70"]
RemoveMD["rm -rf OUTPUT_DIR/markdown\nline 71"]
MkdirMD["mkdir -p OUTPUT_DIR/markdown\nline 72"]
CopyMD["cp -r WIKI_DIR/. OUTPUT_DIR/markdown/\nline 73"]
CheckRaw{"if [ -d RAW_DIR ]\nline 75"}
EchoStep3["echo 'Step 3: Copying raw...'\nline 77"]
RemoveRaw["rm -rf OUTPUT_DIR/raw_markdown\nline 78"]
MkdirRaw["mkdir -p OUTPUT_DIR/raw_markdown\nline 79"]
CopyRaw["cp -r RAW_DIR/. OUTPUT_DIR/raw_markdown/\nline 80"]
EchoComplete["echo 'Markdown extraction complete!'\nlines 84-91"]
ExitZero["exit 0\nline 92"]
end
ContinueStandard["Continue to Step 2:\nInitialize mdBook structure\nline 96+"]
ConfigVar --> Step1Complete
Step1Complete --> Conditional
Conditional -->|= true| EchoStep2
EchoStep2 --> RemoveMD
RemoveMD --> MkdirMD
MkdirMD --> CopyMD
CopyMD --> CheckRaw
CheckRaw -->|RAW_DIR exists| EchoStep3
CheckRaw -->|RAW_DIR missing| EchoComplete
EchoStep3 --> RemoveRaw
RemoveRaw --> MkdirRaw
MkdirRaw --> CopyRaw
CopyRaw --> EchoComplete
EchoComplete --> ExitZero
Conditional -->|!= true| ContinueStandard
Key implementation details:
- String comparison : The check uses shell string equality
[ "$MARKDOWN_ONLY" = "true" ], not numeric comparison - Defensive copying : Uses
rm -rfbeforemkdir -pto ensure clean output directories - Dot notation :
cp -r "$WIKI_DIR"/. "$OUTPUT_DIR/markdown/"copies contents, not the directory itself - Conditional raw copy : Only copies
raw_markdown/if the directory exists (some runs may not generate it) - Early exit : Uses
exit 0to terminate successfully, preventing any subsequent processing
Sources: scripts/build-docs.sh:68-92
Use Cases
Markdown-only mode supports several workflows beyond standard documentation generation:
Table 2: Common Use Cases
| Use Case | Benefit | Typical Workflow |
|---|---|---|
| Scraper debugging | Inspect raw extraction results without mdBook overhead | Enable mode → examine raw_markdown/ → modify scraper → re-run |
| Diagram placement testing | Verify fuzzy matching results in markdown/ files | Enable mode → search for mermaid blocks → adjust matching logic |
| Alternative build tools | Use Hugo, Jekyll, Docusaurus instead of mdBook | Enable mode → feed markdown/ to different static site generator |
| CI/CD optimization | Faster feedback loop for markdown quality checks | Enable mode in test pipeline → validate markdown syntax → fail fast |
| Content export | Extract DeepWiki content for archival or migration | Enable mode → preserve markdown/ directory → import elsewhere |
| Performance profiling | Isolate Phase 1 performance from mdBook build time | Time execution with/without mode → identify bottlenecks |
Sources: scripts/build-docs.sh26 (comment), README.md37
Skipped Processing Steps
When MARKDOWN_ONLY=true, the following operations are completely bypassed:
Figure 3: Skipped Components in Markdown-Only Mode
graph TB
subgraph "Skipped: mdBook Initialization [lines 95-122]"
BookDir["mkdir -p BOOK_DIR/src"]
BookToml["Generate book.toml\nwith title, authors, config"]
end
subgraph "Skipped: SUMMARY.md Generation [lines 124-188]"
ScanFiles["ls WIKI_DIR/*.md\nDiscover file structure"]
SortNumeric["sort -t- -k1 -n\nNumeric sorting"]
ExtractTitles["head -1 file / sed 's/^# //'\nExtract titles"]
BuildTOC["Generate nested TOC\nwith section subdirectories"]
WriteSummary["Write src/SUMMARY.md"]
end
subgraph "Skipped: Template Processing [lines 190-261]"
LoadHeader["Load templates/header.html"]
LoadFooter["Load templates/footer.html"]
ProcessVars["process-template.py\nVariable substitution"]
InjectHTML["Inject header/footer\ninto all .md files"]
end
subgraph "Skipped: mdBook Build [lines 263-271]"
MermaidInstall["mdbook-mermaid install"]
MdBookBuild["mdbook build"]
GenerateHTML["Generate book/\nHTML output"]
end
Note["None of these operations\nexecute when MARKDOWN_ONLY=true"]
BookDir -.-> Note
BookToml -.-> Note
ScanFiles -.-> Note
SortNumeric -.-> Note
ExtractTitles -.-> Note
BuildTOC -.-> Note
WriteSummary -.-> Note
LoadHeader -.-> Note
LoadFooter -.-> Note
ProcessVars -.-> Note
InjectHTML -.-> Note
MermaidInstall -.-> Note
MdBookBuild -.-> Note
GenerateHTML -.-> Note
Impact of skipped steps:
- No template injection : Markdown files contain pure content without HTML header/footer wrappers
- No
book.toml: Configuration metadata is not generated (title, authors, etc.) - No
SUMMARY.md: Table of contents is not created (users must navigate by filename) - No Mermaid rendering : Diagrams remain as code blocks (not rendered to SVG)
- No search index : Full-text search functionality is not available
- Faster execution : Typical time savings of 50-70% depending on content size
Sources: scripts/build-docs.sh:95-309
Console Output Differences
The console output differs between modes, providing clear feedback about the execution path:
Standard mode output:
Step 1: Scraping wiki from DeepWiki...
Step 2: Initializing mdBook structure...
Step 3: Generating SUMMARY.md from scraped content...
Step 4: Copying and processing markdown files to book...
Step 5: Installing mdbook-mermaid assets...
Step 6: Building mdBook...
Step 7: Copying outputs to /output...
✓ Documentation build complete!
Markdown-only mode output:
Step 1: Scraping wiki from DeepWiki...
Step 2: Copying markdown files to output (markdown-only mode)...
Step 3: Copying raw markdown snapshots...
✓ Markdown extraction complete!
The markdown-only output explicitly mentions “(markdown-only mode)” in Step 2 and terminates after Step 3, providing clear confirmation of the execution path.
Sources: scripts/build-docs.sh:69-91 scripts/build-docs.sh:296-309
Integration with CI/CD
Markdown-only mode can be used in GitHub Actions workflows for testing and validation without full HTML builds:
Example: Markdown quality check workflow
This workflow extracts markdown, then validates it with linters and Mermaid syntax checkers without the overhead of building HTML documentation.
Sources: README.md37 scripts/build-docs.sh:26-92
graph LR
subgraph "Iteration Loop"
Enable["Set MARKDOWN_ONLY=true"]
Run["docker run deepwiki-to-mdbook"]
Examine["Examine output/markdown/\nand output/raw_markdown/"]
Issue{"Found issue?"}
ModifyCode["Modify deepwiki-scraper.py\nor diagram processing"]
Rebuild["docker build -t deepwiki-to-mdbook ."]
end
Complete["Issue resolved"]
FullBuild["Remove MARKDOWN_ONLY\nRun full build\nVerify HTML output"]
Enable --> Run
Run --> Examine
Examine --> Issue
Issue -->|Yes| ModifyCode
ModifyCode --> Rebuild
Rebuild --> Run
Issue -->|No| Complete
Complete --> FullBuild
Debugging Workflow
A typical debugging workflow using markdown-only mode:
Figure 4: Iterative Debugging with MARKDOWN_ONLY
Benefits of this approach:
- Fast feedback : Markdown extraction completes in seconds vs. minutes for full builds
- Isolated testing : Changes to scraper logic can be verified without mdBook complications
- Raw comparison :
raw_markdown/shows pre-enhancement state for before/after analysis - Incremental development : Iterate rapidly on extraction logic before testing full pipeline
Sources: scripts/build-docs.sh:68-92
Limitations and Considerations
Markdown-only mode has specific limitations that users should be aware of:
Table 3: Limitations
| Limitation | Impact | Workaround |
|---|---|---|
| No HTML output | Cannot preview documentation in browser | Use markdown preview tool or build separately |
| No template injection | Missing repository links and metadata | Add manually or post-process markdown files |
| No SUMMARY.md | No automated navigation structure | Generate TOC with alternative tool |
| No Mermaid rendering | Diagrams remain as code blocks | Use Mermaid CLI or other rendering tool |
| No validation | Broken links not detected | Run mdBook build occasionally to catch issues |
When NOT to use markdown-only mode:
- Final production builds destined for GitHub Pages
- Workflows requiring search functionality
- When testing template customizations
- Validation of cross-references and internal links
Sources: scripts/build-docs.sh26 scripts/build-docs.sh:68-92
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Numbering and Path Resolution
Loading…
Numbering and Path Resolution
Relevant source files
This document explains how the system transforms DeepWiki’s page numbering scheme into the file structure used in the generated mdBook documentation. The process involves three key operations: (1) normalizing DeepWiki page numbers by shifting them down by one, (2) resolving normalized numbers to file paths and directory locations, and (3) rewriting internal wiki links to use correct relative paths.
The numbering and path resolution system is foundational to maintaining a consistent file structure and ensuring that cross-references between wiki pages function correctly in the final mdBook output.
For information about the overall markdown extraction process, see page 6. For details about file organization and directory structure, see page 10.
Overview of Operations
The system performs three distinct but related operations:
| Operation | Function | Purpose |
|---|---|---|
| Number Normalization | normalized_number_parts() | Shift DeepWiki numbers down by 1 (page 1 becomes unnumbered) |
| Path Resolution | resolve_output_path() | Generate filename and section directory from page number |
| Link Rewriting | fix_wiki_link() | Convert absolute URLs to relative markdown paths |
Sources: python/deepwiki-scraper.py:28-64
Numbering Scheme Transformation
DeepWiki Numbering Convention
DeepWiki numbers pages starting from 1, with subsections using dot notation (e.g., 1, 2, 2.1, 2.2, 3, 3.1). This numbering includes an “overview” page as page 1, which the system treats specially.
Normalization Algorithm
The normalized_number_parts() function shifts all page numbers down by one, making the overview page unnumbered and adjusting all subsequent numbers:
Diagram: Number Normalization Transformation
graph LR
subgraph "DeepWiki Numbering"
DW1["1 (Overview)"]
DW2["2"]
DW3["3"]
DW4["4.1"]
DW5["4.2"]
end
subgraph "normalized_number_parts()"
Norm["Subtract 1 from\nmain number"]
end
subgraph "Normalized Numbering"
N1["[] (Unnumbered)"]
N2["[1]"]
N3["[2]"]
N4["[3, 1]"]
N5["[3, 2]"]
end
DW1 --> Norm
DW2 --> Norm
DW3 --> Norm
DW4 --> Norm
DW5 --> Norm
Norm --> N1
Norm --> N2
Norm --> N3
Norm --> N4
Norm --> N5
Sources: python/deepwiki-scraper.py:28-43
Implementation Details
The function parses the page number string and applies the following rules:
Diagram: normalized_number_parts() Control Flow
Sources: python/deepwiki-scraper.py:28-43
Numbering Examples
| DeepWiki Number | Input | Normalized Parts | Notes |
|---|---|---|---|
"1" | Overview page | [] | Unnumbered in output |
"2" | Second page | ["1"] | Becomes first numbered page |
"3" | Third page | ["2"] | Becomes second numbered page |
"1.3" | Overview subsection | ["1", "3"] | Special case: kept as page 1 |
"4.2" | Subsection | ["3", "2"] | Main number decremented |
Sources: python/tests/test_numbering.py:1-13
Path Resolution
graph TB
Input["resolve_output_path(page_number, title)"]
Sanitize["sanitize_filename(title)\nConvert title to safe filename slug"]
Normalize["normalized_number_parts(page_number)\nGet normalized parts"]
CheckParts{"Parts valid\nand non-empty?"}
NoNumber["filename = slug + '.md'\nsection_dir = None"]
BuildFilename["filename = parts.join('-') + '-' + slug + '.md'"]
CheckLevel{"len(parts) > 1?"}
WithSection["section_dir = 'section-' + parts[0]"]
NoSection["section_dir = None"]
Return["Return (filename, section_dir)"]
Input --> Sanitize
Input --> Normalize
Sanitize --> CheckParts
Normalize --> CheckParts
CheckParts -->|No| NoNumber
CheckParts -->|Yes| BuildFilename
BuildFilename --> CheckLevel
CheckLevel -->|Yes| WithSection
CheckLevel -->|No| NoSection
NoNumber --> Return
WithSection --> Return
NoSection --> Return
File Path Generation
The resolve_output_path() function converts normalized page numbers into file paths, determining both the filename and the optional section directory.
Diagram: Path Resolution Algorithm
Sources: python/deepwiki-scraper.py:45-53
Directory Structure Mapping
Diagram: File Organization After Path Resolution
Sources: python/deepwiki-scraper.py:45-53 python/tests/test_numbering.py:15-31
Path Resolution Examples
| DeepWiki Number | Title | Filename | Section Directory |
|---|---|---|---|
"1" | “Overview Title” | overview-title.md | None |
"3" | “System Architecture” | 2-system-architecture.md | None |
"5.2" | “HTML to Markdown Conversion” | 4-2-html-to-markdown-conversion.md | "section-4" |
"2.1" | “Components” | 1-1-components.md | "section-1" |
Sources: python/tests/test_numbering.py:15-31
Link Rewriting
Target Path Construction
The build_target_path() function constructs the full relative path for link targets, including section directories when appropriate:
Diagram: Target Path Construction Logic
flowchart TD
Start["build_target_path(page_number, slug)"]
Sanitize["slug = sanitize_filename(slug)"]
Normalize["parts = normalized_number_parts(page_number)"]
CheckParts{"Parts valid?"}
SimpleFile["Return slug + '.md'"]
BuildFile["filename = parts.join('-') + '-' + slug + '.md'"]
CheckSub{"len(parts) > 1?"}
WithDir["Return 'section-' + parts[0] + '/' + filename"]
JustFile["Return filename"]
Start --> Sanitize
Start --> Normalize
Sanitize --> CheckParts
Normalize --> CheckParts
CheckParts -->|No| SimpleFile
CheckParts -->|Yes| BuildFile
BuildFile --> CheckSub
CheckSub -->|Yes| WithDir
CheckSub -->|No| JustFile
Sources: python/deepwiki-scraper.py:55-63
Link Transformation Overview
DeepWiki uses absolute URL paths for internal wiki links in the format /owner/repo/N-page-title or /owner/repo/N.M-subsection-title. These links must be rewritten to relative Markdown file paths that respect the mdBook directory structure where:
- Main pages (e.g., “1-overview”, “2-architecture”) reside in the root markdown directory
- Subsections (e.g., “2.1-subsection”, “2.2-another”) reside in subdirectories named
section-N/ - File names use hyphens instead of dots (e.g.,
2-1-subsection.mdinstead of2.1-subsection.md)
The rewriting logic must compute the correct relative path based on both the source page location and the target page location.
Sources: python/deepwiki-scraper.py:854-875
graph TB
Root["Root Directory\n(output/markdown/)"]
Main1["overview.md\n(Unnumbered)"]
Main2["1-architecture.md\n(Main Page)"]
Main3["2-installation.md\n(Main Page)"]
Section1["section-1/\n(Subsection Directory)"]
Section2["section-2/\n(Subsection Directory)"]
Sub1_1["1-1-components.md\n(Subsection)"]
Sub1_2["1-2-workflows.md\n(Subsection)"]
Sub2_1["2-1-docker-setup.md\n(Subsection)"]
Sub2_2["2-2-manual-setup.md\n(Subsection)"]
Root --> Main1
Root --> Main2
Root --> Main3
Root --> Section1
Root --> Section2
Section1 --> Sub1_1
Section1 --> Sub1_2
Section2 --> Sub2_1
Section2 --> Sub2_2
Relative Path Strategy
The system organizes markdown files into a hierarchical structure that affects link rewriting:
Diagram: File Organization Hierarchy
This structure requires different relative path strategies depending on where the link originates and where it points.
Sources: python/deepwiki-scraper.py:843-851
flowchart TD
Start["Markdown Content"]
Regex["Apply Regex Pattern:\n'\\]\\(/[^/]+/[^/]+/([^)]+)\\)'"]
Extract["Extract Path Component:\ne.g., '4-query-planning'"]
Parse["Parse Page Number and Slug:\nPattern: '(\\d+(?:\\.\\d+)*)-(.+)$'"]
PageNum["page_num\n(e.g., '2.1' or '4')"]
Slug["slug\n(e.g., 'query-planning')"]
Start --> Regex
Regex --> Extract
Extract --> Parse
Parse --> PageNum
Parse --> Slug
Link Pattern Matching
The algorithm begins by matching markdown links that reference the DeepWiki URL structure using a regular expression pattern.
Diagram: Link Pattern Matching Flow
The regex \]\(/[^/]+/[^/]+/([^)]+)\) captures the path component after the repository identifier. For example, in <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/text" undefined file-path="text">Hii</FileRef>, it captures 4-query-planning.
Sources: python/deepwiki-scraper.py875
flowchart TD
Start["extract_page_content(url, session, current_page_info)"]
CheckInfo{"current_page_info\nprovided?"}
NoInfo["source_section_dir = None\n(Default to root)"]
GetPageNum["page_number = current_page_info['number']\ntitle = current_page_info['title']"]
ResolvePath["resolve_output_path(page_number, title)\nReturns (filename, section_dir)"]
SetSource["source_section_dir = section_dir"]
DefineRewriter["Define fix_wiki_link()\nusing source_section_dir"]
ApplyRewriter["markdown = re.sub(pattern, fix_wiki_link, markdown)"]
Start --> CheckInfo
CheckInfo -->|No| NoInfo
CheckInfo -->|Yes| GetPageNum
GetPageNum --> ResolvePath
ResolvePath --> SetSource
NoInfo --> DefineRewriter
SetSource --> DefineRewriter
DefineRewriter --> ApplyRewriter
Source Location Detection
The system determines the source page’s location from the current_page_info parameter passed to extract_page_content():
Diagram: Source Location Detection in extract_page_content
Sources: python/deepwiki-scraper.py:843-851 python/deepwiki-scraper.py875
Relative Path Calculation
The relative path is computed based on the combination of source and target locations:
| Source Location | Target Location | Relative Path Strategy | Example |
|---|---|---|---|
| Root | Root | Direct filename | 2-installation.md |
| Root | section-N/ | Section prefix + filename | section-1/1-1-components.md |
section-N/ | Root | Parent directory prefix | ../2-installation.md |
section-N/ | Same section-N/ | Direct filename | 1-2-workflows.md |
section-N/ | Different section-M/ | Parent + section prefix | ../section-2/2-1-setup.md |
Sources: python/deepwiki-scraper.py:854-871
flowchart TD
Start["fix_wiki_link(match)"]
ExtractPath["full_path = match.group(1)\n(e.g., '4-query-planning')"]
ParseLink["link_match = re.search(pattern, full_path)"]
Success{"Match\nsuccessful?"}
NoMatch["Return match.group(0)\n(link unchanged)"]
ExtractParts["page_num = link_match.group(1)\nslug = link_match.group(2)"]
BuildTarget["target_path = build_target_path(page_num, slug)"]
CheckSource{"source_section_dir\nexists?"}
NoSource["Return target_path\n(as-is from root)"]
CheckTargetDir{"target_path starts\nwith 'section-'?"}
NoDir["Add '../' prefix\nReturn '../' + target_path"]
CheckSameSection{"target_path starts with\nsource_section_dir + '/'?"}
SameSection["Strip section directory\nReturn filename only"]
OtherSection["Add '../' prefix\nReturn '../' + target_path"]
Start --> ExtractPath
ExtractPath --> ParseLink
ParseLink --> Success
Success -->|No| NoMatch
Success -->|Yes| ExtractParts
ExtractParts --> BuildTarget
BuildTarget --> CheckSource
CheckSource -->|No| NoSource
CheckSource -->|Yes| CheckTargetDir
CheckTargetDir -->|No| NoDir
CheckTargetDir -->|Yes| CheckSameSection
CheckSameSection -->|Yes| SameSection
CheckSameSection -->|No| OtherSection
The fix_wiki_link Implementation
The core implementation is a nested function fix_wiki_link defined within extract_page_content() that serves as a callback for re.sub:
Diagram: fix_wiki_link Function Control Flow
The function delegates target path construction to build_target_path(), then adjusts the path based on the source location captured in the closure variable source_section_dir.
Sources: python/deepwiki-scraper.py:854-871
Code Entity Mapping
Key functions involved in the path resolution pipeline:
| Function | Location | Purpose |
|---|---|---|
normalized_number_parts() | python/deepwiki-scraper.py:28-43 | Shift page numbers down by 1 |
resolve_output_path() | python/deepwiki-scraper.py:45-53 | Convert number + title to filename and section |
build_target_path() | python/deepwiki-scraper.py:55-63 | Construct full relative path for link targets |
fix_wiki_link() | python/deepwiki-scraper.py:854-871 | Rewrite individual link (nested function) |
extract_page_content() | python/deepwiki-scraper.py:751-877 | Main extraction function with link rewriting |
Sources: python/deepwiki-scraper.py:28-63 python/deepwiki-scraper.py:751-877
Link Rewriting Examples
Scenario 1: Root to Root
When a root-level page (e.g., overview.md) links to another root-level page (e.g., 2-installation.md):
- Source:
overview.md(source_section_dir = None) - Target: DeepWiki page
3→ normalized to2-installation.md - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Installation" undefined file-path="Installation">Hii</FileRef> - target_path:
2-installation.md(no section prefix) - Generated Path:
2-installation.md(no adjustment needed) - Reason: Both files are in root directory
Sources: python/deepwiki-scraper.py:854-871
Scenario 2: Root to Subsection
When a root-level page (e.g., 1-architecture.md) links to a subsection (e.g., 2.1-components → 1-1-components.md):
- Source:
1-architecture.md(source_section_dir = None) - Target: DeepWiki page
2.1→ normalized tosection-1/1-1-components.md - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Components" undefined file-path="Components">Hii</FileRef> - target_path:
section-1/1-1-components.md - Generated Path:
section-1/1-1-components.md(no adjustment needed) - Reason: Target is in subdirectory, source is in root
Sources: python/deepwiki-scraper.py:854-871
Scenario 3: Subsection to Root
When a subsection (e.g., section-1/1-1-components.md) links to a root-level page:
- Source:
section-1/1-1-components.md(source_section_dir = "section-1") - Target: DeepWiki page
3→ normalized to2-installation.md - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Installation" undefined file-path="Installation">Hii</FileRef> - target_path:
2-installation.md(doesn’t start with “section-”) - Generated Path:
../2-installation.md(add parent directory) - Reason: Source is in subdirectory, target is in parent directory
Sources: python/deepwiki-scraper.py:868-870
Scenario 4: Subsection to Same Section
When a subsection links to another subsection in the same section:
- Source:
section-1/1-1-components.md(source_section_dir = "section-1") - Target: DeepWiki page
2.2→ normalized tosection-1/1-2-workflows.md - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Workflows" undefined file-path="Workflows">Hii</FileRef> - target_path:
section-1/1-2-workflows.md - Generated Path:
1-2-workflows.md(strip section directory) - Reason: Both files are in same
section-1/directory
Sources: python/deepwiki-scraper.py:864-866
Scenario 5: Subsection to Different Section
When a subsection links to a subsection in a different section:
- Source:
section-1/1-1-components.md(source_section_dir = "section-1") - Target: DeepWiki page
3.1→ normalized tosection-2/2-1-setup.md - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Docker Setup" undefined file-path="Docker Setup">Hii</FileRef> - target_path:
section-2/2-1-setup.md - Generated Path:
../section-2/2-1-setup.md(go to parent, then other section) - Reason: Different section directories require parent navigation
Sources: python/deepwiki-scraper.py:868-870
sequenceDiagram
participant Main as main()
participant EWS as extract_wiki_structure()
participant EPC as extract_page_content()
participant ROP as resolve_output_path()
participant FWL as fix_wiki_link()
Main->>EWS: Discover pages
EWS-->>Main: pages list with DeepWiki numbers
loop For each page
Note over Main: page = {number: 2.1, title: Components, level: 1}
Main->>EPC: extract_page_content(url, session, page)
Note over EPC: Convert HTML to markdown
EPC->>ROP: Get source location
ROP-->>EPC: source_section_dir = "section-1"
Note over EPC: Define fix_wiki_link() with closure over source_section_dir
EPC->>FWL: Apply via re.sub() for each link
FWL->>FWL: Parse target page number
FWL->>FWL: build_target_path()
FWL->>FWL: Adjust for source location
FWL-->>EPC: Rewritten relative path
EPC-->>Main: Markdown with fixed links
Main->>ROP: Determine output path
ROP-->>Main: (filename, section_dir)
Note over Main: Write to section-1/1-1-components.md
end
Integration in Content Extraction Pipeline
The numbering and path resolution components integrate into the main extraction flow:
Diagram: Integration Sequence Across Extraction Pipeline
The link rewriting occurs at line 875 using re.sub(r'\]\(/[^/]+/[^/]+/([^)]+)\)', fix_wiki_link, markdown), which finds all internal wiki links and replaces them with their rewritten versions.
Sources: python/deepwiki-scraper.py:1310-1353 python/deepwiki-scraper.py:843-877
flowchart TD
Input["normalized_number_parts('abc')"]
Split["parts = 'abc'.split('.')"]
TryParse["Try int(parts[0])"]
ValueError["ValueError exception"]
ReturnNone["Return None"]
Input --> Split
Split --> TryParse
TryParse --> ValueError
ValueError --> ReturnNone
Edge Cases and Special Handling
Invalid Page Numbers
If normalized_number_parts() receives an invalid page number (non-numeric main component), it returns None:
Diagram: Invalid Number Handling
This graceful failure allows resolve_output_path() and build_target_path() to fall back to simple slug-based filenames.
Sources: python/deepwiki-scraper.py:28-43
Malformed Links
If a link doesn’t match the expected pattern (\d+(?:\.\d+)*)-(.+)$, fix_wiki_link() returns the original match unchanged:
This ensures that malformed or external links are preserved in their original form.
Sources: python/deepwiki-scraper.py:856-872
Missing Page Context
If current_page_info is not provided to extract_page_content(), the function defaults to treating the source as a root-level page:
This allows the function to work in degraded mode, though links from subsections may not be correctly rewritten.
Sources: python/deepwiki-scraper.py:843-850
Overview Page Special Case
The overview page (DeepWiki page 1) is treated specially:
normalized_number_parts("1")returns[](empty list)resolve_output_path("1", "Overview")returns("overview.md", None)- The file is placed at root level with no numeric prefix
Subsections of the overview (e.g., 1.3) are handled differently:
normalized_number_parts("1.3")returns["1", "3"]- Main number is kept as 1 (not decremented to 0)
- These become
section-1/1-3-subsection.md
Sources: python/deepwiki-scraper.py:28-43 python/tests/test_numbering.py:1-13
Performance Considerations
The link rewriting is performed using a single re.sub call with a callback function, which is efficient for typical wiki pages with dozens to hundreds of links. The regex compilation is implicit and cached by Python’s re module.
The algorithm has O(n) complexity where n is the number of internal links in the page, with each link requiring constant-time string operations.
Sources: tools/deepwiki-scraper.py592
Testing and Validation
The correctness of link rewriting can be validated by:
- Checking that generated links use
.mdextension - Verifying that links from subsections to main pages use
../ - Confirming that links to subsections use the
section-N/prefix when appropriate - Testing cross-section subsection links resolve correctly
The mdBook build process will fail if links are incorrectly rewritten, providing a validation mechanism during Phase 3 of the pipeline.
Sources: tools/deepwiki-scraper.py:547-594
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Auto-Detection Features
Loading…
Auto-Detection Features
Relevant source files
Purpose and Scope
This document details the automatic configuration detection mechanisms in the DeepWiki-to-mdBook converter. These features enable the system to operate with minimal manual configuration by intelligently inferring settings from the Git environment and repository metadata.
The auto-detection system primarily operates in the build orchestration script and focuses on repository identification and related URL construction. For information about other configuration options that require explicit setting, see Configuration Reference. For the overall build orchestration process, see build-docs.sh Orchestrator.
Overview of Auto-Detection
The system implements two categories of auto-detection:
| Category | Features | Fallback Behavior |
|---|---|---|
| Primary Detection | Repository identification from Git remote | Fails with error if not detected and not provided |
| Derived Configuration | Author names, URLs, badge links | Uses sensible defaults based on detected repository |
The auto-detection executes during the initialization phase of scripts/build-docs.sh:8-46 before any content scraping or processing begins.
Repository Auto-Detection Flow
Detection Algorithm
Detection Algorithm Flow
Sources: scripts/build-docs.sh:8-38
Git Remote Parsing
The repository detection uses a single sed regular expression to handle multiple GitHub URL formats:
Git Remote URL Parsing
graph LR
subgraph "Supported URL Formats"
HTTPS["https://github.com/owner/repo.git"]
HTTPSNOGIT["https://github.com/owner/repo"]
SSH["git@github.com:owner/repo.git"]
SSHNOGIT["git@github.com:owner/repo"]
end
subgraph "Extraction Process"
GitConfig["git config --get\nremote.origin.url"]
SedRegex["sed -E\ngithub\.com[:/]([^/]+/[^/\.]+)"]
RepoVar["REPO variable\nowner/repo"]
end
HTTPS --> GitConfig
HTTPSNOGIT --> GitConfig
SSH --> GitConfig
SSHNOGIT --> GitConfig
GitConfig --> SedRegex
SedRegex --> RepoVar
The parsing logic at scripts/build-docs.sh16 uses this pattern:
sed -E 's#.*github\.com<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/#LNaN-LNaN" NaN file-path="">Hii</FileRef>(\.git)?.*#\1#'
| Pattern Component | Purpose |
|---|---|
.*github\.com | Match any characters before github.com |
[:/] | Match either : (SSH) or / (HTTPS) separator |
([^/]+/[^/\.]+) | Capture group: owner/repo (stops at / or .) |
(\.git)? | Optional .git suffix |
.* | Match remaining characters |
#\1# | Replace entire string with capture group 1 |
Sources: scripts/build-docs.sh:8-19
Derived Configuration Values
Once the REPO value is established (either through auto-detection or explicit setting), the system derives several related configuration values automatically.
graph TB
REPO["REPO\n(owner/repo)"]
Split["String split on '/'"]
RepoOwner["REPO_OWNER\n(first segment)"]
RepoName["REPO_NAME\n(second segment)"]
subgraph "Derived URLs"
GitURL["GIT_REPO_URL\nhttps://github.com/owner/repo"]
DeepWikiURL["DEEPWIKI_URL\nhttps://deepwiki.com/owner/repo"]
end
subgraph "Derived Badges"
DeepWikiBadge["DEEPWIKI_BADGE_URL\nhttps://deepwiki.com/badge.svg"]
GitHubBadge["GITHUB_BADGE_URL\nimg.shields.io badge"]
end
subgraph "Default Metadata"
BookAuthors["BOOK_AUTHORS\n(defaults to REPO_OWNER)"]
end
REPO --> Split
Split --> RepoOwner
Split --> RepoName
RepoOwner --> BookAuthors
REPO --> GitURL
REPO --> DeepWikiURL
DeepWikiURL --> DeepWikiBadge
REPO --> GitHubBadge
Derivation Chain
Configuration Value Derivation
Sources: scripts/build-docs.sh:40-51
Default Value Assignment
The script uses shell parameter expansion with default values at scripts/build-docs.sh:44-46:
| Variable | Default Value | Condition |
|---|---|---|
BOOK_AUTHORS | $REPO_OWNER | If not explicitly set |
GIT_REPO_URL | https://github.com/$REPO | If not explicitly set |
DEEPWIKI_URL | https://deepwiki.com/$REPO | Always constructed |
DEEPWIKI_BADGE_URL | https://deepwiki.com/badge.svg | Always constructed |
GITHUB_BADGE_URL | https://img.shields.io/badge/GitHub-{label}-181717?logo=github | Always constructed with URL encoding |
Sources: scripts/build-docs.sh:44-51
Badge URL Construction
The GitHub badge URL requires special encoding for the repository label at scripts/build-docs.sh:50-51:
GitHub Badge URL Encoding
The encoding is necessary because the badge service interprets - and / as special characters. The double-dash (--) escapes the hyphen, and %2F is the URL encoding for forward slash.
Sources: scripts/build-docs.sh:50-51
Configuration Precedence
The system follows a clear precedence order for all configurable values:
Configuration Precedence Order
| Priority | Source | Example |
|---|---|---|
| 1 (Highest) | Explicit environment variable | docker run -e REPO=owner/repo |
| 2 | Git auto-detection | git config --get remote.origin.url |
| 3 (Lowest) | Hard-coded default | BOOK_TITLE="Documentation" |
Sources: scripts/build-docs.sh:8-46
Error Handling
Repository Detection Failure
If repository detection fails and no explicit REPO value is provided, the script terminates with a descriptive error at scripts/build-docs.sh:34-38:
Repository Validation and Error Flow
The error message provides actionable guidance:
ERROR: REPO must be set or run from within a Git repository with a GitHub remote
Usage: REPO=owner/repo $0
Sources: scripts/build-docs.sh:34-38
graph TB
subgraph "Auto-Detection Phase"
DetectRepo["Detect/Set REPO"]
DeriveVars["Derive configuration\nvariables"]
end
subgraph "Template Processing"
LoadTemplate["Load header.html\nand footer.html"]
InvokeScript["Execute\nprocess-template.py"]
PassVars["Pass variables as\ncommand-line arguments"]
Substitute["Variable substitution\nin templates"]
end
subgraph "Available Variables"
VarRepo["REPO"]
VarTitle["BOOK_TITLE"]
VarAuthors["BOOK_AUTHORS"]
VarGitURL["GIT_REPO_URL"]
VarDeepWiki["DEEPWIKI_URL"]
VarDate["GENERATION_DATE"]
end
DetectRepo --> DeriveVars
DeriveVars --> LoadTemplate
LoadTemplate --> InvokeScript
InvokeScript --> PassVars
DeriveVars -.-> VarRepo
DeriveVars -.-> VarTitle
DeriveVars -.-> VarAuthors
DeriveVars -.-> VarGitURL
DeriveVars -.-> VarDeepWiki
DeriveVars -.-> VarDate
VarRepo --> PassVars
VarTitle --> PassVars
VarAuthors --> PassVars
VarGitURL --> PassVars
VarDeepWiki --> PassVars
VarDate --> PassVars
PassVars --> Substitute
Integration with Template System
Auto-detected values are automatically propagated to the template processing system, where they can be used as variables in header and footer templates.
Template Variable Propagation
Template Variable Propagation Flow
The invocation at scripts/build-docs.sh:205-213 passes all auto-detected and derived values to process-template.py:
Sources: scripts/build-docs.sh:195-234 README.md:34-36
Usage Examples
Auto-Detection in Local Development
When running the Docker container from within a Git repository with a GitHub remote configured:
The script automatically:
- Detects
REPOfromgit config --get remote.origin.url - Derives
BOOK_AUTHORSfrom the repository owner - Constructs all URLs based on the detected repository
Explicit Override
Users can override auto-detection by explicitly setting environment variables:
In this case:
REPOuses the explicit value (no auto-detection)BOOK_AUTHORSuses the explicit value (not derived fromREPO)BOOK_TITLEuses the explicit value- URLs are still derived from the explicit
REPOvalue
Sources: README.md:14-27 scripts/build-docs.sh:8-46
Detection Validation Output
The build script outputs configuration information after detection completes at scripts/build-docs.sh:53-59:
Configuration:
Repository: jzombie/deepwiki-to-mdbook
Book Title: Documentation
Authors: jzombie
Git Repo URL: https://github.com/jzombie/deepwiki-to-mdbook
Markdown Only: false
This output serves as verification that auto-detection and default derivation completed successfully before content processing begins.
Sources: scripts/build-docs.sh:53-59
Limitations and Considerations
Git Repository Requirement
Auto-detection only works when the Docker container is run with the workspace mounted and contains a Git repository with a GitHub remote. For non-Git scenarios or non-GitHub repositories, the REPO environment variable must be explicitly provided.
GitHub-Specific Detection
The URL parsing logic at scripts/build-docs.sh16 specifically looks for github.com in the remote URL. Repositories hosted on other platforms (GitLab, Bitbucket, etc.) will not be auto-detected and require explicit configuration.
Single Remote Assumption
The detection reads from remote.origin.url specifically. If a repository has multiple remotes or uses a different primary remote name, auto-detection will use the origin remote or fail if it doesn’t exist.
Sources: scripts/build-docs.sh:8-19 scripts/build-docs.sh:34-38
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Development Guide
Loading…
Development Guide
Relevant source files
This page provides guidance for developers who want to modify, extend, or contribute to the DeepWiki-to-mdBook Converter system. It covers the development environment setup, local workflow, testing procedures, and key considerations when working with the codebase.
For detailed information about the repository structure, see Project File Structure. For instructions on building the Docker image, see Building the Docker Image. For Python dependency details, see Python Dependencies.
Development Environment Requirements
The system is designed to run entirely within Docker, but local development requires the following tools:
| Tool | Purpose | Version |
|---|---|---|
| Docker | Container runtime | Latest stable |
| Git | Version control | 2.x or later |
| Text editor/IDE | Code editing | Any (VS Code recommended) |
| Python | Local testing (optional) | 3.12+ |
| Rust toolchain | Local testing (optional) | Latest stable |
The Docker image handles all runtime dependencies, so local installation of Python and Rust is optional and only needed for testing individual components outside the container.
Sources: Dockerfile:1-33
Development Workflow Architecture
The following diagram shows the typical development cycle and how different components interact during development:
Development Workflow Diagram : Shows the cycle from editing code to building the Docker image to testing with mounted output volume.
graph TB
subgraph "Development Environment"
Editor["Code Editor"]
GitRepo["Local Git Repository"]
end
subgraph "Docker Build Process"
BuildCmd["docker build -t deepwiki-scraper ."]
Stage1["Rust Builder Stage\nCompiles mdbook binaries"]
Stage2["Python Runtime Stage\nAssembles final image"]
FinalImage["deepwiki-scraper:latest"]
end
subgraph "Testing & Validation"
RunCmd["docker run with test params"]
OutputMount["Volume mount: ./output"]
Validation["Manual inspection of output"]
end
subgraph "Key Development Files"
Dockerfile["Dockerfile"]
BuildScript["build-docs.sh"]
Scraper["tools/deepwiki-scraper.py"]
Requirements["tools/requirements.txt"]
end
Editor -->|Edit| GitRepo
GitRepo --> Dockerfile
GitRepo --> BuildScript
GitRepo --> Scraper
GitRepo --> Requirements
BuildCmd --> Stage1
Stage1 --> Stage2
Stage2 --> FinalImage
FinalImage --> RunCmd
RunCmd --> OutputMount
OutputMount --> Validation
Validation -.->|Iterate| Editor
Sources: Dockerfile:1-33 build-docs.sh:1-206
Component Development Map
This diagram bridges system concepts to actual code entities, showing which files implement which functionality:
Code Entity Mapping Diagram : Maps system functionality to specific code locations, file paths, and binaries.
graph LR
subgraph "Entry Point Layer"
CMD["CMD in Dockerfile:32"]
BuildDocs["build-docs.sh"]
end
subgraph "Configuration Layer"
EnvVars["Environment Variables\nREPO, BOOK_TITLE, etc."]
AutoDetect["Auto-detect logic\nbuild-docs.sh:8-19"]
Validation["Validation\nbuild-docs.sh:33-37"]
end
subgraph "Processing Scripts"
ScraperPy["deepwiki-scraper.py"]
MdBookBin["/usr/local/bin/mdbook"]
MermaidBin["/usr/local/bin/mdbook-mermaid"]
end
subgraph "Configuration Generation"
BookToml["book.toml generation\nbuild-docs.sh:85-103"]
SummaryMd["SUMMARY.md generation\nbuild-docs.sh:113-159"]
end
subgraph "Dependency Management"
ReqTxt["requirements.txt"]
UvInstall["uv pip install\nDockerfile:17"]
CargoInstall["cargo install\nDockerfile:5"]
end
CMD --> BuildDocs
BuildDocs --> EnvVars
EnvVars --> AutoDetect
AutoDetect --> Validation
Validation --> ScraperPy
BuildDocs --> BookToml
BuildDocs --> SummaryMd
BuildDocs --> MdBookBin
MdBookBin --> MermaidBin
ReqTxt --> UvInstall
UvInstall --> ScraperPy
CargoInstall --> MdBookBin
CargoInstall --> MermaidBin
Sources: Dockerfile:1-33 build-docs.sh:8-19 build-docs.sh:85-103 build-docs.sh:113-159
Local Development Workflow
1. Clone and Setup
The repository has a minimal structure focused on the essential build artifacts. The .gitignore:1-2 excludes the output/ directory to prevent committing generated files.
2. Make Changes
Key files for common modifications:
| Modification Type | Primary File | Related Files |
|---|---|---|
| Scraping logic | tools/deepwiki-scraper.py | - |
| Build orchestration | build-docs.sh | - |
| Python dependencies | tools/requirements.txt | Dockerfile:16-17 |
| Docker build process | Dockerfile | - |
| Output structure | build-docs.sh | Lines 179-191 |
3. Build Docker Image
After making changes, rebuild the Docker image:
The multi-stage build process Dockerfile:1-7 first compiles Rust binaries in a rust:latest builder stage, then Dockerfile:8-33 assembles the final python:3.12-slim image with copied binaries and Python dependencies.
4. Test Changes
Test with a real repository:
Setting MARKDOWN_ONLY=true build-docs.sh:61-76 bypasses the mdBook build phase, allowing faster iteration when testing scraping logic changes.
5. Validate Output
Inspect the generated files:
Sources: .gitignore:1-2 Dockerfile:1-33 build-docs.sh:61-76 build-docs.sh:179-191
Testing Strategies
Fast Iteration with Markdown-Only Mode
The MARKDOWN_ONLY environment variable enables a fast path for testing scraping changes:
This mode executes only Phase 1 (Markdown Extraction) and skips Phase 2 (Diagram Enhancement) and Phase 3 (mdBook Build). See Phase 1: Markdown Extraction for details on what this phase includes.
The conditional logic build-docs.sh:61-76 checks the MARKDOWN_ONLY variable and exits early after copying markdown files to /output/markdown/.
Testing Auto-Detection
The repository auto-detection logic build-docs.sh:8-19 attempts to extract the GitHub repository from Git remotes if REPO is not explicitly set:
The script checks git config --get remote.origin.url and extracts the owner/repo portion using sed pattern matching build-docs.sh16
Testing Configuration Generation
To test book.toml and SUMMARY.md generation without a full build:
The book.toml template build-docs.sh:85-103 uses shell variable substitution to inject environment variables into the TOML structure.
Sources: build-docs.sh:8-19 build-docs.sh:61-76 build-docs.sh:85-103
Debugging Techniques
Inspecting Intermediate Files
The build process creates temporary files in /workspace inside the container. To inspect them:
This allows inspection of:
- Scraped markdown files in
/workspace/wiki/ - Generated
book.tomlin/workspace/book/ - Generated
SUMMARY.mdin/workspace/book/src/
Adding Debug Output
Both build-docs.sh:1-206 and deepwiki-scraper.py use echo statements for progress tracking. Add additional debug output:
Testing Python Script Independently
To test the scraper without Docker:
This is useful for rapid iteration on scraping logic without rebuilding the Docker image.
Sources: build-docs.sh:1-206 tools/requirements.txt:1-4
Build Optimization Considerations
Multi-Stage Build Rationale
The Dockerfile:1-7 uses a separate Rust builder stage to:
- Compile
mdbookandmdbook-mermaidwith a full Rust toolchain - Discard the ~1.5 GB builder stage after compilation
- Copy only the compiled binaries Dockerfile:20-21 to the final image
This reduces the final image size from ~1.5 GB to ~300-400 MB while still providing both Python and Rust tools. See Docker Multi-Stage Build for architectural details.
Dependency Management with uv
The Dockerfile13 copies uv from the official Astral image and uses it Dockerfile17 to install Python dependencies with --no-cache flag:
This approach:
- Provides faster dependency resolution than pip
- Reduces layer size with
--no-cache - Installs system-wide with
--systemflag
Image Layer Ordering
The Dockerfile orders operations to maximize layer caching:
- Copy
uvbinary (rarely changes) - Install Python dependencies (changes with
requirements.txt) - Copy Rust binaries (changes when rebuilding Rust stage)
- Copy Python scripts (changes frequently during development)
This ordering means modifying deepwiki-scraper.py only invalidates the final layers Dockerfile:24-29 not the entire dependency installation.
Sources: Dockerfile:1-33
Common Development Tasks
Adding a New Environment Variable
To add a new configuration option:
-
Define default in build-docs.sh:21-30:
-
Add to configuration display build-docs.sh:47-53:
-
Use in downstream processing as needed
-
Document in Configuration Reference
Modifying SUMMARY.md Generation
The table of contents generation logic build-docs.sh:113-159 uses bash loops and file discovery:
To modify the structure:
- Adjust the file pattern matching
- Modify the section detection logic
- Update the markdown output format
- Test with repositories that have different hierarchical structures
Adding New Python Dependencies
- Add to tools/requirements.txt:1-4 with version constraint:
new-package>=1.0.0
-
Rebuild Docker image (triggers Dockerfile17)
-
Update Python Dependencies documentation
-
Import and use in
deepwiki-scraper.py
Sources: build-docs.sh:21-30 build-docs.sh:113-159 tools/requirements.txt:1-4 Dockerfile17
File Modification Guidelines
Modifying build-docs.sh
The orchestrator script uses several idioms:
| Pattern | Purpose | Example |
|---|---|---|
set -e | Exit on error | build-docs.sh2 |
"${VAR:-default}" | Default values | build-docs.sh:22-26 |
$(command) | Command substitution | build-docs.sh12 |
echo "" | Visual spacing | build-docs.sh47 |
mkdir -p | Safe directory creation | build-docs.sh64 |
Maintain these patterns for consistency. The script is designed to be readable and self-documenting with clear step labels build-docs.sh:4-6
Modifying Dockerfile
Key considerations:
- Keep stages separate Dockerfile:1-2 vs Dockerfile8
- Use
COPY --from=builderDockerfile:20-21 for cross-stage artifact copying - Set executable permissions Dockerfile:25-29 for scripts
- Use
WORKDIRDockerfile10 to establish consistent working directory - Keep
CMDDockerfile32 as the default entrypoint
Modifying Python Scripts
When editing tools/deepwiki-scraper.py:
- The script is executed via build-docs.sh58 with two arguments:
REPOand output directory - It must be Python 3.12 compatible Dockerfile8
- It has access to dependencies from tools/requirements.txt:1-4
- It should write output to the specified directory argument
- It should use
print()for progress output that appears in build logs
Sources: build-docs.sh2 build-docs.sh58 Dockerfile:1-33 tools/requirements.txt:1-4
Integration Testing
End-to-End Test
Validate the complete pipeline:
Testing Configuration Variants
Test different repository configurations:
Sources: build-docs.sh:8-19 build-docs.sh:61-76
Contributing Guidelines
When submitting changes:
- Test locally : Build and run the Docker image with multiple test repositories
- Validate output : Ensure markdown files are properly formatted and the HTML site builds correctly
- Check backwards compatibility : Existing repositories should continue to work
- Update documentation : Modify relevant wiki pages if changing behavior
- Follow existing patterns : Match the coding style in build-docs.sh:1-206
The system is designed to be “fully generic” - it should work with any DeepWiki repository without modification. Test that your changes maintain this property.
Sources: build-docs.sh:1-206
Troubleshooting Development Issues
Build Failures
| Symptom | Likely Cause | Solution |
|---|---|---|
| Rust compilation fails | Network issues, incompatible versions | Check rust:latest image availability |
| Python package install fails | Version conflicts in requirements.txt | Verify package versions are compatible |
mdbook not found | Binary copy failed | Check Dockerfile:20-21 paths |
| Permission denied on scripts | Missing chmod +x | Verify Dockerfile:25-29 |
Runtime Failures
| Symptom | Likely Cause | Solution |
|---|---|---|
| “REPO must be set” error | Auto-detection failed, no REPO env var | Check build-docs.sh:33-36 validation logic |
| Scraper crashes | DeepWiki site structure changed | Debug deepwiki-scraper.py with local testing |
| SUMMARY.md is empty | No markdown files found | Verify scraper output in /workspace/wiki/ |
| mdBook build fails | Invalid markdown syntax | Inspect markdown files for issues |
Output Validation Checklist
After a successful build, verify:
output/markdown/contains.mdfiles- Section directories exist (e.g.,
output/markdown/section-4/) output/book/index.htmlexists and opens in browser- Navigation menu appears in generated site
- Search functionality works
- Mermaid diagrams render correctly
- Links between pages work
- “Edit this file” links point to correct GitHub URLs
Sources: build-docs.sh:33-36 Dockerfile:20-21 Dockerfile:25-29
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Project Structure
Loading…
Project Structure
Relevant source files
This document describes the repository’s file organization, detailing the purpose of each file and directory in the codebase. Understanding this structure is essential for developers who want to modify or extend the system.
For information about running tests, see page 13.2. For details about the Python dependencies, see page 13.3.
Repository Layout
The repository follows a clean, organized structure that separates Python code, shell scripts, and HTML templates into dedicated directories.
graph TB
Root["Repository Root"]
Root --> GitIgnore[".gitignore"]
Root --> Dockerfile["Dockerfile"]
Root --> README["README.md"]
Root --> PythonDir["python/"]
Root --> ScriptsDir["scripts/"]
Root --> TemplatesDir["templates/"]
Root --> GithubDir[".github/"]
Root --> OutputDir["output/"]
PythonDir --> Scraper["deepwiki-scraper.py"]
PythonDir --> ProcessTemplate["process-template.py"]
PythonDir --> Requirements["requirements.txt"]
PythonDir --> TestsDir["tests/"]
ScriptsDir --> BuildScript["build-docs.sh"]
ScriptsDir --> RunTests["run-tests.sh"]
TemplatesDir --> Header["header.html"]
TemplatesDir --> Footer["footer.html"]
TemplatesDir --> TemplateREADME["README.md"]
GithubDir --> Workflows["workflows/"]
OutputDir --> MarkdownOut["markdown/"]
OutputDir --> RawMarkdownOut["raw_markdown/"]
OutputDir --> BookOut["book/"]
OutputDir --> ConfigOut["book.toml"]
style Root fill:#f9f9f9,stroke:#333
style PythonDir fill:#e8f5e9,stroke:#388e3c
style ScriptsDir fill:#fff4e1,stroke:#f57c00
style TemplatesDir fill:#e1f5ff,stroke:#0288d1
style OutputDir fill:#ffe0b2,stroke:#e64a19
Physical File Hierarchy
Sources: README.md:84-88 .gitignore:1-7
Root Directory Files
The repository root contains the primary configuration and documentation files that define the system’s build behavior.
| File | Type | Purpose |
|---|---|---|
.gitignore | Config | Excludes generated output and temporary files |
Dockerfile | Build | Multi-stage Docker build specification |
README.md | Docs | Quick start guide and configuration reference |
.gitignore
Excludes build artifacts and temporary files from version control:
output/- Generated documentation artifacts*.pycand__pycache__/- Python bytecode.env- Local environment variables.DS_Store- macOS metadatatmp/- Temporary working directory
Sources: .gitignore:1-7
Dockerfile
Implements a two-stage build pattern to optimize image size. The builder stage compiles Rust binaries (mdbook, mdbook-mermaid), and the final stage creates a Python runtime with only the necessary executables.
Sources: README.md78
README.md
Primary documentation file containing quick start instructions, configuration reference, and high-level system overview. Serves as the entry point for new users.
Sources: README.md:1-95
graph TB
PythonDir["python/"]
PythonDir --> Scraper["deepwiki-scraper.py"]
PythonDir --> ProcessTemplate["process-template.py"]
PythonDir --> Requirements["requirements.txt"]
PythonDir --> TestsDir["tests/"]
TestsDir --> TemplateTest["test_template_processing.py"]
TestsDir --> MermaidTest["test_mermaid_normalization.py"]
TestsDir --> NumberingTest["test_page_numbering.py"]
Scraper --> ExtractWikiStructure["extract_wiki_structure()"]
Scraper --> ExtractPageContent["extract_page_content()"]
Scraper --> ExtractMermaid["extract_mermaid_from_nextjs_data()"]
Scraper --> NormalizeDiagram["normalize_mermaid_diagram()"]
Scraper --> ExtractAndEnhance["extract_and_enhance_diagrams()"]
ProcessTemplate --> ProcessFile["process_template_file()"]
ProcessTemplate --> SubstituteVars["substitute_variables()"]
Python Directory
The python/ directory contains all Python scripts, their dependencies, and test suites.
Python Directory Structure
Sources: README.md85
deepwiki-scraper.py
Core Python module for content extraction and diagram processing. Implements the Phase 1 (markdown extraction) and Phase 2 (diagram enhancement) logic of the pipeline.
Key Functions:
| Function | Purpose |
|---|---|
sanitize_filename() | Convert page titles to filesystem-safe names |
fetch_page() | HTTP client with retry logic and error handling |
discover_subsections() | Recursively probe for nested wiki pages |
extract_wiki_structure() | Build hierarchical page structure from DeepWiki |
clean_deepwiki_footer() | Remove DeepWiki UI elements from markdown |
convert_html_to_markdown() | HTML→Markdown conversion via html2text |
extract_mermaid_from_nextjs_data() | Extract diagrams from Next.js JavaScript payload |
normalize_mermaid_diagram() | Seven-step normalization for Mermaid 11 compatibility |
extract_page_content() | Main content extraction and markdown generation |
extract_and_enhance_diagrams() | Fuzzy matching and diagram injection |
main() | Entry point with temporary directory management |
The scraper uses a temporary directory pattern to ensure atomic operations. Files are written to tempfile.TemporaryDirectory(), enhanced in-place, then moved to the final output location.
Sources: README.md85
process-template.py
Template processing script that performs variable substitution in header and footer HTML files. Supports conditional rendering and automatic variable detection.
Key Functions:
| Function | Purpose |
|---|---|
process_template_file() | Main template processing entry point |
substitute_variables() | Replace {{VARIABLE}} placeholders with values |
Template variables include: {{REPO}}, {{BOOK_TITLE}}, {{BOOK_AUTHORS}}, {{GIT_REPO_URL}}, {{DEEPWIKI_URL}}, {{GENERATION_DATE}}.
Sources: README.md51
requirements.txt
Python dependencies for the scraper and template processor:
requests>=2.31.0- HTTP client for fetching wiki pagesbeautifulsoup4>=4.12.0- HTML parsing libraryhtml2text>=2020.1.16- HTML-to-Markdown converter
Installed via uv pip install during Docker build for faster, more reliable installation.
Sources: README.md85
tests/
Test suite for Python components. Contains unit tests for template processing, Mermaid normalization, and page numbering logic. See page 13.2 for details on running tests.
Sources: README.md82
Scripts Directory
The scripts/ directory contains shell scripts for orchestration and testing.
Scripts Directory Structure
Sources: README.md82 README.md86
build-docs.sh
Main orchestration script that coordinates the three-phase pipeline. Invoked as the Docker container’s entry point.
Execution Flow:
- Auto-detection - Detect
REPOfrom git remote if not provided - Configuration - Parse environment variables and set defaults
- Phase 1 - Execute
deepwiki-scraper.pyto extract markdown - Phase 2 - Process templates and generate
book.toml,SUMMARY.md - Phase 3 - Run
mdbook buildto generate HTML (unlessMARKDOWN_ONLY=true) - Cleanup - Copy outputs to
/outputvolume
Environment Variables:
REPO- GitHub repository (owner/repo format)BOOK_TITLE- Documentation titleBOOK_AUTHORS- Author metadataGIT_REPO_URL- Repository URL for edit linksDEEPWIKI_URL- DeepWiki page URLMARKDOWN_ONLY- Skip HTML build for debugging
Critical Paths:
WORK_DIR=/workspace- Working directoryWIKI_DIR=/workspace/wiki- Temporary markdown locationOUTPUT_DIR=/output- Volume mount for outputsBOOK_DIR=/workspace/book- mdBook source directory
Sources: README.md:34-37 README.md86
run-tests.sh
Test execution script that runs pytest on the Python test suite. Provides colored output and detailed test results.
Sources: README.md82
graph TB
TemplatesDir["templates/"]
TemplatesDir --> Header["header.html"]
TemplatesDir --> Footer["footer.html"]
TemplatesDir --> TemplateREADME["README.md"]
Header --> Variables["Template variables:\n{{REPO}}\n{{BOOK_TITLE}}\n{{GIT_REPO_URL}}\n{{DEEPWIKI_URL}}\n{{GENERATION_DATE}}"]
Footer --> Variables
Templates Directory
The templates/ directory contains HTML template files for header and footer customization.
Templates Directory Structure
Sources: README.md87
header.html
HTML template injected at the beginning of each markdown file. Supports variable substitution for dynamic content like repository links and generation timestamps.
Sources: README.md:40-51
footer.html
HTML template injected at the end of each markdown file. Supports the same variable substitution as header.html.
Sources: README.md:40-51
README.md
Documentation for the template system, including variable reference and customization examples.
Sources: README.md51
graph TB
Output["output/"]
Output --> Markdown["markdown/"]
Output --> RawMarkdown["raw_markdown/"]
Output --> Book["book/"]
Output --> Config["book.toml"]
Markdown --> MainPages["Main pages:\n1-overview.md\n2-quick-start.md"]
Markdown --> Sections["Subsection dirs:\nsection-2/\nsection-3/"]
Sections --> SubPages["Subsection pages:\n2-1-docker.md\n3-1-environment.md"]
RawMarkdown --> RawPages["Pre-enhanced\nmarkdown files\n(for debugging)"]
Book --> Index["index.html"]
Book --> CSS["css/"]
Book --> JS["mermaid.min.js"]
Book --> Search["searchindex.js"]
Output Directory (Generated)
The output/ directory is created at runtime and excluded from version control. It contains all generated artifacts produced by the build pipeline.
Output Structure
Sources: README.md:54-59
markdown/
Contains enhanced markdown source files with injected diagrams and processed templates. Files are organized hierarchically with subsections in section-N/ subdirectories.
Main Pages:
- Format:
{number}-{slug}.md(e.g.,1-overview.md) - Location:
output/markdown/
Subsection Pages:
- Format:
section-{main}/{number}-{slug}.md - Location:
output/markdown/section-{N}/ - Example:
section-3/3-2-environment-variables.md
Sources: README.md56
raw_markdown/
Pre-enhancement markdown files for debugging purposes. Contains the output of Phase 1 before diagram injection and template processing. Useful for troubleshooting diagram matching issues.
Sources: README.md57
book/
Complete HTML documentation site generated by mdBook. Self-contained static website with:
- Navigation sidebar generated from
SUMMARY.md - Full-text search via
searchindex.js - Rendered Mermaid diagrams via
mdbook-mermaid - Edit-on-GitHub links from
GIT_REPO_URL - Responsive Rust theme
The entire directory can be served by any static file server or deployed to GitHub Pages.
Sources: README.md55
book.toml
mdBook configuration file with repository-specific metadata. Dynamically generated during Phase 2 of the build pipeline. Contains book title, authors, theme settings, and preprocessor configuration.
Sources: README.md58
graph TB
BuildContext["Docker Build Context"]
BuildContext --> Included["Included in Image"]
BuildContext --> Excluded["Excluded"]
Included --> DockerfileBuild["Dockerfile\n(Build instructions)"]
Included --> ToolsCopy["tools/\n(COPY instruction)"]
Included --> ScriptCopy["build-docs.sh\n(COPY instruction)"]
ToolsCopy --> ReqInstall["requirements.txt\n→ uv pip install"]
ToolsCopy --> ScraperInstall["deepwiki-scraper.py\n→ /usr/local/bin/"]
ScriptCopy --> BuildInstall["build-docs.sh\n→ /usr/local/bin/"]
Excluded --> GitIgnored["output/\n(git-ignored)"]
Excluded --> GitFiles[".git/\n(implicit)"]
Excluded --> Readme["README.md\n(not referenced)"]
style BuildContext fill:#f9f9f9,stroke:#333
style Included fill:#e8f5e9,stroke:#388e3c
style Excluded fill:#ffebee,stroke:#c62828
Docker Build Context
The Docker build process includes only the files needed for container construction. Understanding this context is important for build optimization.
Build Context Inclusion
Copy Operations:
- Dockerfile16 -
COPY tools/requirements.txt /tmp/requirements.txt - Dockerfile24 -
COPY tools/deepwiki-scraper.py /usr/local/bin/ - Dockerfile28 -
COPY build-docs.sh /usr/local/bin/
Not Copied:
.gitignore- only used by Gitoutput/- generated at runtime.git/- version control metadata- Any documentation files (README, LICENSE)
Sources: Dockerfile:16-28 .gitignore:1-2
graph TB
subgraph BuildTime["Build-Time Dependencies"]
DF["Dockerfile"]
Req["tools/requirements.txt"]
Scraper["tools/deepwiki-scraper.py"]
BuildSh["build-docs.sh"]
DF -->|COPY [Line 16]| Req
DF -->|RUN install [Line 17]| Req
DF -->|COPY [Line 24]| Scraper
DF -->|COPY [Line 28]| BuildSh
DF -->|CMD [Line 32]| BuildSh
end
subgraph Runtime["Run-Time Dependencies"]
BuildShRun["build-docs.sh\n(Entry point)"]
ScraperExec["deepwiki-scraper.py\n(Phase 1-2)"]
MdBook["mdbook\n(Phase 3)"]
MdBookMermaid["mdbook-mermaid\n(Phase 3)"]
BuildShRun -->|python3 [Line 58]| ScraperExec
BuildShRun -->|mdbook-mermaid install [Line 171]| MdBookMermaid
BuildShRun -->|mdbook build [Line 176]| MdBook
ScraperExec -->|import requests| Req
ScraperExec -->|import bs4| Req
ScraperExec -->|import html2text| Req
end
subgraph Generated["Generated Artifacts"]
WikiDir["$WIKI_DIR/\n(Temp markdown)"]
BookToml["book.toml\n(Config)"]
Summary["SUMMARY.md\n(TOC)"]
OutputDir["output/\n(Final artifacts)"]
ScraperExec -->|sys.argv[2]| WikiDir
BuildShRun -->|cat > [Line 85]| BookToml
BuildShRun -->|Lines 113-159| Summary
BuildShRun -->|cp [Lines 184-191]| OutputDir
end
BuildTime --> Runtime
Runtime --> Generated
style DF fill:#e1f5ff,stroke:#0288d1
style BuildShRun fill:#fff4e1,stroke:#f57c00
style ScraperExec fill:#e8f5e9,stroke:#388e3c
style OutputDir fill:#ffe0b2,stroke:#e64a19
File Dependency Graph
This diagram maps the relationships between files and shows which files depend on or reference others.
Sources: Dockerfile:1-33 build-docs.sh:1-206 tools/deepwiki-scraper.py:1-920 tools/requirements.txt:1-4
File Size and Complexity Metrics
Understanding the relative complexity of each component helps developers identify which files require the most attention during modifications.
| File | Lines | Purpose | Complexity |
|---|---|---|---|
tools/deepwiki-scraper.py | 920 | Content extraction and diagram matching | High |
build-docs.sh | 206 | Orchestration and configuration | Medium |
Dockerfile | 33 | Multi-stage build specification | Low |
tools/requirements.txt | 4 | Dependency list | Minimal |
.gitignore | 2 | Git exclusion rule | Minimal |
Key Observations:
- 90% of code is in the Python scraper tools/deepwiki-scraper.py:1-920
- Shell script handles high-level orchestration build-docs.sh:1-206
- Dockerfile is minimal due to multi-stage optimization Dockerfile:1-33
- No configuration files in repository root (all generated at runtime)
Sources: tools/deepwiki-scraper.py:1-920 build-docs.sh:1-206 Dockerfile:1-33 tools/requirements.txt:1-4 .gitignore:1-2
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Running Tests
Loading…
Running Tests
Relevant source files
This page provides instructions for running the test suite locally and understanding the test organization within the DeepWiki-to-mdBook converter project. It covers local execution methods, test structure, and integration with the development workflow. For information about the automated CI/CD test workflow, see Test Workflow.
Test Organization
The test suite is located in the python/tests/ directory and consists of multiple test modules that validate different components of the system.
Test Structure
python/
├── tests/
│ ├── conftest.py # pytest fixtures and configuration
│ ├── test_template_processor.py # Template system tests
│ ├── test_mermaid_normalization.py # Mermaid diagram normalization tests
│ └── test_numbering.py # Page numbering and path resolution tests
Sources: scripts/run-tests.sh:1-43 python/tests/conftest.py:1-16
Test Categories
| Test Module | Purpose | Test Framework |
|---|---|---|
test_template_processor.py | Validates template variable substitution, conditional rendering, and header/footer injection | Standalone Python (no pytest required) |
test_mermaid_normalization.py | Tests the seven-step Mermaid diagram normalization pipeline | pytest |
test_numbering.py | Validates page numbering logic and path resolution algorithms | pytest |
Sources: scripts/run-tests.sh:7-30
Running Tests Locally
There are two primary methods for running tests locally: using the convenience shell script or invoking pytest directly.
Method 1: Using the Shell Script
The run-tests.sh script provides a unified interface for running all tests with appropriate error handling and formatted output:
This script:
- Runs template processor tests directly with Python
- Detects if pytest is installed
- Runs pytest-based tests if available
- Provides summary output with pass/fail status
Sources: scripts/run-tests.sh:1-43
Method 2: Using pytest Directly
For pytest-based tests, you can invoke pytest directly for more control:
Sources: .github/workflows/tests.yml:24-25
Method 3: Individual Test Execution
The template processor tests can run independently without pytest:
Sources: scripts/run-tests.sh11
Test Execution Flow
Test Execution Flow Diagram
This diagram shows how tests are executed locally. The run-tests.sh script checks for pytest availability and runs tests accordingly, while developers can also invoke pytest or Python directly.
Sources: scripts/run-tests.sh:1-43 python/tests/conftest.py:1-16
Prerequisites
Required Dependencies
Install Python dependencies before running tests:
The requirements.txt file contains all runtime dependencies needed by the scraper and test utilities.
Sources: .github/workflows/tests.yml:19-23
Python Version
Tests are designed for Python 3.12, which is the version used in both the Docker container and CI workflow:
Sources: .github/workflows/tests.yml:17-18
Test Module Details
Test Module Dependencies Diagram
This diagram shows the organization of test modules and their relationship to the conftest.py fixture system. The scraper_module fixture dynamically loads deepwiki-scraper.py for use in pytest-based tests.
Sources: python/tests/conftest.py:1-16 scripts/run-tests.sh:7-30
Template Processor Tests
The test_template_processor.py module tests the template variable substitution system used for header and footer injection. It validates:
- Variable substitution with
{{VARIABLE_NAME}}syntax - Conditional blocks with
{{#if CONDITION}}...{{/if}} - Edge cases like missing variables and nested conditions
This module can run independently without pytest and directly imports the template processing functions.
Sources: scripts/run-tests.sh:7-11
Mermaid Normalization Tests
The test_mermaid_normalization.py module validates the seven-step normalization pipeline that ensures Mermaid 11 compatibility. Each normalization step has dedicated tests:
| Normalization Step | Test Coverage |
|---|---|
| Unescape sequences | \n, \t, \u003c character handling |
| Multiline edge labels | Flattening logic for edge descriptions |
| State descriptions | State : Description syntax fixes |
| Flowchart nodes | Pipe character removal |
| Statement separators | Semicolon insertion |
| Empty labels | Fallback label generation |
| Gantt task IDs | Synthetic ID generation for unnamed tasks |
This module uses the scraper_module fixture from conftest.py to access normalization functions.
Sources: scripts/run-tests.sh:16-22 python/tests/conftest.py:7-16
Numbering Tests
The test_numbering.py module validates page numbering logic and path generation algorithms. It tests:
- Hierarchical numbering schemes (e.g.,
1.2.3) - Numeric sorting that correctly handles multi-digit sections
- Path generation from page numbers
- Link rewriting for internal references
This module also uses the scraper_module fixture to access numbering functions.
Sources: scripts/run-tests.sh:24-30 python/tests/conftest.py:7-16
The conftest.py Fixture System
The conftest.py file provides a session-scoped fixture that loads the deepwiki-scraper.py module dynamically:
This approach allows tests to import functions from the scraper without requiring it to be installed as a package. The fixture is shared across all test sessions for efficiency.
Sources: python/tests/conftest.py:7-16
CI Integration
The test suite integrates with GitHub Actions through the tests.yml workflow, which:
- Triggers on push to
mainand on pull requests - Sets up Python 3.12
- Installs dependencies from
python/requirements.txt - Installs pytest
- Runs all pytest tests with the
-sflag (show output)
For detailed information about the CI test workflow, configuration, and failure handling, see Test Workflow.
Sources: .github/workflows/tests.yml:1-26
Understanding Test Output
Successful Test Run
When all tests pass using run-tests.sh, you will see:
==========================================
Running Template Processor Tests
==========================================
[Template test output...]
==========================================
Running Mermaid Normalization Tests
==========================================
[pytest output with test results...]
==========================================
Running Numbering Tests
==========================================
[pytest output with test results...]
==========================================
✓ All tests passed!
==========================================
Sources: scripts/run-tests.sh:34-42
Pytest Not Available
If pytest is not installed, the script will skip pytest-based tests:
==========================================
Running Template Processor Tests
==========================================
[Template test output...]
==========================================
⚠ Template tests passed (mermaid/numbering tests skipped)
Note: pytest not found, install with: pip install pytest
==========================================
Sources: scripts/run-tests.sh:34-42
Pytest Verbose Output
Using pytest with the -v flag provides detailed test information:
The -s flag shows print statements and output, useful for debugging:
Sources: .github/workflows/tests.yml25
Local vs. CI Test Execution
Local vs. CI Test Execution Comparison
This diagram illustrates the difference between local and CI test execution. Local execution allows for flexible Python versions and gracefully handles missing pytest, while CI enforces Python 3.12 and guarantees pytest availability.
Sources: scripts/run-tests.sh:13-31 .github/workflows/tests.yml:1-26
Key Differences
| Aspect | Local Execution | CI Execution |
|---|---|---|
| Python Version | Any 3.x version | Fixed at 3.12 |
| pytest Requirement | Optional (graceful fallback) | Always installed |
| Execution Method | run-tests.sh or manual | pytest python/tests/ -s |
| Output Control | User-configurable verbosity | Fixed -s flag for output |
| Trigger | Manual by developer | Automatic on push/PR |
Sources: scripts/run-tests.sh:13-31 .github/workflows/tests.yml:17-25
Best Practices
Running Tests Before Commits
Always run the test suite before committing changes:
This ensures your changes don’t break existing functionality.
Iterative Testing
When developing new features, run specific test modules for faster feedback:
Adding New Tests
When adding new functionality:
- Create test functions in the appropriate test module
- Use the
scraper_modulefixture for accessing scraper functions - Test locally with both methods (script and pytest)
- Verify CI passes on your pull request
Sources: python/tests/conftest.py:7-16 .github/workflows/tests.yml:1-26
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Python Dependencies
Loading…
Python Dependencies
Relevant source files
This page documents the Python dependencies required by the deepwiki-scraper.py script, including their purposes, version requirements, and how they are used throughout the content extraction and conversion pipeline. For information about the scraper script itself, see deepwiki-scraper.py. For details about how Rust dependencies (mdBook and mdbook-mermaid) are installed, see Docker Multi-Stage Build.
Dependencies Overview
The system requires three core Python libraries for web scraping and HTML-to-Markdown conversion:
| Package | Minimum Version | Primary Purpose |
|---|---|---|
requests | 2.31.0 | HTTP client for fetching web pages |
beautifulsoup4 | 4.12.0 | HTML parsing and DOM manipulation |
html2text | 2020.1.16 | HTML to Markdown conversion |
These dependencies are declared in tools/requirements.txt:1-3 and installed during Docker image build using the uv package manager.
Sources: tools/requirements.txt:1-3 Dockerfile:16-17
Dependency Usage Flow
The following diagram illustrates how each Python dependency is used across the three-phase processing pipeline:
Sources: tools/deepwiki-scraper.py:17-19 tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:596-788
flowchart TD
subgraph "Phase 1: Markdown Extraction"
FetchPage["fetch_page()\n[tools/deepwiki-scraper.py:27-42]"]
ExtractStruct["extract_wiki_structure()\n[tools/deepwiki-scraper.py:78-125]"]
ExtractContent["extract_page_content()\n[tools/deepwiki-scraper.py:453-594]"]
ConvertHTML["convert_html_to_markdown()\n[tools/deepwiki-scraper.py:175-190]"]
end
subgraph "Phase 2: Diagram Enhancement"
ExtractDiagrams["extract_and_enhance_diagrams()\n[tools/deepwiki-scraper.py:596-788]"]
end
subgraph "requests Library"
Session["requests.Session()"]
GetMethod["session.get()"]
HeadMethod["session.head()"]
end
subgraph "BeautifulSoup4 Library"
BS4Parser["BeautifulSoup(html, 'html.parser')"]
FindAll["soup.find_all()"]
Select["soup.select()"]
Decompose["element.decompose()"]
end
subgraph "html2text Library"
H2TClass["html2text.HTML2Text()"]
HandleMethod["h.handle()"]
end
FetchPage --> Session
FetchPage --> GetMethod
ExtractStruct --> GetMethod
ExtractStruct --> BS4Parser
ExtractStruct --> FindAll
ExtractContent --> GetMethod
ExtractContent --> BS4Parser
ExtractContent --> Select
ExtractContent --> Decompose
ExtractContent --> ConvertHTML
ConvertHTML --> H2TClass
ConvertHTML --> HandleMethod
ExtractDiagrams --> GetMethod
requests
The requests library provides HTTP client functionality for fetching web pages from DeepWiki.com. It is imported at tools/deepwiki-scraper.py17 and used throughout the scraper.
Key Usage Patterns
Session Management: A requests.Session() object is created at tools/deepwiki-scraper.py:818-821 to maintain connection pooling and share headers across multiple requests:
HTTP GET Requests: The fetch_page() function at tools/deepwiki-scraper.py:27-42 uses session.get() with retry logic, browser-like headers, and 30-second timeout to fetch HTML content.
HTTP HEAD Requests: The discover_subsections() function at tools/deepwiki-scraper.py:44-76 uses session.head() to efficiently check for page existence without downloading full content.
Configuration Options
The library is configured with:
- Custom User-Agent: Mimics a real browser to avoid bot detection tools/deepwiki-scraper.py:29-31
- Timeout: 30-second limit on requests tools/deepwiki-scraper.py35
- Retry Logic: Up to 3 attempts with 2-second delays tools/deepwiki-scraper.py:33-42
- Connection Pooling: Automatic via
Session()object
Sources: tools/deepwiki-scraper.py17 tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:818-821
BeautifulSoup4
The beautifulsoup4 library (imported as bs4) provides HTML parsing and DOM manipulation capabilities. It is imported at tools/deepwiki-scraper.py18 as from bs4 import BeautifulSoup.
Parser Selection
BeautifulSoup is instantiated with the built-in html.parser backend at multiple locations:
- Structure discovery: tools/deepwiki-scraper.py84
- Content extraction: tools/deepwiki-scraper.py463
This parser choice avoids external dependencies (lxml, html5lib) and provides sufficient functionality for well-formed HTML.
flowchart LR
subgraph "Navigation Methods"
FindAll["soup.find_all()"]
Find["soup.find()"]
Select["soup.select()"]
SelectOne["soup.select_one()"]
end
subgraph "Usage in extract_wiki_structure()"
StructLinks["Find wiki page links\n[line 90]"]
end
subgraph "Usage in extract_page_content()"
RemoveNav["Remove navigation elements\n[line 466]"]
FindContent["Locate main content area\n[line 473-485]"]
RemoveUI["Remove DeepWiki UI elements\n[line 491-511]"]
end
FindAll --> StructLinks
FindAll --> RemoveUI
Select --> RemoveNav
SelectOne --> FindContent
Find --> FindContent
DOM Navigation Methods
The following diagram maps BeautifulSoup methods to their usage contexts in the codebase:
Sources: tools/deepwiki-scraper.py18 tools/deepwiki-scraper.py84 tools/deepwiki-scraper.py90 tools/deepwiki-scraper.py463 tools/deepwiki-scraper.py:466-511
Content Manipulation
Element Removal: The element.decompose() method permanently removes elements from the DOM tree:
- Navigation elements: tools/deepwiki-scraper.py:466-467
- DeepWiki UI components: tools/deepwiki-scraper.py:491-500
- Table of contents lists: tools/deepwiki-scraper.py:504-511
CSS Selectors: BeautifulSoup’s select() and select_one() methods support CSS selector syntax for finding content areas:
tools/deepwiki-scraper.py:473-476
Attribute-Based Selection: The find() method with attrs parameter locates elements by ARIA roles:
Text Extraction
BeautifulSoup’s get_text() method extracts plain text from elements:
- With
strip=Trueto remove whitespace: tools/deepwiki-scraper.py94 tools/deepwiki-scraper.py492 - Used for DeepWiki UI element detection: tools/deepwiki-scraper.py:492-500
Sources: tools/deepwiki-scraper.py:466-511
html2text
The html2text library converts HTML content to Markdown format. It is imported at tools/deepwiki-scraper.py19 and used exclusively in the convert_html_to_markdown() function.
Configuration
An HTML2Text instance is created with specific configuration at tools/deepwiki-scraper.py:178-180:
Key Settings:
ignore_links = False: Preserves hyperlinks as Markdown link syntax<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/text" undefined file-path="text">Hii</FileRef>body_width = 0: Disables automatic line wrapping at 80 characters, preserving original formatting
Conversion Process
The handle() method at tools/deepwiki-scraper.py181 performs the actual HTML-to-Markdown conversion:
This processes the cleaned HTML (after BeautifulSoup removes unwanted elements) and produces Markdown text with:
- Headers converted to
#syntax - Links converted to
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/text" undefined file-path="text">Hii</FileRef>format - Lists converted to
-or1.format - Bold/italic formatting preserved
- Code blocks and inline code preserved
Post-Processing
The conversion output undergoes additional cleanup at tools/deepwiki-scraper.py188:
- DeepWiki footer removal via
clean_deepwiki_footer(): tools/deepwiki-scraper.py:127-173 - Link rewriting to relative paths: tools/deepwiki-scraper.py:549-592
- Duplicate title removal: tools/deepwiki-scraper.py:525-545
Sources: tools/deepwiki-scraper.py19 tools/deepwiki-scraper.py:175-190
flowchart TD
subgraph "Dockerfile Stage 2"
BaseImage["FROM python:3.12-slim\n[Dockerfile:8]"]
CopyUV["COPY uv from astral-sh image\n[Dockerfile:13]"]
CopyReqs["COPY tools/requirements.txt\n[Dockerfile:16]"]
InstallDeps["RUN uv pip install --system\n[Dockerfile:17]"]
end
subgraph "requirements.txt"
Requests["requests>=2.31.0"]
BS4["beautifulsoup4>=4.12.0"]
HTML2Text["html2text>=2020.1.16"]
end
BaseImage --> CopyUV
CopyUV --> CopyReqs
CopyReqs --> InstallDeps
Requests --> InstallDeps
BS4 --> InstallDeps
HTML2Text --> InstallDeps
Installation Process
The dependencies are installed during Docker image build using the uv package manager, which provides fast, reliable Python package installation.
Multi-Stage Build Integration
Sources: Dockerfile8 Dockerfile13 Dockerfile:16-17 tools/requirements.txt:1-3
Installation Command
The dependencies are installed with a single uv pip install command at Dockerfile17:
Flags:
--system: Installs into system Python, not a virtual environment--no-cache: Avoids caching to reduce Docker image size-r /tmp/requirements.txt: Specifies requirements file path
The uv tool is significantly faster than standard pip and provides deterministic dependency resolution, making builds more reliable and reproducible.
Sources: Dockerfile:16-17
Version Requirements
The minimum version constraints specified in tools/requirements.txt:1-3 ensure compatibility with required features:
requests >= 2.31.0
This version requirement ensures:
- Security fixes : Addresses CVE-2023-32681 (Proxy-Authorization header leakage)
- Session improvements : Enhanced connection pooling and retry mechanisms
- HTTP/2 support : Better performance for multiple requests
The codebase relies on stable Session API behavior introduced in 2.x releases.
beautifulsoup4 >= 4.12.0
This version requirement ensures:
- Python 3.12 compatibility : Required for the base image
python:3.12-slim - Parser stability : Consistent behavior with
html.parserbackend - Security updates : Protection against XML parsing vulnerabilities
The codebase uses standard find/select methods that are stable across 4.x versions.
html2text >= 2020.1.16
This version requirement ensures:
- Python 3 compatibility : Earlier versions targeted Python 2.7
- Markdown formatting fixes : Improved handling of nested lists and code blocks
- Link preservation : Proper conversion of HTML links to Markdown syntax
The codebase uses the body_width=0 configuration which was stabilized in this version.
Sources: tools/requirements.txt:1-3
Import Locations
All three dependencies are imported at the top of deepwiki-scraper.py:
These are the only external dependencies required by the Python layer. The script uses only standard library modules for all other functionality (sys, re, time, pathlib, urllib.parse).
Sources: tools/deepwiki-scraper.py:17-19
Dismiss
Refresh this wiki
Enter email to refresh