Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Overview

Loading…

Overview

Relevant source files

Purpose and Scope

This document introduces the DeepWiki-to-mdBook Converter, a containerized system that extracts wiki documentation from DeepWiki.com and transforms it into searchable HTML documentation using mdBook. This page covers the system’s purpose, core capabilities, and high-level architecture.

For detailed usage instructions, see Quick Start. For architecture details, see System Architecture. For configuration options, see Configuration Reference.

Sources: README.md:1-3

Problem Statement

DeepWiki.com provides AI-generated documentation for GitHub repositories as a web-based wiki. The system addresses the following limitations:

ProblemSolution
Content locked in web platformHTTP scraping with requests and BeautifulSoup4
Mermaid diagrams rendered client-side onlyJavaScript payload extraction with fuzzy matching
No offline accessSelf-contained HTML site generation
No searchabilitymdBook’s built-in search
Platform-specific formattingConversion to standard Markdown

Sources: README.md:3-15

Core Capabilities

The system provides the following capabilities through environment variable configuration:

  • Generic Repository Support : Works with any GitHub repository indexed by DeepWiki via REPO environment variable
  • Auto-Detection : Extracts repository metadata from Git remotes when available
  • Hierarchy Preservation : Maintains wiki page numbering and section structure
  • Diagram Intelligence : Extracts ~461 total diagrams, matches ~48 with sufficient context using fuzzy matching
  • Dual Output Modes : Full mdBook build or markdown-only extraction via MARKDOWN_ONLY flag
  • No Authentication : Public HTTP scraping without API keys or credentials
  • Containerized Deployment : Single Docker image with all dependencies

Sources: README.md:5-15 README.md:42-51

System Components

The system consists of three primary executable components coordinated by a shell orchestrator. The following diagram maps user interaction to specific code entities:

Component Architecture with Code Entities

graph TB
    DockerRun["docker run\nwith env vars"]
subgraph Container["/usr/local/bin/ executables"]
BuildDocs["build-docs.sh"]
Scraper["deepwiki-scraper.py\nmain()\nscrape_wiki()\nextract_mermaid_from_nextjs_data()"]
MdBook["mdbook binary\n(Rust)"]
MermaidPlugin["mdbook-mermaid binary\n(Rust)"]
end
    
    subgraph External["External HTTP Endpoints"]
DeepWikiAPI["deepwiki.com/$REPO"]
GitHubEdit["github.com/$REPO/edit/"]
end
    
    subgraph OutputVol["/output volume mount"]
MarkdownDir["markdown/\nnumbered .md files"]
RawMarkdownDir["raw_markdown/\npre-enhancement .md"]
BookDir["book/\nindex.html + search"]
ConfigFile["book.toml"]
SummaryFile["SUMMARY.md"]
end
    
 
   DockerRun -->|REPO BOOK_TITLE BOOK_AUTHORS MARKDOWN_ONLY| BuildDocs
    
 
   BuildDocs -->|python3 deepwiki-scraper.py| Scraper
 
   BuildDocs -->|mdbook init| MdBook
 
   BuildDocs -->|mdbook build| MdBook
 
   BuildDocs -->|generates| ConfigFile
 
   BuildDocs -->|generates| SummaryFile
    
 
   Scraper -->|requests.get| DeepWikiAPI
 
   Scraper -->|writes| RawMarkdownDir
 
   Scraper -->|writes enhanced| MarkdownDir
    
 
   MdBook -->|preprocessor chain| MermaidPlugin
 
   MdBook -->|generates| BookDir
    
 
   BookDir -.->|edit links point to| GitHubEdit

Executable Components

ComponentTypeEntry PointKey Operations
build-docs.shShell scriptCMD in DockerfileParse $REPO, $BOOK_TITLE, generate book.toml, invoke Python and Rust tools
deepwiki-scraper.pyPython 3.12 modulemain() functionscrape_wiki(), extract_mermaid_from_nextjs_data(), inject_mermaid_diagrams_into_markdown()
mdbookRust binaryCLI invocationmdbook init, mdbook build with book.toml configuration
mdbook-mermaidRust preprocessormdBook plugin chainAsset injection for Mermaid.js runtime

Sources: README.md:1-27 README.md:84-88 Diagram 1, Diagram 3

Processing Pipeline

The system executes a three-phase pipeline with conditional execution based on the MARKDOWN_ONLY environment variable. Each phase invokes specific executables and functions:

Three-Phase Execution Flow with Code Entities

stateDiagram-v2
    [*] --> ParseEnv : build-docs.sh reads env
    
    state ParseEnv {
        [*] --> ReadREPO : $REPO
        ReadREPO --> ReadBOOKTITLE : $BOOK_TITLE
        ReadBOOKTITLE --> ReadMARKDOWNONLY : $MARKDOWN_ONLY
        ReadMARKDOWNONLY --> [*]
    }
    
    ParseEnv --> Phase1 : python3 deepwiki-scraper.py
    
    state Phase1 {
        [*] --> scrape_wiki
        scrape_wiki --> BeautifulSoup4 : parse HTML
        BeautifulSoup4 --> html2text : convert to .md
        html2text --> extract_mermaid : extract_mermaid_from_nextjs_data()
        extract_mermaid --> normalize_mermaid : 7-step normalization
        normalize_mermaid --> inject_diagrams : inject_mermaid_diagrams_into_markdown()
        inject_diagrams --> write_files : /output/markdown/*.md
        write_files --> [*]
    }
    
    Phase1 --> CheckMode
    
    state CheckMode <<choice>>
    CheckMode --> Phase2 : if MARKDOWN_ONLY != true
    CheckMode --> Exit : if MARKDOWN_ONLY == true
    
    Phase2 --> GenerateBookToml : build-docs.sh writes book.toml
    GenerateBookToml --> GenerateSummary : build-docs.sh writes SUMMARY.md
    
    GenerateSummary --> Phase3
    
    state Phase3 {
        [*] --> mdbook_init : mdbook init
        mdbook_init --> mdbook_mermaid_install : mdbook-mermaid install
        mdbook_mermaid_install --> mdbook_build : mdbook build
        mdbook_build --> [*] : /output/book/
    }
    
    Phase3 --> Exit
    Exit --> [*]

Phase Execution Details

PhasePrimary ExecutableKey Functions/CommandsArtifacts
1: Extractdeepwiki-scraper.pyscrape_wiki(), extract_mermaid_from_nextjs_data(), normalize_mermaid_code(), inject_mermaid_diagrams_into_markdown()/output/markdown/*.md, /output/raw_markdown/*.md
2: Configurebuild-docs.shTemplate string generation for book.toml and SUMMARY.md/output/book.toml, /output/SUMMARY.md
3: Buildmdbook, mdbook-mermaidmdbook init, mdbook-mermaid install, mdbook build/output/book/index.html, /output/book/searchindex.json

Sources: README.md:72-77 Diagram 2, Diagram 4

Input and Output

Input Requirements

InputFormatSourceExample
REPOowner/repoEnvironment variablefacebook/react
BOOK_TITLEStringEnvironment variable (optional)React Documentation
BOOK_AUTHORSStringEnvironment variable (optional)Meta Open Source
MARKDOWN_ONLYtrue/falseEnvironment variable (optional)false

Sources: README.md:42-51

Output Artifacts

Full Build Mode (MARKDOWN_ONLY=false or unset):

output/
├── markdown/
│   ├── 1-overview.md
│   ├── 2-quick-start.md
│   ├── section-3/
│   │   ├── 3-1-workspace.md
│   │   └── 3-2-parser.md
│   └── ...
├── book/
│   ├── index.html
│   ├── searchindex.json
│   ├── mermaid.min.js
│   └── ...
└── book.toml

Markdown-Only Mode (MARKDOWN_ONLY=true):

output/
└── markdown/
    ├── 1-overview.md
    ├── 2-quick-start.md
    └── ...

Sources: README.md:89-119

Technical Stack

The system combines multiple technology stacks in a single container using Docker multi-stage builds:

Runtime Dependencies

ComponentVersionPurposeInstallation Method
Python3.12-slimScraping runtimeBase image
requestsLatestHTTP clientuv pip install
beautifulsoup4LatestHTML parseruv pip install
html2textLatestHTML to Markdownuv pip install
mdbookLatestDocumentation builderCompiled from source (Rust)
mdbook-mermaidLatestDiagram preprocessorCompiled from source (Rust)

Build Architecture

The Dockerfile uses a two-stage build:

  1. Stage 1 (rust:latest): Compiles mdbook and mdbook-mermaid binaries (~1.5 GB, discarded)
  2. Stage 2 (python:3.12-slim): Copies binaries into Python runtime (~300-400 MB final)

Sources: README.md:146-157 Diagram 3

graph TB
    subgraph HostFS["Host Filesystem"]
HostOutput["./output/\n(bind mount)"]
end
    
    subgraph ContainerFS["Container Filesystem"]
BuildScript["/usr/local/bin/build-docs.sh"]
ScraperScript["/usr/local/bin/deepwiki-scraper.py"]
TmpWiki["/tmp/wiki_temp/\n(write buffer)"]
OutputMount["/output/\n(volume mount)"]
WorkspaceTemplates["/workspace/templates/\nheader.html, footer.html"]
end
    
    subgraph WriteOperations["File Write Operations"]
WriteRaw["write_markdown_file()\nraw .md to /tmp"]
WriteEnhanced["write enhanced .md\nafter diagram injection"]
AtomicMove["shutil.move()\nor mv command"]
CopyBook["cp -r mdbook_output/"]
end
    
    subgraph OutputStructure["/output/ final structure"]
OutMarkdown["/output/markdown/"]
OutRaw["/output/raw_markdown/"]
OutBook["/output/book/"]
OutConfig["/output/book.toml"]
end
    
 
   HostOutput -.->|-v bind mount| OutputMount
    
 
   ScraperScript -->|Phase 1| WriteRaw
 
   WriteRaw --> TmpWiki
 
   ScraperScript --> WriteEnhanced
 
   WriteEnhanced --> TmpWiki
    
 
   TmpWiki -->|atomic| AtomicMove
 
   AtomicMove --> OutMarkdown
 
   AtomicMove --> OutRaw
    
 
   BuildScript -->|Phase 2| OutConfig
 
   BuildScript -->|Phase 3| CopyBook
 
   CopyBook --> OutBook
    
 
   WorkspaceTemplates -.->|process-template.py reads| BuildScript
    
 
   OutMarkdown --> OutputMount
 
   OutRaw --> OutputMount
 
   OutBook --> OutputMount
 
   OutConfig --> OutputMount

File System Interaction

The system uses a temporary directory pattern to ensure atomic writes to the output volume:

Filesystem Write Pattern

Write Sequence

  1. deepwiki-scraper.py writes raw markdown to /tmp/wiki_temp/ using write_markdown_file() function
  2. After diagram injection via inject_mermaid_diagrams_into_markdown(), enhanced markdown moves to /output/markdown/
  3. build-docs.sh generates /output/book.toml from environment variables
  4. mdbook build writes HTML to internal directory, which build-docs.sh copies to /output/book/

This pattern ensures atomicity: partial writes never appear in /output/.

Sources: README.md:19-26 README.md:54-58 Diagram 3

Configuration Philosophy

The system operates on three configuration principles:

  1. Environment-Driven : All customization via environment variables, no file editing required
  2. Auto-Detection : Intelligent defaults from Git remotes (repository URL, author name)
  3. Zero-Configuration : Minimal required inputs (REPO or auto-detect from current directory)

Minimal Example :

This single command triggers the complete extraction, transformation, and build pipeline.

For complete configuration options, see Configuration Reference. For deployment patterns, see Quick Start.

Sources: README.md:22-51 README.md:220-227

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Quick Start

Loading…

Quick Start

Relevant source files

This page provides step-by-step instructions for basic usage of the DeepWiki-to-mdBook converter. It covers Docker image building, container execution with environment variables, output inspection, and local serving. For comprehensive configuration options, see Configuration Reference. For understanding the internal pipeline, see System Architecture.

Prerequisites

Before starting, ensure you have the following installed:

  • Docker (version 20.10 or later)
  • Git (for repository cloning)
  • Python 3 (for local serving)

The system requires no additional dependencies on the host machine; all build tools and Python packages are contained within the Docker image.

Sources: README.md:1-95

Basic Workflow

The typical workflow involves three steps: building the Docker image, running the container to generate documentation, and serving the output locally for preview.

flowchart TD
    Start["User initiates workflow"]
Clone["Clone repository\ngit clone"]
Build["Build Docker image\ndocker build -t deepwiki-to-mdbook ."]
Run["Run container with config\ndocker run --rm -e REPO=... -v ..."]
subgraph Container["Docker Container Execution"]
BuildScript["build-docs.sh"]
Scraper["deepwiki-scraper.py"]
Process["process-template.py"]
MdBook["mdbook build"]
end
    
    Output["Output directory\n/output mounted volume"]
Serve["Serve locally\npython3 -m http.server"]
View["View in browser\nhttp://localhost:8000"]
Start --> Clone
 
   Clone --> Build
 
   Build --> Run
 
   Run --> Container
 
   Container --> BuildScript
 
   BuildScript --> Scraper
 
   BuildScript --> Process
 
   BuildScript --> MdBook
 
   MdBook --> Output
 
   Output --> Serve
 
   Serve --> View

Workflow Diagram

Sources: README.md:12-29

Step 1: Build the Docker Image

Navigate to the repository root and build the Docker image:

This command:

  • Reads the multi-stage Dockerfile at the repository root
  • Compiles mdbook and mdbook-mermaid from source in the Rust builder stage
  • Installs Python 3.12 dependencies in the final stage
  • Tags the resulting image as deepwiki-to-mdbook

The build process typically takes 5-10 minutes on the first run due to Rust compilation. Subsequent builds use Docker layer caching.

Sources: README.md:14-16

Step 2: Run the Container

Execute the container with required environment variables and a volume mount for output:

Environment Variable Configuration

The container accepts configuration exclusively through environment variables:

VariableRequiredDescriptionDefault
REPONo*GitHub repository (owner/repo format)Auto-detected from git remote
BOOK_TITLENoDocumentation title“Documentation”
BOOK_AUTHORSNoAuthor namesRepository owner
MARKDOWN_ONLYNoSet to “true” to skip HTML buildfalse

*The REPO variable is auto-detected if the container is run in a Git repository context. For manual execution, it should be explicitly provided.

Volume Mount

The -v "$(pwd)/output:/output" mount maps the host’s ./output directory to the container’s /output directory. All generated artifacts are written here.

Sources: README.md:18-27 README.md:31-51

Step 3: Serve and View Output

After the container completes execution, serve the generated HTML documentation locally:

This command:

  • Changes to the output directory
  • Starts Python’s built-in HTTP server
  • Serves files from the book subdirectory
  • Listens on port 8000

Open http://localhost:8000 in a web browser to view the searchable documentation.

Output Directory Structure

The container generates four output artifacts:

Directory Descriptions:

  • book/ : The final HTML output generated by mdbook build. This is the directory to serve for viewing documentation.
  • markdown/ : Enhanced markdown files after diagram injection and template processing. Contains the source files used by mdBook.
  • raw_markdown/ : Markdown files immediately after HTML-to-markdown conversion, before diagram enhancement. Useful for debugging the extraction phase.
  • book.toml : The mdBook configuration file generated from environment variables.

Sources: README.md:26-27 README.md:53-58

flowchart LR
    CMD["Container CMD\nbuild-docs.sh"]
subgraph Phase1["Phase 1: Extraction"]
Fetch["fetch_and_convert_to_markdown()\ndeepwiki-scraper.py"]
RawOut["raw_markdown/\ndirectory"]
end
    
    subgraph Phase2["Phase 2: Enhancement"]
Extract["extract_mermaid_from_nextjs_data()\ndeepwiki-scraper.py"]
Normalize["normalize_mermaid_diagram()\ndeepwiki-scraper.py"]
Inject["inject_mermaid_diagrams()\ndeepwiki-scraper.py"]
Templates["process-template.py\n--input templates/header.html"]
MdOut["markdown/\ndirectory"]
end
    
    subgraph Phase3["Phase 3: Build"]
Summary["generate_summary()\nbuild-docs.sh"]
BookInit["mdbook init\nbinary"]
BookBuild["mdbook build\nbinary"]
BookOut["book/\ndirectory"]
end
    
 
   CMD --> Fetch
 
   Fetch --> RawOut
 
   RawOut --> Extract
 
   Extract --> Normalize
 
   Normalize --> Inject
 
   Inject --> Templates
 
   Templates --> MdOut
 
   MdOut --> Summary
 
   Summary --> BookInit
 
   BookInit --> BookBuild
 
   BookBuild --> BookOut

Container Execution Flow

The following diagram maps the container’s internal execution to specific code entities:

This diagram shows the execution path through the three phases of the pipeline, with references to actual functions and binaries. The build-docs.sh script orchestrates all three phases sequentially. For detailed information on each phase, see Three-Phase Pipeline.

Sources: README.md:72-77

Common Usage Patterns

Pattern 1: Custom Book Title and Authors

Pattern 2: Markdown-Only Mode

Skip the HTML build to inspect or further process the markdown files:

The markdown/ directory will contain all enhanced markdown files, but the book/ directory will not be created. See Markdown-Only Mode for more details.

Pattern 3: Custom Templates

Provide custom header and footer templates by mounting a templates directory:

Your my-templates/ directory should contain header.html and/or footer.html. See Template System for template variable syntax and Custom Templates for a comprehensive guide.

Sources: README.md:39-51

Minimal Example

For a minimal working example with auto-detected repository:

The REPO variable is auto-detected from git remote get-url origin, and BOOK_AUTHORS defaults to the repository owner.

Sources: README.md:12-29 README.md34

Verification Steps

After running the container, verify successful execution:

  1. Check container logs : The container should print progress messages for each phase.
  2. Inspect output directory : Ensure all four artifacts exist (book/, markdown/, raw_markdown/, book.toml).
  3. Verify HTML structure : book/index.html should exist and contain the search interface.
  4. Test local serving : The HTTP server should start without errors.
  5. Browse documentation : Navigate to http://localhost:8000 and verify page rendering and search functionality.

Next Steps

After completing the Quick Start:

Sources: README.md:1-95

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Configuration Reference

Loading…

Configuration Reference

Relevant source files

This document provides a comprehensive reference for all configuration options available in the DeepWiki-to-mdBook Converter system. It covers environment variables, their default values, validation logic, auto-detection features, and how configuration flows through the system components.

For information about running the system with these configurations, see Quick Start. For details on how auto-detection works internally, see Auto-Detection Features.

Configuration System Overview

The DeepWiki-to-mdBook Converter uses environment variables as its sole configuration mechanism. All configuration is processed by the build-docs.sh orchestrator script at runtime, with no configuration files required. The system provides intelligent defaults and auto-detection capabilities to minimize required configuration.

Configuration Flow Diagram

flowchart TD
    User["User/CI System"]
Docker["docker run -e VAR=value"]
subgraph "build-docs.sh Configuration Processing"
        AutoDetect["Git Auto-Detection\n[build-docs.sh:8-19]"]
ParseEnv["Environment Variable Parsing\n[build-docs.sh:21-26]"]
Defaults["Default Value Assignment\n[build-docs.sh:43-45]"]
Validate["Validation\n[build-docs.sh:32-37]"]
end
    
    subgraph "Configuration Consumers"
        Scraper["deepwiki-scraper.py\nREPO parameter"]
BookToml["book.toml Generation\n[build-docs.sh:85-103]"]
SummaryGen["SUMMARY.md Generation\n[build-docs.sh:113-159]"]
end
    
 
   User -->|Set environment variables| Docker
 
   Docker -->|Container startup| AutoDetect
 
   AutoDetect -->|REPO detection| ParseEnv
 
   ParseEnv -->|Parse all vars| Defaults
 
   Defaults -->|Apply defaults| Validate
 
   Validate -->|REPO validated| Scraper
 
   Validate -->|BOOK_TITLE, BOOK_AUTHORS, GIT_REPO_URL| BookToml
 
   Validate -->|No direct config needed| SummaryGen

Sources: build-docs.sh:1-206 README.md:41-51

Environment Variables Reference

The following table lists all environment variables supported by the system:

VariableTypeRequiredDefaultDescription
REPOStringConditionalAuto-detected from Git remoteGitHub repository in owner/repo format. Required if not running in a Git repository with a GitHub remote.
BOOK_TITLEStringNo"Documentation"Title displayed in the generated mdBook documentation. Used in book.toml title field.
BOOK_AUTHORSStringNoRepository owner (from REPO)Author name(s) displayed in the documentation. Used in book.toml authors array.
GIT_REPO_URLStringNohttps://github.com/{REPO}Full GitHub repository URL. Used for “Edit this page” links in mdBook output.
MARKDOWN_ONLYBooleanNo"false"When "true", skips Phase 3 (mdBook build) and outputs only extracted Markdown files. Useful for debugging.

Sources: build-docs.sh:21-26 README.md:44-51

Variable Details and Usage

REPO

Format: owner/repo (e.g., "facebook/react" or "microsoft/vscode")

Purpose: Identifies the GitHub repository to scrape from DeepWiki.com. This is the primary configuration variable that drives the entire system.

flowchart TD
    Start["build-docs.sh Startup"]
CheckEnv{"REPO environment\nvariable set?"}
UseEnv["Use provided REPO value\n[build-docs.sh:22]"]
CheckGit{"Git repository\ndetected?"}
GetRemote["Execute: git config --get\nremote.origin.url\n[build-docs.sh:12]"]
ParseURL["Extract owner/repo using regex:\n.*github\\.com[:/]([^/]+/[^/\\.]+)\n[build-docs.sh:16]"]
SetRepo["Set REPO variable\n[build-docs.sh:16]"]
ValidateRepo{"REPO is set?"}
Error["Exit with error\n[build-docs.sh:33-37]"]
Continue["Continue with\nREPO=$REPO_OWNER/$REPO_NAME"]
Start --> CheckEnv
 
   CheckEnv -->|Yes| UseEnv
 
   CheckEnv -->|No| CheckGit
 
   CheckGit -->|Yes| GetRemote
 
   CheckGit -->|No| ValidateRepo
 
   GetRemote --> ParseURL
 
   ParseURL --> SetRepo
 
   UseEnv --> ValidateRepo
 
   SetRepo --> ValidateRepo
 
   ValidateRepo -->|No| Error
 
   ValidateRepo -->|Yes| Continue

Auto-Detection Logic:

Sources: build-docs.sh:8-37

Validation: The system exits with an error if REPO is not set and cannot be auto-detected:

ERROR: REPO must be set or run from within a Git repository with a GitHub remote
Usage: REPO=owner/repo $0

Usage in System:

BOOK_TITLE

Default: "Documentation"

Purpose: Sets the title of the generated mdBook documentation. This appears in the browser tab, navigation header, and book metadata.

Usage: Injected into book.toml configuration file build-docs.sh87:

Examples:

  • BOOK_TITLE="React Documentation"
  • BOOK_TITLE="VS Code Internals"
  • BOOK_TITLE="Apache Arrow DataFusion Developer Guide"

Sources: build-docs.sh23 build-docs.sh87

BOOK_AUTHORS

Default: Repository owner extracted from REPO

Purpose: Sets the author name(s) in the mdBook documentation metadata.

Default Assignment Logic: build-docs.sh44

This uses shell parameter expansion to set BOOK_AUTHORS to REPO_OWNER only if BOOK_AUTHORS is unset or empty.

Usage: Injected into book.toml as an array build-docs.sh88:

Examples:

  • If REPO="facebook/react" and BOOK_AUTHORS not set → BOOK_AUTHORS="facebook"
  • Explicitly set: BOOK_AUTHORS="Meta Open Source"
  • Multiple authors: BOOK_AUTHORS="John Doe, Jane Smith" (rendered as single string in array)

Sources: build-docs.sh24 build-docs.sh44 build-docs.sh88

GIT_REPO_URL

Default: https://github.com/{REPO}

Purpose: Provides the full GitHub repository URL used for “Edit this page” links in the generated mdBook documentation. Each page includes a link back to the source repository.

Default Assignment Logic: build-docs.sh45

Usage: Injected into book.toml configuration build-docs.sh95:

Notes:

  • mdBook automatically appends /edit/main/ or similar paths based on its heuristics
  • The URL must be a valid Git repository URL for the edit links to work correctly
  • Can be overridden for non-standard Git hosting scenarios

Sources: build-docs.sh25 build-docs.sh45 build-docs.sh95

MARKDOWN_ONLY

Default: "false"

Type: Boolean string ("true" or "false")

Purpose: Controls whether the system executes the full three-phase pipeline or stops after Phase 2 (Markdown extraction with diagram enhancement). When set to "true", Phase 3 (mdBook build) is skipped.

flowchart TD
    Start["build-docs.sh Execution"]
Phase1["Phase 1: Scrape & Extract\n[build-docs.sh:56-58]"]
Phase2["Phase 2: Enhance Diagrams\n(within deepwiki-scraper.py)"]
CheckMode{"MARKDOWN_ONLY\n== 'true'?\n[build-docs.sh:61]"}
CopyMD["Copy markdown to /output/markdown\n[build-docs.sh:64-65]"]
ExitEarly["Exit (skipping mdBook build)\n[build-docs.sh:75]"]
Phase3Init["Phase 3: Initialize mdBook\n[build-docs.sh:79-106]"]
BuildBook["Build HTML documentation\n[build-docs.sh:176]"]
CopyAll["Copy all outputs\n[build-docs.sh:179-191]"]
Start --> Phase1
 
   Phase1 --> Phase2
 
   Phase2 --> CheckMode
 
   CheckMode -->|Yes| CopyMD
 
   CopyMD --> ExitEarly
 
   CheckMode -->|No| Phase3Init
 
   Phase3Init --> BuildBook
 
   BuildBook --> CopyAll
    
    style ExitEarly fill:#ffebee
    style CopyAll fill:#e8f5e9

Execution Flow with MARKDOWN_ONLY:

Sources: build-docs.sh26 build-docs.sh:61-76

Use Cases:

  • Debugging diagram placement: Quickly iterate on diagram matching without waiting for mdBook build
  • Markdown-only extraction: When you only need the Markdown source files
  • Faster feedback loops: mdBook build adds significant time; skipping it speeds up testing
  • Custom processing: Extract Markdown for processing with different documentation tools

Output Differences:

ModeOutput Directory Structure
MARKDOWN_ONLY="false" (default)/output/book/ (HTML site)
/output/markdown/ (source)
/output/book.toml (config)
MARKDOWN_ONLY="true"/output/markdown/ (source only)

Performance Impact: Markdown-only mode is approximately 3-5x faster, as it skips:

Sources: build-docs.sh:61-76 README.md:55-76

Internal Configuration Variables

These variables are derived or used internally and are not meant to be configured by users:

VariableSourcePurpose
WORK_DIRHard-coded: /workspace build-docs.sh27Temporary working directory inside container
WIKI_DIRDerived: $WORK_DIR/wiki build-docs.sh28Directory where deepwiki-scraper.py outputs Markdown
OUTPUT_DIRHard-coded: /output build-docs.sh29Container output directory (mounted as volume)
BOOK_DIRDerived: $WORK_DIR/book build-docs.sh30mdBook project directory
REPO_OWNERExtracted from REPO build-docs.sh40First component of owner/repo
REPO_NAMEExtracted from REPO build-docs.sh41Second component of owner/repo

Sources: build-docs.sh:27-30 build-docs.sh:40-41

Configuration Precedence and Inheritance

The system follows this precedence order for configuration values:

Sources: build-docs.sh:8-45

Example Scenarios:

  1. User provides all values:

All explicit values used; no auto-detection occurs.

  1. User provides only REPO:

    • REPO: "facebook/react" (explicit)
    • BOOK_TITLE: "Documentation" (default)
    • BOOK_AUTHORS: "facebook" (derived from REPO)
    • GIT_REPO_URL: "https://github.com/facebook/react" (derived)
    • MARKDOWN_ONLY: "false" (default)
  2. User provides no values in Git repo:

    • REPO: Auto-detected from git config --get remote.origin.url
    • All other values derived or defaulted as above

Generated Configuration Files

The system generates configuration files dynamically based on environment variables:

book.toml

Location: Created at $BOOK_DIR/book.toml build-docs.sh85 copied to /output/book.toml build-docs.sh191

Template Structure:

Sources: build-docs.sh:85-103

Variable Substitution Mapping:

Template VariableEnvironment VariableSection
${BOOK_TITLE}$BOOK_TITLE[book]
${BOOK_AUTHORS}$BOOK_AUTHORS[book]
${GIT_REPO_URL}$GIT_REPO_URL[output.html]

Hard-Coded Values:

SUMMARY.md

Location: Created at $BOOK_DIR/src/SUMMARY.md build-docs.sh159

Generation: Automatically generated from file structure in $WIKI_DIR, no direct environment variable input. See SUMMARY.md Generation for details.

Sources: build-docs.sh:109-159

Configuration Examples

Minimal Configuration

Results:

  • REPO: "owner/repo"
  • BOOK_TITLE: "Documentation"
  • BOOK_AUTHORS: "owner"
  • GIT_REPO_URL: "https://github.com/owner/repo"
  • MARKDOWN_ONLY: "false"

Full Custom Configuration

Auto-Detected Configuration

Note: This only works if the current directory is a Git repository with a GitHub remote URL configured.

Debugging Configuration

Outputs only Markdown files to /output/markdown/, skipping the mdBook build phase.

Sources: README.md:28-88

Configuration Validation

The system performs validation on the REPO variable build-docs.sh:32-37:

Validation Rules:

  • REPO must be non-empty after auto-detection
  • No format validation is performed on REPO value (e.g., owner/repo pattern)
  • Invalid REPO values will cause failures during scraping phase, not during validation

Other Variables:

  • No validation performed on BOOK_TITLE, BOOK_AUTHORS, or GIT_REPO_URL
  • MARKDOWN_ONLY is not validated; any value other than "true" is treated as false

Sources: build-docs.sh:32-37

Configuration Debugging

To debug configuration values, check the console output at startup build-docs.sh:47-53:

Configuration:
  Repository:    facebook/react
  Book Title:    React Documentation
  Authors:       Meta Open Source
  Git Repo URL:  https://github.com/facebook/react
  Markdown Only: false

This output shows the final resolved configuration values after auto-detection, derivation, and defaults are applied.

Sources: build-docs.sh:47-53

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

System Architecture

Loading…

System Architecture

Relevant source files

This page provides a comprehensive overview of the DeepWiki-to-mdBook converter’s architecture, including its component organization, execution model, and data flow patterns. The system is designed as a containerized pipeline that transforms DeepWiki content into searchable mdBook documentation through three distinct processing phases.

For detailed information about the three-phase transformation pipeline, see Three-Phase Pipeline. For Docker-specific implementation details, see Docker Multi-Stage Build.

Architectural Overview

The system follows a pipeline architecture with three sequential phases, orchestrated by a shell script and executed within a Docker container. All components are stateless and communicate through the filesystem, with no external dependencies required at runtime.

graph TB
    subgraph Docker["Docker Container (python:3.12-slim)"]
subgraph Executables["/usr/local/bin/"]
BuildScript["build-docs.sh"]
Scraper["deepwiki-scraper.py"]
TemplateProc["process-template.py"]
mdBook["mdbook"]
mdBookMermaid["mdbook-mermaid"]
end
        
        subgraph Workspace["/workspace/"]
Templates["templates/\nheader.html\nfooter.html"]
WorkingDirs["wiki/\nraw_markdown/\nbook/"]
end
        
        subgraph Output["/output/ (Volume Mount)"]
BookHTML["book/"]
MarkdownSrc["markdown/"]
RawMarkdown["raw_markdown/"]
BookConfig["book.toml"]
end
    end
    
    subgraph External["External Dependencies"]
DeepWiki["deepwiki.com"]
GitRemote["git remote"]
end
    
 
   BuildScript -->|executes| Scraper
 
   BuildScript -->|executes| TemplateProc
 
   BuildScript -->|executes| mdBook
 
   BuildScript -->|executes| mdBookMermaid
    
 
   Scraper -->|writes| WorkingDirs
 
   TemplateProc -->|reads| Templates
 
   TemplateProc -->|outputs HTML| BuildScript
 
   mdBook -->|reads| WorkingDirs
 
   mdBook -->|writes| BookHTML
    
 
   BuildScript -->|copies to| Output
    
 
   Scraper -->|HTTP GET| DeepWiki
 
   BuildScript -->|auto-detect| GitRemote
    
    style Executables fill:#f0f0f0
    style Workspace fill:#f0f0f0
    style Output fill:#e8f5e9

System Composition

Diagram: Container Internal Structure and Component Relationships

The container is structured into three main areas: executables in /usr/local/bin/, working files in /workspace/, and outputs in /output/. The build-docs.sh orchestrator coordinates all components, with persistent results written to the mounted volume.

Sources: Dockerfile:1-34 scripts/build-docs.sh:1-310

Core Components

The system consists of five primary components, each with a specific responsibility in the documentation generation pipeline.

ComponentTypeLocationPrimary Responsibility
build-docs.shShell Script/usr/local/bin/Pipeline orchestration and configuration management
deepwiki-scraper.pyPython Script/usr/local/bin/Wiki content extraction and markdown conversion
process-template.pyPython Script/usr/local/bin/Template variable substitution
mdbookRust Binary/usr/local/bin/HTML documentation generation
mdbook-mermaidRust Binary/usr/local/bin/Mermaid diagram rendering

Component Interaction Map

Diagram: Component Interaction and Data Flow

This diagram shows how build-docs.sh coordinates the three processing components sequentially, with data flowing through working directories before final output to the mounted volume.

Sources: scripts/build-docs.sh:1-310 Dockerfile:20-33

Execution Flow

The system follows a strictly sequential execution model, with each step depending on the output of the previous step. This design simplifies error handling and allows for debugging at intermediate stages.

Build Script Orchestration

The build-docs.sh script orchestrates the entire pipeline through seven distinct steps:

  1. Configuration & Validation scripts/build-docs.sh:8-59

    • Auto-detects REPO from git remote if not provided
    • Sets defaults for BOOK_TITLE, BOOK_AUTHORS, GIT_REPO_URL
    • Validates required configuration
    • Computes derived URLs (DEEPWIKI_URL, badge URLs)
  2. Wiki Scraping scripts/build-docs.sh:61-65

    • Executes deepwiki-scraper.py with repository identifier
    • Writes to /workspace/wiki/ and /workspace/raw_markdown/
  3. Early Exit (Markdown-Only Mode) scripts/build-docs.sh:67-93

    • Optional: skip HTML build if MARKDOWN_ONLY=true
    • Copies markdown directly to output volume
  4. mdBook Initialization scripts/build-docs.sh:95-122

    • Creates /workspace/book/ structure
    • Generates book.toml configuration
    • Initializes src/ directory
  5. SUMMARY.md Generation scripts/build-docs.sh:124-188

    • Scans wiki directory for .md files
    • Sorts numerically by page number prefix
    • Builds hierarchical table of contents
    • Handles subsections in section-N/ directories
  6. Template Processing & Injection scripts/build-docs.sh:190-261

    • Processes header.html and footer.html with variable substitution
    • Injects processed HTML into every markdown file
    • Copies enhanced markdown to book/src/
  7. mdBook Build scripts/build-docs.sh:263-271

    • Installs mermaid assets via mdbook-mermaid install
    • Executes mdbook build
    • Generates searchable HTML in book/
  8. Output Copying scripts/build-docs.sh:273-309

    • Copies all artifacts to /output/ volume mount
    • Preserves intermediate outputs for debugging

Diagram: build-docs.sh Sequential Execution Flow

Sources: scripts/build-docs.sh:1-310

Three-Phase Pipeline Architecture

The core transformation happens in three distinct phases, each with specific inputs, processing logic, and outputs. This separation allows for independent testing and debugging of each phase.

Phase Overview

PhasePrimary ComponentInputOutputKey Operations
Phase 1: Extractiondeepwiki-scraper.pyDeepWiki HTMLMarkdown filesStructure discovery, HTML→Markdown conversion, raw diagram extraction
Phase 2: Enhancementdeepwiki-scraper.pyRaw markdown + diagramsEnhanced markdownDiagram normalization, fuzzy matching, template injection
Phase 3: Buildmdbook + mdbook-mermaidEnhanced markdownSearchable HTMLSUMMARY generation, mermaid rendering, search index

Diagram: Three-Phase Pipeline with Key Functions

For detailed documentation of each phase, see Three-Phase Pipeline.

Sources: scripts/build-docs.sh:61-271 README.md:72-77

graph TB
    subgraph Stage1["Stage 1: Builder (rust:latest)"]
RustToolchain["Rust Toolchain\ncargo, rustc"]
CargoInstall["cargo install mdbook\ncargo install mdbook-mermaid"]
Binaries["/usr/local/cargo/bin/\nmdbook\nmdbook-mermaid"]
RustToolchain --> CargoInstall
 
       CargoInstall --> Binaries
    end
    
    subgraph Stage2["Stage 2: Runtime (python:3.12-slim)"]
PythonBase["Python 3.12 Runtime"]
PipInstall["pip install requirements"]
CopyBinaries["COPY --from=builder\nmdbook binaries"]
CopyScripts["COPY Python scripts\nCOPY Shell scripts\nCOPY Templates"]
FinalImage["Final Image\n~500MB"]
PythonBase --> PipInstall
 
       PipInstall --> CopyBinaries
 
       CopyBinaries --> CopyScripts
 
       CopyScripts --> FinalImage
    end
    
 
   Binaries -.->|copy only| CopyBinaries
    
    style Stage1 fill:#ffebee
    style Stage2 fill:#e8f5e9
    style FinalImage fill:#c8e6c9

Docker Multi-Stage Build

The Docker architecture uses a multi-stage build pattern to minimize final image size while compiling Rust-based tools from source. This approach separates build-time dependencies from runtime dependencies.

Stage Architecture

Diagram: Multi-Stage Build Process

The builder stage (approximately 2GB) is discarded after compilation, and only the compiled binaries (approximately 50MB) are copied to the final image, resulting in a significantly smaller runtime image.

Container Filesystem Layout

/usr/local/bin/
├── mdbook                    # Rust binary from builder stage
├── mdbook-mermaid            # Rust binary from builder stage
├── deepwiki-scraper.py       # Python script (executable)
├── process-template.py       # Python script (executable)
└── build-docs.sh             # Shell script (executable)

/workspace/
├── templates/
│   ├── header.html           # Default header template
│   └── footer.html           # Default footer template
├── wiki/                     # Created at runtime
├── raw_markdown/             # Created at runtime
└── book/                     # Created at runtime

/output/                      # Volume mount point
└── (user-provided volume)

For detailed Docker implementation information, see Docker Multi-Stage Build.

Sources: Dockerfile:1-34

Configuration Architecture

The system is configured entirely through environment variables and volume mounts , following the Twelve-Factor App methodology. No configuration files are required; all settings have sensible defaults.

Configuration Layers

  1. Auto-Detection scripts/build-docs.sh:8-19

    • Extracts REPO from git remote get-url origin
    • Supports GitHub URLs in multiple formats (HTTPS, SSH)
  2. Environment Variables scripts/build-docs.sh:21-26

    • User-provided overrides
    • Takes precedence over auto-detection
  3. Computed Defaults scripts/build-docs.sh:40-51

    • Derives BOOK_AUTHORS from REPO owner
    • Constructs GIT_REPO_URL from REPO
    • Generates badge URLs

Diagram: Configuration Resolution Order

Key Configuration Variables

VariableDefaultSourceDescription
REPO(auto-detected)scripts/build-docs.sh:9-19GitHub repository (owner/repo)
BOOK_TITLE“Documentation”scripts/build-docs.sh23Title in book.toml
BOOK_AUTHORS(derived from REPO)scripts/build-docs.sh45Author metadata
GIT_REPO_URL(derived from REPO)scripts/build-docs.sh46Link in generated docs
MARKDOWN_ONLY“false”scripts/build-docs.sh26Skip HTML build
GENERATION_DATE(current UTC time)scripts/build-docs.sh200Timestamp in templates

For complete configuration documentation, see Configuration Reference.

Sources: scripts/build-docs.sh:8-59 README.md:31-51

Output Artifacts

The system produces four distinct output artifacts, each serving a specific purpose in the documentation workflow:

ArtifactLocationPurposeGenerated By
book//output/book/Searchable HTML documentationmdbook build
markdown//output/markdown/Enhanced markdown sourcedeepwiki-scraper.py + templates
raw_markdown//output/raw_markdown/Pre-enhancement markdown (debug)deepwiki-scraper.py (raw output)
book.toml/output/book.tomlmdBook configurationbuild-docs.sh generation

The multi-artifact design allows users to inspect intermediate stages, debug transformation issues, or use the markdown files for alternative processing workflows.

For detailed output structure documentation, see Output Structure.

Sources: scripts/build-docs.sh:273-309 README.md:53-58

Extensibility Points

The architecture provides three primary extension mechanisms:

1. Custom Templates

Users can override default header/footer templates by mounting a custom directory:

The process-template.py script performs variable substitution on any mounted templates, supporting custom branding and layout.

Sources: scripts/build-docs.sh:195-234 Dockerfile26

2. Environment Variable Configuration

All behavior can be modified through environment variables without rebuilding the Docker image. This includes metadata, URLs, and operational modes.

Sources: scripts/build-docs.sh:8-59

3. Markdown-Only Mode

Setting MARKDOWN_ONLY=true allows users to skip the HTML build entirely, enabling alternative processing pipelines or custom mdBook configurations.

Sources: scripts/build-docs.sh:67-93

For advanced customization patterns, see Advanced Topics.

Summary

The DeepWiki-to-mdBook converter implements a pipeline architecture with three distinct phases (extraction, enhancement, build), orchestrated by a shell script within a multi-stage Docker container. The system is stateless, configuration-driven, and produces multiple output artifacts for different use cases. All components communicate through the filesystem, with no runtime dependencies beyond the container image.

Key architectural principles:

  • Sequential processing : Each phase depends on the previous phase’s output
  • Stateless execution : No persistent state between runs
  • Configuration through environment : No config files required
  • Multi-stage build : Minimized runtime image size
  • Multiple outputs : Debugging and alternative workflows supported

Sources: Dockerfile:1-34 scripts/build-docs.sh:1-310 README.md:1-95

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Three-Phase Pipeline

Loading…

Three-Phase Pipeline

Relevant source files

Purpose and Scope

This document describes the three-phase processing pipeline that transforms DeepWiki HTML pages into searchable mdBook documentation. The pipeline consists of Phase 1: Clean Markdown Extraction , Phase 2: Diagram Enhancement , and Phase 3: mdBook Build. Each phase has distinct responsibilities and uses different technology stacks.

For overall system architecture, see System Architecture. For detailed implementation of individual phases, see Phase 1: Markdown Extraction, Phase 2: Diagram Enhancement, and Phase 3: mdBook Build. For configuration that affects pipeline behavior, see Configuration Reference.

Pipeline Overview

The system processes content through three sequential phases, with an optional bypass mechanism for Phase 3. Each phase is implemented by different components and operates on files in specific directories.

Pipeline Execution Flow

stateDiagram-v2
    [*] --> Init["build-docs.sh\nParse env vars"]
    
    Init --> Phase1["Phase 1 : deepwiki-scraper.py"]
    
    state "Phase 1 : Markdown Extraction" as Phase1 {
        [*] --> ExtractStruct["extract_wiki_structure()"]
        ExtractStruct --> LoopPages["for page in pages"]
        LoopPages --> ExtractPage["extract_page_content(url)"]
        ExtractPage --> ConvertHTML["convert_html_to_markdown()"]
        ConvertHTML --> CleanFooter["clean_deepwiki_footer()"]
        CleanFooter --> WriteTemp["/workspace/wiki/*.md"]
        WriteTemp --> LoopPages
        LoopPages --> RawSnapshot["/workspace/raw_markdown/\n(debug snapshot)"]
    }
    
    Phase1 --> Phase2["Phase 2 : deepwiki-scraper.py"]
    
    state "Phase 2 : Diagram Enhancement" as Phase2 {
        [*] --> ExtractDiagrams["extract_and_enhance_diagrams()"]
        ExtractDiagrams --> FetchJS["Fetch JS payload\nextract_mermaid_from_nextjs_data()"]
        FetchJS --> NormalizeDiagrams["normalize_mermaid_diagram()\n7 normalization passes"]
        NormalizeDiagrams --> FuzzyMatch["Fuzzy match loop\n300/200/150/100/80 char chunks"]
        FuzzyMatch --> InjectFiles["Modify /workspace/wiki/*.md\nInsert ```mermaid blocks"]
    }
    
    Phase2 --> CheckMode{"MARKDOWN_ONLY\nenv var?"}
    
    CheckMode --> CopyMarkdown["build-docs.sh\ncp -r /workspace/wiki /output/markdown"] : true
    CheckMode --> Phase3["Phase 3 : build-docs.sh"] - false
    
    state "Phase 3 : mdBook Build" as Phase3 {
        [*] --> GenToml["Generate book.toml\n[book], [output.html]"]
        GenToml --> GenSummary["Generate src/SUMMARY.md\nScan .md files"]
        GenSummary --> CopyToSrc["cp -r /workspace/wiki/* src/"]
        CopyToSrc --> MermaidInstall["mdbook-mermaid install"]
        MermaidInstall --> MdBookBuild["mdbook build"]
        MdBookBuild --> OutputBook["/workspace/book/book/"]
    }
    
    Phase3 --> CopyAll["cp -r book /output/\ncp -r markdown /output/"]
    CopyMarkdown --> Done["/output directory\nready"]
    CopyAll --> Done
    Done --> [*]

Sources: scripts/build-docs.sh:61-93 python/deepwiki-scraper.py:1277-1408 python/deepwiki-scraper.py:880-1276

Phase Coordination

The build-docs.sh orchestrator coordinates all three phases and handles the decision point for markdown-only mode.

Orchestrator Control Flow

flowchart TD
    Start["Container entrypoint\nCMD: /usr/local/bin/build-docs.sh"]
Start --> ParseEnv["Parse environment\n$REPO, $BOOK_TITLE, $BOOK_AUTHORS\n$MARKDOWN_ONLY, $GIT_REPO_URL"]
ParseEnv --> CheckRepo{"$REPO\nset?"}
CheckRepo -->|No| GitDetect["git config --get remote.origin.url\nsed -E 's#.*github.com[:/]([^/]+/[^/.]+).*#\1#'"]
CheckRepo -->|Yes| SetVars["Set defaults:\nBOOK_AUTHORS=$REPO_OWNER\nGIT_REPO_URL=https://github.com/$REPO"]
GitDetect --> SetVars
    
 
   SetVars --> SetPaths["WORK_DIR=/workspace\nWIKI_DIR=/workspace/wiki\nRAW_DIR=/workspace/raw_markdown\nOUTPUT_DIR=/output\nBOOK_DIR=/workspace/book"]
SetPaths --> CallScraper["python3 /usr/local/bin/deepwiki-scraper.py $REPO $WIKI_DIR"]
CallScraper --> ScraperRuns["deepwiki-scraper.py executes:\nPhase 1: extract_wiki_structure()\nPhase 2: extract_and_enhance_diagrams()"]
ScraperRuns --> CheckMode{"$MARKDOWN_ONLY\n== 'true'?"}
CheckMode -->|Yes| QuickCopy["rm -rf $OUTPUT_DIR/markdown\nmkdir -p $OUTPUT_DIR/markdown\ncp -r $WIKI_DIR/. $OUTPUT_DIR/markdown/\nexit 0"]
CheckMode -->|No| MkdirBook["mkdir -p $BOOK_DIR\ncd $BOOK_DIR"]
MkdirBook --> GenToml["cat > book.toml <<EOF\n[book] title=$BOOK_TITLE\n[output.html] git-repository-url=$GIT_REPO_URL\n[preprocessor.mermaid]"]
GenToml --> MkdirSrc["mkdir -p src"]
MkdirSrc --> GenSummary["Generate src/SUMMARY.md\nls $WIKI_DIR/*.md /sort -t- -k1 -n for file: head -1 $file/ sed 's/^# //'"]
GenSummary --> CopyToSrc["cp -r $WIKI_DIR/* src/"]
CopyToSrc --> ProcessTemplates["python3 process-template.py header.html\npython3 process-template.py footer.html\nInject into src/*.md"]
ProcessTemplates --> InstallMermaid["mdbook-mermaid install $BOOK_DIR"]
InstallMermaid --> BuildBook["mdbook build"]
BuildBook --> CopyOutputs["mkdir -p $OUTPUT_DIR\ncp -r book $OUTPUT_DIR/\ncp -r $WIKI_DIR/. $OUTPUT_DIR/markdown/\ncp book.toml $OUTPUT_DIR/"]
QuickCopy --> Done["Exit 0"]
CopyOutputs --> Done

Sources: scripts/build-docs.sh:8-47 scripts/build-docs.sh:61-93 scripts/build-docs.sh:95-309

Phase 1: Clean Markdown Extraction

Phase 1 discovers the wiki structure and converts HTML pages to clean Markdown, writing files to a temporary directory (/workspace/wiki). This phase is implemented entirely in Python within deepwiki-scraper.py.

Phase 1 Data Flow

flowchart LR
    DeepWiki["https://deepwiki.com/$REPO"]
DeepWiki -->|session.get base_url| ExtractStruct["extract_wiki_structure(repo, session)"]
ExtractStruct -->|soup.find_all 'a', href=re.compile ...| ParseLinks["Parse sidebar links\nPattern: /$REPO/\d+"]
ParseLinks --> PageList["pages = [\n {'number': '1', 'title': 'Overview',\n 'url': '...', 'href': '...', 'level': 0},\n {'number': '2.1', 'title': 'Sub',\n 'url': '...', 'href': '...', 'level': 1},\n ...\n]\nsorted by page number"]
PageList --> Loop["for page in pages:]
 Loop --> FetchPage[fetch_page(url, session)\nUser-Agent header\n3 retries with timeout=30"]
FetchPage --> ParseHTML["BeautifulSoup(response.text)\nRemove: nav, header, footer, aside\nFind: article or main or body"]
ParseHTML --> ConvertMD["h = html2text.HTML2Text()\nh.body_width = 0\nmarkdown = h.handle(html_content)"]
ConvertMD --> CleanFooter["clean_deepwiki_footer(markdown)\nRegex patterns:\n'Dismiss', 'Refresh this wiki',\n'On this page', 'Edit Wiki'"]
CleanFooter --> FixLinks["fix_wiki_link(match)\nRegex: /owner/repo/(\d+(?:\.\d+)*)-(.+)\nConvert to: section-N/N-M-slug.md"]
FixLinks --> ResolvePath["resolve_output_path(page_number, title)\nnormalized_number_parts()\nsanitize_filename()"]
ResolvePath --> WriteFile["filepath.write_text(markdown)\nMain: /workspace/wiki/N-slug.md\nSub: /workspace/wiki/section-N/N-M-slug.md"]
WriteFile --> Loop

Sources: python/deepwiki-scraper.py:116-163 python/deepwiki-scraper.py:751-877 python/deepwiki-scraper.py:213-228 python/deepwiki-scraper.py:165-211 python/deepwiki-scraper.py:22-26 python/deepwiki-scraper.py:28-53

Key Functions and Their Roles

FunctionFile LocationResponsibility
extract_wiki_structure()python/deepwiki-scraper.py:116-163Parse main wiki page, discover all pages via sidebar links matching /repo/\d+, return sorted list of page metadata
extract_page_content()python/deepwiki-scraper.py:751-877Fetch individual page HTML, parse with BeautifulSoup, remove nav/footer elements, convert to Markdown
convert_html_to_markdown()python/deepwiki-scraper.py:213-228Convert HTML string to Markdown using html2text.HTML2Text() with body_width=0 (no line wrapping)
clean_deepwiki_footer()python/deepwiki-scraper.py:165-211Scan last 50 lines for DeepWiki UI patterns (Dismiss, Refresh this wiki, etc.) and truncate
sanitize_filename()python/deepwiki-scraper.py:22-26Strip special chars, replace spaces/hyphens, convert to lowercase: re.sub(r'[^\w\s-]', '', text)
normalized_number_parts()python/deepwiki-scraper.py:28-43Shift DeepWiki page numbers down by 1 (page 1 becomes unnumbered), split on . into parts
resolve_output_path()python/deepwiki-scraper.py:45-53Determine filename (N-slug.md) and optional subdirectory (section-N/) based on page numbering
fix_wiki_link()python/deepwiki-scraper.py:854-876Rewrite internal links from /owner/repo/N-title to relative paths like ../section-N/N-M-title.md

File Organization Logic

flowchart TD
    PageNum["page['number']\n(from DeepWiki)"]
PageNum --> Normalize["normalized_number_parts(page_number)\nSplit on '.', shift main number down by 1\nDeepWiki '1' → []\nDeepWiki '2' → ['1']\nDeepWiki '2.1' → ['1', '1']"]
Normalize --> CheckParts{"len(parts)?"}
CheckParts -->|0 was page 1| RootOverview["Filename: overview.md\nPath: $WIKI_DIR/overview.md\nNo section dir"]
CheckParts -->|1 main page| RootMain["Filename: N-slug.md\nExample: 1-quick-start.md\nPath: $WIKI_DIR/1-quick-start.md\nNo section dir"]
CheckParts -->|2+ subsection| ExtractSection["main_section = parts[0]\nsection_dir = f'section-{main_section}'"]
ExtractSection --> CreateDir["section_path = Path($WIKI_DIR) / section_dir\nsection_path.mkdir(exist_ok=True)"]
CreateDir --> SubFile["Filename: N-M-slug.md\nExample: 1-1-installation.md\nPath: $WIKI_DIR/section-1/1-1-installation.md"]

The system organizes files hierarchically based on page numbering. DeepWiki pages are numbered starting from 1, but the system shifts them down by 1 so that the first page becomes unnumbered (the overview).

Sources: python/deepwiki-scraper.py:28-43 python/deepwiki-scraper.py:45-63 python/deepwiki-scraper.py:1332-1338

Phase 2: Diagram Enhancement

Phase 2 extracts Mermaid diagrams from the JavaScript payload and uses fuzzy matching to intelligently place them in the appropriate Markdown files. This phase operates on files in the temporary directory (/workspace/wiki).

Phase 2 Algorithm Flow

flowchart TD
    Start["extract_and_enhance_diagrams(repo, temp_dir, session, diagram_source_url)"]
Start --> FetchJS["response = session.get(diagram_source_url)\nhtml_text = response.text"]
FetchJS --> ExtractRegex["pattern = r'```mermaid(?:\\r\\\n/\\/\r?\)(.*?)(?:\\r\\\n/\\/\r?\)```'\ndiagram_matches = re.finditer(pattern, html_text, re.DOTALL)"]
ExtractRegex --> CountTotal["print(f'Found {len(diagram_matches)}
total diagrams')"]
CountTotal --> ExtractContext["for match in diagram_matches:\ncontext_start = max(0, match.start() - 2000)\ncontext_before = html_text[context_start:match.start()]"]
ExtractContext --> Unescape["Unescape escape sequences:\nreplace('\\\n', '\\n')\nreplace('\\	', '\	')\nreplace('\\\"', '\"')\nreplace('\<', '<')\nreplace('\>', '>')"]
Unescape --> ParseContext["context_lines = [l for l in context.split('\\n')
if l.strip()]\nFind last_heading (line starting with #)\nExtract anchor_text (last 2-3 non-heading lines, max 300 chars)"]
ParseContext --> Normalize["normalize_mermaid_diagram(diagram)\n7 normalization passes:\nnormalize_mermaid_edge_labels()\nnormalize_mermaid_state_descriptions()\nnormalize_flowchart_nodes()\nnormalize_statement_separators()\nnormalize_empty_node_labels()\nnormalize_gantt_diagram()"]
Normalize --> BuildContexts["diagram_contexts.append({\n 'last_heading': last_heading,\n 'anchor_text': anchor_text[-300:],\n 'diagram': normalized_diagram\n})"]
BuildContexts --> ScanFiles["md_files = list(temp_dir.glob('**/*.md'))\nfor md_file in md_files:]
    
 
   ScanFiles --> SkipExisting{re.search(r'^\s*`{3,}\s*mermaid\b',\ncontent)?"}
SkipExisting -->|Yes| ScanFiles
 
   SkipExisting -->|No| NormalizeContent["content_normalized = content.lower()\ncontent_normalized = ' '.join(content_normalized.split())"]
NormalizeContent --> MatchLoop["for idx, item in enumerate(diagram_contexts):]
 MatchLoop --> TryChunks[for chunk_size in [300, 200, 150, 100, 80]:\ntest_chunk = anchor_normalized[-chunk_size:]\npos = content_normalized.find(test_chunk)"]
TryChunks --> FoundMatch{"pos != -1?"}
FoundMatch -->|Yes| ConvertToLine["char_count = 0\nfor line_num, line in enumerate(lines):\n char_count += len(' '.join(line.split())) + 1\n if char_count >= pos: best_match_line = line_num"]
FoundMatch -->|No| TryHeading["for line_num, line in enumerate(lines):\nif heading_normalized in line_normalized:\n best_match_line = line_num"]
TryHeading --> FoundMatch2{"best_match_line != -1?"}
FoundMatch2 -->|Yes| ConvertToLine
 
   FoundMatch2 -->|No| MatchLoop
    
 
   ConvertToLine --> CheckScore{"best_match_score >= 80?"}
CheckScore -->|Yes| FindInsert["insert_line = best_match_line + 1\nSkip blank lines, skip paragraph/list"]
CheckScore -->|No| MatchLoop
    
 
   FindInsert --> QueueInsert["pending_insertions.append(\n (insert_line, diagram, score, idx)\n)\ndiagrams_used.add(idx)"]
QueueInsert --> MatchLoop
    
 
   MatchLoop --> SortInsert["pending_insertions.sort(key=lambda x: x[0], reverse=True)"]
SortInsert --> InsertLoop["for insert_line, diagram, score, idx in pending_insertions:\nlines.insert(insert_line, '')\nlines.insert(insert_line, f'{fence}mermaid')\nlines.insert(insert_line, diagram)\nlines.insert(insert_line, fence)\nlines.insert(insert_line, '')"]
InsertLoop --> WriteFile["with open(md_file, 'w')
as f:\n f.write('\\n'.join(lines))"]
WriteFile --> ScanFiles

Sources: python/deepwiki-scraper.py:880-1276 python/deepwiki-scraper.py:899-1088 python/deepwiki-scraper.py:1149-1273 python/deepwiki-scraper.py:230-393

Fuzzy Matching Algorithm

The algorithm uses progressively shorter anchor text chunks to find the best match location for each diagram. The score threshold of 80 ensures only high-confidence matches are inserted.

Sources: python/deepwiki-scraper.py:1184-1218

flowchart LR
    AnchorText["anchor_text\n(last 300 chars from context)"]
AnchorText --> NormalizeA["anchor_normalized = anchor.lower()\nanchor_normalized = ' '.join(anchor_normalized.split())"]
MDFile["markdown file content"]
MDFile --> NormalizeC["content_normalized = content.lower()\ncontent_normalized = ' '.join(content_normalized.split())"]
NormalizeA --> Loop["for chunk_size in [300, 200, 150, 100, 80]:]
 NormalizeC --> Loop
 Loop --> Extract[if len(anchor_normalized) >= chunk_size:\n test_chunk = anchor_normalized[-chunk_size:]"]
Extract --> Find["pos = content_normalized.find(test_chunk)"]
Find --> FoundPos{"pos != -1?"}
FoundPos -->|Yes| CharToLine["char_count = 0\nfor line_num, line in enumerate(lines):\n char_count += len(' '.join(line.split())) + 1\n if char_count >= pos:\n best_match_line = line_num\n best_match_score = chunk_size"]
FoundPos -->|No| Loop
    
 
   CharToLine --> CheckThresh{"best_match_score >= 80?"}
CheckThresh -->|Yes| Accept["Accept match\nQueue for insertion"]
CheckThresh -->|No| HeadingFallback["Try heading_normalized in line_normalized\nbest_match_score = 50"]
HeadingFallback --> CheckThresh2{"best_match_score >= 80?"}
CheckThresh2 -->|Yes| Accept
 
   CheckThresh2 -->|No| Reject["Reject match\nSkip diagram"]

Diagram Extraction from JavaScript

Diagrams are extracted from the Next.js JavaScript payload embedded in the HTML response. DeepWiki stores all diagrams for all pages in a single JavaScript bundle, which requires fuzzy matching to place each diagram in the correct file.

Extraction Method

The primary extraction pattern captures fenced Mermaid blocks with various newline representations:

Unescape Sequence

Each diagram block undergoes unescaping to convert JavaScript string literals to actual text:

Escape SequenceReplacementPurpose
\\n\nNewline characters in diagram syntax
\\t\tTab characters for indentation
\\""Quoted strings in node labels
\\\\\Literal backslashes
\\u003c<HTML less-than entity
\\u003e>HTML greater-than entity
\\u0026&HTML ampersand entity
<br/>, <br> (space)HTML line breaks in labels

Sources: python/deepwiki-scraper.py:899-901 python/deepwiki-scraper.py:1039-1047

Phase 3: mdBook Build

Phase 3 generates mdBook configuration, creates the table of contents, and builds the final HTML documentation. This phase is orchestrated by build-docs.sh and invokes Rust tools (mdbook, mdbook-mermaid).

Phase 3 Component Interactions

flowchart TD
    Entry["build-docs.sh line 95\nPhase 3 starts"]
Entry --> MkdirBook["mkdir -p $BOOK_DIR\ncd $BOOK_DIR\n$BOOK_DIR=/workspace/book"]
MkdirBook --> GenToml["cat > book.toml <<EOF\n[book]\ntitle = \"$BOOK_TITLE\"\nauthors = [\"$BOOK_AUTHORS\"]\n[output.html]\ndefault-theme = \"rust\"\ngit-repository-url = \"$GIT_REPO_URL\"\n[preprocessor.mermaid]\ncommand = \"mdbook-mermaid\"\n[output.html.fold]\nenable = true"]
GenToml --> MkdirSrc["mkdir -p src"]
MkdirSrc --> InitSummary["{ echo '# Summary'; echo ''; } > src/SUMMARY.md"]
InitSummary --> FindOverview["main_pages_list=$(ls $WIKI_DIR/*.md)\noverview_file=$(printf '%s\ ' $main_pages_list / grep -Ev '^[0-9]' / head -1)\ntitle=$(head -1 $WIKI_DIR/$overview_file /sed 's/^# //'"]
FindOverview --> WriteOverview["echo \"[$title] $overview_file \" >> src/SUMMARY.md echo '' >> src/SUMMARY.md"]
WriteOverview --> SortPages["main_pages=$ printf '%s\ ' $main_pages_list/ awk -F/ '{print $NF}'\n /grep -E '^[0-9]'/ sort -t- -k1 -n)"]
SortPages --> LoopPages["echo \"$main_pages\" |while read -r file; do"]
LoopPages --> ExtractTitle["filename=$ basename \"$file\" title=$ head -1 \"$file\"| sed 's/^# //')"]
ExtractTitle --> ExtractNum["section_num=$(echo \"$filename\" |grep -oE '^[0-9]+' "]
ExtractNum --> CheckSubdir{"[ -d \"$WIKI_DIR/section-$section_num\" ]?"}
CheckSubdir -->|Yes|WriteMain["echo \"- [$title] $filename \" >> src/SUMMARY.md"]
WriteMain --> LoopSubs["ls $WIKI_DIR/section-$section_num/*.md/ awk -F/ '{print $NF}'\n /sort -t- -k1 -n/ while read subname; do"]
LoopSubs --> WriteSub["subfile=\"section-$section_num/$subname\"\nsubtitle=$(head -1 \"$subfile\" |sed 's/^# //' echo \" - [$subtitle] $subfile \" >> src/SUMMARY.md"]
WriteSub --> LoopSubs
 CheckSubdir -->|No| WriteStandalone["echo \"- [$title]($filename)\" >> src/SUMMARY.md"]
LoopSubs --> LoopPages
 
   WriteStandalone --> LoopPages
    
 
   LoopPages --> CopySrc["cp -r $WIKI_DIR/* src/"]
CopySrc --> ProcessTemplates["python3 /usr/local/bin/process-template.py $HEADER_TEMPLATE\npython3 /usr/local/bin/process-template.py $FOOTER_TEMPLATE\nInject into src/*.md and src/*/*.md"]
ProcessTemplates --> MermaidInstall["mdbook-mermaid install $BOOK_DIR"]
MermaidInstall --> MdBookBuild["mdbook build\nReads book.toml and src/SUMMARY.md\nProcesses src/*.md files\nGenerates book/index.html"]
MdBookBuild --> CopyOut["mkdir -p $OUTPUT_DIR\ncp -r book $OUTPUT_DIR/\ncp -r $WIKI_DIR/. $OUTPUT_DIR/markdown/\ncp book.toml $OUTPUT_DIR/"]

Sources: scripts/build-docs.sh:95-309 scripts/build-docs.sh:124-188 scripts/build-docs.sh:190-261 scripts/build-docs.sh:263-271

book.toml Generation

The orchestrator dynamically generates book.toml with runtime configuration from environment variables:

Sources: scripts/build-docs.sh:102-119

flowchart TD
    Start["Generate src/SUMMARY.md\n{ echo '# Summary'; echo ''; } > src/SUMMARY.md"]
Start --> ListFiles["main_pages_list=$(ls $WIKI_DIR/*.md 2>/dev/null // true)"]
ListFiles --> FindOverview["overview_file=$(printf '%s\ ' $main_pages_list\n /awk -F/ '{print $NF}'/ grep -Ev '^[0-9]'\n /head -1"]
FindOverview --> WriteOverview["if [ -n \"$overview_file\" ]; then title=$ head -1 $WIKI_DIR/$overview_file| sed 's/^# //')\n  echo \"[${title:-Overview}]($overview_file)\" >> src/SUMMARY.md\n  echo '' >> src/SUMMARY.md\nfi"]
WriteOverview --> FilterMain["main_pages=$(printf '%s\ ' $main_pages_list\n /awk -F/ '{print $NF}'/ grep -E '^[0-9]'\n /sort -t- -k1 -n"]
FilterMain --> LoopMain["echo \"$main_pages\"| while read -r file; do"]
LoopMain --> CheckFile{"[ -f \"$file\" ]?"}
CheckFile -->|Yes| GetFilename["filename=$(basename \"$file\")\ntitle=$(head -1 \"$file\" |sed 's/^# //' "]
CheckFile -->|No|LoopMain
 GetFilename --> ExtractNum["section_num=$ echo \"$filename\"| grep -oE '^[0-9]+' || true)\nsection_dir=\"$WIKI_DIR/section-$section_num\""]
ExtractNum --> CheckSubdir{"[ -n \"$section_num\" ] &&\n[ -d \"$section_dir\" ]?"}
CheckSubdir -->|Yes| WriteMainWithSubs["echo \"- [$title]($filename)\" >> src/SUMMARY.md"]
WriteMainWithSubs --> ListSubs["ls $section_dir/*.md 2>/dev/null\n /awk -F/ '{print $NF}'/ sort -t- -k1 -n"]
ListSubs --> LoopSubs["while read subname; do"]
LoopSubs --> CheckSubFile{"[ -f \"$subfile\" ]?"}
CheckSubFile -->|Yes| WriteSub["subfile=\"$section_dir/$subname\"\nsubfilename=$(basename \"$subfile\")\nsubtitle=$(head -1 \"$subfile\" |sed 's/^# //' echo \" - [$subtitle] section-$section_num/$subfilename \" >> src/SUMMARY.md"]
CheckSubFile -->|No|LoopSubs
 WriteSub --> LoopSubs
 CheckSubdir -->|No| WriteStandalone["echo \"- [$title]($filename)\" >> src/SUMMARY.md"]
LoopSubs --> LoopMain
 
   WriteStandalone --> LoopMain
    
 
   LoopMain --> CountEntries["echo \"Generated SUMMARY.md with $(grep -c '\\[' src/SUMMARY.md)
entries\""]

SUMMARY.md Generation Algorithm

The table of contents is generated by scanning the actual file structure in /workspace/wiki and extracting titles from the first line of each file:

Sources: scripts/build-docs.sh:124-188

mdBook and mdbook-mermaid Execution

The build process invokes two Rust binaries installed via the Docker multi-stage build:

CommandLocationPurposeOutput
mdbook-mermaid install $BOOK_DIRscripts/build-docs.sh265Install Mermaid.js assets into book directory and update book.tomlmermaid.min.js, mermaid-init.js in $BOOK_DIR/
mdbook buildscripts/build-docs.sh270Parse SUMMARY.md, process all Markdown files, generate static HTML siteHTML files in $BOOK_DIR/book/ (subdirectory, not root)

mdbook Build Process:

The mdbook build command performs the following operations:

  1. Parse Structure : Read src/SUMMARY.md to determine page hierarchy and navigation order
  2. Process Files : For each .md file referenced in SUMMARY.md:
    • Parse Markdown with CommonMark parser
    • Process Mermaid fenced code blocks via mdbook-mermaid preprocessor
    • Apply rust theme styles (configurable via default-theme in book.toml)
    • Generate sidebar navigation
  3. Generate HTML : Create HTML files with:
    • Responsive navigation sidebar
    • Client-side search functionality (elasticlunr.js)
    • “Edit this page” links using git-repository-url from book.toml
    • Syntax highlighting for code blocks
  4. Copy Assets : Bundle theme assets, fonts, and JavaScript libraries

Sources: scripts/build-docs.sh:263-271 scripts/build-docs.sh:102-119

Data Transformation Summary

Each phase transforms data in specific ways, with temporary directories used for intermediate work:

PhaseInputProcessing ComponentsOutput
Phase 1HTML from https://deepwiki.com/$REPOextract_wiki_structure(), extract_page_content(), BeautifulSoup, html2text.HTML2Text(), clean_deepwiki_footer()Clean Markdown in /workspace/wiki/
Phase 2Markdown files + JavaScript payload from DeepWikiextract_and_enhance_diagrams(), normalize_mermaid_diagram(), fuzzy matching with 300/200/150/100/80 char chunksEnhanced Markdown in /workspace/wiki/ (modified in place)
Phase 3Markdown files + environment variables ($BOOK_TITLE, $BOOK_AUTHORS, etc.)Shell script generates book.toml and src/SUMMARY.md, process-template.py, mdbook-mermaid install, mdbook buildHTML site in /workspace/book/book/

Working Directories:

DirectoryPurposeContentsLifecycle
/workspace/wiki/Primary working directoryMarkdown files organized by numbering schemeCreated in Phase 1, modified in Phase 2, read in Phase 3
/workspace/raw_markdown/Debug snapshotCopy of /workspace/wiki/ before Phase 2 enhancementCreated between Phase 1 and Phase 2, copied to /output/raw_markdown/
/workspace/book/mdBook project directorybook.toml, src/ subdirectory, final book/ subdirectoryCreated in Phase 3
/workspace/book/src/mdBook sourceCopy of /workspace/wiki/ with injected headers/footers, SUMMARY.mdCreated in Phase 3
/workspace/book/book/Final HTML outputComplete static HTML siteGenerated by mdbook build
/output/Final container outputbook/, markdown/, raw_markdown/, book.tomlPopulated at end of Phase 3 (or end of Phase 2 if MARKDOWN_ONLY=true)

Final Output Structure:

/output/
├── book/                          # Static HTML site (from /workspace/book/book/)
│   ├── index.html
│   ├── overview.html               # First page (unnumbered)
│   ├── 1-quick-start.html         # Main pages
│   ├── section-1/
│   │   ├── 1-1-installation.html  # Subsections
│   │   └── ...
│   ├── mermaid.min.js             # Installed by mdbook-mermaid
│   ├── mermaid-init.js            # Installed by mdbook-mermaid
│   └── ...
├── markdown/                       # Enhanced Markdown source (from /workspace/wiki/)
│   ├── overview.md
│   ├── 1-quick-start.md
│   ├── section-1/
│   │   ├── 1-1-installation.md
│   │   └── ...
│   └── ...
├── raw_markdown/                   # Pre-enhancement snapshot (from /workspace/raw_markdown/)
│   ├── overview.md                 # Same structure as markdown/ but without diagrams
│   └── ...
└── book.toml                      # mdBook configuration (from /workspace/book/book.toml)

Sources: scripts/build-docs.sh:273-294 python/deepwiki-scraper.py:1358-1366

flowchart TD
    Phase1["Phase 1: Extraction\n(deepwiki-scraper.py)"]
Phase2["Phase 2: Enhancement\n(deepwiki-scraper.py)"]
Phase1 --> Phase2
    
 
   Phase2 --> Check{"MARKDOWN_ONLY\n== true?"}
Check -->|Yes| FastPath["cp -r /workspace/wiki/* /output/markdown/\nExit (fast path)"]
Check -->|No| Phase3["Phase 3: mdBook Build\n(build-docs.sh)"]
Phase3 --> FullOutput["Copy book/ and markdown/ to /output/\nExit (full build)"]
FastPath --> End[/"Build complete"/]
 
   FullOutput --> End

Conditional Execution: MARKDOWN_ONLY Mode

The MARKDOWN_ONLY environment variable allows bypassing Phase 3 for faster iteration during development:

When MARKDOWN_ONLY=true:

  • Execution time: ~30-60 seconds (scraping + diagram matching only)
  • Output: /output/markdown/ only
  • Use case: Debugging diagram placement, testing content extraction

When MARKDOWN_ONLY=false (default):

  • Execution time: ~60-120 seconds (full pipeline)
  • Output: /output/book/, /output/markdown/, /output/book.toml
  • Use case: Production documentation builds

Sources: build-docs.sh:60-76 README.md:55-76

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Docker Multi-Stage Build

Loading…

Docker Multi-Stage Build

Relevant source files

Purpose and Scope

This document explains the Docker multi-stage build strategy used to create the deepwiki-scraper container image. It details how the system combines a Rust toolchain for compiling documentation tools with a Python runtime for web scraping, while optimizing the final image size.

For information about how the container orchestrates the build process, see build-docs.sh Orchestrator. For details on the Python scraper implementation, see deepwiki-scraper.py.

Multi-Stage Build Strategy

The Dockerfile implements a two-stage build pattern that separates compilation from runtime. Stage 1 uses a full Rust development environment to compile mdBook binaries from source. Stage 2 creates a minimal Python runtime and extracts only the compiled binaries, discarding the build toolchain.

Build Stages Flow

graph TD
    subgraph Stage1["Stage 1: Rust Builder (rust:latest)"]
RustBase["rust:latest base image\n~1.5 GB with toolchain"]
CargoInstall["cargo install mdbook\ncargo install mdbook-mermaid"]
Binaries["/usr/local/cargo/bin/\nmdbook\nmdbook-mermaid"]
RustBase --> CargoInstall
 
       CargoInstall --> Binaries
    end
    
    subgraph Stage2["Stage 2: Python Runtime (python:3.12-slim)"]
PyBase["python:3.12-slim base\n~150 MB"]
UVInstall["Copy uv from\nghcr.io/astral-sh/uv:latest"]
PipInstall["uv pip install --system\nrequirements.txt"]
CopyBinaries["COPY --from=builder\n/usr/local/cargo/bin/"]
CopyScripts["COPY tools/ and\nbuild-docs.sh"]
PyBase --> UVInstall
 
       UVInstall --> PipInstall
 
       PipInstall --> CopyBinaries
 
       CopyBinaries --> CopyScripts
    end
    
 
   Binaries -.->|Extract only binaries| CopyBinaries
    
 
   CopyScripts --> FinalImage["Final Image\n~300-400 MB"]
Stage1 -.->|Discarded after build| Discard["Discard"]
style RustBase fill:#f5f5f5
    style PyBase fill:#f5f5f5
    style FinalImage fill:#e8e8e8
    style Discard fill:#fff,stroke-dasharray: 5 5

Sources: Dockerfile:1-33

Stage 1: Rust Builder

Stage 1 uses rust:latest as the base image, providing the complete Rust toolchain including cargo, the Rust package manager and build tool.

Rust Builder Configuration

AspectDetails
Base Imagerust:latest
Size~1.5 GB (includes rustc, cargo, stdlib)
Build Commandscargo install mdbook, cargo install mdbook-mermaid
Output Location/usr/local/cargo/bin/
Stage Identifierbuilder

The cargo install commands fetch mdBook and mdbook-mermaid source from crates.io, compile them from source, and install the resulting binaries to /usr/local/cargo/bin/.

flowchart LR
    subgraph BuilderStage["builder stage"]
CratesIO["crates.io\n(package registry)"]
CargoFetch["cargo fetch\n(download sources)"]
CargoCompile["cargo build --release\n(compile to binary)"]
CargoInstallBin["Install to\n/usr/local/cargo/bin/"]
CratesIO --> CargoFetch
 
       CargoFetch --> CargoCompile
 
       CargoCompile --> CargoInstallBin
    end
    
 
   CargoInstallBin --> MdBookBin["mdbook binary"]
CargoInstallBin --> MermaidBin["mdbook-mermaid binary"]
MdBookBin -.->|Copied to Stage 2| NextStage["NextStage"]
MermaidBin -.->|Copied to Stage 2| NextStage

Sources: Dockerfile:1-5

Stage 2: Python Runtime Assembly

Stage 2 builds the final runtime image starting from python:3.12-slim, a minimal Python base image that omits development headers and unnecessary packages.

Python Runtime Components

graph TB
    subgraph PythonBase["python:3.12-slim"]
PyInterpreter["Python 3.12 interpreter"]
PyStdlib["Python standard library"]
BaseUtils["Essential utilities\n(bash, sh, coreutils)"]
end
    
    subgraph InstalledTools["Installed via COPY"]
UV["uv package manager\n/bin/uv, /bin/uvx"]
PyDeps["Python packages\n(requests, beautifulsoup4, html2text)"]
RustBins["Rust binaries\n(mdbook, mdbook-mermaid)"]
Scripts["Application scripts\n(deepwiki-scraper.py, build-docs.sh)"]
end
    
 
   PythonBase --> UV
 
   UV --> PyDeps
 
   PythonBase --> RustBins
 
   PythonBase --> Scripts
    
 
   PyDeps --> Runtime["Runtime Environment"]
RustBins --> Runtime
 
   Scripts --> Runtime

The installation sequence follows a specific order:

  1. Copy uv Dockerfile13 - Multi-stage copy from ghcr.io/astral-sh/uv:latest
  2. Install Python dependencies Dockerfile:16-17 - Uses uv pip install --system --no-cache
  3. Copy Rust binaries Dockerfile:20-21 - Extracts from builder stage
  4. Copy application scripts Dockerfile:24-29 - Adds Python scraper and orchestrator

Sources: Dockerfile:8-29

Binary Extraction and Integration

The critical optimization occurs at Dockerfile:20-21 where the COPY --from=builder directive extracts only the compiled binaries without any build dependencies.

Binary Extraction Pattern

Source (Stage 1)Destination (Stage 2)Purpose
/usr/local/cargo/bin/mdbook/usr/local/bin/mdbookDocumentation builder executable
/usr/local/cargo/bin/mdbook-mermaid/usr/local/bin/mdbook-mermaidMermaid preprocessor executable
flowchart LR
    subgraph BuilderFS["Builder Filesystem"]
CargoDir["/usr/local/cargo/bin/"]
MdBookSrc["mdbook\n(compiled binary)"]
MermaidSrc["mdbook-mermaid\n(compiled binary)"]
CargoDir --> MdBookSrc
 
       CargoDir --> MermaidSrc
    end
    
    subgraph RuntimeFS["Runtime Filesystem"]
BinDir["/usr/local/bin/"]
MdBookDst["mdbook\n(extracted)"]
MermaidDst["mdbook-mermaid\n(extracted)"]
BinDir --> MdBookDst
 
       BinDir --> MermaidDst
    end
    
 
   MdBookSrc -.->|COPY --from=builder| MdBookDst
 
   MermaidSrc -.->|COPY --from=builder| MermaidDst
    
    subgraph Discarded["Discarded (not copied)"]
RustToolchain["rustc compiler"]
CargoTool["cargo build tool"]
SourceFiles["mdBook source files"]
BuildCache["cargo build cache"]
end

Both binaries are statically linked or contain all necessary Rust runtime dependencies, allowing them to execute in the Python base image without the Rust toolchain.

Sources: Dockerfile:19-21

Python Dependency Installation

Python dependencies are installed using uv, a fast Python package installer written in Rust. The dependencies are defined in tools/requirements.txt:1-4

Python Dependencies

PackageVersionPurpose
requests≥2.31.0HTTP client for scraping DeepWiki
beautifulsoup4≥4.12.0HTML parsing and navigation
html2text≥2020.1.16HTML to Markdown conversion

The installation command Dockerfile17 uses these flags:

  • --system: Install to system Python (not virtualenv)
  • --no-cache: Don’t cache downloaded packages (reduces image size)
  • -r /tmp/requirements.txt: Read dependencies from file

Sources: Dockerfile:16-17 tools/requirements.txt:1-4

graph LR
    subgraph Approach1["Single-Stage Approach (Hypothetical)"]
Single["rust:latest + Python\n~2+ GB"]
end
    
    subgraph Approach2["Multi-Stage Approach (Actual)"]
Builder["Stage 1: rust:latest\n~1.5 GB\n(discarded)"]
Runtime["Stage 2: python:3.12-slim\n+ binaries + dependencies\n~300-400 MB"]
Builder -.->|Extract binaries only| Runtime
    end
    
 
   Single -->|Contains unnecessary build toolchain| Waste["Wasted Space"]
Runtime -->|Contains only runtime essentials| Efficient["Efficient"]
style Single fill:#f5f5f5
    style Builder fill:#f5f5f5
    style Runtime fill:#e8e8e8
    style Waste fill:#fff,stroke-dasharray: 5 5
    style Efficient fill:#fff,stroke-dasharray: 5 5

Image Size Optimization

The multi-stage strategy achieves significant size reduction by discarding the build environment.

Size Comparison

Size Breakdown of Final Image

ComponentApproximate Size
Python 3.12 slim base~150 MB
Python packages (requests, BeautifulSoup4, html2text)~20 MB
mdBook binary~8 MB
mdbook-mermaid binary~6 MB
uv package manager~10 MB
Application scripts<1 MB
Total~300-400 MB

Sources: Dockerfile:1-33 README.md156

graph TB
    subgraph Filesystem["/usr/local/bin/"]
BuildScript["build-docs.sh\n(orchestrator)"]
ScraperScript["deepwiki-scraper.py\n(Python scraper)"]
MdBookBin["mdbook\n(Rust binary)"]
MermaidBin["mdbook-mermaid\n(Rust binary)"]
UVBin["uv\n(Python installer)"]
end
    
    subgraph SystemPython["/usr/local/lib/python3.12/"]
Requests["requests package"]
BS4["beautifulsoup4 package"]
Html2Text["html2text package"]
end
    
    subgraph Execution["Execution Flow"]
Docker["docker run\n(CMD)"]
Docker --> BuildScript
 
       BuildScript -->|python| ScraperScript
 
       BuildScript -->|subprocess| MdBookBin
 
       MdBookBin -->|preprocessor| MermaidBin
 
       ScraperScript --> Requests
 
       ScraperScript --> BS4
 
       ScraperScript --> Html2Text
    end

Runtime Environment Structure

The final image contains a hybrid Python-Rust runtime where Python scripts can execute Rust binaries as subprocesses.

Runtime Component Locations

The entrypoint Dockerfile32 executes /usr/local/bin/build-docs.sh, which orchestrates calls to both Python and Rust components. The script can execute:

  • python /usr/local/bin/deepwiki-scraper.py for web scraping
  • mdbook init for initialization
  • mdbook build for HTML generation
  • mdbook-mermaid install for asset installation

Sources: Dockerfile:28-32 build-docs.sh

Container Execution Model

When the container runs, Docker executes the CMD Dockerfile32 which invokes build-docs.sh. This shell script has access to all binaries in /usr/local/bin/ (automatically on $PATH).

Process Tree During Execution

graph TD
    Docker["docker run\n(container init)"]
Docker --> CMD["CMD: build-docs.sh"]
CMD --> Phase1["Phase 1:\npython deepwiki-scraper.py"]
CMD --> Phase2["Phase 2: mdbook init"]
CMD --> Phase3["Phase 3: mdbook-mermaid install"]
CMD --> Phase4["Phase 4: mdbook build"]
Phase1 --> PyProc["Python 3.12 process"]
PyProc --> ReqLib["requests.get()"]
PyProc --> BS4Lib["BeautifulSoup()"]
PyProc --> H2TLib["html2text.HTML2Text()"]
Phase2 --> MdBookProc1["mdbook binary process"]
Phase3 --> MermaidProc["mdbook-mermaid binary process"]
Phase4 --> MdBookProc2["mdbook binary process"]
MdBookProc2 --> MermaidPreproc["mdbook-mermaid\n(as preprocessor)"]

Sources: Dockerfile32 README.md:122-145

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Component Reference

Loading…

Component Reference

Relevant source files

Purpose and Scope

This page provides a high-level overview of the major components in the DeepWiki-to-mdBook converter system and their responsibilities. Each component is introduced with its primary function, key files, and relationships to other components.

For detailed information about specific components:

System Component Map

The following diagram shows all major components and their organizational relationships:

Sources: scripts/build-docs.sh:1-310 README.md:84-88

graph TB
    subgraph "Entry Points"
        Dockerfile["Dockerfile\n(multi-stage build)"]
ActionYAML["action.yml\n(GitHub Action)"]
end
    
    subgraph "Orchestration Layer"
        BuildScript["build-docs.sh\n(main orchestrator)"]
end
    
    subgraph "Python Components"
        Scraper["deepwiki-scraper.py\n(content extraction)"]
TemplateProc["process-template.py\n(template rendering)"]
end
    
    subgraph "Build Tools"
        mdBook["mdbook\n(HTML generator)"]
mdBookMermaid["mdbook-mermaid\n(diagram renderer)"]
end
    
    subgraph "Configuration Assets"
        HeaderTemplate["templates/header.html"]
FooterTemplate["templates/footer.html"]
BookToml["book.toml\n(generated)"]
SummaryMd["SUMMARY.md\n(generated)"]
end
    
    subgraph "Data Directories"
        WikiDir["/workspace/wiki\n(enhanced markdown)"]
RawDir["/workspace/raw_markdown\n(pre-enhancement)"]
BookSrc["/workspace/book/src\n(mdBook input)"]
OutputDir["/output\n(final artifacts)"]
end
    
 
   Dockerfile -->|builds| BuildScript
 
   Dockerfile -->|installs| Scraper
 
   Dockerfile -->|installs| TemplateProc
 
   Dockerfile -->|compiles| mdBook
 
   Dockerfile -->|compiles| mdBookMermaid
 
   ActionYAML -->|invokes| Dockerfile
    
 
   BuildScript -->|executes| Scraper
 
   BuildScript -->|executes| TemplateProc
 
   BuildScript -->|executes| mdBook
 
   BuildScript -->|executes| mdBookMermaid
 
   BuildScript -->|generates| BookToml
 
   BuildScript -->|generates| SummaryMd
    
 
   Scraper -->|writes to| WikiDir
 
   Scraper -->|writes to| RawDir
 
   TemplateProc -->|reads| HeaderTemplate
 
   TemplateProc -->|reads| FooterTemplate
 
   TemplateProc -->|outputs HTML| BuildScript
    
 
   BuildScript -->|copies| WikiDir
 
   WikiDir -->|to| BookSrc
 
   BuildScript -->|injects templates into| BookSrc
    
 
   mdBook -->|reads| BookToml
 
   mdBook -->|reads| SummaryMd
 
   mdBook -->|reads| BookSrc
 
   mdBook -->|builds to| OutputDir
 
   mdBookMermaid -->|preprocesses| BookSrc

Core Components

build-docs.sh

Type: Shell script orchestrator
Location: scripts/build-docs.sh:1-310
Entry Point: Docker container CMD instruction

The main orchestration script that coordinates the entire build process. It performs seven sequential steps:

StepLine RangeDescription
Configurationscripts/build-docs.sh:8-60Auto-detect repository, set environment defaults
Scrapingscripts/build-docs.sh:61-65Invoke deepwiki-scraper.py to fetch content
Optional Exitscripts/build-docs.sh:67-93If MARKDOWN_ONLY=true, copy outputs and exit
mdBook Initscripts/build-docs.sh:95-122Create book.toml and directory structure
SUMMARY Generationscripts/build-docs.sh:124-188Discover files and build table of contents
Template Processingscripts/build-docs.sh:190-261Process header/footer and inject into markdown
Build & Copyscripts/build-docs.sh:263-309Run mdBook build and copy artifacts to /output

Key Responsibilities:

  • Environment variable validation and default assignment
  • Git repository auto-detection from remote URLs
  • Orchestrating execution order of Python scripts
  • Dynamic SUMMARY.md generation with numeric sorting
  • Template injection into all markdown files
  • Output directory management

Sources: scripts/build-docs.sh:1-310

deepwiki-scraper.py

Type: Python script
Location: python/deepwiki-scraper.py
Invocation: scripts/build-docs.sh65

The content extraction component that scrapes DeepWiki wiki pages and converts them to markdown with embedded diagrams.

Key Responsibilities:

  • Fetch wiki HTML from https://deepwiki.com/{REPO}
  • Parse Next.js data payload to discover wiki structure
  • Convert HTML to markdown using html2text library
  • Extract Mermaid diagrams from JavaScript payload
  • Normalize diagrams for Mermaid 11 compatibility (7-step pipeline)
  • Match diagrams to pages using fuzzy text matching
  • Write enhanced markdown to /workspace/wiki
  • Write pre-enhancement snapshot to /workspace/raw_markdown

The scraper is covered in detail in deepwiki-scraper.py.

Sources: scripts/build-docs.sh65 README.md74

process-template.py

Type: Python script
Location: python/process-template.py
Invocation: scripts/build-docs.sh:205-213 scripts/build-docs.sh:222-230

A template rendering utility that processes HTML template files with variable substitution.

Key Responsibilities:

  • Read template file from path argument
  • Parse variable assignments from command-line arguments (format: KEY=value)
  • Substitute {{VARIABLE}} placeholders with values
  • Handle conditional rendering with {{#if VARIABLE}}...{{/if}} blocks
  • Output processed HTML to stdout

Template Variables Supported:

  • REPO - Repository identifier (e.g., “owner/repo”)
  • BOOK_TITLE - Documentation title
  • BOOK_AUTHORS - Author names
  • GIT_REPO_URL - Full GitHub repository URL
  • DEEPWIKI_URL - DeepWiki page URL
  • DEEPWIKI_BADGE_URL - Badge image URL
  • GITHUB_BADGE_URL - GitHub badge URL
  • GENERATION_DATE - Build timestamp

See Template System for comprehensive documentation.

Sources: scripts/build-docs.sh:195-234 README.md51

Template Files

Type: HTML configuration files
Location: templates/header.html, templates/footer.html
Default Path: /workspace/templates/
Custom Mount: -v "$(pwd)/my-templates:/workspace/templates"

Static HTML template files that are processed by process-template.py and injected into every markdown file.

File Responsibilities:

FilePurposeInjection Point
header.htmlTop-of-page content (badges, navigation)Before markdown content
footer.htmlBottom-of-page content (metadata, links)After markdown content

Injection Logic:

[Header HTML]
<blank line>
[Original Markdown Content]
<blank line>
[Footer HTML]

Templates are injected at scripts/build-docs.sh:240-261 after all markdown files are copied to the book source directory.

Sources: scripts/build-docs.sh:195-234 scripts/build-docs.sh:240-261 README.md:39-51

mdBook and mdbook-mermaid

Type: External build tools (Rust binaries)
Location: /usr/local/bin/mdbook, /usr/local/bin/mdbook-mermaid
Compilation: Dockerfile multi-stage build

Pre-compiled tools that generate the final HTML output.

mdBook Responsibilities:

  • Read configuration from book.toml scripts/build-docs.sh:102-119
  • Parse SUMMARY.md to build navigation structure
  • Convert markdown files to HTML with search index
  • Apply theme (rust theme by default)
  • Generate table of contents sidebar
  • Create chapter navigation links

mdbook-mermaid Responsibilities:

See mdBook Integration for detailed integration documentation.

Sources: scripts/build-docs.sh:113-114 scripts/build-docs.sh266 scripts/build-docs.sh271

sequenceDiagram
    participant Docker
    participant BuildScript as "build-docs.sh"
    participant Scraper as "deepwiki-scraper.py"
    participant TemplateProc as "process-template.py"
    participant mdBook as "mdbook"
    participant FileSystem as "/output"
    
    Docker->>BuildScript: Execute CMD
    
    BuildScript->>BuildScript: Validate REPO env var
    BuildScript->>BuildScript: Auto-detect from git remote
    BuildScript->>BuildScript: Set defaults (BOOK_AUTHORS, etc)
    
    BuildScript->>Scraper: Execute with REPO arg
    Scraper->>Scraper: Fetch DeepWiki HTML
    Scraper->>Scraper: Extract wiki structure
    Scraper->>Scraper: Convert HTML to markdown
    Scraper->>Scraper: Process Mermaid diagrams
    Scraper->>FileSystem: Write /workspace/wiki/*.md
    Scraper->>FileSystem: Write /workspace/raw_markdown/*.md
    Scraper-->>BuildScript: Exit 0
    
    alt MARKDOWN_ONLY=true
        BuildScript->>FileSystem: Copy markdown to /output
        BuildScript->>Docker: Exit 0
    end
    
    BuildScript->>BuildScript: Create /workspace/book/
    BuildScript->>BuildScript: Generate book.toml
    BuildScript->>BuildScript: Scan wiki files
    BuildScript->>BuildScript: Generate SUMMARY.md
    
    BuildScript->>TemplateProc: Process header.html
    TemplateProc-->>BuildScript: Return HTML string
    BuildScript->>TemplateProc: Process footer.html
    TemplateProc-->>BuildScript: Return HTML string
    
    BuildScript->>BuildScript: Copy wiki/*.md to book/src/
    BuildScript->>BuildScript: Inject header into each .md
    BuildScript->>BuildScript: Inject footer into each .md
    
    BuildScript->>mdBook: mdbook-mermaid install
    BuildScript->>mdBook: mdbook build
    mdBook->>mdBook: Parse SUMMARY.md
    mdBook->>mdBook: Convert markdown to HTML
    mdBook->>mdBook: Build search index
    mdBook->>FileSystem: Write /workspace/book/book/
    mdBook-->>BuildScript: Exit 0
    
    BuildScript->>FileSystem: Copy book/ to /output/book/
    BuildScript->>FileSystem: Copy wiki/ to /output/markdown/
    BuildScript->>FileSystem: Copy raw_markdown/ to /output/raw_markdown/
    BuildScript->>FileSystem: Copy book.toml to /output/
    
    BuildScript-->>Docker: Exit 0

Component Execution Flow

This diagram shows the runtime execution sequence and data flow between components:

Sources: scripts/build-docs.sh:1-310

File System Organization

The following table maps logical component names to their physical locations in the Docker container and output directory:

ComponentContainer PathOutput PathDescription
Main orchestrator/usr/local/bin/build-docs.sh-Shell script entry point
Scraper/usr/local/bin/deepwiki-scraper.py-Python extraction script
Template processor/usr/local/bin/process-template.py-Python template engine
mdBook binary/usr/local/bin/mdbook-Rust-compiled tool
mdbook-mermaid binary/usr/local/bin/mdbook-mermaid-Rust-compiled preprocessor
Default templates/workspace/templates/*.html-Header/footer HTML files
Working wiki dir/workspace/wiki//output/markdown/Enhanced markdown files
Raw markdown dir/workspace/raw_markdown//output/raw_markdown/Pre-enhancement snapshot
Book workspace/workspace/book/-Temporary build directory
Book source files/workspace/book/src/-mdBook input directory
Generated config/workspace/book/book.toml/output/book.tomlmdBook configuration
Generated TOC/workspace/book/src/SUMMARY.md-Navigation structure
Built HTML/workspace/book/book//output/book/Final documentation site

Sources: scripts/build-docs.sh:27-31 scripts/build-docs.sh:274-294

graph TD
    subgraph "Docker Build Stage 1"
        RustBase["rust:latest base image"]
CargoInstall["cargo install"]
RustBase --> CargoInstall
 
       CargoInstall --> mdBookBin["mdbook binary"]
CargoInstall --> mdBookMermaidBin["mdbook-mermaid binary"]
end
    
    subgraph "Docker Build Stage 2"
        PythonBase["python:3.12-slim base image"]
PipInstall["pip install"]
PythonBase --> PipInstall
 
       PipInstall --> RequestsLib["requests library"]
PipInstall --> Html2TextLib["html2text library"]
PipInstall --> RapidFuzzLib["rapidfuzz library"]
end
    
    subgraph "Runtime Dependencies"
        BuildScript["build-docs.sh"]
ScraperPy["deepwiki-scraper.py"]
TemplatePy["process-template.py"]
BuildScript --> ScraperPy
 
       BuildScript --> TemplatePy
 
       BuildScript --> mdBookBin
 
       BuildScript --> mdBookMermaidBin
        
 
       ScraperPy --> RequestsLib
 
       ScraperPy --> Html2TextLib
 
       ScraperPy --> RapidFuzzLib
    end
    
    subgraph "Environment Inputs"
        EnvREPO["REPO env var"]
EnvBOOK_TITLE["BOOK_TITLE env var"]
EnvMARKDOWN_ONLY["MARKDOWN_ONLY env var"]
GitRemote["git remote origin"]
GitRemote -.fallback.-> EnvREPO
 
       EnvREPO --> BuildScript
 
       EnvBOOK_TITLE --> BuildScript
 
       EnvMARKDOWN_ONLY --> BuildScript
    end
    
    mdBookBin -.copied from.-> RustBase
    mdBookMermaidBin -.copied from.-> RustBase

Component Dependencies

This diagram maps the dependency relationships between components, showing which components require which other components:

Sources: scripts/build-docs.sh:8-19 README.md:14-27

Component Communication Patterns

Inter-Process Communication

All component communication uses standard Unix patterns:

PatternComponentsMechanism
Parent-child executionbuild-docs.sh → Python scriptspython3 /usr/local/bin/script.py args
Parent-child executionbuild-docs.sh → mdBook toolsmdbook build, mdbook-mermaid install
Output capturebuild-docs.shprocess-template.pyCommand substitution: VAR=$(python3 ...)
Exit statusAll → build-docs.shStandard exit codes (0 = success)
Error propagationAllset -e in bash (exit on any error)

Sources: scripts/build-docs.sh2 scripts/build-docs.sh65 scripts/build-docs.sh:205-213

File System Communication

Components communicate via shared file system locations:

Sources: scripts/build-docs.sh:27-31 scripts/build-docs.sh237 scripts/build-docs.sh:274-294

Configuration Communication

Environment variables flow unidirectionally from the container entry point to all components:

VariableSet ByRead ByUsage
REPODocker -e flagbuild-docs.shDeepWiki URL construction
BOOK_TITLEDocker -e flagbuild-docs.shbook.toml generation
BOOK_AUTHORSDocker -e flagbuild-docs.shbook.toml generation
MARKDOWN_ONLYDocker -e flagbuild-docs.shBuild mode selection
GENERATION_DATEbuild-docs.shprocess-template.pyTemplate variable
GIT_REPO_URLbuild-docs.sh (derived)process-template.pyTemplate variable
DEEPWIKI_URLbuild-docs.sh (derived)process-template.pyTemplate variable

Sources: scripts/build-docs.sh:8-60 scripts/build-docs.sh:200-230

Component Responsibilities Matrix

The following table summarizes what each component is and is not responsible for:

ComponentResponsible ForNot Responsible For
build-docs.shOrchestration, environment validation, SUMMARY generation, template injection, output copyingContent extraction, HTML rendering, diagram normalization
deepwiki-scraper.pyHTTP requests, HTML parsing, markdown conversion, diagram extraction/normalization/matchingFile system orchestration, mdBook integration, template processing
process-template.pyVariable substitution, conditional rendering in templatesFile discovery, output management, HTML generation
mdbookMarkdown to HTML conversion, search index, navigation, themingContent extraction, diagram processing, template injection
mdbook-mermaidMermaid library installation, diagram rendering configurationDiagram extraction, diagram normalization, markdown conversion
Templates (*.html)Define header/footer structure and variablesVariable substitution, file injection, content generation

Sources: scripts/build-docs.sh:1-310 README.md:72-77

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

build-docs.sh Orchestrator

Loading…

build-docs.sh Orchestrator

Relevant source files

Purpose and Scope

The build-docs.sh script is the main orchestration layer for the DeepWiki-to-mdBook conversion system. It coordinates all components of the three-phase pipeline, manages configuration, handles environment variable processing, and produces the final output artifacts. This document covers the script’s responsibilities, execution flow, configuration management, and integration with other system components.

For details on the components orchestrated by this script, see deepwiki-scraper.py, Template System, and mdBook Integration. For information on the three-phase architecture, see Three-Phase Pipeline.

Role and Responsibilities

The orchestrator serves as the single entry point for the documentation build process. It is invoked as the Docker container’s default command and coordinates all system components in a sequential, deterministic manner.

Key Responsibilities:

ResponsibilityImplementation
Configuration ManagementValidates and sets defaults for all environment variables
Auto-detectionDiscovers repository information from Git remotes
Component CoordinationInvokes deepwiki-scraper.py, process-template.py, mdbook, and mdbook-mermaid
Error HandlingUses set -e for fail-fast behavior on any component failure
Output ManagementOrganizes all artifacts into /output directory structure
Mode SelectionSupports standard and markdown-only execution modes
Template ProcessingCoordinates header/footer injection into all markdown files

Sources: scripts/build-docs.sh:1-310

Architecture Overview

The following diagram maps the orchestrator’s workflow to actual code entities and directory paths used in the script:

Diagram: Orchestrator Component Integration

graph TB
    Entry["Entry Point\nbuild-docs.sh"]
subgraph "Configuration Phase"
        AutoDetect["Git Auto-detection\nlines 8-19"]
EnvVars["Environment Variables\nREPO, BOOK_TITLE, etc.\nlines 21-26"]
Defaults["Default Generation\nBOOK_AUTHORS, GIT_REPO_URL\nlines 44-51"]
Validate["Validation\nREPO check\nlines 33-38"]
end
    
    subgraph "Execution Phase"
        Step1["Step 1: deepwiki-scraper.py\n$REPO → $WIKI_DIR\nlines 61-65"]
Decision{{"MARKDOWN_ONLY\n?\nlines 68"}}
MarkdownExit["Copy to /output/markdown\nlines 69-93"]
Step2["Step 2: mkdir $BOOK_DIR\nCreate book.toml\nlines 95-119"]
Step3["Step 3: Generate SUMMARY.md\nDiscover structure\nlines 124-188"]
Step4["Step 4: process-template.py\nInject headers/footers\nlines 190-261"]
Step5["Step 5: mdbook-mermaid install\nlines 263-266"]
Step6["Step 6: mdbook build\nlines 268-271"]
Step7["Step 7: Copy to /output\nlines 273-295"]
end
    
 
   Entry --> AutoDetect
 
   AutoDetect --> EnvVars
 
   EnvVars --> Defaults
 
   Defaults --> Validate
 
   Validate --> Step1
 
   Step1 --> Decision
 
   Decision -->|true| MarkdownExit
 
   Decision -->|false| Step2
 
   Step2 --> Step3
 
   Step3 --> Step4
 
   Step4 --> Step5
 
   Step5 --> Step6
 
   Step6 --> Step7
    
 
   MarkdownExit --> OutputMarkdown["/output/markdown/"]
MarkdownExit --> OutputRaw["/output/raw_markdown/"]
Step7 --> OutputBook["/output/book/"]
Step7 --> OutputMarkdown
 
   Step7 --> OutputRaw
 
   Step7 --> OutputConfig["/output/book.toml"]

Sources: scripts/build-docs.sh:1-310

Configuration Management

The script implements a sophisticated configuration system with automatic detection, environment variable overrides, and sensible defaults.

Auto-Detection Logic

The script attempts to automatically detect the repository from Git metadata if REPO is not explicitly set:

Diagram: Repository Auto-Detection Flow

flowchart TD
    Start["Check if REPO\nenvironment variable set"]
Start -->|Not set| CheckGit["Check if .git directory exists\ngit rev-parse --git-dir"]
Start -->|Set| UseProvided["Use provided REPO value"]
CheckGit -->|Yes| GetRemote["Get remote.origin.url\ngit config --get remote.origin.url"]
CheckGit -->|No| RequireManual["Require manual REPO setting"]
GetRemote -->|Found| ExtractOwnerRepo["Extract owner/repo using sed\nPattern: github.com[:/]owner/repo"]
GetRemote -->|Not found| RequireManual
    
 
   ExtractOwnerRepo --> SetRepo["Set REPO variable"]
UseProvided --> SetRepo
 
   SetRepo --> Validate["Validate REPO is not empty"]
RequireManual --> Validate
    
 
   Validate -->|Empty| Error["Exit with error\nlines 34-37"]
Validate -->|Valid| Continue["Continue execution"]

The regex pattern at scripts/build-docs.sh16 handles multiple GitHub URL formats:

  • https://github.com/owner/repo.git
  • git@github.com:owner/repo.git
  • https://github.com/owner/repo

Sources: scripts/build-docs.sh:8-19 scripts/build-docs.sh:33-38

Configuration Variables

The following table documents all configuration variables managed by the orchestrator:

VariableDefaultDerivationLine Reference
REPOAuto-detectedExtracted from git remote.origin.urlscripts/build-docs.sh:9-19
BOOK_TITLE"Documentation"Nonescripts/build-docs.sh23
BOOK_AUTHORSRepository ownerExtracted from $REPO (first segment)scripts/build-docs.sh45
GIT_REPO_URLGitHub URLConstructed from $REPOscripts/build-docs.sh46
MARKDOWN_ONLY"false"Nonescripts/build-docs.sh26
WORK_DIR"/workspace"Fixedscripts/build-docs.sh27
WIKI_DIR"/workspace/wiki"Fixedscripts/build-docs.sh28
RAW_DIR"/workspace/raw_markdown"Fixedscripts/build-docs.sh29
OUTPUT_DIR"/output"Fixedscripts/build-docs.sh30
BOOK_DIR"/workspace/book"Fixedscripts/build-docs.sh31

Computed variables derived from $REPO:

VariableComputationLine Reference
REPO_OWNER`echo “$REPO”cut -d’/’ -f1`
REPO_NAME`echo “$REPO”cut -d’/’ -f2`
DEEPWIKI_URL"https://deepwiki.com/$REPO"scripts/build-docs.sh48
DEEPWIKI_BADGE_URL"https://deepwiki.com/badge.svg"scripts/build-docs.sh49
REPO_BADGE_LABELURL-encoded with dash escapingscripts/build-docs.sh50
GITHUB_BADGE_URLShields.io badge URLscripts/build-docs.sh51

Sources: scripts/build-docs.sh:21-51

sequenceDiagram
    participant Script as build-docs.sh
    participant Scraper as deepwiki-scraper.py
    participant FileSystem as File System
    participant Templates as process-template.py
    participant MDBook as mdbook
    participant Mermaid as mdbook-mermaid
    
    Note over Script: Configuration Phase
    Script->>Script: Auto-detect REPO
    Script->>Script: Set defaults
    Script->>Script: Validate configuration
    
    Note over Script: Step 1: Scraping
    Script->>FileSystem: rm -rf $RAW_DIR
    Script->>Scraper: python3 deepwiki-scraper.py $REPO $WIKI_DIR
    Scraper-->>FileSystem: Write markdown to $WIKI_DIR
    Scraper-->>FileSystem: Write raw snapshots to $RAW_DIR
    
    alt MARKDOWN_ONLY == true
        Note over Script: Markdown-Only Exit Path
        Script->>FileSystem: cp $WIKI_DIR to /output/markdown
        Script->>FileSystem: cp $RAW_DIR to /output/raw_markdown
        Script->>Script: Exit (skip HTML build)
    else MARKDOWN_ONLY == false
        Note over Script: Step 2: mdBook Initialization
        Script->>FileSystem: mkdir -p $BOOK_DIR/src
        Script->>FileSystem: Create book.toml
        
        Note over Script: Step 3: SUMMARY.md Generation
        Script->>FileSystem: Scan $WIKI_DIR for .md files
        Script->>FileSystem: Generate src/SUMMARY.md
        
        Note over Script: Step 4: Template Processing
        Script->>Templates: process-template.py $HEADER_TEMPLATE
        Templates-->>Script: Processed HEADER_HTML
        Script->>Templates: process-template.py $FOOTER_TEMPLATE
        Templates-->>Script: Processed FOOTER_HTML
        Script->>FileSystem: cp $WIKI_DIR/* to src/
        Script->>FileSystem: Inject header/footer into all .md files
        
        Note over Script: Step 5: Mermaid Installation
        Script->>Mermaid: mdbook-mermaid install $BOOK_DIR
        Mermaid-->>FileSystem: Install mermaid.js assets
        
        Note over Script: Step 6: Build
        Script->>MDBook: mdbook build
        MDBook-->>FileSystem: Generate book/ directory
        
        Note over Script: Step 7: Output Collection
        Script->>FileSystem: cp book to /output/book
        Script->>FileSystem: cp $WIKI_DIR to /output/markdown
        Script->>FileSystem: cp $RAW_DIR to /output/raw_markdown
        Script->>FileSystem: cp book.toml to /output/book.toml
    end

Execution Flow

The orchestrator follows a seven-step execution sequence, with conditional branching for markdown-only mode:

Diagram: Step-by-Step Execution Sequence

Sources: scripts/build-docs.sh:61-310

Step Details

Step 1: Wiki Scraping

Lines: scripts/build-docs.sh:61-65

Invokes the Python scraper to fetch and convert DeepWiki content:

The scraper writes output to two locations:

  • $WIKI_DIR (/workspace/wiki): Enhanced markdown with injected diagrams
  • $RAW_DIR (/workspace/raw_markdown): Pre-enhancement markdown snapshots for debugging

For details on the scraper’s operation, see deepwiki-scraper.py.

Sources: scripts/build-docs.sh:61-65

Step 2: mdBook Structure Initialization

Lines: scripts/build-docs.sh:95-119

Skipped if: MARKDOWN_ONLY=true

Creates the mdBook directory structure and generates book.toml configuration:

$BOOK_DIR/
├── book.toml
└── src/

The book.toml file is generated using a heredoc with variable substitution:

Configuration SectionVariables UsedPurpose
[book]$BOOK_TITLE, $BOOK_AUTHORSBook metadata
[output.html]$GIT_REPO_URLRepository link in UI
[preprocessor.mermaid]N/AEnable mermaid diagrams
[output.html.fold]N/AEnable section folding

Sources: scripts/build-docs.sh:95-119

Step 3: SUMMARY.md Generation

Lines: scripts/build-docs.sh:124-188

flowchart TD
    Start["Start SUMMARY.md Generation"]
Start --> WriteHeader["Write '# Summary' header"]
WriteHeader --> FindOverview["Find overview file\ngrep -Ev '^[0-9]'"]
FindOverview -->|Found| WriteOverview["Write overview entry\nExtract title from first line"]
FindOverview -->|Not found| ListMain
 
   WriteOverview --> ListMain["List all main pages\nls $WIKI_DIR/*.md"]
ListMain --> FilterOverview["Filter out overview file"]
FilterOverview --> NumericSort["Sort numerically\nsort -t- -k1 -n"]
NumericSort --> ProcessLoop["For each file"]
ProcessLoop --> ExtractTitle["Extract title\nhead -1 /sed 's/^# //'"]
ExtractTitle --> GetSectionNum["Extract section number grep -oE '^[0-9]+'"]
GetSectionNum --> CheckSubdir{"Subsection directory section-N exists?"}
CheckSubdir -->|Yes|WriteSection["Write section entry - [title] filename"]
WriteSection --> ListSubs["List subsection files ls section-N/*.md"]
ListSubs --> SortSubs["Sort numerically sort -t- -k1 -n"]
SortSubs --> WriteSubLoop["For each subsection: - [subtitle] section-N/file"]
WriteSubLoop --> NextFile
 CheckSubdir -->|No|WriteStandalone["Write standalone entry - [title] filename"]
WriteStandalone --> NextFile{"More files?"}
NextFile -->|Yes|ProcessLoop
 NextFile -->|No| Complete["Complete src/SUMMARY.md"]

Dynamically generates the table of contents by discovering the file structure in $WIKI_DIR. This step implements numeric sorting and hierarchical organization.

Diagram: SUMMARY.md Generation Algorithm

Key implementation details:

Overview Page Detection: scripts/build-docs.sh:136-144

  • Searches for files without numeric prefix
  • Typically matches Overview.md or similar

Numeric Sorting: scripts/build-docs.sh:147-155

  • Uses sort -t- -k1 -n to sort by numeric prefix
  • Handles formats like 1-Title.md, 2.1-Subtopic.md

Hierarchy Detection: scripts/build-docs.sh:165-180

  • Checks for section-N/ directories for each numeric section
  • Creates indented entries for subsections

Sources: scripts/build-docs.sh:124-188

Step 4: Template Processing and File Copying

Lines: scripts/build-docs.sh:190-261

flowchart LR
    subgraph "Template Loading"
        HeaderPath["$HEADER_TEMPLATE\n/workspace/templates/header.html"]
FooterPath["$FOOTER_TEMPLATE\n/workspace/templates/footer.html"]
GenDate["GENERATION_DATE\ndate -u command"]
end
    
    subgraph "Variable Substitution"
        ProcessH["process-template.py\n$HEADER_TEMPLATE"]
ProcessF["process-template.py\n$FOOTER_TEMPLATE"]
Vars["Variables passed:\nDEEPWIKI_URL\nDEEPWIKI_BADGE_URL\nGIT_REPO_URL\nGITHUB_BADGE_URL\nREPO\nBOOK_TITLE\nBOOK_AUTHORS\nGENERATION_DATE"]
end
    
    subgraph "Injection"
        CopyFiles["cp $WIKI_DIR/* to src/"]
InjectLoop["For each .md file:\nsrc/*.md src/*/*.md"]
CreateTemp["Create temp file:\nHEADER + content + FOOTER"]
Replace["mv temp to original"]
end
    
 
   HeaderPath --> ProcessH
 
   FooterPath --> ProcessF
 
   GenDate --> Vars
 
   Vars --> ProcessH
 
   Vars --> ProcessF
 
   ProcessH --> HeaderHTML["HEADER_HTML variable"]
ProcessF --> FooterHTML["FOOTER_HTML variable"]
CopyFiles --> InjectLoop
 
   HeaderHTML --> InjectLoop
 
   FooterHTML --> InjectLoop
 
   InjectLoop --> CreateTemp
 
   CreateTemp --> Replace

Processes header and footer templates and injects them into all markdown files.

Template Processing Flow:

Diagram: Template Processing Pipeline

The template processor is invoked with all configuration variables as arguments: scripts/build-docs.sh:205-213 scripts/build-docs.sh:222-230

File injection pattern: scripts/build-docs.sh:243-257

  • Processes all .md files in src/ and src/*/
  • Creates temporary file with header + original content + footer
  • Replaces original with modified version

For details on the template system and variable substitution, see Template System.

Sources: scripts/build-docs.sh:190-261

Step 5: Mermaid Installation

Lines: scripts/build-docs.sh:263-266

Installs mdbook-mermaid preprocessor assets into the book directory:

This command installs the mermaid.js library and initialization code required for client-side diagram rendering in the final HTML output.

Sources: scripts/build-docs.sh:263-266

Step 6: Book Build

Lines: scripts/build-docs.sh:268-271

Executes the mdBook build process:

Build Process:

  1. Reads book.toml configuration
  2. Processes src/SUMMARY.md to determine structure
  3. Applies mermaid preprocessor to all markdown files
  4. Converts markdown to HTML with search indexing
  5. Outputs to $BOOK_DIR/book/ directory

For more information on mdBook integration, see mdBook Integration.

Sources: scripts/build-docs.sh:268-271

Step 7: Output Collection

Lines: scripts/build-docs.sh:273-295

Copies all build artifacts to the /output volume mount for persistence:

SourceDestinationDescription
$BOOK_DIR/book//output/book/Built HTML documentation
$WIKI_DIR//output/markdown/Enhanced markdown files
$RAW_DIR//output/raw_markdown/Pre-enhancement markdown (if exists)
$BOOK_DIR/book.toml/output/book.tomlBook configuration reference

The script ensures clean output by removing existing directories before copying: scripts/build-docs.sh:282-290

Sources: scripts/build-docs.sh:273-295

Markdown-Only Mode

When MARKDOWN_ONLY=true, the orchestrator follows a shortened execution path that skips HTML generation:

Execution Path:

  1. Step 1: Scrape wiki (normal)
  2. Copy $WIKI_DIR to /output/markdown/
  3. Copy $RAW_DIR to /output/raw_markdown/ (if exists)
  4. Exit with success

Use Cases:

  • Debugging the scraper output without full build overhead
  • Extracting markdown for alternative processing pipelines
  • CI/CD test workflows that only validate markdown generation
  • Custom post-processing before HTML generation

Implementation: scripts/build-docs.sh:68-93

Sources: scripts/build-docs.sh:68-93

Error Handling

The orchestrator implements fail-fast error handling:

Error Handling Mechanisms:

MechanismImplementationLine Reference
Exit on any errorset -escripts/build-docs.sh2
Configuration validationExplicit REPO check with error messagescripts/build-docs.sh:33-38
Component failuresAutomatic propagation due to set -eAll component invocations
Template warningsNon-fatal warnings if templates not foundscripts/build-docs.sh:215-216 scripts/build-docs.sh:232-233

The script does not use explicit error trapping; instead, it relies on Bash’s set -e behavior to immediately exit if any command returns a non-zero status. This ensures that failures in any component (scraper, template processor, mdBook) halt execution and propagate to the Docker container exit code.

Sources: scripts/build-docs.sh2 scripts/build-docs.sh:33-38 scripts/build-docs.sh:215-216 scripts/build-docs.sh:232-233

graph TB
    Orchestrator["build-docs.sh"]
subgraph "Python Components"
        Scraper["deepwiki-scraper.py\nArgs: REPO, WIKI_DIR"]
Templates["process-template.py\nArgs: template_path, var1=val1, ..."]
end
    
    subgraph "Build Tools"
        MDBook["mdbook build\nWorking dir: $BOOK_DIR"]
Mermaid["mdbook-mermaid install\nArgs: $BOOK_DIR"]
end
    
    subgraph "File System"
        Input["Input:\n/workspace/templates/"]
Working["Working:\n$WIKI_DIR\n$RAW_DIR\n$BOOK_DIR"]
Output["Output:\n/output/"]
end
    
    subgraph "Environment"
        EnvVars["Environment Variables:\nREPO, BOOK_TITLE,\nBOOK_AUTHORS, etc."]
end
    
 
   EnvVars --> Orchestrator
 
   Input --> Templates
    
 
   Orchestrator -->|python3| Scraper
 
   Orchestrator -->|python3| Templates
 
   Orchestrator -->|mdbook| MDBook
 
   Orchestrator -->|mdbook-mermaid| Mermaid
    
 
   Scraper --> Working
 
   Templates --> Orchestrator
 
   Orchestrator --> Working
 
   MDBook --> Working
 
   Mermaid --> Working
    
 
   Orchestrator --> Output

Integration Points

The orchestrator integrates with multiple system components through well-defined interfaces:

Diagram: Component Integration Interfaces

Interface Specifications:

deepwiki-scraper.py:

  • Invocation: python3 /usr/local/bin/deepwiki-scraper.py $REPO $WIKI_DIR
  • Input: Repository identifier (e.g., "jzombie/deepwiki-to-mdbook")
  • Output: Markdown files in $WIKI_DIR, raw snapshots in $RAW_DIR
  • Documentation: deepwiki-scraper.py

process-template.py:

  • Invocation: python3 /usr/local/bin/process-template.py $TEMPLATE_PATH var1=val1 var2=val2 ...
  • Input: Template file path and variable assignments
  • Output: Processed HTML string to stdout
  • Documentation: Template System

mdbook:

  • Invocation: mdbook build (in $BOOK_DIR)
  • Input: book.toml and src/ directory structure
  • Output: HTML in book/ subdirectory
  • Documentation: mdBook Integration

mdbook-mermaid:

  • Invocation: mdbook-mermaid install $BOOK_DIR
  • Input: Book directory path
  • Output: Mermaid assets installed in book directory
  • Documentation: mdBook Integration

Sources: scripts/build-docs.sh65 scripts/build-docs.sh:205-213 scripts/build-docs.sh:222-230 scripts/build-docs.sh266 scripts/build-docs.sh271

Output Artifacts

The orchestrator produces a structured output directory with multiple artifact types:

/output/
├── book/                    # Searchable HTML documentation (Step 7)
│   ├── index.html
│   ├── searchindex.js
│   ├── mermaid.min.js
│   └── ...
├── markdown/                # Enhanced markdown files (Step 7 or markdown-only)
│   ├── Overview.md
│   ├── 1-First-Section.md
│   ├── section-1/
│   │   └── 1.1-Subsection.md
│   └── ...
├── raw_markdown/            # Pre-enhancement snapshots (if available)
│   ├── Overview.md
│   ├── 1-First-Section.md
│   └── ...
└── book.toml                # Book configuration reference (Step 7)

Artifact Generation Timeline:

ArtifactGenerated ByWhenPurpose
raw_markdown/deepwiki-scraper.pyStep 1Debug: pre-enhancement state
markdown/deepwiki-scraper.pyStep 1Final markdown with diagrams
book.tomlbuild-docs.shStep 2Book configuration reference
book/mdbookStep 6Final HTML documentation

Sources: scripts/build-docs.sh:273-295

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

deepwiki-scraper.py

Loading…

deepwiki-scraper.py

Relevant source files

Purpose and Scope

The deepwiki-scraper.py script is the primary data extraction and transformation component that converts DeepWiki wiki content into enhanced markdown files. It orchestrates a three-phase pipeline: (1) extracting clean markdown from DeepWiki HTML, (2) enhancing files with normalized Mermaid diagrams using fuzzy matching, and (3) moving completed files to the output directory.

This page documents the script’s architecture, execution model, and key algorithms. For information about how this script is invoked by the build system, see build-docs.sh Orchestrator. For detailed explanations of the extraction and enhancement phases, see Phase 1: Markdown Extraction and Phase 2: Diagram Enhancement.

Sources: python/deepwiki-scraper.py:1-11


Command-Line Interface

The script requires two positional arguments:

  • Repository identifier : Format owner/repo (e.g., jzombie/deepwiki-to-mdbook)
  • Output directory : Destination path for generated markdown files

The repository identifier is validated using the regex pattern ^[\w-]+/[\w-]+$ at python/deepwiki-scraper.py:1287-1289 The script exits with status code 1 if validation fails or if the wiki structure cannot be extracted.

Sources: python/deepwiki-scraper.py:1-10 python/deepwiki-scraper.py:1277-1289


Three-Phase Execution Model

Figure 1: Three-Phase Execution Pipeline

The main() function at python/deepwiki-scraper.py:1277-1410 implements a three-phase workflow:

PhaseFunctionPrimary ResponsibilityOutput Location
1extract_wiki_structure + extract_page_contentScrape HTML and convert to markdownTemporary directory
2extract_and_enhance_diagramsMatch and inject Mermaid diagramsIn-place modification of temp directory
3File system operationsMove validated files to outputFinal output directory

A temporary directory is created at python/deepwiki-scraper.py:1295-1296 using Python’s tempfile.TemporaryDirectory context manager. This ensures automatic cleanup even if the script fails. A raw markdown snapshot is saved to raw_markdown/ at python/deepwiki-scraper.py:1358-1366 before diagram enhancement for debugging purposes.

Sources: python/deepwiki-scraper.py:1277-1410 python/deepwiki-scraper.py:1298-1371


Wiki Structure Discovery

Figure 2: Structure Discovery Algorithm Using extract_wiki_structure

The extract_wiki_structure function at python/deepwiki-scraper.py:116-163 discovers all wiki pages by parsing the main wiki index page. It uses a compiled regex pattern to find all links matching ^/{repo_pattern}/\d+ at python/deepwiki-scraper.py:128-129

The page numbering scheme distinguishes main pages from subsections using dot notation:

  • Level 0 : Main pages (e.g., 1, 2, 3) - pages with no dots
  • Level 1 : Subsections (e.g., 2.1, 2.2) - pages with one dot
  • Level N : Deeper subsections (e.g., 2.1.3) - pages with N dots

The level is calculated at python/deepwiki-scraper.py145 as page_num.count('.'). Pages are sorted using a custom key function at python/deepwiki-scraper.py:157-159 that splits the page number by dots and converts each component to an integer, ensuring proper numerical ordering (e.g., 2.10 comes after 2.9).

Each page dictionary contains:

  • number: Page number string (e.g., "2.1")
  • title: Extracted link text
  • url: Full URL to the page
  • href: Relative path (used for link rewriting)
  • level: Nesting depth based on dot count

Sources: python/deepwiki-scraper.py:116-163 python/deepwiki-scraper.py145 python/deepwiki-scraper.py:157-161


Path Resolution and Numbering Normalization

Figure 3: Path Resolution Using normalized_number_parts and resolve_output_path

The path resolution system normalizes DeepWiki’s numbering scheme to match mdBook’s conventions. The normalized_number_parts function at python/deepwiki-scraper.py:28-43 shifts page numbers down by one so that DeepWiki’s page 1 becomes unnumbered (the index page), and subsequent pages start at 1.

DeepWiki Numbernormalized_number_parts OutputFinal Filename
"1"[] (empty list)overview.md (unnumbered)
"2"["1"]1-introduction.md
"3.1"["2", "1"]2-1-subsection.md
"3.2"["2", "2"]2-2-another.md

The resolve_output_path function at python/deepwiki-scraper.py:45-53 combines normalized numbers with sanitized titles. Subsections (with len(parts) > 1) are placed in directories named section-{main_number} at python/deepwiki-scraper.py52 The sanitize_filename function at python/deepwiki-scraper.py:22-26 strips special characters and normalizes whitespace using regex patterns r'[^\w\s-]' and r'[-\s]+'.

The build_target_path function at python/deepwiki-scraper.py:55-63 constructs full relative paths for link rewriting, used by the link fixing logic at python/deepwiki-scraper.py:854-875

Sources: python/deepwiki-scraper.py:28-63 python/deepwiki-scraper.py:22-26 python/deepwiki-scraper.py:854-875


Content Extraction and HTML-to-Markdown Conversion

Figure 4: Content Extraction Pipeline Using extract_page_content

The extract_page_content function at python/deepwiki-scraper.py:751-877 implements a multi-stage HTML cleaning and conversion pipeline. BeautifulSoup selectors at python/deepwiki-scraper.py:761-762 remove navigation elements before content extraction.

The content finder at python/deepwiki-scraper.py:765-779 tries a prioritized list of selectors: article, main, .wiki-content, .content, #content, .markdown-body, and finally falls back to body. DeepWiki-specific UI elements are removed at python/deepwiki-scraper.py:786-795 by searching for text patterns like “Index your code with Devin” and “Edit Wiki”.

Navigation list removal at python/deepwiki-scraper.py:799-806 detects and removes <ul> elements containing more than 5 links where 80%+ are internal wiki links.

The convert_html_to_markdown function at python/deepwiki-scraper.py:213-228 uses the html2text library with configuration:

  • ignore_links = False - preserve all links
  • body_width = 0 - disable line wrapping to prevent formatting issues

Note at python/deepwiki-scraper.py:221-223 explicitly documents that Mermaid diagram processing is disabled during HTML conversion because diagrams from ALL pages are mixed together in the JavaScript payload.

The clean_deepwiki_footer function at python/deepwiki-scraper.py:165-211 removes DeepWiki UI elements using compiled regex patterns for text like “Dismiss”, “Refresh this wiki”, and “On this page”. It scans backwards from the end of the file up to 50 lines to find footer markers at python/deepwiki-scraper.py:187-191

Link rewriting at python/deepwiki-scraper.py:854-875 converts DeepWiki URLs to relative markdown paths, handling both same-section and cross-section references by calculating relative paths based on the source file’s section directory.

Sources: python/deepwiki-scraper.py:751-877 python/deepwiki-scraper.py:213-228 python/deepwiki-scraper.py:165-211 python/deepwiki-scraper.py:395-406


Diagram Extraction from JavaScript Payload

Figure 5: Diagram Extraction Using extract_and_enhance_diagrams

The extract_and_enhance_diagrams function at python/deepwiki-scraper.py:880-1275 extracts all Mermaid diagrams from DeepWiki’s Next.js JavaScript payload. The regex pattern at python/deepwiki-scraper.py899 matches fenced code blocks with various newline formats: \\r\\n, \\n, or actual newline characters.

Context extraction at python/deepwiki-scraper.py:903-1087 captures up to 2000 characters before each diagram to enable fuzzy matching. For each diagram, the context is parsed to extract:

  1. Last heading : The most recent line starting with # (searched backwards from diagram position)
  2. Anchor text : The last 2-3 non-heading lines exceeding 20 characters in length, concatenated and truncated to 300 characters

The context extraction logic at python/deepwiki-scraper.py:1066-1081 searches backwards through context lines to find the last heading, then collects up to 3 substantial non-heading lines as anchor text.

The unescaping phase at python/deepwiki-scraper.py:1039-1046 handles JavaScript string escapes:

Escaped SequenceUnescaped Result
\\nNewline character
\\tTab character
\\"Double quote
\\\\Single backslash
\\u003c< character
\\u003e> character
\\u0026& character

The merge_multiline_labels function at python/deepwiki-scraper.py:907-1009 collapses wrapped Mermaid labels into literal \n sequences. This is crucial because DeepWiki sometimes wraps long labels across multiple lines in the HTML, but Mermaid 11 expects these to be explicitly marked with \n tokens.

Sources: python/deepwiki-scraper.py:880-1087 python/deepwiki-scraper.py899 python/deepwiki-scraper.py:1039-1046 python/deepwiki-scraper.py:907-1009


Seven-Step Mermaid Normalization Pipeline

Figure 6: Seven-Step Normalization Pipeline Using normalize_mermaid_diagram

The normalize_mermaid_diagram function at python/deepwiki-scraper.py:385-393 applies seven normalization passes to ensure Mermaid 11 compatibility:

Step 1: normalize_mermaid_edge_labels

Function at python/deepwiki-scraper.py:230-251 Applies only to graphs and flowcharts (detected by checking if first line starts with graph or flowchart). Uses regex r'\|([^|]*)\|' to find edge labels and flattens any containing \n, \\n, (, or ) by:

  • Replacing \\n and \n with spaces
  • Removing parentheses
  • Collapsing whitespace with re.sub(r'\s+', ' ', cleaned).strip()

Step 2: normalize_mermaid_state_descriptions

Function at python/deepwiki-scraper.py:253-277 Applies only to state diagrams. Ensures state descriptions use the syntax State : Description by:

  • Skipping lines with :: (already valid)
  • Splitting on single : and cleaning suffix
  • Replacing colons in description with -
  • Rebuilding as {prefix.rstrip()} : {cleaned_suffix}

Step 3: normalize_flowchart_nodes

Function at python/deepwiki-scraper.py:279-301 Applies to graphs and flowcharts. Uses regex r'\["([^"]*)"\]' to find node labels and:

Step 4: normalize_statement_separators

Function at python/deepwiki-scraper.py:313-328 Applies to graphs and flowcharts. The STATEMENT_BREAK_PATTERN at python/deepwiki-scraper.py:309-311 detects consecutive statements on one line and inserts newlines between them while preserving indentation.

Step 5: normalize_empty_node_labels

Function at python/deepwiki-scraper.py:330-341 Uses regex r'(\b[A-Za-z0-9_]+)\[""\]' to find nodes with empty labels. Generates fallback label from node ID by replacing underscores/hyphens with spaces.

Step 6: normalize_gantt_diagram

Function at python/deepwiki-scraper.py:343-383 Applies only to Gantt diagrams. Detects task lines missing IDs using pattern r'^(\s*"[^"]+"\s*):\s*(.+)$' and inserts synthetic IDs (task1, task2, etc.) when the first token after the colon is not an ID or after reference.

Sources: python/deepwiki-scraper.py:385-393 python/deepwiki-scraper.py:230-251 python/deepwiki-scraper.py:253-277 python/deepwiki-scraper.py:279-301 python/deepwiki-scraper.py:313-328 python/deepwiki-scraper.py:330-341 python/deepwiki-scraper.py:343-383


Fuzzy Matching and Diagram Injection

Figure 7: Fuzzy Matching Algorithm for Diagram Injection

The fuzzy matching algorithm at python/deepwiki-scraper.py:1150-1275 pairs each diagram with its correct markdown file by matching context against file contents. The algorithm uses a progressive chunk size strategy at python/deepwiki-scraper.py1188 to find matches:

Chunk SizeUse Case
300 charsHighest precision - exact context match
200 charsMedium precision - paragraph-level match
150 charsLower precision - sentence-level match
100 charsLow precision - phrase-level match
80 charsMinimum threshold - short phrase match

The matching loop at python/deepwiki-scraper.py:1170-1238 attempts anchor text matching first. The anchor text (last 2-3 lines of context before the diagram) is normalized to lowercase with whitespace collapsed at python/deepwiki-scraper.py:1185-1186 For each chunk size, the algorithm searches for the test chunk at the end of the anchor text (anchor_normalized[-chunk_size:]) in the normalized file content.

If anchor matching fails (score < 80), the algorithm falls back to heading matching at python/deepwiki-scraper.py:1204-1216 This compares the last_heading from diagram context against all headings in the file after normalizing both by removing # symbols and collapsing whitespace.

Only matches with best_match_score >= 80 are accepted at python/deepwiki-scraper.py1218 This threshold balances precision (avoiding false matches) with recall (ensuring most diagrams are placed).

Insertion Point Logic

The insertion point finder at python/deepwiki-scraper.py:1220-1236 behaves differently based on match type:

After heading match :

  1. Skip blank lines after heading
  2. Skip through the following paragraph
  3. Insert after the paragraph ends (blank line or next heading)

After paragraph match :

  1. Find end of current paragraph
  2. Insert when encountering blank line or heading

Content Guards

The enforce_content_start function at python/deepwiki-scraper.py:1138-1147 and advance_past_lists function at python/deepwiki-scraper.py:1125-1137 implement content guards to prevent diagram insertion in protected areas:

Protected prefix (detected by protected_prefix_end at python/deepwiki-scraper.py:1101-1115):

  • Title line (first line starting with #)
  • “Relevant source files” section and its list items
  • Blank lines in these sections

List blocks (detected by is_list_line at python/deepwiki-scraper.py:1117-1123):

  • Lines starting with -, *, +
  • Lines matching \d+[.)]\s (numbered lists)

Diagrams are never inserted inside list blocks. If the insertion point lands in a list, advance_past_lists moves the insertion point to after the list ends.

Dynamic Fence Length

The insertion logic at python/deepwiki-scraper.py:1249-1266 calculates a dynamic fence length to handle diagrams containing backticks. It scans the diagram text for the longest run of consecutive backticks and sets fence_len = max(3, max_backticks + 1). This ensures the fence markers (````mermaid`) always properly delimit the diagram content.

Sources: python/deepwiki-scraper.py:1150-1275 python/deepwiki-scraper.py:1170-1238 python/deepwiki-scraper.py:1101-1147 python/deepwiki-scraper.py:1249-1266


Error Handling and Retry Logic

Figure 8: Retry Logic in fetch_page Function

The fetch_page function at python/deepwiki-scraper.py:65-80 implements a 3-attempt retry strategy with exponential backoff. The retry loop at python/deepwiki-scraper.py:71-80 catches all exceptions using a bare except Exception as e clause and retries with a 2-second delay using time.sleep(2).

Browser-like headers are set at python/deepwiki-scraper.py:67-69 to avoid bot detection:

User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36

The timeout is set to 30 seconds at python/deepwiki-scraper.py73 After a successful fetch, response.raise_for_status() validates the HTTP status code.

The main extraction loop at python/deepwiki-scraper.py:1328-1353 catches exceptions per-page and continues processing remaining pages:

This ensures that a single page failure doesn’t abort the entire scraping process. The success count is reported at python/deepwiki-scraper.py1355

The top-level try-except block at python/deepwiki-scraper.py:1310-1407 catches any unhandled exceptions and exits with status code 1, signaling failure to the calling build script.

Sources: python/deepwiki-scraper.py:65-80 python/deepwiki-scraper.py:1328-1353 python/deepwiki-scraper.py:1310-1407


Session Management and Rate Limiting

The script uses a requests.Session object created at python/deepwiki-scraper.py:1305-1308 with persistent headers:

Session reuse provides connection pooling and persistent cookies across requests. The session is passed to all HTTP functions: extract_wiki_structure, extract_page_content, and extract_and_enhance_diagrams.

Rate limiting is implemented at python/deepwiki-scraper.py1350 with a 1-second sleep between page extractions:

This prevents overwhelming the DeepWiki server and reduces the risk of rate limiting or IP blocking. The comment at python/deepwiki-scraper.py1349 explicitly states “Be nice to the server”.

Sources: python/deepwiki-scraper.py:1305-1308 python/deepwiki-scraper.py:1349-1350


Key Function Reference

FunctionLinesPurpose
main()1277-1410Entry point - orchestrates three-phase pipeline
extract_wiki_structure(repo, session)116-163Discover all wiki pages from index
extract_page_content(url, session, page_info)751-877Extract and clean single page content
extract_and_enhance_diagrams(repo, temp_dir, session, url)880-1275Extract diagrams and inject into files
convert_html_to_markdown(html_content)213-228Convert HTML to markdown using html2text
clean_deepwiki_footer(markdown)165-211Remove DeepWiki UI elements from footer
normalize_mermaid_diagram(diagram_text)385-393Apply seven-step normalization pipeline
normalize_mermaid_edge_labels(diagram_text)230-251Flatten multiline edge labels
normalize_mermaid_state_descriptions(diagram_text)253-277Fix state diagram syntax
normalize_flowchart_nodes(diagram_text)279-301Clean flowchart node labels
normalize_statement_separators(diagram_text)313-328Insert newlines between statements
normalize_empty_node_labels(diagram_text)330-341Provide fallback labels
normalize_gantt_diagram(diagram_text)343-383Add synthetic task IDs
merge_multiline_labels(diagram_text)907-1009Collapse wrapped labels
strip_wrapping_quotes(diagram_text)1011-1022Remove extra quotes
fetch_page(url, session)65-80HTTP fetch with retry logic
sanitize_filename(text)22-26Convert text to safe filename
normalized_number_parts(page_number)28-43Shift DeepWiki numbering down by 1
resolve_output_path(page_number, title)45-53Determine filename and section dir
build_target_path(page_number, slug)55-63Build relative path for links
format_source_references(markdown)397-406Insert colons in source links

Sources: python/deepwiki-scraper.py:1-1411

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Template System

Loading…

Template System

Relevant source files

The template system provides customizable header and footer content that is injected into every markdown file during the build process. This system uses a simple variable substitution syntax with conditional rendering support, allowing users to customize the appearance and metadata of generated documentation without modifying the core build scripts.

For information about how templates are injected during the build process, see Template Injection. For comprehensive documentation on template variables, see Template Variables.

Sources: templates/README.md:1-77

Template Files

The system uses two HTML template files that define content to be injected into each markdown file:

Template FileLocationPurposeInjection Point
header.html/workspace/templates/header.htmlInjected at the beginning of each markdown fileImmediately after frontmatter
footer.html/workspace/templates/footer.htmlInjected at the end of each markdown fileAfter all content

Both templates are processed through the same variable substitution engine before injection.

Sources: templates/README.md:6-8 templates/header.html:1-9 templates/footer.html:1-11

Default Header Template

The default header template displays project badges and attribution information:

The template uses inline conditionals to prevent mdBook from wrapping links in separate paragraph tags, which would break the styling.

Sources: templates/header.html:1-9

The default footer template displays generation metadata and repository information:

Sources: templates/footer.html:1-11

Template Syntax

The template system supports three syntactic features: variable substitution, conditional rendering, and HTML comments.

Variable Substitution

Variables use double-brace syntax: {{VARIABLE_NAME}}. The processor replaces these with the corresponding variable value, or an empty string if the variable is not defined.

Variable names must match the pattern \w+ (alphanumeric and underscore characters).

Sources: python/process-template.py:38-45 templates/README.md:12-15

Conditional Rendering

Conditional blocks use {{#if VARIABLE}}...{{/if}} syntax. The content between the tags is included only if the variable exists and is non-empty.

The conditional pattern matches \{\{#if\s+(\w+)\}\}(.*?)\{\{/if\}\} and evaluates whether the variable is truthy.

Sources: python/process-template.py:24-36 templates/README.md:17-22

HTML Comments

HTML comments are automatically stripped from the output during processing:

Sources: python/process-template.py:47-48 templates/README.md:24-28

Template Processing Engine

Diagram: Template Processing Flow

Sources: python/process-template.py:1-82

The process-template.py script implements a two-pass processing algorithm:

  1. Conditional Processing (first pass): Evaluates {{#if}} blocks using regular expression matching and removes or includes content based on variable truthiness python/process-template.py:24-36
  2. Variable Substitution (second pass): Replaces {{VAR}} placeholders with actual values python/process-template.py:38-45
  3. Comment Removal (cleanup): Strips HTML comments from the final output python/process-template.py:47-48

Command-Line Interface

The script accepts a template file path and variable assignments in KEY=value format:

Arguments are parsed at python/process-template.py:66-70 where each KEY=value pair is split and stored in a dictionary for substitution.

Sources: python/process-template.py:53-82

Available Template Variables

The following variables are provided by the build system and available in all templates:

VariableDescriptionExample ValueSource
REPORepository in owner/repo formatjzombie/deepwiki-to-mdbookEnvironment or Git detection
BOOK_TITLEDocumentation titleDeepWiki DocumentationEnvironment or auto-generated
BOOK_AUTHORSAuthor nameszenOSmosisEnvironment or Git config
GENERATION_DATEISO 8601 timestamp (UTC)2024-01-15T10:30:00ZBuild time
DEEPWIKI_URLDeepWiki documentation URLhttps://deepwiki.com/wiki/...DeepWiki scraper
DEEPWIKI_BADGE_URLDeepWiki badge image URLhttps://deepwiki.com/badge/...Constructed from DEEPWIKI_URL
GIT_REPO_URLFull Git repository URLhttps://github.com/...Constructed from REPO
GITHUB_BADGE_URLGitHub badge image URLhttps://img.shields.io/github/...Constructed from REPO

All variables are optional. If a variable is undefined, it is replaced with an empty string during substitution.

Sources: templates/README.md:30-39

Customization

Diagram: Template Customization Architecture

Sources: templates/README.md:42-56

Volume Mount Customization

Custom templates can be provided by mounting a local directory or individual files into the Docker container:

Mount entire template directory:

Mount individual template files:

Sources: templates/README.md:45-51

Environment Variable Customization

Template locations can be overridden via environment variables:

Environment VariableDefault ValueDescription
TEMPLATE_DIR/workspace/templatesBase directory for templates
HEADER_TEMPLATE$TEMPLATE_DIR/header.htmlFull path to header template
FOOTER_TEMPLATE$TEMPLATE_DIR/footer.htmlFull path to footer template

Example:

Sources: templates/README.md:53-56

Integration with Build Process

Diagram: Template System in Build Pipeline

Sources: templates/README.md:1-77 Diagram 2 from high-level system architecture

The template system is invoked during Phase 3 of the build pipeline, after markdown files have been generated and enhanced with diagrams. The build-docs.sh orchestrator script processes each template file through process-template.py, then injects the processed content into every markdown file before generating SUMMARY.md and running mdbook build.

Template processing occurs in the following sequence:

  1. Template Resolution : Determine paths to header and footer templates using environment variables or defaults
  2. Variable Collection : Gather all template variables from environment and runtime context
  3. Template Processing : Invoke process-template.py for each template file with collected variables
  4. Content Injection : Prepend processed header and append processed footer to each markdown file
  5. Build Continuation : Proceed with SUMMARY.md generation and mdBook build

This design ensures that all documentation pages share consistent branding, navigation, and metadata without requiring manual edits to individual markdown files.

Sources: templates/README.md:1-77

Example Custom Templates

Minimal Header Example

Sources: templates/README.md:60-68

Sources: templates/README.md:70-76

Conditional Badge Example

This example demonstrates conditional rendering of badges based on available repository and documentation URLs. If neither GIT_REPO_URL nor DEEPWIKI_URL is defined, the template produces no output.

Sources: templates/header.html:2-4 templates/README.md:17-22

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

mdBook Integration

Loading…

mdBook Integration

Relevant source files

Purpose and Scope

This page documents how the system integrates with mdBook and mdbook-mermaid to generate the final HTML documentation. It covers the configuration generation process, automatic table of contents creation, mermaid diagram support installation, and the build execution. For information about the overall three-phase pipeline, see Three-Phase Pipeline. For template injection specifics, see Template Injection.

mdBook and mdbook-mermaid Tools

The system uses two Rust-based tools compiled during the Docker build:

ToolPurposeInstallation Location
mdbookStatic site generator for documentation from Markdown/usr/local/bin/mdbook
mdbook-mermaidmdBook preprocessor for rendering Mermaid diagrams/usr/local/bin/mdbook-mermaid

Both tools are compiled from source in the Rust builder stage of the Docker image and copied to the final Python-based runtime container for size optimization.

mdBook Build Pipeline

graph TB
    subgraph "Input_Preparation"
        WikiDir["/workspace/wiki/\nEnhanced markdown files"]
Templates["Processed templates\nHEADER_HTML, FOOTER_HTML"]
end
    
    subgraph "Configuration_Generation"
        BookToml["book.toml generation\n[scripts/build-docs.sh:102-119]"]
SummaryGen["SUMMARY.md generation\n[scripts/build-docs.sh:124-186]"]
EnvVars["Environment variables:\nBOOK_TITLE, BOOK_AUTHORS\nGIT_REPO_URL"]
end
    
    subgraph "Structure_Creation"
        BookDir["/workspace/book/\nmdBook project root"]
SrcDir["/workspace/book/src/\nMarkdown source"]
CopyFiles["Copy wiki files to src/\n[scripts/build-docs.sh:237]"]
InjectTemplates["Inject header/footer\n[scripts/build-docs.sh:239-261]"]
end
    
    subgraph "mdBook_Processing"
        MermaidInstall["mdbook-mermaid install\n[scripts/build-docs.sh:266]"]
MdBookBuild["mdbook build\n[scripts/build-docs.sh:271]"]
Preprocessor["mdbook-mermaid preprocessor\nConverts ```mermaid blocks"]
end
    
    subgraph "Output"
        BookHTML["/workspace/book/book/\nBuilt HTML documentation"]
OutputCopy["Copy to /output/book/\n[scripts/build-docs.sh:279]"]
end
    
 
   EnvVars --> BookToml
 
   EnvVars --> SummaryGen
 
   WikiDir --> CopyFiles
 
   Templates --> InjectTemplates
    
 
   BookToml --> BookDir
 
   SummaryGen --> SrcDir
 
   CopyFiles --> SrcDir
 
   InjectTemplates --> SrcDir
    
 
   BookDir --> MermaidInstall
 
   MermaidInstall --> MdBookBuild
 
   SrcDir --> MdBookBuild
 
   MdBookBuild --> Preprocessor
 
   Preprocessor --> BookHTML
 
   BookHTML --> OutputCopy

Sources: scripts/build-docs.sh:95-295

Configuration Generation (book.toml)

The book.toml configuration file is dynamically generated at scripts/build-docs.sh:102-119 using environment variables.

book.toml Structure

Sources: scripts/build-docs.sh:102-119

Configuration Sections

SectionKeyValue SourcePurpose
[book]title$BOOK_TITLEDocumentation title displayed in UI
[book]authors$BOOK_AUTHORSAuthor names shown in metadata
[book]language"en" (hardcoded)Content language for HTML lang attribute
[book]src"src" (hardcoded)Markdown source directory relative to book root
[output.html]default-theme"rust" (hardcoded)Visual theme (rust, light, navy, ayu, coal)
[output.html]git-repository-url$GIT_REPO_URLEnables “Edit on GitHub” link in top bar
[preprocessor.mermaid]command"mdbook-mermaid"Specifies mermaid preprocessor executable
[output.html.fold]enabletrueEnables collapsible sidebar sections
[output.html.fold]level1Sidebar sections collapsed by default at depth 1

The git-repository-url setting automatically adds an “Edit on GitHub” button to the top-right of each page, pointing to the repository specified in $GIT_REPO_URL (defaults to https://github.com/$REPO).

Sources: scripts/build-docs.sh:102-119

SUMMARY.md Generation

The table of contents is automatically generated by analyzing the file structure in /workspace/wiki/. The algorithm at scripts/build-docs.sh:124-186 creates a hierarchical navigation structure.

SUMMARY.md Generation Algorithm

graph TB
    subgraph "File_Discovery"
        WikiDir["/workspace/wiki/"]
ListFiles["ls *.md\n[scripts/build-docs.sh:135]"]
FilterNumeric["grep -E '^[0-9]'\n[scripts/build-docs.sh:150]"]
NumericSort["sort -t- -k1 -n\n[scripts/build-docs.sh:151]"]
end
    
    subgraph "Overview_Detection"
        OverviewFile["Find non-numeric file\n[scripts/build-docs.sh:138]"]
ExtractTitle["head -1 /sed 's/^# //' [scripts/build-docs.sh:140]"]
WriteOverview["Write [Title] filename [scripts/build-docs.sh:141]"]
end subgraph "Main_Page_Processing" IteratePages["Iterate main pages [scripts/build-docs.sh:158-185]"]
ExtractPageTitle["Extract '# Title' from file [scripts/build-docs.sh:163]"]
CheckSubsections["Check section-N directory [scripts/build-docs.sh:166-169]"]
end subgraph "Subsection_Handling" ListSubsections["ls section-N/*.md [scripts/build-docs.sh:174]"]
SortSubsections["sort -t- -k1 -n [scripts/build-docs.sh:174]"]
ExtractSubTitle["Extract '# Title' from subfile [scripts/build-docs.sh:178]"]
WriteSubsection["Write ' - [Title] section-N/file ' [scripts/build-docs.sh:179]"]
end subgraph "Output" SummaryMd["/workspace/book/src/SUMMARY.md"]
end
 WikiDir --> ListFiles
 ListFiles --> OverviewFile
 OverviewFile --> ExtractTitle
 ExtractTitle --> WriteOverview
 ListFiles --> FilterNumeric
 FilterNumeric --> NumericSort
 NumericSort --> IteratePages
 IteratePages --> ExtractPageTitle
 ExtractPageTitle --> CheckSubsections
 CheckSubsections -->|Has subsections|ListSubsections
 ListSubsections --> SortSubsections
 SortSubsections --> ExtractSubTitle
 ExtractSubTitle --> WriteSubsection
 CheckSubsections -->|No subsections| WriteStandalone["Write '- [Title](file)'"]
WriteOverview --> SummaryMd
 
   WriteSubsection --> SummaryMd
 
   WriteStandalone --> SummaryMd

Sources: scripts/build-docs.sh:124-186

SUMMARY.md Structure

The generated SUMMARY.md follows this format:

Generation Logic

  1. Overview Detection scripts/build-docs.sh:136-144: Finds the first non-numeric markdown file and writes it as the top-level overview link.

  2. Main Page Sorting scripts/build-docs.sh:147-155: Extracts numeric prefixes (e.g., 1, 2, 10) and sorts numerically using sort -t- -k1 -n, ensuring 10-file.md comes after 2-file.md.

  3. Subsection Detection scripts/build-docs.sh:166-180: For each main page with numeric prefix N, checks if directory section-N/ exists. If found, lists and sorts subsection files, writing them indented with two spaces.

  4. Title Extraction scripts/build-docs.sh:163-178: Reads the first line of each markdown file and removes the # prefix using sed 's/^# //'.

Sources: scripts/build-docs.sh:124-186

mdbook-mermaid Installation

Before building the book, the mermaid preprocessor assets must be installed. This is performed by mdbook-mermaid install at scripts/build-docs.sh266

mdbook-mermaid Installation Process

Sources: scripts/build-docs.sh:263-266

The mdbook-mermaid install command:

  1. Downloads the mermaid.js library (version compatible with Mermaid 11)
  2. Creates initialization scripts to configure mermaid rendering
  3. Adds CSS stylesheets for diagram theming
  4. Modifies the HTML templates to include the necessary <script> tags

During the subsequent mdbook build, the mermaid preprocessor:

  1. Detects code blocks with ````mermaid` fence
  2. Wraps them in <pre class="mermaid"> tags
  3. Leaves the mermaid syntax intact for client-side rendering

Sources: scripts/build-docs.sh:263-271

Build Execution

The actual mdBook build occurs at scripts/build-docs.sh271 with a simple mdbook build command.

mdBook Build Process

Sources: scripts/build-docs.sh:268-271

Build Steps

  1. Configuration Parsing : mdBook reads book.toml to load title, authors, theme, and preprocessor configuration.

  2. Chapter Loading : mdBook parses SUMMARY.md to determine navigation structure and file order.

  3. Markdown Processing : Each markdown file is read from the src/ directory.

  4. Preprocessor Execution : The mdbook-mermaid preprocessor transforms mermaid code blocks into render-ready HTML.

  5. Theme Application : The rust theme provides CSS styling, JavaScript functionality, and HTML templates.

  6. Search Index : mdBook generates a searchable index from all content, enabling the built-in search feature.

  7. Output Generation : HTML files are written to the book/ subdirectory within the project root.

Sources: scripts/build-docs.sh:268-271

Output Structure

After the build completes, the system copies outputs to /output/ at scripts/build-docs.sh:274-295

Output Directory Structure

Sources: scripts/build-docs.sh:274-295

Output Artifacts

PathContentPurpose
/output/book/Complete HTML documentationServable website with search, navigation, and rendered diagrams
/output/markdown/Enhanced markdown filesSource files with injected headers/footers and diagrams
/output/raw_markdown/Pre-enhancement markdownDebug artifact showing initial conversion from HTML
/output/book.tomlmdBook configurationReference for reproduction or customization

The book/ directory contains:

  • index.html - Main entry point with navigation sidebar
  • print.html - Single-page version for printing
  • *.html - Individual chapter pages
  • searchindex.json - Search index data
  • searchindex.js - Search functionality
  • CSS and JavaScript assets for theming and interactivity

Sources: scripts/build-docs.sh:274-295 README.md:53-58

Local Serving

The generated HTML can be served locally using Python’s built-in HTTP server, as documented in README.md26:

This serves the documentation at http://localhost:8000, providing:

  • Full-text search functionality
  • Interactive navigation sidebar
  • Rendered Mermaid diagrams (client-side rendering)
  • Theme switching (light/dark/rust/coal/navy/ayu)
  • Print-friendly single-page view

Sources: README.md:26-29 scripts/build-docs.sh:307-308

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Phase 1: Markdown Extraction

Loading…

Phase 1: Markdown Extraction

Relevant source files

This page documents Phase 1 of the three-phase processing pipeline, which handles the extraction and initial conversion of wiki content from DeepWiki.com into clean Markdown files. Phase 1 produces raw Markdown files in a temporary directory before diagram enhancement (Phase 2, see #7) and mdBook HTML generation (Phase 3, see #8).

For detailed information about specific sub-processes within Phase 1, see:

  • Wiki structure discovery algorithm: #6.1
  • HTML parsing and Markdown conversion: #6.2

Scope and Objectives

Phase 1 accomplishes the following:

  1. Discover all wiki pages and their hierarchical structure from DeepWiki
  2. Fetch HTML content for each page via HTTP requests
  3. Parse HTML to extract main content and remove UI elements
  4. Convert cleaned HTML to Markdown using html2text
  5. Organize output files into a hierarchical directory structure
  6. Save to a temporary directory for subsequent processing

This phase is implemented entirely in Python within deepwiki-scraper.py and operates independently of Phases 2 and 3.

Sources: README.md:121-128 tools/deepwiki-scraper.py:790-876

Phase 1 Execution Flow

The following diagram shows the complete execution flow of Phase 1, mapping high-level steps to specific functions in the codebase:

Sources: tools/deepwiki-scraper.py:790-876 tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594

flowchart TD
    Start["main()
Entry Point"]
CreateTemp["Create tempfile.TemporaryDirectory()"]
CreateSession["requests.Session()
with User-Agent"]
DiscoverPhase["Structure Discovery Phase"]
ExtractWiki["extract_wiki_structure(repo, session)"]
ParseLinks["BeautifulSoup: find_all('a', href=pattern)"]
SortPages["sort by page number (handle dots)"]
ExtractionPhase["Content Extraction Phase"]
LoopPages["For each page in pages list"]
FetchContent["extract_page_content(url, session, page_info)"]
FetchHTML["fetch_page(url, session)
with retries"]
ParseHTML["BeautifulSoup(response.text)"]
RemoveNav["Remove nav/header/footer/aside elements"]
FindContent["Find main content: article/main/[role='main']"]
ConvertPhase["Conversion Phase"]
ConvertMD["convert_html_to_markdown(html_content)"]
HTML2Text["html2text.HTML2Text with body_width=0"]
CleanFooter["clean_deepwiki_footer(markdown)"]
FixLinks["Regex replace: wiki links → .md paths"]
SavePhase["File Organization Phase"]
DetermineLevel{"page['level'] == 0?"}
SaveRoot["Save to temp_dir/NUM-title.md"]
CreateSubdir["Create temp_dir/section-N/"]
SaveSubdir["Save to section-N/NUM-title.md"]
NextPage{"More pages?"}
Complete["Phase 1 Complete: temp_dir contains all .md files"]
Start --> CreateTemp
 
   CreateTemp --> CreateSession
 
   CreateSession --> DiscoverPhase
    
 
   DiscoverPhase --> ExtractWiki
 
   ExtractWiki --> ParseLinks
 
   ParseLinks --> SortPages
 
   SortPages --> ExtractionPhase
    
 
   ExtractionPhase --> LoopPages
 
   LoopPages --> FetchContent
 
   FetchContent --> FetchHTML
 
   FetchHTML --> ParseHTML
 
   ParseHTML --> RemoveNav
 
   RemoveNav --> FindContent
 
   FindContent --> ConvertPhase
    
 
   ConvertPhase --> ConvertMD
 
   ConvertMD --> HTML2Text
 
   HTML2Text --> CleanFooter
 
   CleanFooter --> FixLinks
 
   FixLinks --> SavePhase
    
 
   SavePhase --> DetermineLevel
 
   DetermineLevel -->|Yes: Main Page| SaveRoot
 
   DetermineLevel -->|No: Subsection| CreateSubdir
 
   CreateSubdir --> SaveSubdir
 
   SaveRoot --> NextPage
 
   SaveSubdir --> NextPage
    
 
   NextPage -->|Yes| LoopPages
 
   NextPage -->|No| Complete

Core Components and Data Flow

Structure Discovery Pipeline

The structure discovery process identifies all wiki pages and builds a hierarchical page list:

Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:90-116 tools/deepwiki-scraper.py:118-123

flowchart LR
    subgraph Input
        BaseURL["Base URL\ndeepwiki.com/owner/repo"]
end
    
    subgraph extract_wiki_structure
        FetchMain["fetch_page(base_url)"]
ParseSoup["BeautifulSoup(response.text)"]
FindLinks["soup.find_all('a', href=regex)"]
ExtractInfo["Extract page number & title\nRegex: /(\d+(?:\.\d+)*)-(.+)$"]
CalcLevel["Calculate level from dots\nlevel = page_num.count('.')"]
BuildPages["Build pages list with metadata"]
SortFunc["Sort by sort_key(page)\nparts = [int(x)
for x in num.split('.')]"]
end
    
    subgraph Output
        PagesList["List[Dict]\n{'number': '2.1',\n'title': 'Section',\n'url': full_url,\n'href': path,\n'level': 1}"]
end
    
 
   BaseURL --> FetchMain
 
   FetchMain --> ParseSoup
 
   ParseSoup --> FindLinks
 
   FindLinks --> ExtractInfo
 
   ExtractInfo --> CalcLevel
 
   CalcLevel --> BuildPages
 
   BuildPages --> SortFunc
 
   SortFunc --> PagesList

Content Extraction and Cleaning

Each page undergoes a multi-step cleaning process to remove DeepWiki UI elements:

Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:175-190 tools/deepwiki-scraper.py:127-173

flowchart TD
    subgraph fetch_page
        MakeRequest["requests.get(url, headers)\nUser-Agent: Mozilla/5.0..."]
RetryLogic["Retry up to 3 times\n2 second delay between attempts"]
CheckStatus["response.raise_for_status()"]
end
    
    subgraph extract_page_content
        ParsePage["BeautifulSoup(response.text)"]
RemoveUnwanted["Decompose: nav, header, footer,\naside, .sidebar, script, style"]
FindMain["Try selectors in order:\narticle → main → .wiki-content\n→ [role='main'] → body"]
RemoveUI["Remove DeepWiki UI elements:\n'Edit Wiki', 'Last indexed:',\n'Index your code with Devin'"]
RemoveNavLists["Remove navigation <ul> lists\n(80%+ internal wiki links)"]
end
    
    subgraph convert_html_to_markdown
        HTML2TextInit["h = html2text.HTML2Text()\nh.ignore_links = False\nh.body_width = 0"]
HandleContent["markdown = h.handle(html_content)"]
CleanFooterCall["clean_deepwiki_footer(markdown)"]
end
    
    subgraph clean_deepwiki_footer
        SplitLines["lines = markdown.split('\\n')"]
ScanBackward["Scan last 50 lines backward\nfor footer patterns"]
MatchPatterns["Regex patterns:\n'Dismiss', 'Refresh this wiki',\n'On this page', 'Edit Wiki'"]
TruncateLines["lines = lines[:footer_start]"]
RemoveEmpty["Remove trailing empty lines"]
end
    
 
   MakeRequest --> RetryLogic
 
   RetryLogic --> CheckStatus
 
   CheckStatus --> ParsePage
    
 
   ParsePage --> RemoveUnwanted
 
   RemoveUnwanted --> FindMain
 
   FindMain --> RemoveUI
 
   RemoveUI --> RemoveNavLists
 
   RemoveNavLists --> HTML2TextInit
    
 
   HTML2TextInit --> HandleContent
 
   HandleContent --> CleanFooterCall
    
 
   CleanFooterCall --> SplitLines
 
   SplitLines --> ScanBackward
 
   ScanBackward --> MatchPatterns
 
   MatchPatterns --> TruncateLines
 
   TruncateLines --> RemoveEmpty

Phase 1 transforms internal DeepWiki links into relative Markdown file paths. The rewriting logic accounts for hierarchical directory structure:

Sources: tools/deepwiki-scraper.py:549-592

flowchart TD
    subgraph Input
        WikiLink["DeepWiki Link\n[text](/owner/repo/2-1-section)"]
SourcePage["Current Page Info\n{level: 1, number: '2.1'}"]
end
    
    subgraph fix_wiki_link
        ExtractPath["Regex: /(\d+(?:\.\d+)*)-(.+)$"]
ParseNumbers["Extract: page_num='2.1', slug='section'"]
ConvertNum["file_num = page_num.replace('.', '-')\nResult: '2-1'"]
CheckTarget{"Is target\nsubsection?\n(has '.')"}
CheckSource{"Is source\nsubsection?\n(level > 0)"}
CheckSame{"Same main\nsection?"}
PathSameSection["Relative path:\nfile_num-slug.md"]
PathDiffSection["Full path:\nsection-N/file_num-slug.md"]
PathToMain["Up one level:\n../file_num-slug.md"]
PathMainToMain["Same level:\nfile_num-slug.md"]
end
    
    subgraph Output
        MDLink["Markdown Link\n[text](2-1-section.md)\nor [text](section-2/2-1-section.md)\nor [text](../2-1-section.md)"]
end
    
 
   WikiLink --> ExtractPath
 
   ExtractPath --> ParseNumbers
 
   ParseNumbers --> ConvertNum
 
   ConvertNum --> CheckTarget
    
 
   CheckTarget -->|Yes| CheckSource
 
   CheckTarget -->|No: Main Page| CheckSource
    
 
   CheckSource -->|Target: Sub, Source: Sub| CheckSame
 
   CheckSource -->|Target: Sub, Source: Main| PathDiffSection
 
   CheckSource -->|Target: Main, Source: Sub| PathToMain
 
   CheckSource -->|Target: Main, Source: Main| PathMainToMain
    
 
   CheckSame -->|Yes| PathSameSection
 
   CheckSame -->|No| PathDiffSection
    
 
   PathSameSection --> MDLink
 
   PathDiffSection --> MDLink
 
   PathToMain --> MDLink
 
   PathMainToMain --> MDLink

File Organization Strategy

Phase 1 organizes output files into a hierarchical directory structure based on page levels:

Directory Structure Rules

Page LevelPage Number FormatDirectory LocationFilename PatternExample
0 (Main)1, 2, 3temp_dir/ (root){num}-{slug}.md1-overview.md
1 (Subsection)2.1, 3.4temp_dir/section-{N}/{num}-{slug}.mdsection-2/2-1-workspace.md

File Organization Implementation

Sources: tools/deepwiki-scraper.py:21-25 tools/deepwiki-scraper.py:845-868

HTTP Session Configuration

Phase 1 uses a persistent requests.Session with browser-like headers and retry logic:

Session Setup

Retry Strategy

Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:817-821

Data Structures

Page Metadata Dictionary

Each page discovered by extract_wiki_structure() is represented as a dictionary:

Sources: tools/deepwiki-scraper.py:109-115

BeautifulSoup Content Selectors

Phase 1 attempts multiple selector strategies to find main content, in priority order:

PrioritySelector TypeSelector ValueRationale
1CSS SelectorarticleSemantic HTML5 element for main content
2CSS SelectormainHTML5 main landmark element
3CSS Selector.wiki-contentCommon class name for wiki content
4CSS Selector.contentGeneric content class
5CSS Selector#contentGeneric content ID
6CSS Selector.markdown-bodyGitHub-style markdown container
7Attributerole="main"ARIA landmark role
8FallbackbodyLast resort: entire body

Sources: tools/deepwiki-scraper.py:472-484

Error Handling and Robustness

Page Extraction Error Handling

Phase 1 implements graceful degradation for individual page failures:

Sources: tools/deepwiki-scraper.py:841-876

Content Extraction Fallbacks

If primary content selectors fail, Phase 1 applies fallback strategies:

  1. Content Selector Fallback Chain : Try 8 different selectors (see table above)
  2. Empty Content Check : Raises exception if no content element found tools/deepwiki-scraper.py:486-487
  3. HTTP Retry Logic : 3 attempts with exponential backoff
  4. Session Persistence : Reuses TCP connections for efficiency

Sources: tools/deepwiki-scraper.py:472-487 tools/deepwiki-scraper.py:27-42

Output Format

Temporary Directory Structure

At the end of Phase 1, the temporary directory contains the following structure:

temp_dir/
├── 1-overview.md                    # Main page (level 0)
├── 2-architecture.md                # Main page (level 0)
├── 3-components.md                  # Main page (level 0)
├── section-2/                       # Subsections of page 2
│   ├── 2-1-workspace-and-crates.md  # Subsection (level 1)
│   └── 2-2-dependency-graph.md      # Subsection (level 1)
└── section-4/                       # Subsections of page 4
    ├── 4-1-logical-planning.md
    └── 4-2-physical-planning.md

Markdown File Format

Each generated Markdown file has the following characteristics:

  • Title : Always starts with # {Page Title} heading
  • Content : Cleaned HTML converted to Markdown via html2text
  • Links : Internal wiki links rewritten to relative .md paths
  • No Diagrams : Diagrams are added in Phase 2 (see #7)
  • No Footer : DeepWiki UI elements removed via clean_deepwiki_footer()
  • Encoding : UTF-8

Sources: tools/deepwiki-scraper.py:862-868 tools/deepwiki-scraper.py:127-173

Phase 1 Completion Criteria

Phase 1 is considered complete when:

  1. All pages discovered by extract_wiki_structure() have been processed
  2. Each page’s Markdown file has been written to the temporary directory
  3. Directory structure (main pages + section-N/ subdirectories) has been created
  4. Success count is reported: "✓ Successfully extracted N/M pages to temp directory"

The temporary directory is then passed to Phase 2 for diagram enhancement.

Sources: tools/deepwiki-scraper.py877 tools/deepwiki-scraper.py:596-788

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Wiki Structure Discovery

Loading…

Wiki Structure Discovery

Relevant source files

Purpose and Scope

This document describes the wiki structure discovery mechanism in Phase 1 of the processing pipeline. The system analyzes the main DeepWiki repository page to identify all available wiki pages and their hierarchical relationships. This discovery phase produces a structured page list that drives subsequent content extraction.

For the HTML-to-Markdown conversion that follows discovery, see HTML to Markdown Conversion. For the overall Phase 1 process, see Phase 1: Markdown Extraction.

Overview

The discovery process fetches the main wiki page for a repository and parses its HTML to extract all wiki page references. The system identifies both main pages (e.g., 1, 2, 3) and subsections (e.g., 2.1, 2.2, 3.1) by analyzing link patterns. The output is a sorted list of page metadata that includes page numbers, titles, URLs, and hierarchical levels.

flowchart TD
 
   Start["main()
entry point"] --> ValidateRepo["Validate repo format\n(owner/repo)"]
ValidateRepo --> CreateSession["Create requests.Session\nwith User-Agent headers"]
CreateSession --> CallExtract["extract_wiki_structure(repo, session)"]
CallExtract --> FetchMain["Fetch https://deepwiki.com/{repo}"]
FetchMain --> ParseHTML["BeautifulSoup(response.text)"]
ParseHTML --> FindLinks["soup.find_all('a', href=regex)"]
FindLinks --> IterateLinks["Iterate over all links"]
IterateLinks --> ExtractPattern["Regex: /(\d+(?:\.\d+)*)-(.+)$"]
ExtractPattern --> BuildPageDict["Build page dict:\n{number, title, url, href, level}"]
BuildPageDict --> CheckDupe{"href in seen_urls?"}
CheckDupe -->|Yes| IterateLinks
 
   CheckDupe -->|No| AddToList["pages.append(page_dict)"]
AddToList --> IterateLinks
    
 
   IterateLinks -->|Done| SortPages["Sort by numeric parts:\nsort_key([int(x)
for x in num.split('.')])"]
SortPages --> ReturnPages["Return pages list"]
ReturnPages --> ProcessPages["Process each page\nin main loop"]
style CallExtract fill:#f9f,stroke:#333,stroke-width:2px
    style ExtractPattern fill:#f9f,stroke:#333,stroke-width:2px
    style SortPages fill:#f9f,stroke:#333,stroke-width:2px

Discovery Flow Diagram

Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:790-831

Main Discovery Function

The extract_wiki_structure function performs the core discovery logic. It accepts a repository identifier (e.g., "jzombie/deepwiki-to-mdbook") and an HTTP session object, then returns a list of page dictionaries.

Function Signature and Entry Point

Sources: tools/deepwiki-scraper.py:78-79

HTTP Request and HTML Parsing

The function constructs the base URL and fetches the main wiki page:

The fetch_page helper includes retry logic (3 attempts) and browser-like headers to handle transient failures.

Sources: tools/deepwiki-scraper.py:80-84 tools/deepwiki-scraper.py:27-42

The system uses a compiled regex pattern to find all wiki page links:

This pattern matches URLs like:

  • /jzombie/deepwiki-to-mdbook/1-overview
  • /jzombie/deepwiki-to-mdbook/2-quick-start
  • /jzombie/deepwiki-to-mdbook/2-1-basic-usage

Sources: tools/deepwiki-scraper.py:88-90

Page Information Extraction

For each matched link, the system extracts page metadata using a detailed regex pattern:

The regex r'/(\d+(?:\.\d+)*)-(.+)$' captures:

  • Group 1: Page number with optional dots (e.g., 1, 2.1, 3.2.1)
  • Group 2: URL slug (e.g., overview, basic-usage)

Sources: tools/deepwiki-scraper.py:98-107

Sources: tools/deepwiki-scraper.py:98-115

Deduplication and Sorting

Deduplication Strategy

The system maintains a seen_urls set to prevent duplicate page entries:

Sources: tools/deepwiki-scraper.py:92-116

Hierarchical Sorting

Pages are sorted by their numeric components to maintain proper ordering:

This ensures ordering like: 122.12.233.1

Sources: tools/deepwiki-scraper.py:118-123

Sorting Example

Before Sorting (Link Order)Page NumberAfter Sorting (Numeric Order)
/3-phase-33/1-overview
/2-1-subsection-one2.1/2-quick-start
/1-overview1/2-1-subsection-one
/2-quick-start2/2-2-subsection-two
/2-2-subsection-two2.2/3-phase-3

Page Data Structure

Page Dictionary Schema

Each discovered page is represented as a dictionary:

Sources: tools/deepwiki-scraper.py:109-115

Level Calculation

The level field indicates hierarchical depth:

Page NumberLevelType
10Main page
20Main page
2.11Subsection
2.21Subsection
3.1.12Sub-subsection

Sources: tools/deepwiki-scraper.py:106-114

Discovery Result Processing

Output Statistics

After discovery, the system categorizes pages and reports statistics:

Sources: tools/deepwiki-scraper.py:824-837

Integration with Content Extraction

The discovered page list drives the extraction loop in main():

Sources: tools/deepwiki-scraper.py:841-860

Alternative Discovery Method (Unused)

Subsection Probing Function

The codebase includes a discover_subsections function that uses HTTP HEAD requests to probe for subsections, but this function is not invoked in the current implementation:

This function attempts to discover subsections by making HEAD requests to potential URLs (e.g., /repo/2-1-, /repo/2-2-). However, the actual implementation relies entirely on parsing links from the main wiki page.

Sources: tools/deepwiki-scraper.py:44-76

Discovery Method Comparison

Sources: tools/deepwiki-scraper.py:44-76 tools/deepwiki-scraper.py:78-125

Error Handling

No Pages Found

The system validates that at least one page was discovered:

Sources: tools/deepwiki-scraper.py:828-830

Network Failures

The fetch_page function includes retry logic:

Sources: tools/deepwiki-scraper.py:33-42

Summary

The wiki structure discovery process provides a robust mechanism for identifying all pages in a DeepWiki repository through a single HTML parse operation. The resulting page list is hierarchically organized and drives all subsequent extraction operations in Phase 1.

Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:790-831

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

HTML to Markdown Conversion

Loading…

HTML to Markdown Conversion

Relevant source files

This page documents the HTML to Markdown conversion process in Phase 1 of the pipeline. After the wiki structure is discovered (see Wiki Structure Discovery), each page’s HTML content is fetched, cleaned, and converted to Markdown format. This conversion prepares the content for diagram enhancement in Phase 2 (see Phase 2: Diagram Enhancement).

Conversion Pipeline

The HTML to Markdown conversion follows a multi-step pipeline that progressively cleans and transforms the content. The process is orchestrated by the extract_page_content function and involves HTML parsing, element removal, conversion, and post-processing.

Conversion Pipeline Flow

graph TB
    fetch["fetch_page()\n[65-80]"]
parse["BeautifulSoup()\nHTML Parser"]
remove1["Remove Navigation\nElements [761-762]"]
remove2["Find Main Content\nArea [765-782]"]
remove3["Remove DeepWiki UI\nElements [786-806]"]
convert["convert_html_to_markdown()\n[213-228]"]
clean["clean_deepwiki_footer()\n[165-211]"]
format["format_source_references()\n[397-406]"]
links["Fix Internal Links\n[854-875]"]
output["Cleaned Markdown\nOutput"]
fetch --> parse
 
   parse --> remove1
 
   remove1 --> remove2
 
   remove2 --> remove3
 
   remove3 --> convert
 
   convert --> clean
 
   clean --> format
 
   format --> links
 
   links --> output

Sources: python/deepwiki-scraper.py:751-877

HTML Parsing and Content Extraction

The conversion begins by fetching the HTML page using a requests.Session with browser-like headers to avoid bot detection. BeautifulSoup parses the HTML into a navigable tree structure.

Content Area Detection

The system uses a cascading selector strategy to locate the main content area, trying multiple selectors in order of preference:

PrioritySelector TypeExample
1Semantic HTML5 tagsarticle, main
2Class-based selectors.wiki-content, .content, .markdown-body
3ID-based selectors#content
4ARIA role attributesrole="main"
5Fallbackbody tag

Content Detection Logic

Sources: python/deepwiki-scraper.py:765-782

DeepWiki UI Element Removal

Before conversion, the system removes DeepWiki-specific navigation and UI elements that would pollute the final documentation. This occurs in two stages: pre-processing element removal and footer cleanup.

Pre-Processing Element Removal

The first stage removes structural elements and UI components using CSS selectors:

The second stage removes DeepWiki-specific text-based UI elements by scanning for characteristic strings:

UI ElementDetection StringMax Length
Code indexing prompt“Index your code with Devin”200 chars
Edit controls“Edit Wiki”200 chars
Indexing status“Last indexed:”200 chars
Search links“View this search on DeepWiki”200 chars

Sources: python/deepwiki-scraper.py:761-762 python/deepwiki-scraper.py:786-795

DeepWiki pages include navigation lists that link to all wiki pages. The system detects and removes these by identifying unordered lists (<ul>) with the following characteristics:

  1. Contains more than 5 links
  2. At least 80% of links are internal (start with /)
graph LR
    ul["Find all ul\nElements"]
count["Count Links\nin List"]
check1{"More than\n5 links?"}
check2{"80%+ are\ninternal?"}
remove["Remove ul\nElement"]
keep["Keep ul\nElement"]
ul --> count
 
   count --> check1
 
   check1 -->|Yes| check2
 
   check1 -->|No| keep
 
   check2 -->|Yes| remove
 
   check2 -->|No| keep

Navigation List Detection

Sources: python/deepwiki-scraper.py:799-806

html2text Conversion

After cleaning the HTML, the system uses the html2text library to convert HTML to Markdown. The conversion is configured with specific settings to preserve link structure and prevent line wrapping.

html2text Configuration

The body_width = 0 setting is critical because it prevents the converter from introducing artificial line breaks that would break code blocks and formatted content.

Important: Mermaid diagram extraction is explicitly disabled at this stage. DeepWiki’s Next.js payload contains diagrams from ALL pages mixed together, making per-page extraction unreliable. Diagrams are handled separately in Phase 2 using fuzzy matching (see Fuzzy Matching Algorithm).

Sources: python/deepwiki-scraper.py:213-228

graph TB
    scan["Scan Last 50 Lines\nBackwards"]
patterns["Check Against\nFooter Patterns"]
found{"Pattern\nMatch?"}
backward["Scan Backward\n20 More Lines"]
content{"Hit Real\nContent?"}
cut["Cut Lines from\nFooter Start"]
trim["Trim Trailing\nEmpty Lines"]
scan --> patterns
 
   patterns --> found
 
   found -->|Yes| backward
 
   found -->|No| trim
 
   backward --> content
 
   content -->|Yes| cut
 
   content -->|No| backward
 
   cut --> trim

The clean_deepwiki_footer function removes DeepWiki’s footer UI elements that appear at the end of each page. It uses regex patterns to detect footer markers and removes everything from that point onward.

The footer patterns are compiled regex expressions:

PatternPurposeExample Match
^\s*Dismiss\s*$Close button“Dismiss”
Refresh this wikiRefresh controls“Refresh this wiki”
This wiki was recently refreshedStatus messageVarious timestamps
###\s*On this pagePage navigation“### On this page”
Please wait \d+ days?Rate limiting“Please wait 7 days”
View this search on DeepWikiSearch linkExact match
^\s*Edit Wiki\s*$Edit button“Edit Wiki”

Sources: python/deepwiki-scraper.py:165-211

Post-Processing Steps

After initial conversion, two post-processing steps refine the Markdown output: source reference formatting and internal link rewriting.

Source Reference Formatting

The format_source_references function inserts colons between filenames and line numbers in source code references. This transforms patterns like [path/to/file:10-20] into [path/to/file:10-20].

Pattern Matching:

  • Regex: \[([A-Za-z0-9._/-]+?)(\d+-\d+)\]
  • Capture Group 1: Filename path
  • Capture Group 2: Line number range
  • Output: [filename:linerange]

Sources: python/deepwiki-scraper.py:395-406

graph TB
    find["Find Link Pattern:\n/owner/repo/page"]
extract["Extract page_num\nand slug"]
normalize["normalized_number_parts()\n[28-43]"]
build["build_target_path()\n[55-63]"]
relative{"Source in\nSubsection?"}
same{"Target in Same\nSection?"}
diff["Prefix: ../\nDifferent section"]
none["Prefix: ../\nTop level"]
local["No prefix\nSame section"]
find --> extract
 
   extract --> normalize
 
   normalize --> build
 
   build --> relative
 
   relative -->|Yes| same
 
   relative -->|No| done["Return\nRelative Path"]
same -->|Yes| local
 
   same -->|No| diff
 
   diff --> done
 
   local --> done
 
   none --> done

DeepWiki uses absolute URLs for internal wiki links (e.g., /owner/repo/4-query-planning). The system rewrites these to relative Markdown file paths using the build_target_path function.

Link Rewriting Process

Path Resolution Examples:

Source FileTarget LinkResolved Path
1-overview.md/repo/2-architecture2-architecture.md
section-2/2-1-pipeline.md/repo/2-2-build2-2-build.md
section-2/2-1-pipeline.md/repo/3-config../3-config.md
1-overview.md/repo/2-1-subsectionsection-2/2-1-subsection.md

Sources: python/deepwiki-scraper.py:854-875 python/deepwiki-scraper.py:55-63 python/deepwiki-scraper.py:28-43

Duplicate Content Removal

The final cleanup step removes duplicate titles and stray “Menu” text that may appear in the converted Markdown. The system tracks whether a title has been seen and skips subsequent occurrences if they match the first title exactly.

Cleanup Rules:

  1. Skip standalone “Menu” lines
  2. Keep first # Title occurrence
  3. Skip duplicate titles that match the first title
  4. Preserve all other content

Sources: python/deepwiki-scraper.py:820-841

Output Format

The final output is clean Markdown with the following characteristics:

  • Title guaranteed to be present (added if missing)
  • No DeepWiki UI elements
  • No artificial line wrapping
  • Relative internal links
  • Formatted source references
  • Stripped trailing whitespace

The output is written to temporary storage before diagram enhancement in Phase 2. A snapshot of this raw Markdown (without diagrams) is saved to raw_markdown/ for debugging purposes.

Sources: python/deepwiki-scraper.py:1357-1366

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Phase 2: Diagram Enhancement

Loading…

Phase 2: Diagram Enhancement

Relevant source files

Purpose and Scope

Phase 2 performs intelligent diagram extraction and placement after Phase 1 has generated clean markdown files. This phase extracts Mermaid diagrams from DeepWiki’s JavaScript payload, matches them to appropriate locations in the markdown content using fuzzy text matching, and inserts them contextually after relevant paragraphs.

For information about the initial markdown extraction that precedes this phase, see Phase 1: Markdown Extraction. For details on the specific fuzzy matching algorithm implementation, see Fuzzy Diagram Matching Algorithm. For information about the extraction patterns used, see Diagram Extraction from Next.js.

Sources: README.md:130-136 tools/deepwiki-scraper.py:596-789

The Client-Side Rendering Problem

DeepWiki renders diagrams client-side using JavaScript, making them invisible to traditional HTML scraping. All Mermaid diagrams are embedded in a JavaScript payload (self.__next_f.push) that contains diagram code for all pages in the wiki , not just the current page. This creates a matching problem: given ~461 diagrams in a single payload and individual markdown files, how do we determine which diagrams belong in which files?

Key challenges:

  • Diagrams are escaped JavaScript strings (\n, \t, \")
  • No metadata associates diagrams with specific pages
  • html2text conversion changes text formatting from the original JavaScript context
  • Must avoid false positives (placing diagrams in wrong locations)

Sources: tools/deepwiki-scraper.py:458-461 README.md:131-136

Architecture Overview

Diagram: Phase 2 Processing Pipeline

Sources: tools/deepwiki-scraper.py:596-789

Diagram Extraction Process

The extraction process reads the JavaScript payload from any DeepWiki page and locates all Mermaid diagram blocks using regex pattern matching.

flowchart TD
    Start["extract_and_enhance_diagrams()"]
FetchURL["Fetch https://deepwiki.com/repo/1-overview"]
subgraph "Pattern Matching"
        Pattern1["Pattern: r'```mermaid\\\\\n(.*?)```'\n(re.DOTALL)"]
Pattern2["Pattern: r'([^`]{500,}?)```mermaid\\\\ (.*?)```'\n(with context)"]
FindAll["re.findall() → all_diagrams list"]
FindIter["re.finditer() → diagram_contexts with context"]
end
    
    subgraph "Unescaping"
        ReplaceNewline["Replace '\\\\\n' → newline"]
ReplaceTab["Replace '\\\\ ' → tab"]
ReplaceQuote["Replace '\\\\\"' → double-quote"]
ReplaceUnicode["Replace Unicode escapes:\n\\\< → '<'\n\\\> → '>'\n\\\& → '&'"]
end
    
    subgraph "Context Processing"
        Last500["Extract last 500 chars of context"]
FindHeading["Scan for last heading starting with #"]
ExtractAnchor["Extract last 2-3 non-heading lines\n(min 20 chars each)"]
BuildDict["Build dict: {last_heading, anchor_text, diagram}"]
end
    
 
   Start --> FetchURL
 
   FetchURL --> Pattern1
 
   FetchURL --> Pattern2
 
   Pattern1 --> FindAll
 
   Pattern2 --> FindIter
    
 
   FindAll --> ReplaceNewline
 
   FindIter --> ReplaceNewline
 
   ReplaceNewline --> ReplaceTab
 
   ReplaceTab --> ReplaceQuote
 
   ReplaceQuote --> ReplaceUnicode
    
 
   ReplaceUnicode --> Last500
 
   Last500 --> FindHeading
 
   FindHeading --> ExtractAnchor
 
   ExtractAnchor --> BuildDict
    
 
   BuildDict --> Output["Returns:\n- all_diagrams count\n- diagram_contexts list"]

Extraction Function Flow

Diagram: Diagram Extraction and Context Building

Sources: tools/deepwiki-scraper.py:604-674

Key Implementation Details

ComponentImplementationLocation
Regex Patternr'```mermaid\\n(.*?)```' with re.DOTALL flagtools/deepwiki-scraper.py615
Context Patternr'([^]{500,}?)mermaid\\n(.*?)’` captures 500+ charstools/deepwiki-scraper.py621
Unescape Operationsreplace('\\n', '\n'), replace('\\t', '\t'), etc.tools/deepwiki-scraper.py:628-635 tools/deepwiki-scraper.py:639-645
Heading Detectionline.startswith('#') on reversed context linestools/deepwiki-scraper.py:652-656
Anchor ExtractionLast 2-3 lines with len(line) > 20, max 300 charstools/deepwiki-scraper.py:658-666
Context StorageDict with keys: last_heading, anchor_text, diagramtools/deepwiki-scraper.py:668-672

Sources: tools/deepwiki-scraper.py:614-674

Fuzzy Matching Algorithm

The fuzzy matching algorithm determines where each diagram should be inserted by finding the best match between the diagram’s context and the markdown file’s content.

flowchart TD
    Start["For each diagram_contexts[idx]"]
CheckUsed["idx in diagrams_used?"]
Skip["Skip to next diagram"]
subgraph "Text Normalization"
        NormFile["Normalize file content:\ncontent.lower()\n' '.join(content.split())"]
NormAnchor["Normalize anchor_text:\nanchor.lower()\n' '.join(anchor.split())"]
NormHeading["Normalize heading:\nheading.lower().replace('#', '').strip()"]
end
    
    subgraph "Progressive Chunk Matching"
        Try300["Try chunk_size=300"]
Try200["Try chunk_size=200"]
Try150["Try chunk_size=150"]
Try100["Try chunk_size=100"]
Try80["Try chunk_size=80"]
ExtractChunk["test_chunk = anchor_normalized[-chunk_size:]"]
FindPos["pos = content_normalized.find(test_chunk)"]
CheckPos["pos != -1?"]
ConvertLine["Convert char position to line number"]
RecordMatch["Record best_match_line, best_match_score"]
end
    
    subgraph "Heading Fallback"
        IterLines["For each line in markdown"]
CheckHeadingLine["line.strip().startswith('#')?"]
NormalizeLinе["Normalize line heading"]
CheckContains["heading_normalized in line_normalized?"]
RecordHeadingMatch["best_match_line = line_num\nbest_match_score = 50"]
end
    
 
   Start --> CheckUsed
 
   CheckUsed -->|Yes| Skip
 
   CheckUsed -->|No| NormFile
    
 
   NormFile --> NormAnchor
 
   NormAnchor --> Try300
 
   Try300 --> ExtractChunk
 
   ExtractChunk --> FindPos
 
   FindPos --> CheckPos
 
   CheckPos -->|Found| ConvertLine
 
   CheckPos -->|Not found| Try200
 
   ConvertLine --> RecordMatch
    
 
   Try200 --> Try150
 
   Try150 --> Try100
 
   Try100 --> Try80
 
   Try80 -->|All failed| IterLines
    
 
   RecordMatch --> Success["Return match with score"]
IterLines --> CheckHeadingLine
 
   CheckHeadingLine -->|Yes| NormalizeLinе
 
   NormalizeLinе --> CheckContains
 
   CheckContains -->|Yes| RecordHeadingMatch
 
   RecordHeadingMatch --> Success

Matching Strategy

Diagram: Progressive Chunk Matching with Fallback

Sources: tools/deepwiki-scraper.py:708-746

Chunk Size Progression

The algorithm tries progressively smaller chunk sizes to accommodate variations in text formatting between the JavaScript context and the html2text-converted markdown:

Chunk SizeUse CaseSuccess Rate
300 charsPerfect or near-perfect matchesHighest precision
200 charsMinor formatting differencesGood precision
150 charsModerate text variationsAcceptable precision
100 charsSignificant reformattingLower precision
80 charsMinimal context availableLowest precision
Heading matchFallback when text matching failsScore: 50

The algorithm stops at the first successful match, prioritizing larger chunks for higher confidence.

Sources: tools/deepwiki-scraper.py:716-730 README.md134

flowchart TD
    Start["Found best_match_line"]
CheckType["lines[best_match_line].strip().startswith('#')?"]
subgraph "Heading Case"
        H1["insert_line = best_match_line + 1"]
H2["Skip blank lines after heading"]
H3["Skip through paragraph content"]
H4["Stop at next blank line or heading"]
end
    
    subgraph "Paragraph Case"
        P1["insert_line = best_match_line + 1"]
P2["Find end of current paragraph"]
P3["Stop at next blank line or heading"]
end
    
    subgraph "Insertion Format"
        I1["Insert: empty line"]
I2["Insert: ```mermaid"]
I3["Insert: diagram code"]
I4["Insert: ```"]
I5["Insert: empty line"]
end
    
 
   Start --> CheckType
 
   CheckType -->|Heading| H1
 
   CheckType -->|Paragraph| P1
    
 
   H1 --> H2
 
   H2 --> H3
 
   H3 --> H4
    
 
   P1 --> P2
 
   P2 --> P3
    
 
   H4 --> I1
 
   P3 --> I1
    
 
   I1 --> I2
 
   I2 --> I3
 
   I3 --> I4
 
   I4 --> I5
    
 
   I5 --> Append["Append to pending_insertions list:\n(insert_line, diagram, score, idx)"]

Insertion Point Logic

After finding a match, the system determines the precise line number where the diagram should be inserted.

Insertion Algorithm

Diagram: Insertion Point Calculation

Sources: tools/deepwiki-scraper.py:747-768

graph LR
    Collect["Collect all\npending_insertions"]
Sort["Sort by insert_line\n(descending)"]
Insert["Insert from bottom to top\npreserves line numbers"]
Write["Write enhanced file\nto temp_dir"]
Collect --> Sort
 
   Sort --> Insert
 
   Insert --> Write

Batch Insertion Strategy

Diagrams are inserted in descending line order to avoid invalidating insertion points:

Diagram: Batch Insertion Order

Implementation:

Sources: tools/deepwiki-scraper.py:771-783

sequenceDiagram
    participant Main as extract_and_enhance_diagrams()
    participant Glob as temp_dir.glob('**/*.md')
    participant File as Individual .md file
    participant Matcher as Fuzzy Matcher
    participant Writer as File Writer
    
    Main->>Main: Extract all diagram_contexts
    Main->>Glob: Find all markdown files
    
    loop For each md_file
        Glob->>File: Open and read content
        File->>File: Check if '```mermaid' already present
        
        alt Already has diagrams
            File->>Glob: Skip (continue)
        else No diagrams
            File->>Matcher: Normalize content
            
            loop For each diagram_context
                Matcher->>Matcher: Try progressive chunk matching
                Matcher->>Matcher: Try heading fallback
                Matcher->>Matcher: Record best match
            end
            
            Matcher->>File: Return pending_insertions list
            File->>File: Sort insertions (descending)
            File->>File: Insert diagrams bottom-up
            File->>Writer: Write enhanced content
            Writer->>Main: Increment enhanced_count
        end
    end
    
    Main->>Main: Print summary

File Processing Workflow

Phase 2 operates on files in the temporary directory created by Phase 1, enhancing them in-place before they are moved to the final output directory.

Processing Loop

Diagram: File Processing Sequence

Sources: tools/deepwiki-scraper.py:676-788

Performance Characteristics

Extraction Statistics

From a typical wiki with ~10 pages:

MetricValueLocation
Total diagrams in JS payload~461README.md132
Diagrams with context (500+ chars)~48README.md133
Context window size500 characterstools/deepwiki-scraper.py621
Anchor text max length300 characterstools/deepwiki-scraper.py666
Typical enhanced filesVaries by contentPrinted in output

Sources: README.md:132-133 tools/deepwiki-scraper.py674 tools/deepwiki-scraper.py788

Matching Performance

The progressive chunk size strategy balances precision and recall:

  • High precision matches (300-200 chars) : Strong contextual alignment
  • Medium precision matches (150-100 chars) : Acceptable with some risk
  • Low precision matches (80 chars) : Risk of false positives
  • Heading-only matches (score: 50) : Last resort fallback

The algorithm prefers to skip a diagram rather than place it incorrectly, prioritizing documentation quality over diagram count.

Sources: tools/deepwiki-scraper.py:716-745

Integration with Phases 1 and 3

Input Requirements (from Phase 1)

  • Clean markdown files in temp_dir
  • Files must not already contain \```mermaid blocks
  • Proper heading structure for fallback matching
  • Normalized link structure

Sources: tools/deepwiki-scraper.py:810-877

Output Guarantees (for Phase 3)

  • Enhanced markdown files in temp_dir
  • Diagrams inserted with proper fencing: \```mermaid…`````
  • Blank lines before and after diagrams for proper rendering
  • Original file structure preserved (section-N directories maintained)
  • Atomic file operations (write complete file or skip)

Sources: tools/deepwiki-scraper.py:781-786 tools/deepwiki-scraper.py:883-908

Workflow Integration

Diagram: Three-Phase Integration

Sources: README.md:123-145 tools/deepwiki-scraper.py:810-916

Error Handling and Edge Cases

Skipped Files

Files are skipped if they already contain Mermaid diagrams to avoid duplicate insertion:

Sources: tools/deepwiki-scraper.py:686-687

Failed Matches

When a diagram cannot be matched:

  • The diagram is not inserted (conservative approach)
  • No error is raised (continues processing other diagrams)
  • File is left unmodified if no diagrams match

Sources: tools/deepwiki-scraper.py:699-746

Network Errors

If diagram extraction fails (network error, changed HTML structure):

  • Warning is printed but Phase 2 continues
  • Phase 1 files remain valid
  • System can still proceed to Phase 3 without diagrams

Sources: tools/deepwiki-scraper.py:610-612

Diagram Quality Thresholds

ThresholdPurpose
len(diagram) > 10Filter out trivial/invalid diagram code
len(anchor) > 50Ensure sufficient context for matching
len(line) > 20Filter out short lines from anchor text
chunk_size >= 80Minimum viable match size

Sources: tools/deepwiki-scraper.py648 tools/deepwiki-scraper.py712 tools/deepwiki-scraper.py661

Summary

Phase 2 implements a sophisticated fuzzy matching system that:

  1. Extracts all Mermaid diagrams from DeepWiki’s JavaScript payload using regex patterns
  2. Processes diagram context to extract heading and anchor text metadata
  3. Matches diagrams to markdown files using progressive chunk size comparison (300→80 chars)
  4. Inserts diagrams after relevant paragraphs with proper formatting
  5. Validates through conservative matching to avoid false positives

The phase operates entirely on files in the temporary directory, leaving Phase 1’s output intact while preparing enhanced files for Phase 3’s mdBook build process.

Sources: README.md:130-136 tools/deepwiki-scraper.py:596-789

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Mermaid Normalization

Loading…

Mermaid Normalization

Relevant source files

The Mermaid normalization pipeline transforms diagrams extracted from DeepWiki’s JavaScript payload into syntax that is compatible with Mermaid 11. DeepWiki’s diagrams often contain formatting issues, legacy syntax, and multiline constructs that newer Mermaid parsers reject. This seven-step normalization process ensures that all diagrams render correctly in mdBook’s Mermaid renderer.

For information about how diagrams are extracted from the JavaScript payload, see Phase 2: Diagram Enhancement. For information about the fuzzy matching algorithm that places diagrams in the correct markdown files, see Fuzzy Matching Algorithm.

Purpose and Scope

This page documents the seven normalization functions that transform raw Mermaid diagram code into Mermaid 11-compatible syntax. Each normalization step addresses a specific category of syntax errors or incompatibilities. The pipeline is applied to every diagram before it is injected into markdown files.

The normalization pipeline handles:

  • Multiline edge labels that span multiple lines
  • State diagram description syntax variations
  • Flowchart node labels containing reserved characters
  • Missing statement separators between consecutive nodes
  • Empty node labels that lack fallback text
  • Gantt chart tasks missing required task identifiers
  • Additional edge case transformations (quote stripping, label merging)

Normalization Pipeline Architecture

The normalization pipeline is orchestrated by the normalize_mermaid_diagram function, which applies seven normalization passes in sequence. Each pass is idempotent and focuses on a specific syntax issue.

Pipeline Flow Diagram

graph TD
    Input["Raw Diagram Text\nfrom Next.js Payload"]
Step1["normalize_mermaid_edge_labels()\nFlatten multiline edge labels"]
Step2["normalize_mermaid_state_descriptions()\nFix state syntax"]
Step3["normalize_flowchart_nodes()\nClean node labels"]
Step4["normalize_statement_separators()\nInsert newlines"]
Step5["normalize_empty_node_labels()\nAdd fallback labels"]
Step6["normalize_gantt_diagram()\nAdd synthetic task IDs"]
Output["Normalized Diagram\nMermaid 11 Compatible"]
Input --> Step1
 
   Step1 --> Step2
 
   Step2 --> Step3
 
   Step3 --> Step4
 
   Step4 --> Step5
 
   Step5 --> Step6
 
   Step6 --> Output

Sources: python/deepwiki-scraper.py:385-393 python/deepwiki-scraper.py:230-383

Function Name to Normalization Step Mapping

Sources: python/deepwiki-scraper.py:385-393

Step 1: Edge Label Normalization

The normalize_mermaid_edge_labels function collapses multiline edge labels into single-line labels with escaped newline sequences. Mermaid 11 rejects edge labels that span multiple physical lines.

Function : normalize_mermaid_edge_labels(diagram_text: str) -> str

Pattern Matched : Edge labels enclosed in pipes: |....|

Transformations Applied :

  • Replace literal newline characters with spaces
  • Replace escaped \n sequences with spaces
  • Remove parentheses from labels (invalid syntax)
  • Collapse multiple spaces into single spaces
BeforeAfter
`A –>“Label\nLine 2”
`C –>Text (note)
`E –>First\nSecond\nThird

Implementation Details :

  • Only processes diagrams starting with graph or flowchart keywords
  • Uses regex pattern \|([^|]*)\| to match edge labels
  • Checks for presence of \n, (, or ) before applying cleanup
  • Preserves labels that are already properly formatted

Sources: python/deepwiki-scraper.py:230-251

Step 2: State Description Normalization

The normalize_mermaid_state_descriptions function ensures state diagram descriptions follow the strict State : Description syntax required by Mermaid 11.

Function : normalize_mermaid_state_descriptions(diagram_text: str) -> str

Pattern Matched : State declarations with colons in state diagrams

Transformations Applied :

  • Ensure single space after state name before colon
  • Replace newlines in descriptions with spaces
  • Replace additional colons in description with -
  • Collapse multiple spaces to single space
BeforeAfter
Idle:Waiting\nfor inputIdle : Waiting for input
Active:Processing:dataActive : Processing - data
Error : Multiple spacesError : Multiple spaces

Implementation Details :

  • Only processes diagrams starting with statediagram keyword
  • Skips lines containing :: (double colon, used for class names)
  • Splits each line on first colon occurrence
  • Requires both prefix and suffix to be non-empty after stripping

Sources: python/deepwiki-scraper.py:253-277

Step 3: Flowchart Node Normalization

The normalize_flowchart_nodes function removes reserved characters (especially pipe |) from flowchart node labels and adds statement separators.

Function : normalize_flowchart_nodes(diagram_text: str) -> str

Pattern Matched : Node labels in brackets: ["..."]

Transformations Applied :

  • Replace pipe characters | with forward slash /
  • Collapse multiple spaces to single space
  • Insert newlines between consecutive statements on same line
BeforeAfter
`Node[“LabelWith Pipes“]`
A["Text"] B["More"]A["Text"]
B["More"]
C["Many Spaces"]C["Many Spaces"]

Implementation Details :

  • Only processes diagrams starting with graph or flowchart keywords
  • Uses regex \["([^"]*)"\] to match quoted node labels
  • Inserts newlines after closing brackets/braces/parens using regex: (\"]|\}|\))\s+(?=[A-Za-z0-9_])
  • Preserves indentation when splitting statements

Sources: python/deepwiki-scraper.py:279-301

Step 4: Statement Separator Normalization

The normalize_statement_separators function inserts newlines between consecutive Mermaid statements that have been flattened onto a single line.

Function : normalize_statement_separators(diagram_text: str) -> str

Connector Tokens Recognized :

--> ==> -.-> --x x-- o--> o-> x-> *--> <--> <-.-> <-- --o

Pattern Matched : Whitespace before a node identifier that precedes a connector

Regex Pattern : STATEMENT_BREAK_PATTERN

BeforeAfter
A-->B B-->C C-->DA-->B
B-->C
C-->D
Node1-->Node2 Node3-->Node4Node1-->Node2
Node3-->Node4

Implementation Details :

  • Only processes diagrams starting with graph or flowchart keywords
  • Defines FLOW_CONNECTORS list of all Mermaid connector tokens
  • Builds regex pattern by escaping and joining connector tokens
  • Pattern: (?<!\n)([ \t]+)(?=[A-Za-z0-9_][\w\-]*(?:\s*\[[^\]]*\])?\s*(?:CONNECTORS)(?:\|[^|]*\|)?\s*)
  • Preserves indentation length when inserting newlines
  • Converts tabs to 4 spaces for consistent indentation

Sources: python/deepwiki-scraper.py:303-328 python/deepwiki-scraper.py:309-311

Step 5: Empty Node Label Normalization

The normalize_empty_node_labels function provides fallback text for nodes with empty labels, which Mermaid 11 rejects.

Function : normalize_empty_node_labels(diagram_text: str) -> str

Pattern Matched : Empty quoted labels: NodeId[""]

Transformation Applied :

  • Use node ID as fallback label text
  • Replace underscores and hyphens with spaces
  • Preserve original node ID for connections
BeforeAfter
Dead[""]Dead["Dead"]
User_Profile[""]User_Profile["User Profile"]
API-Gateway[""]API-Gateway["API Gateway"]

Implementation Details :

  • Regex pattern: (\b[A-Za-z0-9_]+)\[""\]
  • Converts underscores/hyphens to spaces for readable label: re.sub(r'[_\-]+', ' ', node_id)
  • Falls back to raw node_id if cleaned version is empty
  • Applied to all diagram types (not limited to flowcharts)

Sources: python/deepwiki-scraper.py:330-341 python/tests/test_mermaid_normalization.py:19-23

Step 6: Gantt Diagram Normalization

The normalize_gantt_diagram function assigns synthetic task identifiers to gantt chart tasks that are missing them, which is required by Mermaid 11.

Function : normalize_gantt_diagram(diagram_text: str) -> str

Pattern Matched : Task lines in format "Task Name" : start, end[, duration]

Transformation Applied :

  • Insert synthetic task ID (task1, task2, etc.) after colon
  • Only apply to tasks lacking valid identifiers
  • Preserve tasks that already have IDs or use after dependencies
BeforeAfter
"Design" : 2024-01-01, 2024-01-10"Design" : task1, 2024-01-01, 2024-01-10
"Code" : myTask, 2024-01-11, 5d"Code" : myTask, 2024-01-11, 5d (unchanged)
"Test" : after task1, 3d"Test" : after task1, 3d (unchanged)

Implementation Details :

  • Only processes diagrams starting with gantt keyword
  • Task line regex: ^(\s*"[^"]+"\s*):\s*(.+)$
  • Splits remainder on commas (max 3 parts)
  • Checks if first token matches ^[A-Za-z_][\w-]*$ or starts with after
  • Maintains counter (task_counter) for generating unique IDs
  • Reconstructs line: "{task_name}" : {task_id}, {start}, {end}[, {duration}]

Sources: python/deepwiki-scraper.py:343-383

Step 7: Additional Preprocessing

Before the seven main normalization steps, diagrams undergo additional preprocessing in the extraction phase:

Quote Stripping : strip_wrapping_quotes(diagram_text: str) -> str

  • Removes unnecessary quotation marks around edge labels: |"text"| → |text|
  • Removes quotes in state transitions: : "label" → : label

Label Merging : merge_multiline_labels(diagram_text: str) -> str

  • Collapses wrapped labels inside node shapes into \n sequences
  • Handles multiple shape types: (), [], {}, (()), [[]], {{}}
  • Skips lines containing structural tokens (arrows, keywords)
  • Applied before unescaping, so works with both real and escaped newlines

Sources: python/deepwiki-scraper.py:907-1023

Main Orchestrator Function

The normalize_mermaid_diagram function orchestrates all normalization passes in the correct order.

Function Signature : normalize_mermaid_diagram(diagram_text: str) -> str

Implementation :

Key Characteristics :

  • Each pass is idempotent and can be safely applied multiple times
  • Passes are independent and order-dependent
  • Edge label normalization must precede statement separator insertion
  • Flowchart node normalization includes its own statement separator logic
  • Empty label normalization should occur after other node transformations

Sources: python/deepwiki-scraper.py:385-393

graph TD
    Extract["extract_and_enhance_diagrams()"]
Loop["For each diagram match"]
Unescape["Unescape sequences\n(\\\n, \\ , \<, etc)"]
Preprocess["merge_multiline_labels()\nstrip_wrapping_quotes()"]
Normalize["normalize_mermaid_diagram()"]
Context["Extract context\n(heading, anchor text)"]
Pool["Add to diagram_contexts list"]
Extract --> Loop
 
   Loop --> Unescape
 
   Unescape --> Preprocess
 
   Preprocess --> Normalize
 
   Normalize --> Context
 
   Context --> Pool

Normalization Invocation Points

The normalization pipeline is invoked at a single location during diagram processing:

Invocation Context Diagram

Sources: python/deepwiki-scraper.py:880-1089 python/deepwiki-scraper.py:1058-1060

Testing and Validation

The normalization pipeline has dedicated unit tests covering each normalization function:

Test Coverage :

FunctionTest FileTest Cases
normalize_statement_separatorstest_mermaid_normalization.pyNewline insertion, indentation preservation
normalize_empty_node_labelstest_mermaid_normalization.pyEmpty label replacement
normalize_flowchart_nodestest_mermaid_normalization.pyPipe character stripping
normalize_mermaid_diagramtest_mermaid_normalization.pyEnd-to-end pipeline test

Example Test Case (Statement Separator Normalization):

End-to-End Test : The end-to-end test validates that multiple normalization steps work together correctly:

  • Input: graph TD\n Stage1[""] --> Stage2["Stage 2"]\n Stage2 --> Stage3 Stage3 --> Stage4
  • Validates empty label replacement: Stage1["Stage1"]
  • Validates statement separation: Stage2 --> Stage3 and Stage3 --> Stage4 on separate lines

Sources: python/tests/test_mermaid_normalization.py:1-42

Common Edge Cases

The normalization pipeline handles several edge cases that commonly occur in DeepWiki diagrams:

Empty Diagram Handling :

  • All normalizers check for empty/whitespace-only input
  • Return original text unchanged if stripped content is empty

Diagram Type Detection :

  • Each normalizer checks diagram type via first line keyword
  • normalize_mermaid_edge_labels: only processes graph or flowchart
  • normalize_mermaid_state_descriptions: only processes statediagram
  • normalize_gantt_diagram: only processes gantt
  • Other normalizers apply to all diagram types

Indentation Preservation :

  • Statement separator normalization preserves original indentation level
  • Converts tabs to 4 spaces for consistent formatting
  • Inserts newlines with matching indentation

Backtick Escaping in Fence Blocks : When injecting normalized diagrams into markdown, the injection logic dynamically calculates fence length to avoid conflicts with backticks inside diagram code:

  • Scans diagram for longest backtick run
  • Uses max(3, max_backticks + 1) as fence length

Sources: python/deepwiki-scraper.py:1249-1255

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Fuzzy Matching Algorithm

Loading…

Fuzzy Matching Algorithm

Relevant source files

Purpose and Scope

The fuzzy matching algorithm places Mermaid diagrams extracted from DeepWiki’s JavaScript payload into the correct locations within Markdown files. It matches diagram context (as it appears in the JavaScript) to content locations in html2text-converted Markdown, accounting for formatting differences between the two representations.

This algorithm is implemented in extract_and_enhance_diagrams() function and processes all files after the initial markdown extraction phase completes.

Sources: python/deepwiki-scraper.py:880-1275

The Matching Problem

The fuzzy matching algorithm addresses a fundamental mismatch: diagrams are embedded in DeepWiki’s JavaScript payload alongside their surrounding context text, but this context text differs significantly from the final Markdown output produced by html2text. The algorithm must find where each diagram belongs despite these differences.

Format Differences Between Sources

AspectJavaScript Payloadhtml2text Output
WhitespaceEscaped \n sequencesActual newlines
Line wrappingNo wrapping (continuous text)Wrapped at natural boundaries
HTML entitiesEscaped (\u003c, \u0026)Decoded (<, &)
FormattingInline with escaped quotesClean Markdown syntax
StructureLinear text streamHierarchical headings/paragraphs

Sources: python/deepwiki-scraper.py:898-903

Context Extraction Strategy

The algorithm extracts two types of context for each diagram to enable matching:

1. Last Heading Before Diagram

Extractinglast_heading from Context

The algorithm scans backwards through context_lines to find the most recent line starting with #, which provides a coarse-grained location hint.

Sources: python/deepwiki-scraper.py:1066-1071

2. Anchor Text (Last 2-3 Paragraphs)

Extractinganchor_text from Context

The anchor_text consists of the last 2-3 substantial non-heading lines before the diagram, truncated to 300 characters. This provides fine-grained matching capability.

Sources: python/deepwiki-scraper.py:1073-1081

Progressive Chunk Size Matching

The core of the fuzzy matching algorithm uses progressively smaller chunk sizes to find matches, prioritizing longer (more specific) matches over shorter ones.

Chunk Size Progression

The algorithm tests chunks in this order:

Chunk SizePurposeMatch Quality
300 charsFull anchor textHighest confidence
200 charsMost of anchorHigh confidence
150 charsSignificant portionMedium-high confidence
100 charsKey phrasesMedium confidence
80 charsMinimum viable matchLow confidence

Matching Algorithm Flow

Progressive Chunk Matching in Code

Sources: python/deepwiki-scraper.py:1169-1239

Text Normalization

Both the diagram context and the target Markdown content undergo identical normalization to maximize matching success:

This process:

  • Converts all text to lowercase
  • Collapses all consecutive whitespace (spaces, tabs, newlines) into single spaces
  • Removes leading/trailing whitespace

Sources: python/deepwiki-scraper.py:1166-1167 python/deepwiki-scraper.py:1185-1186

Fallback: Heading-Based Matching

If progressive chunk matching fails (best_match_line == -1 and heading exists), the algorithm falls back to heading-based matching:

Heading Fallback Implementation

Heading-based matches receive a fixed best_match_score of 50, lower than any chunk-based match (minimum 80), indicating lower confidence.

Sources: python/deepwiki-scraper.py:1203-1216

Insertion Point Calculation

Once a match is found with best_match_score >= 80, the algorithm calculates the precise insert_line for the diagram:

Insertion After Headings

Calculatinginsert_line After Heading

Sources: python/deepwiki-scraper.py:1222-1230

Insertion After Paragraphs

Calculatinginsert_line After Paragraph

Sources: python/deepwiki-scraper.py:1232-1236

Scoring and Deduplication

The algorithm tracks which diagrams have been used to prevent duplicates:

For each file, the algorithm:

  1. Attempts to match all diagrams in diagram_contexts with file content
  2. Stores successful matches with their scores in pending_insertions as tuples: (insert_line, diagram, best_match_score, idx)
  3. Marks diagrams as used by adding their index to diagrams_used set
  4. Sorts pending_insertions by line number (descending) to avoid index shifting
  5. Inserts diagrams from bottom to top

Sources: python/deepwiki-scraper.py:1162-1163 python/deepwiki-scraper.py1238 python/deepwiki-scraper.py:1242-1243

Diagram Insertion Format

Diagrams are inserted with proper Markdown fencing and spacing, accounting for backticks in the diagram content:

Buildinglines_to_insert

This results in the following structure in the Markdown file:

Next paragraph text.

If the diagram contains triple backticks, the fence length is increased (e.g., to 4 or 5 backticks) to avoid conflicts.

**Sources:** <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/python/deepwiki-scraper.py#L1249-L1266" min=1249 max=1266 file-path="python/deepwiki-scraper.py">Hii</FileRef>

## Complete Matching Pipeline

**`extract_and_enhance_diagrams()` Function Flow**

```mermaid
flowchart TD
    Start["extract_and_enhance_diagrams(\nrepo, temp_dir,\nsession, diagram_source_url)"] --> Fetch["response = session.get(\ndiagram_source_url)\nhtml_text = response.text"]
    Fetch --> ExtractPattern["diagram_pattern =\nr'```mermaid(?:\\r\\n|\\n|\r?\n)\n(.*?)(?:\\r\\n|\\n|\r?\n)```'\ndiagram_matches =\nlist(re.finditer(\npattern, html_text, re.DOTALL))"]
    ExtractPattern --> ContextLoop["diagram_contexts = []\nfor match in diagram_matches:\ncontext_start =\nmax(0, match.start() - 2000)"]
    ContextLoop --> ParseContext["extract:\n- last_heading\n- anchor_text[-300:]\n- diagram (unescaped)"]
    ParseContext --> NormalizeDiag["diagram =\nmerge_multiline_labels(diagram)\ndiagram =\nstrip_wrapping_quotes(diagram)\ndiagram =\nnormalize_mermaid_diagram(diagram)"]
    NormalizeDiag --> AppendContext["diagram_contexts.append({\n'last_heading': last_heading,\n'anchor_text': anchor_text,\n'diagram': diagram\n})"]
    AppendContext --> FindFiles["md_files =\nlist(temp_dir.glob('**/*.md'))"]
    
    FindFiles --> FileLoop["for md_file in md_files:"]
    FileLoop --> ReadFile["content = f.read()"]
    ReadFile --> CheckExists["re.search(\nr'^\\s*`{3,}\\s*mermaid\\b',\ncontent, re.IGNORECASE\n| re.MULTILINE)?"]
    CheckExists -->|Yes| Skip["continue"]
    CheckExists -->|No| SplitLines["lines = content.split('\\n')"]
    SplitLines --> InitVars["diagrams_used = set()\npending_insertions = []\ncontent_normalized =\ncontent.lower()"]
    InitVars --> DiagLoop["for idx, item in\nenumerate(diagram_contexts):"]
    
    DiagLoop --> TryChunks["for chunk_size in\n[300, 200, 150, 100, 80]:\ntest_chunk =\nanchor_normalized[-chunk_size:]\npos = content_normalized\n.find(test_chunk)"]
    TryChunks -->|Found| CalcLine["convert pos to line_num\nbest_match_line = line_num\nbest_match_score = chunk_size"]
    TryChunks -->|Not found| TryHeading["heading fallback matching"]
    TryHeading -->|Found| SetScore["best_match_score = 50"]
    CalcLine --> CheckScore["best_match_score >= 80?"]
    SetScore --> CheckScore
    CheckScore -->|Yes| CalcInsert["calculate insert_line:\nenforce_content_start()\nadvance_past_lists()"]
    CalcInsert --> AppendPending["pending_insertions.append(\ninsert_line, diagram,\nbest_match_score, idx)\ndiagrams_used.add(idx)"]
    CheckScore -->|No| NextDiag["next diagram"]
    
    AppendPending --> NextDiag
    NextDiag --> DiagLoop
    DiagLoop -->|All done| SortPending["pending_insertions.sort(\nkey=lambda x: x[0],\nreverse=True)"]
    SortPending --> InsertLoop["for insert_line, diagram,\nscore, idx in\npending_insertions:\ncalculate fence_len\nlines.insert(insert_line,\nlines_to_insert)"]
    InsertLoop --> SaveFile["with open(md_file, 'w')\nas f:\nf.write('\\n'.join(lines))"]
    SaveFile --> FileLoop
    FileLoop -->|All files| PrintStats["print(f'Enhanced\n{enhanced_count} files')"]

Sources: python/deepwiki-scraper.py:880-1275

Key Functions and Variables

Function/VariableLocationPurpose
extract_and_enhance_diagrams()python/deepwiki-scraper.py:880-1275Main orchestrator for diagram enhancement phase
diagram_contextspython/deepwiki-scraper.py903List of dicts with last_heading, anchor_text, diagram
first_body_heading_index()python/deepwiki-scraper.py:1095-1099Finds first ## heading in file
protected_prefix_end()python/deepwiki-scraper.py:1101-1115Determines where title and source list end
advance_past_lists()python/deepwiki-scraper.py:1125-1137Skips over list blocks to avoid insertion in lists
enforce_content_start()python/deepwiki-scraper.py:1138-1147Ensures insertion is after protected sections
diagrams_usedpython/deepwiki-scraper.py1162Set tracking which diagram indices are already placed
pending_insertionspython/deepwiki-scraper.py1163List of tuples: (insert_line, diagram, score, idx)
best_match_linepython/deepwiki-scraper.py1175Line number where best match was found
best_match_scorepython/deepwiki-scraper.py1176Score of best match (chunk_size or 50 for heading)
Progressive chunk looppython/deepwiki-scraper.py:1187-1201Tries chunk sizes [300, 200, 150, 100, 80]
Heading fallbackpython/deepwiki-scraper.py:1203-1216Matches based on heading text when chunks fail
Insertion calculationpython/deepwiki-scraper.py:1219-1236Determines where to insert diagram after match
Diagram insertionpython/deepwiki-scraper.py:1249-1266Inserts diagram with dynamic fence length

Performance Characteristics

The algorithm processes diagrams in a single pass per file with the following complexity:

OperationComplexityNotes
Content normalizationO(n)Where n = file size in characters
Chunk searchO(n × c × d)c = 5 chunk sizes, d = diagram count
Line number conversionO(L)Where L = number of lines in file
Insertion sortingO(k log k)Where k = matched diagrams
Bottom-up insertionO(k × L)Avoids index recalculation due to reverse order

For a typical file with 1000 lines and 48 diagram candidates with context, the algorithm completes in under 100ms per file.

Sources: python/deepwiki-scraper.py:880-1275

Match Quality Statistics

As reported in the console output, the algorithm typically achieves:

  • Total diagrams in JavaScript : ~461 diagrams across all pages
  • Diagrams with sufficient context : ~48 diagrams (500+ char context)
  • Average match rate : 60-80% of diagrams with context are successfully placed
  • Typical score distribution :
    • 300-char matches: 20-30% (highest confidence)
    • 200-char matches: 15-25%
    • 150-char matches: 15-20%
    • 100-char matches: 10-15%
    • 80-char matches: 5-10%
    • Heading fallback: 5-10% (lowest confidence)

Sources: README.md:132-136 tools/deepwiki-scraper.py674

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Phase 3: mdBook Build

Loading…

Phase 3: mdBook Build

Relevant source files

Purpose and Scope

Phase 3 is the final transformation stage that converts enhanced markdown files into a searchable HTML documentation website using mdBook. This phase creates the book structure, generates the table of contents, injects templates, installs mermaid rendering support, and executes the mdBook build process.

This page covers the overall Phase 3 workflow and its core components. For detailed information about specific sub-processes, see:

  • SUMMARY.md generation algorithm: 8.1
  • Template injection mechanics: 8.2
  • Configuration system: 3
  • Template system details: 11

Phase 3 begins after Phase 2 completes diagram enhancement (see 7) and ends with the production of deployable HTML artifacts.

Sources: scripts/build-docs.sh:95-310


Phase 3 Process Flow

Phase 3 executes six distinct steps, each transforming the workspace toward the final HTML output. The process is orchestrated by build-docs.sh and coordinates multiple tools.

graph TB
    Input["Enhanced Markdown Files\n/workspace/wiki/"]
Step2["Step 2: Initialize mdBook Structure\nCreate /workspace/book/\nGenerate book.toml"]
Step3["Step 3: Generate SUMMARY.md\nDiscover file structure\nSort pages numerically"]
Step4["Step 4: Process Markdown Files\nInject header/footer templates\nCopy to book/src/"]
Step5["Step 5: Install Mermaid Assets\nmdbook-mermaid install"]
Step6["Step 6: Build Book\nmdbook build"]
Step7["Step 7: Copy Outputs\nbook/ → /output/book/\nbook.toml → /output/"]
Output["Final Outputs\n/output/book/ (HTML)\n/output/markdown/\n/output/book.toml"]
Input --> Step2
 
   Step2 --> Step3
 
   Step3 --> Step4
 
   Step4 --> Step5
 
   Step5 --> Step6
 
   Step6 --> Step7
 
   Step7 --> Output

High-Level Phase 3 Pipeline

Sources: scripts/build-docs.sh:95-310


mdBook Structure Initialization

The first step creates the mdBook workspace at /workspace/book/ and generates the configuration file book.toml. This establishes the foundation for all subsequent operations.

Directory Structure Creation

The script creates the following directory hierarchy:

/workspace/book/
├── book.toml          (generated configuration)
└── src/               (will contain markdown files)
    ├── SUMMARY.md     (generated in Step 3)
    ├── *.md           (copied in Step 4)
    └── section-*/     (copied in Step 4)

Sources: scripts/build-docs.sh:96-122

book.toml Configuration

The book.toml file is generated dynamically from environment variables. The configuration structure follows the mdBook specification:

Configuration SectionPurposeSource Variables
[book]Book metadataBOOK_TITLE, BOOK_AUTHORS
[output.html]HTML output settingsGIT_REPO_URL
[preprocessor.mermaid]Mermaid diagram supportStatic configuration
[output.html.fold]Section folding behaviorStatic configuration

The generated configuration:

The git-repository-url setting enables the GitHub icon in the rendered book’s header, providing navigation back to the source repository.

Sources: scripts/build-docs.sh:101-119


SUMMARY.md Generation Process

Step 3 generates src/SUMMARY.md, which defines the book’s table of contents and navigation structure. This is a critical file that mdBook requires to determine page ordering and hierarchy.

graph TB
    WikiDir["/workspace/wiki/\nAll markdown files"]
FindOverview["Find Overview File\ngrep -Ev '^[0-9]'\nFirst non-numbered file"]
FindMain["Find Main Pages\ngrep -E '^[0-9]'\nFiles matching ^[0-9]*.md"]
Sort["Sort Numerically\nsort -t- -k1 -n\nBy leading number"]
CheckSections{"Has Subsections?\nsection-N directory exists?"}
FindSubs["Find Subsections\nls section-N/*.md\nSort numerically"]
ExtractTitle["Extract Title\nhead -1 file.md\nsed 's/^# //'"]
BuildEntry["Build TOC Entry\n- [Title](filename)"]
BuildNested["Build Nested Entry\n- [Title](filename)\n - [Subtitle](section-N/file)"]
Summary["src/SUMMARY.md\nGenerated TOC"]
WikiDir --> FindOverview
 
   WikiDir --> FindMain
 
   FindOverview --> Summary
 
   FindMain --> Sort
 
   Sort --> CheckSections
    
 
   CheckSections -->|No| ExtractTitle
 
   CheckSections -->|Yes| FindSubs
    
 
   ExtractTitle --> BuildEntry
 
   FindSubs --> ExtractTitle
 
   ExtractTitle --> BuildNested
    
 
   BuildEntry --> Summary
 
   BuildNested --> Summary

File Discovery and Sorting

Numeric Sorting Algorithm

The sorting mechanism uses shell built-ins to extract numeric prefixes and sort appropriately:

  1. List all .md files in /workspace/wiki/: ls "$WIKI_DIR"/*.md
  2. Filter by numeric prefix: grep -E '^[0-9]'
  3. Sort using field delimiter - and numeric comparison: sort -t- -k1 -n
  4. For each main page, check for subsection directory: section-$section_num
  5. If subsections exist, repeat sort for subsection files

This ensures pages appear in the correct order: 1-overview.md, 2-architecture.md, 2.1-subsection.md, etc.

Sources: scripts/build-docs.sh:124-188

For detailed documentation of the SUMMARY.md generation algorithm, see 8.1.


Markdown File Processing

Step 4 copies markdown files from /workspace/wiki/ to /workspace/book/src/ and injects HTML templates into each file. This step bridges the gap between raw markdown and mdBook-ready content.

graph TB
    HeaderTemplate["templates/header.html\nRaw template with variables"]
FooterTemplate["templates/footer.html\nRaw template with variables"]
ProcessTemplate["process-template.py\nVariable substitution"]
EnvVars["Environment Variables\nREPO, BOOK_TITLE, DEEPWIKI_URL\nGITHUB_BADGE_URL, etc."]
HeaderHTML["HEADER_HTML\nProcessed HTML string"]
FooterHTML["FOOTER_HTML\nProcessed HTML string"]
MarkdownFiles["Markdown Files\nsrc/*.md, src/*/*.md"]
Inject["Injection Loop\nfor mdfile in src/*.md"]
TempFile["$mdfile.tmp\nHeader + Content + Footer"]
Replace["mv $mdfile.tmp $mdfile\nReplace original"]
FinalFiles["Final Book Source\nsrc/*.md with templates"]
HeaderTemplate --> ProcessTemplate
 
   FooterTemplate --> ProcessTemplate
 
   EnvVars --> ProcessTemplate
    
 
   ProcessTemplate --> HeaderHTML
 
   ProcessTemplate --> FooterHTML
    
 
   HeaderHTML --> Inject
 
   FooterHTML --> Inject
 
   MarkdownFiles --> Inject
    
 
   Inject --> TempFile
 
   TempFile --> Replace
 
   Replace --> FinalFiles

Template Processing Workflow

Template Variable Substitution

The process-template.py script is invoked twice, once for the header and once for the footer:

The processed HTML strings are then injected into every markdown file through shell redirection.

Sources: scripts/build-docs.sh:190-261

For detailed documentation of template mechanics and customization, see 8.2 and 11.


Mermaid Asset Installation

Step 5 installs the mdbook-mermaid preprocessor assets into the book directory. This step configures the JavaScript libraries and stylesheets required to render Mermaid diagrams in the final HTML output.

Installation Command

The installation is performed by the mdbook-mermaid binary:

This command:

  1. Creates theme assets in book/theme/
  2. Installs mermaid-init.js for diagram initialization
  3. Configures mermaid.js library version
  4. Sets up diagram rendering hooks for mdBook’s preprocessor chain

The preprocessor was configured in book.toml during Step 2:

This configuration tells mdBook to run the mermaid preprocessor before generating HTML, which converts mermaid code blocks into rendered diagrams.

Sources: scripts/build-docs.sh:263-266


graph LR
    BookToml["book.toml\nConfiguration"]
SrcDir["src/\nSUMMARY.md\n*.md files\nsection-*/ dirs"]
MdBookBuild["mdbook build\nMain build command"]
Preprocessor["mermaid preprocessor\nConvert mermaid blocks"]
Renderer["HTML renderer\nGenerate pages"]
Assets["Copy static assets\ntheme/, images/"]
Search["Build search index\nsearchindex.js"]
BookOutput["book/\nCompiled HTML"]
BookToml --> MdBookBuild
 
   SrcDir --> MdBookBuild
    
 
   MdBookBuild --> Preprocessor
 
   Preprocessor --> Renderer
 
   Renderer --> Assets
 
   Renderer --> Search
    
 
   Assets --> BookOutput
 
   Search --> BookOutput

Book Build Execution

Step 6 executes the core mdBook build process. This step transforms the prepared markdown files and configuration into a complete HTML documentation website.

mdBook Build Pipeline

Build Command

The build is invoked with no arguments, using the current directory’s configuration:

mdBook automatically:

  • Reads book.toml for configuration
  • Processes src/SUMMARY.md to determine page structure
  • Runs configured preprocessors (mermaid)
  • Generates HTML with search index
  • Applies the configured theme (rust)
  • Creates navigation elements with git repository link

The resulting output is written to book/, relative to the current working directory (/workspace/book/).

Sources: scripts/build-docs.sh:268-271


graph TB
    BookBuild["book/\nBuilt HTML website"]
WikiDir["wiki/\nEnhanced markdown"]
RawDir["raw_markdown/\nPre-enhancement snapshots"]
BookConfig["book.toml\nBuild configuration"]
OutputBook["/output/book/\nDeployable HTML"]
OutputMD["/output/markdown/\nFinal markdown source"]
OutputRaw["/output/raw_markdown/\nDebug snapshots"]
OutputConfig["/output/book.toml\nReference config"]
BookBuild -->|cp -r| OutputBook
 
   WikiDir -->|cp -r| OutputMD
 
   RawDir -->|cp -r if exists| OutputRaw
 
   BookConfig -->|cp| OutputConfig

Output Collection

Step 7 consolidates all build artifacts into the /output/ directory, which is typically mounted as a volume for access from the host system.

Output Artifacts and Layout

Copy Operations

The script performs four copy operations:

SourceDestinationPurpose
book//output/book/Deployable HTML site
/workspace/wiki//output/markdown/Enhanced markdown (with diagrams)
/workspace/raw_markdown//output/raw_markdown/Pre-enhancement markdown (debugging)
book.toml/output/book.tomlConfiguration reference

The /output/book/ directory is immediately servable as a static website:

Sources: scripts/build-docs.sh:273-309


graph TB
    Phase1["Phase 1:\nMarkdown Extraction"]
Phase2["Phase 2:\nDiagram Enhancement"]
CheckMode{"MARKDOWN_ONLY\n== 'true'?"}
CopyMD["Copy wiki/ to\n/output/markdown/"]
CopyRaw["Copy raw_markdown/ to\n/output/raw_markdown/"]
Exit["Exit with success\nSkip Phase 3"]
Phase3["Phase 3:\nmdBook Build"]
Phase1 --> Phase2
 
   Phase2 --> CheckMode
    
 
   CheckMode -->|Yes| CopyMD
 
   CopyMD --> CopyRaw
 
   CopyRaw --> Exit
    
 
   CheckMode -->|No| Phase3

Markdown-Only Mode

The MARKDOWN_ONLY environment variable provides an escape hatch that bypasses Phase 3 entirely. When set to "true", the script exits after Phase 2 (diagram enhancement) without building the HTML book.

Markdown-Only Workflow

This mode is useful for:

  • Debugging diagram placement without HTML build overhead
  • Consuming markdown in alternative build systems
  • Inspecting intermediate transformation results
  • CI/CD pipelines that only need markdown output

Sources: scripts/build-docs.sh:67-93

For more details on markdown-only mode, see 12.1.


graph TB
    subgraph "Shell Script Orchestration"
        BuildScript["build-docs.sh\nMain orchestrator"]
Step2Func["Lines 95-122\nmkdir, cat > book.toml"]
Step3Func["Lines 124-188\nSUMMARY.md generation"]
Step4Func["Lines 190-261\ncp, template injection"]
Step5Func["Lines 263-266\nmdbook-mermaid install"]
Step6Func["Lines 268-271\nmdbook build"]
Step7Func["Lines 273-295\ncp outputs"]
end
    
    subgraph "External Tools"
        MdBook["mdbook\nRust binary\n/usr/local/bin/mdbook"]
MdBookMermaid["mdbook-mermaid\nPreprocessor binary\n/usr/local/bin/mdbook-mermaid"]
ProcessTemplate["process-template.py\nPython script\n/usr/local/bin/process-template.py"]
end
    
    subgraph "Key Files"
        BookToml["book.toml\nGenerated at line 102"]
SummaryMd["src/SUMMARY.md\nGenerated at line 130-186"]
HeaderHtml["templates/header.html\nInput template"]
FooterHtml["templates/footer.html\nInput template"]
end
    
    subgraph "Directory Structures"
        BookDir["/workspace/book/\nBuild workspace"]
SrcDir["/workspace/book/src/\nMarkdown sources"]
OutputDir["/output/\nVolume mount point"]
end
    
 
   BuildScript --> Step2Func
 
   Step2Func --> Step3Func
 
   Step3Func --> Step4Func
 
   Step4Func --> Step5Func
 
   Step5Func --> Step6Func
 
   Step6Func --> Step7Func
    
 
   Step2Func --> BookToml
 
   Step2Func --> BookDir
 
   Step3Func --> SummaryMd
 
   Step4Func --> ProcessTemplate
 
   Step4Func --> SrcDir
 
   Step5Func --> MdBookMermaid
 
   Step6Func --> MdBook
 
   Step7Func --> OutputDir
    
 
   HeaderHtml --> ProcessTemplate
 
   FooterHtml --> ProcessTemplate

Phase 3 Component Map

This diagram maps Phase 3 concepts to their concrete implementations in the codebase:

Sources: scripts/build-docs.sh:95-310


Phase 3 Error Handling

Phase 3 operates under set -e mode, causing immediate script termination on any command failure. Common failure scenarios:

Failure PointCauseImpact
Step 2 mkdirPermissions issue, disk fullScript exits before book.toml creation
Step 3 file discoveryNo markdown files in wiki/SUMMARY.md empty, mdBook build fails
Step 4 template processingInvalid template syntaxHEADER_HTML or FOOTER_HTML empty, build continues without templates
Step 5 mermaid installmdbook-mermaid binary missingScript exits, no mermaid support in output
Step 6 mdbook buildInvalid markdown syntax, broken SUMMARY.mdScript exits, no HTML output
Step 7 copyPermissions issue on /outputScript exits, no artifacts persisted

All errors are reported to stderr and result in a non-zero exit code due to set -e.

Sources: scripts/build-docs.sh2 scripts/build-docs.sh:95-310


Summary

Phase 3 transforms enhanced markdown into a searchable HTML documentation website through six orchestrated steps:

  1. Structure Initialization : Creates mdBook workspace and book.toml configuration
  2. SUMMARY.md Generation : Builds navigation structure with numeric sorting
  3. Markdown Processing : Injects templates and copies files to book source
  4. Mermaid Installation : Configures diagram rendering support
  5. Book Build : Executes mdBook to generate HTML
  6. Output Collection : Consolidates artifacts to /output/ volume

Phase 3 is entirely orchestrated by build-docs.sh and coordinates shell utilities, Python scripts, and Rust binaries to produce the final documentation website. The process is deterministic and idempotent, generating consistent output from the same input markdown.

Sources: scripts/build-docs.sh:95-310

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

SUMMARY.md Generation

Loading…

SUMMARY.md Generation

Relevant source files

Purpose and Scope

This document explains how the SUMMARY.md file is dynamically generated from the scraped markdown content structure. The SUMMARY.md file serves as mdBook’s table of contents, defining the navigation structure and page hierarchy for the generated HTML documentation.

For information about how the markdown files are initially organized during scraping, see Wiki Structure Discovery. For details about the overall mdBook build configuration, see Configuration Generation.

SUMMARY.md in mdBook

The SUMMARY.md file is mdBook’s primary navigation document. It defines:

  • The order of pages in the documentation
  • The hierarchical structure (chapters and sub-chapters)
  • The titles displayed in the navigation sidebar
  • Which markdown files map to which sections

mdBook parses SUMMARY.md to construct the entire book structure. Pages not listed in SUMMARY.md will not be included in the generated documentation.

Sources: build-docs.sh:108-161

Generation Process Overview

The SUMMARY.md generation occurs in Step 3 of the build pipeline build-docs.sh:124-188 after markdown extraction is complete but before the mdBook build begins. The generation algorithm automatically discovers the file structure, applies numeric sorting to section files, and constructs a hierarchical table of contents.

SUMMARY.md Generation Pipeline

flowchart TD
    Start["Start Step 3:\nbuild-docs.sh:126"]
Init["Write Header:\necho '# Summary'\nLines 129-131"]
ListFiles["List all files:\nmain_pages_list=$(ls $WIKI_DIR/*.md)"]
FindOverview["Find overview_file:\ngrep -Ev '^[0-9]' /head -1 Lines 136-138"]
HasOverview{"overview_file exists?"}
ExtractOvTitle["Extract title: head -1/ sed 's/^# //'\nLine 140"]
WriteOv["Write: [{title}]($overview_file)\nLines 141-143"]
RemoveOv["Filter overview_file from list\nLine 143"]
NumericSort["Numeric Sort:\ngrep -E '^[0-9]' /sort -t- -k1 -n Lines 147-155"]
IterateMain["for file in main_pages Line 158"]
ExtractTitle["title=$ head -1/ sed 's/^# //')"]
ExtractSecNum["section_num=$(grep -oE '^[0-9]+')"]
CheckSecDir{"[ -d section-$section_num ]"}
WriteSec["echo '- [$title]($filename)'\nLine 171"]
IterateSub["ls section-$section_num/*.md /sort -t- -k1 -n Lines 174-180"]
WriteSub["echo ' - [$subtitle] section-$section_num/$subfilename '"]
WriteStandalone["echo '- [$title] $filename ' Line 183"]
Complete["Redirect to: src/SUMMARY.md Line 186"]
LogCount["Log entry count: grep -c '\\[' src/SUMMARY.md Line 188"]
End["End Step 3"]
Start --> Init
 Init --> ListFiles
 ListFiles --> FindOverview
 FindOverview --> HasOverview
 HasOverview -->|Yes|ExtractOvTitle
 ExtractOvTitle --> WriteOv
 WriteOv --> RemoveOv
 RemoveOv --> NumericSort
 HasOverview -->|No|NumericSort
 NumericSort --> IterateMain
 IterateMain --> ExtractTitle
 ExtractTitle --> ExtractSecNum
 ExtractSecNum --> CheckSecDir
 CheckSecDir -->|Yes: Has subsections|WriteSec
 WriteSec --> IterateSub
 IterateSub --> WriteSub
 WriteSub --> IterateMain
 CheckSecDir -->|No: Standalone|WriteStandalone
 WriteStandalone --> IterateMain
 IterateMain -->|Done| Complete
 
   Complete --> LogCount
 
   LogCount --> End

Sources: build-docs.sh:124-188

The algorithm executes three key phases:

PhaseLinesDescription
Overview Extraction133-145Identifies and writes non-numbered introduction page
Numeric Sorting147-155Sorts numbered pages by numeric prefix using sort -t- -k1 -n
Hierarchical Writing158-185Iterates sorted pages, detecting and nesting subsections

Sources: build-docs.sh:124-188

Algorithm Components

Step 1: Overview File Selection

The algorithm identifies a special overview file by searching for files without numeric prefixes. This file becomes the introduction page, written before the numbered sections.

Overview File Detection Algorithm

flowchart TD
    Start["main_pages_list =\nls $WIKI_DIR/*.md"]
Filter["overview_file =\nawk -F/ '{print $NF}' /grep -Ev '^[0-9]'/\nhead -1"]
Check{"overview_file\nnot empty?"}
Verify{"File exists?\n[ -f $WIKI_DIR/$overview_file ]"}
Extract["title=$(head -1 $WIKI_DIR/$overview_file /sed 's/^# //'"]
Write["echo '[{title:-Overview}] $overview_file ' echo ''"]
Remove["main_pages_list = grep -v $overview_file"]
Continue["Proceed to numeric sorting"]
Start --> Filter
 Filter --> Check
 Check -->|Yes|Verify
 Check -->|No|Continue
 Verify -->|Yes|Extract
 Verify -->|No| Continue
 
   Extract --> Write
 
   Write --> Remove
 
   Remove --> Continue

Sources: build-docs.sh:133-145

Detection Logic:

StepCommandPurposeExample
List filesls "$WIKI_DIR"/*.mdGet all root markdown filesOverview.md 1-intro.md 2-start.md
Extract basenameawk -F/ '{print $NF}'Get filename onlyOverview.md
Filter non-numericgrep -Ev '^[0-9]'Exclude numbered filesOverview.md (matches)
Take firsthead -1Select single overviewOverview.md
Extract title`head -1sed ‘s/^# //’`Get page title

The overview file is then excluded from main_pages_list before numeric sorting build-docs.sh143

Sources: build-docs.sh:133-145

Step 2: Numeric Sorting Pipeline

After overview extraction, remaining files are sorted numerically by their leading number prefix. This ensures pages appear in logical order (e.g., 2-start.md before 10-advanced.md).

Numeric Sorting Implementation

flowchart LR
    Input["main_pages_list\n(filtered from overview)"]
Basename["awk -F/ '{print $NF}'\nExtract filename"]
GrepNum["grep -E '^[0-9]'\nKeep only numbered files"]
NumSort["sort -t- -k1 -n\nNumeric sort on first field"]
Reconstruct["while read fname; do\n echo $WIKI_DIR/$fname\ndone"]
Output["main_pages\n(sorted file paths)"]
Input --> Basename
 
   Basename --> GrepNum
 
   GrepNum --> NumSort
 
   NumSort --> Reconstruct
 
   Reconstruct --> Output

Sources: build-docs.sh:147-155

Sort Command Breakdown:

FlagPurposeExample Effect
-t-Set delimiter to -Split 10-advanced.md into 10 and advanced.md
-k1Sort by field 1Use 10 as sort key
-nNumeric comparison2 sorts before 10 (not lexicographic)

Example Sorting:

Unsorted FilenamesNumeric SortOutput Order
10-advanced.mdExtract 101-overview.md
2-start.mdExtract 22-start.md
1-overview.mdExtract 15-components.md
5-components.mdExtract 510-advanced.md

Sources: build-docs.sh:147-155

Step 3: Subsection Detection and Iteration

For each main page, the algorithm extracts the numeric prefix (section_num) and checks for a corresponding section-N/ directory. If found, all subsection files are written with 2-space indentation.

Subsection Detection and Writing Flow

flowchart TD
    LoopStart["for file in main_pages\nLine 158"]
FileCheck["[ -f $file ] // continue"]
GetBasename["filename=$(basename $file)"]
ExtractTitle["title=$(head -1 $file /sed 's/^# //' Line 163"]
ExtractNum["section_num=$ echo $filename/\ngrep -oE '^[0-9]+' // true)\nLine 166"]
BuildPath["section_dir=$WIKI_DIR/section-$section_num\nLine 167"]
CheckBoth{"[ -n $section_num ] &&\n[ -d $section_dir ]"}
WriteMain["echo '- [$title]($filename)'\nLine 171"]
ListSubs["ls $section_dir/*.md 2>/dev/null /awk -F/ '{print $NF}'/\nsort -t- -k1 -n\nLines 174-175"]
SubLoop["while read subname\nLine 174"]
SubCheck["[ -f $section_dir/$subname ] // continue"]
SubBasename["subfilename=$(basename $subfile)"]
SubTitle["subtitle=$(head -1 $subfile /sed 's/^# //' Line 178"]
SubWrite["echo ' - [$subtitle] section-$section_num/$subfilename ' Line 179"]
WriteStandalone["echo '- [$title] $filename ' Line 183"]
NextFile["Continue loop"]
LoopStart --> FileCheck
 FileCheck --> GetBasename
 GetBasename --> ExtractTitle
 ExtractTitle --> ExtractNum
 ExtractNum --> BuildPath
 BuildPath --> CheckBoth
 CheckBoth -->|Yes: Has subsections|WriteMain
 WriteMain --> ListSubs
 ListSubs --> SubLoop
 SubLoop --> SubCheck
 SubCheck --> SubBasename
 SubBasename --> SubTitle
 SubTitle --> SubWrite
 SubWrite --> SubLoop
 SubLoop -->|Done|NextFile
 CheckBoth -->|No: Standalone| WriteStandalone
 
   WriteStandalone --> NextFile
    
 
   NextFile --> LoopStart

Sources: build-docs.sh:158-185

Variable Mapping:

VariableTypePurposeExample Value
filenameStringMain page filename5-component-reference.md
titleStringMain page titleComponent Reference
section_numStringNumeric prefix5
section_dirPathSubsection directory/workspace/wiki/section-5
subfilenameStringSubsection filename5.1-build-docs.md
subtitleStringSubsection titlebuild-docs.sh Orchestrator

Sources: build-docs.sh:158-185

Step 4: Subsection Numeric Sorting

Subsection files within section-N/ directories undergo the same numeric sorting as main pages, ensuring proper ordering (e.g., 5.2 before 5.10).

Subsection Sorting Pipeline

flowchart LR
    Dir["section_dir/\nsection-5/"]
List["ls $section_dir/*.md 2>/dev/null"]
Awk["awk -F/ '{print $NF}'"]
Sort["sort -t- -k1 -n"]
Loop["while read subname"]
Verify["[ -f $subfile ] // continue"]
Extract["subtitle=$(head -1 / sed 's/^# //')"]
Write["echo ' - [$subtitle](section-N/$subfilename)'"]
Dir --> List
 
   List --> Awk
 
   Awk --> Sort
 
   Sort --> Loop
 
   Loop --> Verify
 
   Verify --> Extract
 
   Extract --> Write
 
   Write --> Loop

Sources: build-docs.sh:174-180

Key Implementation Details:

AspectImplementationPurpose
Indentationecho " - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/$subtitle" undefined file-path="$subtitle">Hii</FileRef>"Two spaces for mdBook nesting
Path prefixsection-$section_num/$subfilenameCorrect relative path for mdBook
Numeric sortsort -t- -k1 -nSame algorithm as main pages
Title extraction`head -1sed ‘s/^# //’`

Sorting Example:

Input files:        After sort:         SUMMARY.md output:
section-5/          section-5/          - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Component Reference" undefined  file-path="Component Reference">Hii</FileRef>
├── 5.10-tools.md   ├── 5.1-build.md      - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/build-docs.sh" undefined  file-path="build-docs.sh">Hii</FileRef>
├── 5.2-scraper.md  ├── 5.2-scraper.md    - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Scraper" undefined  file-path="Scraper">Hii</FileRef>
└── 5.1-build.md    └── 5.10-tools.md     - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Tools" undefined  file-path="Tools">Hii</FileRef>

Sources: build-docs.sh:174-180

File Structure Conventions

The generation algorithm depends on the file structure created during markdown extraction (see Wiki Structure Discovery):

Diagram: File Structure Conventions for SUMMARY.md Generation

PatternLocationSUMMARY.md Output
*.mdRoot directoryMain pages
N-*.mdRoot directoryMain section (if section-N/ exists)
*.mdsection-N/ directorySubsections (indented under section N)

Sources: build-docs.sh:126-158

Title Extraction Method

All page titles are extracted using a consistent pattern:

This assumes that every markdown file begins with a level-1 heading (# Title). The sed command removes the # prefix, leaving only the title text.

Extraction Pipeline:

CommandPurposeExample InputExample Output
head -1 "$file"Get first line# Component Reference# Component Reference
sed 's/^# //'Remove heading syntax# Component ReferenceComponent Reference

Sources: build-docs.sh120 build-docs.sh134 build-docs.sh150

Output Format

The generated SUMMARY.md follows mdBook’s syntax:

Format Rules:

ElementSyntaxPurpose
Header# SummaryRequired mdBook header
Introduction<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Title" undefined file-path="Title">Hii</FileRef>First page (no bullet)
Main Page- <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Title" undefined file-path="Title">Hii</FileRef>Top-level navigation item
Section Header# Section NameVisual grouping in sidebar
Subsection - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Title" undefined file-path="Title">Hii</FileRef>Nested under main section (2-space indent)

Sources: build-docs.sh:113-159

flowchart TD
    Step3["Step 3 Comment\nLine 124-126"]
Header["Write Header Block\nLines 129-131\n{\n echo '# Summary'\n echo ''\n}"]
Overview["Overview Detection\nLines 133-145\nmain_pages_list=$(ls)\noverview_file=$(grep -Ev '^[0-9]')\nif [ -n $overview_file ]; then\n title=$(head -1)\n echo '[$title]($overview_file)'\n main_pages_list=$(grep -v $overview_file)\nfi"]
NumSort["Numeric Sort Block\nLines 147-155\nmain_pages=$(\n printf '%s' $main_pages_list\n /awk -F/ '{print $NF}'/ grep -E '^[0-9]'\n /sort -t- -k1 -n/ while read fname; do\n echo $WIKI_DIR/$fname\n done\n)"]
MainLoop["Main Page Loop\nLines 158-185\necho $main_pages /while read file do filename=$ basename $file title=$ head -1/ sed 's/^# //')\n section_num=$(grep -oE '^[0-9]+')\n section_dir=$WIKI_DIR/section-$section_num\n if [ -n $section_num ] && [ -d $section_dir ]; then\n ...\n fi\ndone"]
Subsection["Subsection Block\nLines 174-180\nls $section_dir/*.md\n /awk -F/ '{print $NF}'/ sort -t- -k1 -n\n / while read subname; do\n subtitle=$(head -1)\n echo ' - [$subtitle](...)'\n done"]
Redirect["Output Redirection\nLine 186\n} > src/SUMMARY.md"]
Log["Entry Count Log\nLine 188\necho 'Generated SUMMARY.md with\n$(grep -c '\\[' src/SUMMARY.md)
entries'"]
Step3 --> Header
 
   Header --> Overview
 
   Overview --> NumSort
 
   NumSort --> MainLoop
 
   MainLoop --> Subsection
 
   Subsection --> MainLoop
 
   MainLoop --> Redirect
 
   Redirect --> Log

Implementation Code Mapping

The SUMMARY.md generation is implemented in a single contiguous block within build-docs.sh. The following diagram maps algorithm phases to specific line ranges:

Code Structure and Execution Flow

Sources: build-docs.sh:124-188

Shell Variable Reference:

VariableScopeTypeInitializationExample Value
WIKI_DIRGlobalPathLine 28/workspace/wiki
main_pages_listLocalStringLine 135Multi-line list of paths
overview_fileLocalStringLine 138Overview.md
main_pagesLocalStringLine 147Sorted, newline-separated paths
filenameLoopStringLine 1605-component-reference.md
titleLoopStringLine 163Component Reference
section_numLoopStringLine 1665
section_dirLoopPathLine 167/workspace/wiki/section-5
subfilenameNested LoopStringLine 1775.1-build-docs.md
subtitleNested LoopStringLine 178build-docs.sh Orchestrator

Sources: build-docs.sh28 build-docs.sh:124-188

Generation Statistics and Output

After generation completes, the script logs statistical information about the generated SUMMARY.md file:

Entry Counting Logic

Sources: build-docs.sh188

The grep -c command counts lines containing the <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/ character, which appears in every markdown link. This count includes#LNaN-LNaN“ NaN file-path=“ character, which appears in every markdown link. This count includes">Hii</FileRef> | +1 | | Main pages | - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Title" undefined file-path="Title">Hii</FileRef> | +1 per main page | | Subsections | - <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Title" undefined file-path="Title">Hii</FileRef> | +1 per subsection |

Example Output:

Generated SUMMARY.md with 23 entries

This indicates the book contains 23 total navigation links (overview + main pages + all subsections combined).

Sources: build-docs.sh188

Integration with mdBook Build

The generated src/SUMMARY.md file is used directly by mdBook during the build process (Step 6):

  1. mdBook reads src/SUMMARY.md to determine book structure
  2. For each entry, mdBook looks up the corresponding markdown file in src/
  3. Files are processed in the order they appear in SUMMARY.md
  4. The navigation sidebar reflects the hierarchy defined in SUMMARY.md

The generation happens in build-docs.sh:108-161 markdown files are copied to src/ in build-docs.sh166 and the mdBook build executes in build-docs.sh176

Sources: build-docs.sh:108-176

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Template Injection

Loading…

Template Injection

Relevant source files

Purpose and Scope

Template Injection is the process of inserting processed HTML header and footer content into each markdown file during Phase 3 of the build pipeline. This occurs after SUMMARY.md Generation and before the final mdBook build. The system reads HTML template files, performs variable substitution and conditional rendering, and prepends/appends the resulting HTML to every markdown file in the book structure.

For information about the template system architecture and customization options, see Template System and Template System Details. For the complete Phase 3 pipeline, see Phase 3: mdBook Build.


Template Processing Architecture

The template injection system consists of two stages: template processing (variable substitution and conditional evaluation) and content injection (inserting processed HTML into markdown files).

graph TB
    subgraph "Input Sources"
        HeaderTemplate["header.html\n/workspace/templates/"]
FooterTemplate["footer.html\n/workspace/templates/"]
EnvVars["Environment Variables\nREPO, BOOK_TITLE,\nGIT_REPO_URL, etc."]
end
    
    subgraph "Template Processing"
        ProcessScript["process-template.py"]
ParseVars["Parse Variables\nVAR=value args"]
ReadTemplate["Read Template File"]
ProcessConditionals["Process Conditionals\n{{#if VAR}}...{{/if}}"]
SubstituteVars["Substitute Variables\n{{VAR}}"]
StripComments["Strip HTML Comments"]
end
    
    subgraph "Processed Output"
        HeaderHTML["HEADER_HTML\nShell Variable"]
FooterHTML["FOOTER_HTML\nShell Variable"]
end
    
 
   HeaderTemplate --> ProcessScript
 
   FooterTemplate --> ProcessScript
 
   EnvVars --> ParseVars
 
   ParseVars --> ProcessScript
 
   ProcessScript --> ReadTemplate
 
   ReadTemplate --> ProcessConditionals
 
   ProcessConditionals --> SubstituteVars
 
   SubstituteVars --> StripComments
 
   StripComments --> HeaderHTML
 
   StripComments --> FooterHTML

Template Processing Flow

Sources: scripts/build-docs.sh:195-234 python/process-template.py:11-50


Template File Discovery

The system locates template files using configurable paths with sensible defaults. Template discovery follows a priority order that allows for custom template overrides.

Configuration VariableDefault ValuePurpose
TEMPLATE_DIR/workspace/templatesBase directory for template files
HEADER_TEMPLATE$TEMPLATE_DIR/header.htmlPath to header template
FOOTER_TEMPLATE$TEMPLATE_DIR/footer.htmlPath to footer template

If a template file is not found, the system emits a warning and continues without that template component, allowing for header-only or footer-only configurations.

Sources: scripts/build-docs.sh:195-197


Variable Substitution Mechanism

The process_template function in process-template.py performs two types of text replacement: conditional blocks and simple variable substitution.

Simple Variable Substitution

Variables use the {{VARIABLE_NAME}} syntax and are replaced with their corresponding values passed as command-line arguments.

graph LR
    Template["Template: {{REPO}}"]
Variables["Variables:\nREPO=owner/repo"]
ProcessTemplate["process_template()\npython/process-template.py:11-50"]
Result["Result: owner/repo"]
Template --> ProcessTemplate
 
   Variables --> ProcessTemplate
 
   ProcessTemplate --> Result

The substitution pattern r'\{\{(\w+)\}\}' matches variable placeholders, and the replace_variable function looks up values in the variables dictionary. If a variable is not found, an empty string is substituted.

Sources: python/process-template.py:38-45

Conditional Rendering

Conditional blocks use {{#if VARIABLE}}...{{/if}} syntax to conditionally include content based on variable presence and non-empty values.

graph TB
    ConditionalPattern["Pattern:\n{{#if VAR}}...{{/if}}"]
CheckVar{"Variable exists\nAND non-empty?"}
IncludeContent["Include Content"]
RemoveBlock["Remove Entire Block"]
ConditionalPattern --> CheckVar
 
   CheckVar -->|Yes| IncludeContent
 
   CheckVar -->|No| RemoveBlock

The regex pattern r'\{\{#if\s+(\w+)\}\}(.*?)\{\{/if\}\}' captures both the variable name and the content block. The replace_conditional function evaluates the condition using if var_name in variables and variables[var_name].

Sources: python/process-template.py:24-36


Available Template Variables

The following variables are passed to template processing and can be used in both header and footer templates:

VariableSourceExample Value
DEEPWIKI_URLConstructed from REPOhttps://deepwiki.com/owner/repo
DEEPWIKI_BADGE_URLStatic URLhttps://deepwiki.com/badge.svg
GIT_REPO_URLEnvironment or derivedhttps://github.com/owner/repo
GITHUB_BADGE_URLConstructed from REPOhttps://img.shields.io/badge/...
REPOEnvironment or Git detectionowner/repo
BOOK_TITLEEnvironment or defaultDocumentation
BOOK_AUTHORSEnvironment or REPO_OWNERowner
GENERATION_DATEdate -u commandJanuary 15, 2024 at 14:30 UTC

Sources: scripts/build-docs.sh:199-213 scripts/build-docs.sh:221-230


Template Processing Invocation

The shell script invokes process-template.py twice: once for the header and once for the footer. Each invocation passes all variables as command-line arguments in KEY=value format.

Header Processing

The processed HTML is captured in the HEADER_HTML shell variable for later injection.

Sources: scripts/build-docs.sh:202-213

Footer processing follows the same pattern but stores the result in FOOTER_HTML.

Sources: scripts/build-docs.sh:219-230


graph TB
    Start["Start Injection"]
CheckTemplates{"HEADER_HTML or\nFOOTER_HTML set?"}
SkipInjection["Skip injection\nCopy files as-is"]
FindFiles["Find all .md files\nsrc/*.md src/*/*.md"]
ProcessFile["For each file"]
ReadOriginal["Read original content"]
CreateTemp["Create temp file:\nheader + content + footer"]
ReplaceOriginal["mv temp to original"]
CountFiles["Increment file_count"]
ReportCount["Report processed count"]
Start --> CheckTemplates
 
   CheckTemplates -->|No| SkipInjection
 
   CheckTemplates -->|Yes| FindFiles
 
   FindFiles --> ProcessFile
 
   ProcessFile --> ReadOriginal
 
   ReadOriginal --> CreateTemp
 
   CreateTemp --> ReplaceOriginal
 
   ReplaceOriginal --> CountFiles
 
   CountFiles --> ProcessFile
 
   ProcessFile --> ReportCount

Markdown File Injection

After templates are processed, the system injects the resulting HTML into all markdown files in the book structure. Injection occurs by creating a temporary file with concatenated content and then replacing the original.

Injection Algorithm

Sources: scripts/build-docs.sh:236-261

File Processing Pattern

The injection loop processes files matching the glob patterns src/*.md and src/*/*.md, covering both root-level pages and subsection pages.

The temporary file approach ensures atomic writes and prevents partial content if processing is interrupted.

Sources: scripts/build-docs.sh:243-257


graph LR
    CopyFiles["Copy markdown files\nto src/"]
ProcessHeader["Process header.html\n→ HEADER_HTML"]
ProcessFooter["Process footer.html\n→ FOOTER_HTML"]
InjectLoop["Inject into each\n.md file"]
InstallMermaid["Install mdbook-mermaid"]
BuildBook["mdbook build"]
CopyFiles --> ProcessHeader
 
   ProcessHeader --> ProcessFooter
 
   ProcessFooter --> InjectLoop
 
   InjectLoop --> InstallMermaid
 
   InstallMermaid --> BuildBook

Template Injection Sequence

Template injection occurs at a specific point in the Phase 3 pipeline, after markdown files are copied but before the mdBook build.

This ordering ensures that:

  1. SUMMARY.md is generated with original titles before injection
  2. All markdown files exist in their final locations
  3. Templates are processed once and reused for all files
  4. mdBook receives fully-decorated markdown files

Sources: scripts/build-docs.sh:190-271


Example Template Structure

The default header template demonstrates the use of variables and conditionals:

Key observations:

  • Conditional wrapping prevents broken links when GIT_REPO_URL is unset
  • Inline conditionals prevent mdBook from wrapping content in <p> tags
  • Style attributes provide layout control within markdown constraints

Sources: templates/header.html:1-8


Skipping Template Injection

Template injection can be effectively disabled by:

  1. Not providing template files at the expected paths
  2. Setting HEADER_TEMPLATE or FOOTER_TEMPLATE to non-existent paths
  3. Mounting empty template files via volume mounts

When templates are not found, the system emits warnings but continues:

Warning: Header template not found at /workspace/templates/header.html, skipping...
Warning: Footer template not found at /workspace/templates/footer.html, skipping...

Files are then copied without modification using the fallback path.

Sources: scripts/build-docs.sh:214-217 scripts/build-docs.sh:232-234


Template Processing Error Handling

The process-template.py script performs validation at startup:

Error conditions result in non-zero exit codes, which would cause the shell script to abort due to set -e on line 2.

Sources: python/process-template.py:53-78 scripts/build-docs.sh2


Performance Characteristics

Template processing and injection have the following performance characteristics:

OperationComplexityTypical Duration
Template file readO(1)< 1ms per file
Variable substitutionO(n × m) where n=template size, m=variables< 5ms per template
Conditional evaluationO(n × c) where c=conditional count< 5ms per template
Per-file injectionO(f) where f=file size< 10ms per file
Total injection timeO(files × file_size)~100-500ms for typical wikis

The system reports processing time via the file count message: "Processed N files with templates".

Sources: scripts/build-docs.sh258


Integration with mdBook

The injected HTML content becomes part of the markdown source that mdBook processes. mdBook’s HTML renderer:

  1. Preserves raw HTML blocks in markdown
  2. Applies syntax highlighting to code blocks
  3. Processes Mermaid diagrams via mdbook-mermaid preprocessor
  4. Wraps content in the theme’s page structure

This allows the templates to provide page-level customization that appears consistently across all pages while leveraging mdBook’s built-in features for navigation, search, and responsive layout.

Sources: scripts/build-docs.sh:264-271

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

CI/CD Integration

Loading…

CI/CD Integration

Relevant source files

Purpose and Scope

This page provides an overview of the continuous integration and continuous deployment (CI/CD) infrastructure for the DeepWiki-to-mdBook system. The system includes three distinct integration patterns:

  1. Automated Build and Deploy Workflow - Weekly scheduled builds with manual trigger support (#9.1)
  2. Continuous Testing Workflow - Automated Python tests on every push and pull request (#9.2)
  3. Reusable GitHub Action - Packaged action for use in other repositories (#9.3)

All three patterns share the same underlying Docker infrastructure documented in Docker Multi-Stage Build, ensuring consistency between local development, testing, and production deployments.

For information about the Docker container architecture itself, see Docker Multi-Stage Build. For configuration options, see Configuration Reference.

CI/CD Architecture Overview

The CI/CD system consists of three independently triggered workflows that share a common Docker image but serve different purposes in the development lifecycle.

Diagram: CI/CD Trigger and Execution Patterns

graph TB
    subgraph "Trigger Sources"
        Schedule["Weekly Schedule\nSundays 00:00 UTC"]
ManualDispatch["Manual Dispatch\nworkflow_dispatch"]
Push["Push to main"]
PullRequest["Pull Request"]
ExternalRepo["External Repository\nuses action"]
end
    
    subgraph "Build and Deploy Workflow"
        BuildJob["build job\n.github/workflows/build-and-deploy.yml"]
ResolveStep["Resolve repository\nand book title"]
BuildxSetup["Setup Docker Buildx"]
DockerBuild1["Build deepwiki-to-mdbook\nDocker image"]
RunContainer1["Run docker container\nwith env vars"]
UploadArtifact["Upload Pages artifact\n./output/book"]
DeployJob["deploy job\nneeds: build"]
DeployPages["Deploy to GitHub Pages"]
end
    
    subgraph "Test Workflow"
        PytestJob["pytest job\n.github/workflows/tests.yml"]
SetupPython["Setup Python 3.12"]
InstallDeps["Install requirements.txt\nand pytest"]
RunTests["Run pytest python/tests/"]
end
    
    subgraph "Reusable Action"
        ActionDef["action.yml\ncomposite action"]
BuildImage["Build Docker image\nwith GITHUB_RUN_ID tag"]
RunContainer2["Run docker container\nwith input parameters"]
end
    
 
   Schedule --> BuildJob
 
   ManualDispatch --> BuildJob
 
   Push --> PytestJob
 
   PullRequest --> PytestJob
 
   ExternalRepo --> ActionDef
    
 
   BuildJob --> ResolveStep
 
   ResolveStep --> BuildxSetup
 
   BuildxSetup --> DockerBuild1
 
   DockerBuild1 --> RunContainer1
 
   RunContainer1 --> UploadArtifact
 
   UploadArtifact --> DeployJob
 
   DeployJob --> DeployPages
    
 
   PytestJob --> SetupPython
 
   SetupPython --> InstallDeps
 
   InstallDeps --> RunTests
    
 
   ActionDef --> BuildImage
 
   BuildImage --> RunContainer2

Sources: .github/workflows/build-and-deploy.yml:1-90 .github/workflows/tests.yml:1-26 action.yml:1-53

Workflow Triggers and Scheduling

The system employs multiple trigger mechanisms to balance automation, manual control, and external integration:

Trigger TypeWorkflowConfigurationPurpose
scheduleBuild and Deploycron: "0 0 * * 0"Weekly documentation refresh on Sundays at midnight UTC
workflow_dispatchBuild and DeployManual inputs supportedOn-demand builds with custom parameters
pushTestsBranch: mainValidate changes merged to main branch
pull_requestTestsAll PRsPre-merge validation of proposed changes
usesGitHub ActionN/AExternal repositories invoke the action

Sources: .github/workflows/build-and-deploy.yml:3-9 .github/workflows/tests.yml:3-8

Shared Docker Infrastructure

All three CI/CD patterns build and execute the same Docker image, ensuring consistent behavior across environments. The image build process follows this pattern:

Diagram: Docker Image Build and Execution Flow

graph LR
    subgraph "Source Files"
        Dockerfile["Dockerfile\nMulti-stage build"]
PythonScripts["Python scripts\ndeepwiki-scraper.py\nprocess-template.py"]
ShellScripts["Shell scripts\nbuild-docs.sh"]
Templates["templates/\nheader.html, footer.html"]
end
    
    subgraph "Build Stage"
        BuildCommand["docker build -t\ndeepwiki-to-mdbook"]
RustStage["Stage 1: rust:latest\nBuild mdBook binaries"]
PythonStage["Stage 2: python:3.12-slim\nInstall Python deps\nCopy executables"]
end
    
    subgraph "Execution"
        RunCommand["docker run --rm"]
EnvVars["Environment variables\nREPO, BOOK_TITLE, etc."]
VolumeMount["-v output:/output"]
Entrypoint["CMD: build-docs.sh"]
end
    
    subgraph "Output"
        OutputDir["./output/book/\nHTML documentation"]
MarkdownDir["./output/markdown/\nEnhanced markdown"]
end
    
 
   Dockerfile --> BuildCommand
 
   PythonScripts --> BuildCommand
 
   ShellScripts --> BuildCommand
 
   Templates --> BuildCommand
    
 
   BuildCommand --> RustStage
 
   RustStage --> PythonStage
 
   PythonStage --> RunCommand
    
 
   EnvVars --> RunCommand
 
   VolumeMount --> RunCommand
 
   RunCommand --> Entrypoint
    
 
   Entrypoint --> OutputDir
 
   Entrypoint --> MarkdownDir

Sources: .github/workflows/build-and-deploy.yml:60-72 action.yml:30-52

Permissions and Concurrency Control

The Build and Deploy workflow requires elevated permissions for GitHub Pages deployment, while the Test workflow operates with default read permissions.

Build and Deploy Permissions

Concurrency Management

The Build and Deploy workflow enforces single-instance execution to prevent deployment conflicts:

This configuration ensures that if a build is already running, subsequent triggers will wait rather than canceling the in-progress deployment.

Sources: .github/workflows/build-and-deploy.yml:11-20

Environment Variable Passing

All CI/CD patterns pass configuration to the Docker container via environment variables. The Build and Deploy workflow includes a resolution step that provides defaults when manual inputs are not specified:

Diagram: Environment Variable Resolution Flow

graph TD
    subgraph "Input Resolution"
        ManualInput["workflow_dispatch inputs\nrepo, book_title"]
CheckRepo{"repo input\nprovided?"}
CheckTitle{"book_title input\nprovided?"}
UseInputRepo["Use manual input"]
UseGitHubRepo["Use github.repository"]
UseInputTitle["Use manual input"]
GenTitle["Generate title from\nrepository name"]
end
    
    subgraph "Docker Execution"
        RepoVar["-e REPO"]
TitleVar["-e BOOK_TITLE"]
DockerRun["docker run deepwiki-to-mdbook"]
end
    
    subgraph "build-docs.sh"
        ReadEnv["Read environment variables"]
FetchWiki["Fetch from DeepWiki"]
BuildBook["Build mdBook"]
end
    
 
   ManualInput --> CheckRepo
 
   CheckRepo -->|Yes| UseInputRepo
 
   CheckRepo -->|No| UseGitHubRepo
    
 
   ManualInput --> CheckTitle
 
   CheckTitle -->|Yes| UseInputTitle
 
   CheckTitle -->|No| GenTitle
    
 
   UseInputRepo --> RepoVar
 
   UseGitHubRepo --> RepoVar
 
   UseInputTitle --> TitleVar
 
   GenTitle --> TitleVar
    
 
   RepoVar --> DockerRun
 
   TitleVar --> DockerRun
 
   DockerRun --> ReadEnv
 
   ReadEnv --> FetchWiki
 
   FetchWiki --> BuildBook

Sources: .github/workflows/build-and-deploy.yml:30-58 .github/workflows/build-and-deploy.yml:66-72

Job Dependencies and Artifact Flow

The Build and Deploy workflow uses a two-job structure with explicit dependency ordering:

  1. Build Job - Executes Docker container, uploads Pages artifact
  2. Deploy Job - Depends on build completion, deploys to GitHub Pages

The jobs communicate via the GitHub Actions artifact system:

JobStepArtifact OperationPath
buildUpload artifactactions/upload-pages-artifact@v3./output/book
deployDeploy to Pagesactions/deploy-pages@v4Artifact from build

Sources: .github/workflows/build-and-deploy.yml:74-89

Integration Points Summary

The following table summarizes how external systems interact with the CI/CD infrastructure:

Integration PointMethodConfigurationOutput
DeepWiki APIHTTP scraping during buildREPO environment variableMarkdown content
GitHub PagesArtifact deploymentactions/deploy-pages@v4Hosted HTML site
External RepositoriesGitHub Actionuses: jzombie/deepwiki-to-mdbook@mainLocal output directory
Git MetadataAuto-detection in containerFalls back to git remote -vRepository URL

Sources: .github/workflows/build-and-deploy.yml:1-90 action.yml:1-53

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Build and Deploy Workflow

Loading…

Build and Deploy Workflow

Relevant source files

Purpose and Scope

This page documents the Build and Deploy Workflow (.github/workflows/build-and-deploy.yml), which automates the process of building documentation from DeepWiki content and deploying it to GitHub Pages. The workflow runs on a weekly schedule and supports manual triggers with configurable parameters.

For information about testing the system’s Python components, see Test Workflow. For using this system in other repositories via the reusable action, see GitHub Action.

Workflow Overview

The Build and Deploy Workflow consists of two sequential jobs: build and deploy. The build job constructs the Docker image, generates the documentation artifacts, and uploads them. The deploy job then publishes these artifacts to GitHub Pages.

Sources: .github/workflows/build-and-deploy.yml:1-90

graph TB
    subgraph "Trigger Mechanisms"
        schedule["schedule\ncron: '0 0 * * 0'\n(Weekly Sunday 00:00 UTC)"]
manual["workflow_dispatch\n(Manual trigger)"]
end
    
    subgraph "Build Job"
        checkout["actions/checkout@v4\nClone repository"]
resolve["Resolve repository and book title\nCustom shell script"]
buildx["docker/setup-buildx-action@v3\nSetup Docker Buildx"]
dockerbuild["docker build -t deepwiki-to-mdbook ."]
dockerrun["docker run\nGenerate documentation"]
upload["actions/upload-pages-artifact@v3\nUpload ./output/book"]
end
    
    subgraph "Deploy Job"
        deploy["actions/deploy-pages@v4\nPublish to GitHub Pages"]
end
    
 
   schedule --> checkout
 
   manual --> checkout
 
   checkout --> resolve
 
   resolve --> buildx
 
   buildx --> dockerbuild
 
   dockerbuild --> dockerrun
 
   dockerrun --> upload
 
   upload --> deploy
    
 
   deploy --> pages["GitHub Pages\ngithub.io site"]

Trigger Configuration

Schedule-Based Execution

The workflow executes automatically on a weekly basis using a cron schedule:

This runs every Sunday at midnight UTC, ensuring the documentation stays synchronized with the latest DeepWiki content.

Manual Dispatch

The workflow_dispatch trigger allows on-demand execution through the GitHub Actions UI. Unlike the scheduled run, manual triggers can accept custom inputs for repository and book title configuration.

Sources: .github/workflows/build-and-deploy.yml:3-9

Permissions and Concurrency

GitHub Token Permissions

The workflow requires specific permissions for GitHub Pages deployment:

PermissionLevelPurpose
contentsreadAccess repository code for building
pageswriteDeploy artifacts to GitHub Pages
id-tokenwriteOIDC token for secure deployment

Concurrency Control

The concurrency configuration ensures only one deployment runs at a time using the pages group. Setting cancel-in-progress: false means new deployments wait for running deployments to complete rather than canceling them.

Sources: .github/workflows/build-and-deploy.yml:11-20

Build Job Architecture

Sources: .github/workflows/build-and-deploy.yml:23-78

Step 1: Repository Checkout

The workflow begins by checking out the repository using actions/checkout@v4:

This provides access to the Dockerfile and all scripts needed for the build process.

Sources: .github/workflows/build-and-deploy.yml:27-28

Step 2: Repository and Title Resolution

The resolution step implements fallback logic for configuring the documentation build:

Repository Resolution:

  1. Check if github.event.inputs.repo was provided (manual trigger)
  2. If not provided, use github.repository (current repository)
  3. Store result in repo_value output

Title Resolution:

  1. Check if github.event.inputs.book_title was provided
  2. If not provided, extract repository name and append “ Documentation“
  3. If extraction fails, use “Documentation” as fallback
  4. Store result in title_value output

The resolved values are written to GITHUB_OUTPUT for use in subsequent steps:

Sources: .github/workflows/build-and-deploy.yml:30-58

Step 3: Docker Buildx Setup

The workflow uses Docker Buildx for enhanced build capabilities:

This provides BuildKit features, including improved caching and multi-platform builds (though this workflow builds for a single platform).

Sources: .github/workflows/build-and-deploy.yml:60-61

Step 4: Docker Image Build

The Docker image is built using the Dockerfile in the repository root:

This creates the deepwiki-to-mdbook image containing:

  • mdBook and mdbook-mermaid binaries
  • Python 3.12 runtime
  • deepwiki-scraper.py and process-template.py scripts
  • build-docs.sh orchestrator
  • Default templates

For details on the Docker image structure, see Docker Multi-Stage Build.

Sources: .github/workflows/build-and-deploy.yml:63-64

Step 5: Documentation Builder Execution

The Docker container is executed with environment variables and a volume mount:

Container Configuration:

ComponentValuePurpose
--rmFlagRemove container after execution
-e REPOResolved repositoryConfigure documentation source
-e BOOK_TITLEResolved titleSet book metadata
-v $(pwd)/output:/outputVolume mountPersist generated artifacts
Imagedeepwiki-to-mdbookPreviously built image

The container runs build-docs.sh (default CMD), which orchestrates the three-phase pipeline:

  1. Phase 1: Extract markdown from DeepWiki (see Markdown Extraction)
  2. Phase 2: Enhance with diagrams (see Diagram Enhancement)
  3. Phase 3: Build mdBook artifacts (see mdBook Build)

Output is written to $(pwd)/output, which maps to /output inside the container.

Sources: .github/workflows/build-and-deploy.yml:66-72

Step 6: GitHub Pages Artifact Upload

The generated book directory is uploaded as a GitHub Pages artifact:

The actions/upload-pages-artifact action packages ./output/book into a tarball optimized for Pages deployment. This artifact is then available to the deploy job.

Sources: .github/workflows/build-and-deploy.yml:74-77

Deploy Job Architecture

Sources: .github/workflows/build-and-deploy.yml:79-90

Job Dependencies and Environment

The deploy job has a strict dependency on the build job:

This ensures the deploy job only runs after the build job successfully completes.

The environment configuration:

  • name: github-pages - Associates deployment with the GitHub Pages environment
  • url - Sets the environment URL to the deployed site URL from the deployment step

Deployment Step

The deployment uses the official GitHub Pages deployment action:

The deploy-pages action:

  1. Downloads the artifact uploaded by the build job
  2. Unpacks the tarball
  3. Publishes content to GitHub Pages
  4. Returns the deployed site URL in page_url output

Sources: .github/workflows/build-and-deploy.yml:79-89

Workflow Execution Flow

Sources: .github/workflows/build-and-deploy.yml:1-90

Environment Variables and Outputs

Step Outputs

The resolve step produces two outputs accessible to subsequent steps:

OutputID PathDescription
repo_valuesteps.resolve.outputs.repo_valueResolved repository (input or default)
title_valuesteps.resolve.outputs.title_valueResolved book title (input or generated)

These outputs are consumed by the Docker run command via environment variables.

Docker Container Environment

The container receives configuration through environment variables:

VariableSourceExample Value
REPOsteps.resolve.outputs.repo_valuejzombie/deepwiki-to-mdbook
BOOK_TITLEsteps.resolve.outputs.title_valuedeepwiki-to-mdbook Documentation

Additional environment variables supported by the container (not set by this workflow) include:

  • BOOK_AUTHORS - Author metadata
  • BOOK_LANGUAGE - Language code (default: en)
  • BOOK_SRC - Source directory (default: src)
  • MARKDOWN_ONLY - Skip HTML build if set

For a complete list, see Configuration Reference.

Sources: .github/workflows/build-and-deploy.yml:30-72

Workflow Status and Monitoring

Build Job Success Criteria

The build job succeeds when:

  1. Repository checkout completes
  2. Resolution logic produces valid outputs
  3. Docker image builds without errors
  4. Container execution exits with code 0
  5. Artifact upload completes

If any step fails, the build job terminates and the deploy job does not run.

Deploy Job Success Criteria

The deploy job succeeds when:

  1. Build job completed successfully
  2. Artifact download succeeds
  3. GitHub Pages deployment completes
  4. Deployment URL is accessible

Monitoring and Debugging

Workflow Execution:

  • View workflow runs in the Actions tab: https://github.com/{owner}/{repo}/actions
  • Each run shows both job statuses and step-by-step logs

Common Failure Points:

Failure LocationPossible CauseResolution
Docker buildDockerfile syntax errorCheck Dockerfile changes in failed commit
Docker runMissing dependenciesReview Python requirements and mdBook installation
Artifact uploadOutput directory emptyCheck build-docs.sh execution logs
DeployPermissions issueVerify Pages write permission in workflow

Sources: .github/workflows/build-and-deploy.yml:1-90

Integration with Other Components

Relationship to Test Workflow

The Build and Deploy Workflow does not include test execution. Testing is handled by a separate workflow that runs on push and PR events. See Test Workflow for details.

Relationship to GitHub Action

This workflow uses the same Docker image and build process as the reusable GitHub Action. The key differences:

AspectBuild and Deploy WorkflowGitHub Action
TriggerSchedule + manualCalled by other workflows
OutputDeployed to PagesArtifact or caller-specified location
ConfigurationHardcoded to this repoParameterized for any repo

For using this system in other repositories, see GitHub Action.

Docker Image Reuse

The workflow builds the Docker image fresh on each run. This ensures:

  • Latest code changes are included
  • Dependencies are up to date
  • Reproducible builds from source

The image is not pushed to a registry; it exists only during workflow execution.

Sources: .github/workflows/build-and-deploy.yml:1-90

Deployment Target Configuration

GitHub Pages Environment

GitHub Pages deployment requires repository configuration:

  1. Pages Source: Set to “GitHub Actions” in repository settings
  2. Branch: Not applicable (artifact-based deployment)
  3. Custom Domain: Optional; configure in repository settings

The deployment uses the github-pages environment, which provides:

  • Protection rules (if configured)
  • Deployment history
  • Environment-specific secrets (if needed)

URL Structure

The deployed documentation is accessible at:

https://{owner}.github.io/{repo}/

For custom domains, the URL follows the custom domain configuration.

Sources: .github/workflows/build-and-deploy.yml:80-82

Workflow Customization

Modifying Schedule

To change the execution frequency, edit the cron expression:

Examples:

  • Daily: "0 0 * * *"
  • Twice weekly: Add another cron entry
  • Monthly: "0 0 1 * *"

Adding Manual Inputs

While the current workflow supports manual trigger, it does not expose inputs in the workflow_dispatch section. To add configurable inputs, modify the trigger:

The resolution step already handles these inputs via github.event.inputs.*.

Sources: .github/workflows/build-and-deploy.yml:8-9

Performance Characteristics

Typical Execution Time

PhaseDurationNotes
Docker build2-4 minutesIncludes Rust compilation of mdBook
Documentation generation1-3 minutesDepends on wiki size
Artifact upload10-30 secondsDepends on output size
Deployment30-60 secondsGitHub Pages processing
Total4-8 minutesEnd-to-end workflow

Resource Usage

The workflow runs on ubuntu-latest runners:

  • CPU: 2 cores
  • RAM: 7 GB
  • Disk: 14 GB SSD

Docker build and Python scraping are the most resource-intensive operations.

Optimization Opportunities

  1. Docker Layer Caching: The workflow could use actions/cache to cache Docker layers between runs
  2. Build Artifact Reuse: Skip documentation rebuild if wiki content hasn’t changed
  3. Parallel Jobs: Split build and test into parallel jobs (though this workflow has no tests)

Sources: .github/workflows/build-and-deploy.yml:1-90

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Test Workflow

Loading…

Test Workflow

Relevant source files

Purpose and Scope

This document describes the continuous integration testing workflow that validates code quality on every push to the main branch and on all pull requests. The workflow executes Python unit tests to ensure that core components function correctly before code is merged.

For information about the production build and deployment workflow, see Build and Deploy Workflow. For details on using the system as a reusable GitHub Action, see GitHub Action. For instructions on running tests locally during development, see Running Tests.

Workflow Configuration

The test workflow is defined in .github/workflows/tests.yml:1-26 It uses GitHub Actions to automatically execute the test suite in a clean Ubuntu environment.

Trigger Events

The workflow activates on two event types:

Event TypeTrigger ConditionPurpose
pushCommits to main branchValidate merged changes
pull_requestAny PR created or updatedBlock merge of failing code

Sources: .github/workflows/tests.yml:3-7

Workflow Trigger and Outcome Logic

Sources: .github/workflows/tests.yml:3-7

Job Structure

The workflow contains a single job named pytest that runs on ubuntu-latest. The job executes five sequential steps to set up the environment and run tests.

pytest Job Execution Flow

graph TB
    subgraph "pytest Job Steps"
        S1["Step 1: Checkout\nactions/checkout@v4"]
S2["Step 2: Setup Python\nactions/setup-python@v5\npython-version: 3.12"]
S3["Step 3: Install Dependencies\npip install -r python/requirements.txt\npip install pytest"]
S4["Step 4: Run pytest\npytest python/tests/ -s"]
end
    
 
   S1 --> S2
 
   S2 --> S3
 
   S3 --> S4

Sources: .github/workflows/tests.yml:10-25

Step-by-Step Breakdown

Step 1: Repository Checkout

Uses actions/checkout@v4 to clone the repository at the commit being tested. For pull requests, this checks out the merge commit to test the exact code that would be merged.

Sources: .github/workflows/tests.yml:13-14

Step 2: Python Environment Setup

Configures Python 3.12 using actions/setup-python@v5. This matches the Python version used in the Docker container Dockerfile26 ensuring consistency between local development, CI testing, and production execution.

Sources: .github/workflows/tests.yml:15-18

Step 3: Dependency Installation

Installs required packages in three phases:

  1. Upgrades pip to the latest version
  2. Installs project dependencies from python/requirements.txt
  3. Installs pytest testing framework

Sources: .github/workflows/tests.yml:19-23

Step 4: Test Execution

Runs pytest against the python/tests/ directory with the -s flag to show print statements. This executes all test modules discovered by pytest.

Sources: .github/workflows/tests.yml:24-25

Test Coverage

The workflow executes three test suites that validate core system components:

Test ModuleComponent TestedKey Functions Validated
test_template_processor.pyTemplate Systemprocess_template, variable substitution, conditional rendering
test_mermaid_normalization.pyDiagram ProcessingSeven-step normalization pipeline, Mermaid 11 compatibility
test_numbering.pyPath ResolutionPage numbering scheme, filename generation, path calculations

Test Module Coverage and Component Mapping

Sources: .github/workflows/tests.yml:24-25 scripts/run-tests.sh:11-30

Template Processor Tests

Validates the process_template function in python/process-template.py Tests verify:

  • Variable substitution ({{REPO}}, {{BOOK_TITLE}}, etc.)
  • Conditional rendering based on variable presence
  • Special character handling in template values
  • Fallback behavior when variables are undefined

Sources: scripts/run-tests.sh11

Mermaid Normalization Tests

Tests the seven-step normalization pipeline in python/deepwiki-scraper.py Validates:

  • Edge label flattening for multiline labels
  • State description syntax fixes
  • Flowchart node cleanup
  • Statement separator insertion
  • Empty label fallback generation
  • Gantt chart task ID synthesis

For detailed information on the normalization pipeline, see Mermaid Normalization.

Sources: scripts/run-tests.sh22

Numbering Tests

Verifies the page numbering and path resolution logic in python/deepwiki-scraper.py Tests confirm:

  • Correct filename generation from hierarchical numbers (e.g., 9.29.2_test-workflow.md)
  • Number parsing and validation
  • Path calculations relative to markdown directory
  • Numeric sorting behavior for SUMMARY.md generation

For detailed information on the numbering system, see Numbering and Path Resolution.

Sources: scripts/run-tests.sh30

graph LR
    subgraph "Development Flow"
        Dev["Developer\nCreates PR"]
Review["Code Review\nProcess"]
Merge["Merge to main"]
end
    
    subgraph "Test Workflow"
        TestTrigger["tests.yml\non: pull_request"]
PytestJob["pytest job"]
TestResult{{"Test Result"}}
end
    
    subgraph "Build Workflow"
        BuildTrigger["build-and-deploy.yml\non: push to main"]
BuildJob["Build & Deploy"]
end
    
 
   Dev --> TestTrigger
 
   TestTrigger --> PytestJob
 
   PytestJob --> TestResult
 
   TestResult -->|Pass| Review
 
   TestResult -->|Fail| Dev
 
   Review --> Merge
 
   Merge --> BuildTrigger
 
   BuildTrigger --> BuildJob

Integration with CI/CD Pipeline

The test workflow serves as a quality gate in the development process. Its relationship to other CI/CD components is shown below:

CI/CD Pipeline Integration and Quality Gate

Sources: .github/workflows/tests.yml:3-7

The workflow prevents code with failing tests from being merged, ensuring that the weekly build-and-deploy workflow (Build and Deploy Workflow) only processes validated code.

graph TB
    subgraph "GitHub Actions Execution"
        GHA["tests.yml workflow"]
GHAEnv["ubuntu-latest\nPython 3.12\nClean environment"]
GHATest["pytest python/tests/ -s"]
end
    
    subgraph "Local Execution"
        Local["scripts/run-tests.sh"]
LocalEnv["Developer machine\nAny Python 3.x\nCurrent environment"]
LocalTest["python3 -m pytest\nOR\npython3 test_*.py"]
end
    
    subgraph "Shared Test Suite"
        Tests["python/tests/\ntest_template_processor.py\ntest_mermaid_normalization.py\ntest_numbering.py"]
end
    
 
   GHA --> GHAEnv
 
   GHAEnv --> GHATest
 
   GHATest --> Tests
    
 
   Local --> LocalEnv
 
   LocalEnv --> LocalTest
 
   LocalTest --> Tests

Local Test Execution

Developers can run the same tests locally using the scripts/run-tests.sh script. This script provides an alternative execution path that doesn’t require GitHub Actions:

Dual Execution Paths for Test Suite

Sources: .github/workflows/tests.yml:1-26 scripts/run-tests.sh:1-43

The local script scripts/run-tests.sh:1-43 includes fallback logic: if pytest is not available, it runs test_template_processor.py directly using Python’s unittest framework, but skips the pytest-dependent tests. This ensures developers can run at least basic tests without installing pytest.

Workflow Best Practices

Pull Request Requirements

Before merging, pull requests must:

  1. Pass all pytest tests
  2. Not introduce new test failures
  3. Update tests if behavior changes

Test Failure Investigation

When tests fail, the workflow output shows:

  • Which test module failed
  • The specific test function that failed
  • Assertion details and error messages
  • Full stdout/stderr due to the -s flag

The -s flag in .github/workflows/tests.yml25 ensures that print statements from the code under test are visible in the workflow logs, aiding debugging.

Adding New Tests

To add new tests to the workflow:

  1. Create a new test module in python/tests/
  2. Follow pytest conventions (test_*.py filename, test_* function names)
  3. The workflow automatically discovers and runs the new tests
  4. No workflow configuration changes required

Sources: .github/workflows/tests.yml:24-25

Comparison with Build Workflow

The following table highlights the differences between the test workflow and the build-and-deploy workflow:

AspectTest WorkflowBuild-and-Deploy Workflow
TriggerPush to main, pull requestsWeekly schedule, manual dispatch
PurposeValidate code qualityGenerate documentation
Duration~1-2 minutes~5-10 minutes
EnvironmentPython onlyFull Docker build
ArtifactsNoneDocumentation site
DeploymentNoneGitHub Pages
BlockingYes (blocks PR merge)No

Sources: .github/workflows/tests.yml:1-26

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

GitHub Action

Loading…

GitHub Action

Relevant source files

This page documents the reusable GitHub Action that enables external repositories to generate mdBook documentation from DeepWiki content. The action packages the entire Docker-based build system into a single workflow step that can be invoked with YAML configuration. For information about the automated build-and-deploy workflow used in this repository itself, see Build and Deploy Workflow. For information about the test workflow, see Test Workflow.

Overview

The GitHub Action is defined in action.yml:1-53 and implements a composite action pattern. It builds the Docker image on-demand within the calling workflow’s runner environment, executes the documentation generation process, and outputs artifacts to a configurable directory. Unlike pre-built action images, this approach bundles the Dockerfile directly, ensuring that the action always uses the exact code version referenced in the workflow.

graph TB
    subgraph "Calling Repository Workflow"
        WorkflowYAML["workflow.yml\nuses: jzombie/deepwiki-to-mdbook@main"]
InputParams["Input Parameters\nrepo, book_title, output_dir, etc."]
end
    
    subgraph "action.yml Composite Steps"
        Step1["Step 1: Build Docker image\nworking-directory: github.action_path"]
Step2["Step 2: Run documentation builder\ndocker run with mounted volume"]
end
    
    subgraph "Execution Environment"
        Dockerfile["Dockerfile\nbundled with action"]
ImageTag["IMAGE_TAG=deepwiki-to-mdbook:GITHUB_RUN_ID"]
Container["Docker Container\nbuild-docs.sh entrypoint"]
end
    
    subgraph "Output Artifacts"
        OutputDir["inputs.output_dir\nmounted to /output"]
Book["book/\nHTML documentation"]
Markdown["markdown/\nenhanced markdown"]
Config["book.toml\nmdBook config"]
end
    
 
   WorkflowYAML --> InputParams
 
   InputParams --> Step1
 
   Step1 --> Dockerfile
 
   Dockerfile --> ImageTag
 
   ImageTag --> Step2
 
   Step2 --> Container
 
   Container --> OutputDir
 
   OutputDir --> Book
 
   OutputDir --> Markdown
 
   OutputDir --> Config

The action provides the same functionality as local Docker execution but wraps it in a GitHub Actions-native interface with declarative input parameters instead of raw environment variables and shell commands.

Diagram: Action Invocation and Execution Flow

The action uses the github.action_path context variable to locate the bundled Dockerfile within the checked-out action repository. The image tag includes GITHUB_RUN_ID to ensure uniqueness across concurrent workflow executions.

Sources: action.yml:1-53 README.md:60-70

Input Parameters

The action exposes six input parameters that map directly to the Docker container’s environment variables. All parameters except repo have sensible defaults.

InputRequiredDefaultDescriptionEnvironment Variable
repoYesN/ADeepWiki repository in owner/repo format (e.g., jzombie/deepwiki-to-mdbook)REPO
book_titleNo"Documentation"Title displayed in the generated mdBookBOOK_TITLE
book_authorsNo""Author metadata for the mdBook (empty defaults to repo owner)BOOK_AUTHORS
git_repo_urlNo""Repository URL for mdBook edit links (empty auto-detects from Git)GIT_REPO_URL
markdown_onlyNo"false"Set to "true" to skip HTML build and only extract markdownMARKDOWN_ONLY
output_dirNo"./output"Output directory on workflow host, mounted to /output in container(Volume mount target)

The output_dir parameter is workflow-relative and resolved to an absolute path before mounting action.yml44 This allows callers to specify paths like ./docs-output or ${{ github.workspace }}/generated-docs.

Sources: action.yml:3-26

Action Implementation

The action uses the composite run type action.yml28 meaning it executes shell commands directly in the workflow runner rather than using a pre-built container image. This design choice enables the action to always use the current repository code without requiring separate image publication.

Diagram: Action Step Implementation Details

graph TB
    subgraph "Step 1: Build Docker image"
        ActionPath["github.action_path\nPoints to checked-out action repo"]
SetWorkDir["working-directory: github.action_path"]
BuildCmd["docker build -t IMAGE_TAG ."]
SetEnv["Export IMAGE_TAG to GITHUB_ENV"]
end
    
    subgraph "Step 2: Run documentation builder"
        MkdirOutput["mkdir -p inputs.output_dir"]
ResolveDir["OUT_DIR=$(cd inputs.output_dir && pwd)"]
DockerRun["docker run --rm"]
EnvVars["Environment Variables\nREPO, BOOK_TITLE, BOOK_AUTHORS\nGIT_REPO_URL, MARKDOWN_ONLY"]
VolumeMount["Volume: OUT_DIR:/output"]
end
    
    subgraph "Container Execution"
        Entrypoint["CMD: build-docs.sh"]
OutputGeneration["Generate book/, markdown/, etc."]
end
    
 
   ActionPath --> SetWorkDir
 
   SetWorkDir --> BuildCmd
 
   BuildCmd --> SetEnv
 
   SetEnv --> MkdirOutput
 
   MkdirOutput --> ResolveDir
 
   ResolveDir --> DockerRun
 
   DockerRun --> EnvVars
 
   DockerRun --> VolumeMount
 
   EnvVars --> Entrypoint
 
   VolumeMount --> Entrypoint
 
   Entrypoint --> OutputGeneration

The first step action.yml:30-37 uses working-directory: ${{ github.action_path }} to ensure docker build executes in the action’s repository root where the Dockerfile is located. The image tag incorporates GITHUB_RUN_ID action.yml35 to prevent conflicts when multiple workflows run concurrently. The tag is exported to $GITHUB_ENV action.yml36 to make it available in subsequent steps.

The second step action.yml:39-52 resolves the output_dir input to an absolute path action.yml44 using cd and pwd, which is necessary because Docker volume mounts require absolute paths. The docker run command action.yml:45-52 uses the same environment variable names as local Docker execution, mapping each input parameter to its corresponding environment variable.

Sources: action.yml:28-53

Usage Examples

Basic Usage

The most common usage pattern provides only the required repo parameter and accepts default values for all other inputs:

This generates documentation for the myorg/myproject DeepWiki repository with the default title “Documentation” and outputs to ./output.

Sources: README.md:62-70

Full Configuration

A fully configured invocation specifies all optional parameters:

This provides custom metadata and changes the output location to ./generated-docs.

Sources: action.yml:3-26

Markdown-Only Mode

To extract markdown without building HTML, useful for custom post-processing:

This skips the mdBook build phase and outputs only markdown/ and raw_markdown/ directories. For more information about markdown-only mode, see Markdown-Only Mode.

Sources: action.yml:19-22

GitHub Pages Deployment

The action integrates with GitHub Pages deployment workflows:

This workflow generates documentation weekly and deploys it to GitHub Pages. The ${{ github.repository }} context variable automatically uses the current repository as the target.

Sources: README.md:60-70

Comparison with Local Docker Usage

The GitHub Action provides the same functionality as direct Docker invocation but with different interfaces:

AspectGitHub ActionLocal Docker
InvocationYAML with: parametersShell environment variables
Image BuildingAutomatic per workflow runManual docker build
Output PathWorkflow-relative (e.g., ./output)Absolute host path (e.g., /home/user/output)
Path ResolutionAutomatic via cd and pwd action.yml44Manual path specification
Use CaseCI/CD automation, scheduled buildsLocal development, debugging

Both methods use the identical Docker image and build-docs.sh entrypoint, ensuring consistent output regardless of invocation method. The action adds automation conveniences like automatic image building and path resolution, while local Docker provides more direct control and faster iteration during development.

Sources: action.yml:28-53 README.md:14-27

Implementation Details

Composite Action Structure

The action uses the composite type action.yml28 rather than docker or javascript types. This design choice has several implications:

  1. No Pre-built Images : The action does not publish Docker images to registries. Instead, it builds the image on-demand in each workflow run.
  2. Version Consistency : Using @main or a specific commit SHA ensures the action always uses the corresponding Dockerfile version.
  3. Build Caching : GitHub Actions runners cache Docker layers between runs, reducing build time after the initial execution.
  4. Shell Portability : The action requires only bash and docker, making it compatible with all standard GitHub-hosted runners.

Sources: action.yml28

Environment Variable Mapping

The action translates workflow inputs to Docker environment variables using GitHub Actions expression syntax:

Each inputs.* reference is resolved by the Actions runtime before shell execution action.yml:46-51 Empty values are passed as empty strings, which triggers auto-detection behavior in build-docs.sh. For details about auto-detection features, see Auto-Detection Features.

Sources: action.yml:45-52

Working Directory Management

The action uses working-directory: ${{ github.action_path }} action.yml31 in the build step to ensure docker build executes in the action repository root. The github.action_path context variable points to the directory where the action was checked out, which is outside the calling workflow’s workspace.

In contrast, the second step does not specify a working directory, so it executes in ${{ github.workspace }} by default. This allows the output_dir input to use workflow-relative paths action.yml44

Sources: action.yml31 action.yml44

Common Integration Patterns

Multi-Repository Documentation

Organizations can use the action to generate documentation for multiple repositories in a single workflow:

This generates documentation for three repositories in parallel, each in its own output subdirectory.

Sources: action.yml:3-26

Conditional Builds

The action can be conditionally executed based on workflow triggers:

This only runs the documentation build when manually triggered, using a repository specified in the workflow dispatch inputs.

Sources: action.yml:1-53

Artifact Upload Integration

The action output integrates seamlessly with GitHub Actions artifact upload:

This uploads multiple output directories as a single artifact for later download or deployment.

Sources: README.md:54-58

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Output Structure

Loading…

Output Structure

Relevant source files

This page documents the structure and contents of the /output directory produced by the DeepWiki-to-mdBook converter. The output structure varies depending on whether the system runs in full build mode or markdown-only mode. For information about enabling markdown-only mode, see Markdown-Only Mode.

Output Directory Overview

The system writes all artifacts to the /output directory, which is typically mounted as a Docker volume. The contents of this directory depend on the MARKDOWN_ONLY environment variable:

Output Mode Decision Logic

graph TD
    Start["build-docs.sh execution"]
CheckMode{MARKDOWN_ONLY\nenvironment variable}
FullBuild["Full Build Path"]
MarkdownOnly["Markdown-Only Path"]
OutputBook["/output/book/\nHTML documentation"]
OutputMarkdown["/output/markdown/\nSource .md files"]
OutputToml["/output/book.toml\nConfiguration"]
Start --> CheckMode
 
   CheckMode -->|false default| FullBuild
 
   CheckMode -->|true| MarkdownOnly
    
 
   FullBuild --> OutputBook
 
   FullBuild --> OutputMarkdown
 
   FullBuild --> OutputToml
    
 
   MarkdownOnly --> OutputMarkdown

Sources: build-docs.sh26 build-docs.sh:60-76

Full Build Mode Output

When MARKDOWN_ONLY is not set or is false, the system produces three distinct outputs:

graph TD
    Output["/output/"]
Book["book/\nComplete HTML site"]
Markdown["markdown/\nSource files"]
BookToml["book.toml\nConfiguration"]
BookIndex["index.html"]
BookCSS["css/"]
BookJS["FontAwesome/"]
BookSearchJS["searchindex.js"]
BookMermaid["mermaid-init.js"]
BookPages["*.html pages"]
MarkdownRoot["*.md files\n(main pages)"]
MarkdownSections["section-N/\n(subsection dirs)"]
Output --> Book
 
   Output --> Markdown
 
   Output --> BookToml
    
 
   Book --> BookIndex
 
   Book --> BookCSS
 
   Book --> BookJS
 
   Book --> BookSearchJS
 
   Book --> BookMermaid
 
   Book --> BookPages
    
 
   Markdown --> MarkdownRoot
 
   Markdown --> MarkdownSections

Directory Structure

Full Build Output Structure

Sources: build-docs.sh:178-192 README.md:92-104

/output/book/ Directory

The book/ directory contains the complete HTML documentation site generated by mdBook. This is a self-contained static website that can be hosted on any web server or opened directly in a browser.

ComponentDescriptionGenerated By
index.htmlMain entry point for the documentationmdBook
*.htmlIndividual page files corresponding to each .md sourcemdBook
css/Styling for the rust thememdBook
FontAwesome/Icon font assetsmdBook
searchindex.jsSearch index for site-wide search functionalitymdBook
mermaid.min.jsMermaid diagram rendering librarymdbook-mermaid
mermaid-init.jsMermaid initialization scriptmdbook-mermaid

The HTML site includes:

  • Responsive navigation sidebar with hierarchical structure
  • Full-text search functionality
  • Syntax highlighting for code blocks
  • Working Mermaid diagram rendering
  • “Edit this page” links pointing to GIT_REPO_URL
  • Collapsible sections in the navigation

Sources: build-docs.sh:173-176 build-docs.sh:94-95 README.md:95-99

/output/markdown/ Directory

The markdown/ directory contains the source Markdown files extracted from DeepWiki and enhanced with Mermaid diagrams. These files follow a specific naming convention and organizational structure.

File Naming Convention:

<page-number>-<page-title-slug>.md

Examples from actual output:

  • 1-overview.md
  • 2-1-workspace-and-crates.md
  • 3-2-sql-parser.md

Subsection Organization:

Pages with subsections have their children organized into directories:

section-N/
  N-1-first-subsection.md
  N-2-second-subsection.md
  ...

For example, if page 4-architecture.md has subsections, they appear in:

section-4/
  4-1-overview.md
  4-2-components.md

This organization is reflected in the mdBook SUMMARY.md generation logic at build-docs.sh:125-159

Sources: README.md:100-119 build-docs.sh:163-166 build-docs.sh:186-188

/output/book.toml File

The book.toml file is a copy of the mdBook configuration used to generate the HTML site. It contains:

This file can be used to:

  • Understand the configuration used for the build
  • Regenerate the book with different settings
  • Debug mdBook configuration issues

Sources: build-docs.sh:84-103 build-docs.sh:190-191

Markdown-Only Mode Output

When MARKDOWN_ONLY=true, the system produces only the /output/markdown/ directory. This mode skips the mdBook build phase entirely.

Markdown-Only Mode Data Flow

graph LR
    Scraper["deepwiki-scraper.py"]
TempDir["/workspace/wiki/\nTemporary directory"]
OutputMarkdown["/output/markdown/\nFinal output"]
Scraper -->|Writes enhanced .md files| TempDir
 
   TempDir -->|cp -r| OutputMarkdown

The output structure is identical to the markdown/ directory in full build mode, but the book/ and book.toml artifacts are not created.

Sources: build-docs.sh:60-76 README.md:106-113

graph TD
    subgraph "Phase 1: Scraping"
        Scraper["deepwiki-scraper.py"]
WikiDir["/workspace/wiki/"]
end
    
    subgraph "Phase 2: Decision Point"
        CheckMode{MARKDOWN_ONLY\ncheck}
end
    
    subgraph "Phase 3: mdBook Build (conditional)"
        BookInit["Initialize /workspace/book/"]
GenToml["Generate book.toml"]
GenSummary["Generate SUMMARY.md"]
CopyToSrc["cp wiki/* book/src/"]
MermaidInstall["mdbook-mermaid install"]
MdBookBuild["mdbook build"]
BuildOutput["/workspace/book/book/"]
end
    
    subgraph "Phase 4: Copy to Output"
        CopyBook["cp -r book /output/"]
CopyMarkdown["cp -r wiki /output/markdown/"]
CopyToml["cp book.toml /output/"]
end
    
 
   Scraper -->|Writes to| WikiDir
 
   WikiDir --> CheckMode
    
 
   CheckMode -->|false| BookInit
 
   CheckMode -->|true| CopyMarkdown
    
 
   BookInit --> GenToml
 
   GenToml --> GenSummary
 
   GenSummary --> CopyToSrc
 
   CopyToSrc --> MermaidInstall
 
   MermaidInstall --> MdBookBuild
 
   MdBookBuild --> BuildOutput
    
 
   BuildOutput --> CopyBook
 
   WikiDir --> CopyMarkdown
 
   GenToml --> CopyToml

Output Generation Process

The following diagram shows how each output artifact is generated during the build process:

Complete Output Generation Pipeline

Sources: build-docs.sh:55-205

File Naming Examples

The following table shows actual filename patterns produced by the system:

PatternExampleDescription
N-title.md1-overview.mdMain page without subsections
N-M-title.md2-1-workspace-and-crates.mdSubsection file in root (legacy format)
section-N/N-M-title.mdsection-4/4-1-logical-planning.mdSubsection file in section directory

The system automatically detects which pages have subsections by examining the numeric prefix and checking for corresponding section-N/ directories during SUMMARY.md generation.

Sources: build-docs.sh:125-159 README.md:115-119

Volume Mounting

The /output directory is designed to be mounted as a Docker volume. The typical Docker run command specifies:

This mounts the host’s ./output directory to the container’s /output directory, making all generated artifacts accessible on the host filesystem after the container exits.

Sources: README.md:34-38 README.md:83-86

Output Size Characteristics

The output directory typically contains:

  • Markdown files : 10-500 KB per page depending on content length and diagram count
  • HTML book : 5-50 MB total depending on page count and assets
  • book.toml : ~500 bytes

For a typical repository with 20-30 documentation pages, expect:

  • markdown/: 5-15 MB
  • book/: 10-30 MB (includes all HTML, CSS, JS, and search index)
  • book.toml: < 1 KB

The HTML book is significantly larger than the markdown source because it includes:

  • Complete mdBook framework (CSS, JavaScript)
  • Search index (searchindex.js)
  • Mermaid rendering library (mermaid.min.js)
  • Font assets (FontAwesome)
  • Generated HTML for each page with navigation

Sources: build-docs.sh:178-205

Serving the Output

The HTML documentation in /output/book/ can be served using any static web server:

The markdown files in /output/markdown/ can be:

  • Committed to a Git repository
  • Used as input for other documentation systems
  • Edited and re-processed through mdBook manually
  • Served directly by markdown-aware platforms like GitHub

Sources: README.md:83-86 build-docs.sh:203-204

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Template System Details

Loading…

Template System Details

Relevant source files

Purpose and Scope

This document provides comprehensive technical documentation of the template system used to customize headers and footers in generated markdown files. The template system supports variable substitution, conditional rendering, and custom template injection through volume mounts.

For information about specific template variables and their usage, see Template Variables. For guidance on providing custom templates, see Custom Templates. For the broader Phase 2 enhancement pipeline where templates are injected, see Phase 2: Diagram Enhancement.

Template Processing Architecture

The template system consists of two core components: the template processor script and the default template files. The processor applies variable substitution and conditional logic, while default templates provide the baseline content structure.

Template System Component Architecture

graph TB
    subgraph "Template Processor"
        ProcessScript["process-template.py"]
ProcessFunc["process_template()"]
ReplaceConditional["replace_conditional()"]
ReplaceVariable["replace_variable()"]
end
    
    subgraph "Default Templates"
        HeaderTemplate["templates/header.html"]
FooterTemplate["templates/footer.html"]
TemplateReadme["templates/README.md"]
end
    
    subgraph "Runtime Configuration"
        EnvVars["Environment Variables\nTEMPLATE_DIR\nHEADER_TEMPLATE\nFOOTER_TEMPLATE"]
VariablesDict["variables dict\nREPO, BOOK_TITLE\nGENERATION_DATE, etc."]
end
    
    subgraph "Processing Steps"
        ReadTemplate["Read template file"]
ParseVars["Parse CLI arguments\nVAR=value format"]
ProcessCond["Process conditionals\n{{#if}}...{{/if}}"]
ProcessVars["Process variables\n{{VARIABLE}}"]
RemoveComments["Remove HTML comments"]
end
    
 
   EnvVars --> ProcessScript
 
   HeaderTemplate --> ReadTemplate
 
   FooterTemplate --> ReadTemplate
 
   ReadTemplate --> ProcessFunc
 
   ParseVars --> VariablesDict
 
   VariablesDict --> ProcessFunc
    
 
   ProcessFunc --> ProcessCond
 
   ProcessCond --> ReplaceConditional
 
   ReplaceConditional --> ProcessVars
 
   ProcessVars --> ReplaceVariable
 
   ReplaceVariable --> RemoveComments
 
   RemoveComments --> Output["Processed HTML output"]

Sources: python/process-template.py:1-83 templates/header.html:1-9 templates/footer.html:1-11

Template Syntax

The template system supports three syntax constructs: variable substitution, conditional rendering, and HTML comments. All processing occurs through regular expression pattern matching in process_template().

Variable Substitution

Variables use double curly brace syntax: {{VARIABLE_NAME}}. The processor replaces these with values from the variables dictionary.

Variable Processing Flow

graph LR
    Input["Template: {{BOOK_TITLE}}"]
Pattern["variable_pattern\nr'\\{\\{(\\w+)\\}\\}'"]
Match["re.sub()
match"]
Lookup["variables.get('BOOK_TITLE', '')"]
Output["Processed: My Documentation"]
Input --> Pattern
 
   Pattern --> Match
 
   Match --> Lookup
 
   Lookup --> Output

The replace_variable() function at python/process-template.py:41-43 implements variable lookup. If a variable is not found in the dictionary, it returns an empty string rather than raising an error.

Sources: python/process-template.py:38-45 templates/README.md:12-15

Conditional Rendering

Conditionals control whether content blocks are included in the output based on variable presence and non-empty values. The syntax is {{#if VARIABLE}}...{{/if}}.

Conditional Evaluation Logic

ConditionVariable ValueResult
Variable not in dictionaryN/AContent excluded
Variable is empty string""Content excluded
Variable is NoneNoneContent excluded
Variable is FalseFalseContent excluded
Variable has non-empty valueAny truthy valueContent included

The conditional pattern at python/process-template.py26 uses re.DOTALL flag to match content across multiple lines. The replace_conditional() function evaluates the condition at python/process-template.py:32-34

Example from Default Header:

This conditional at templates/header.html3 ensures the GitHub badge only appears when GIT_REPO_URL is configured.

Sources: python/process-template.py:24-36 templates/README.md:17-22 templates/header.html3

HTML Comment Removal

HTML comments are automatically stripped from the processed output using the pattern <!--.*?--> with the re.DOTALL flag at python/process-template.py48 This allows template authors to include documentation within template files without affecting the output.

Sources: python/process-template.py:47-48 templates/README.md:24-28

Default Template Structure

The system provides two default templates located in the templates/ directory. These templates are embedded in the Docker image and used unless overridden by volume mounts.

Template File Locations and Purpose

graph TB
    subgraph "Docker Image: /workspace/templates/"
        HeaderFile["header.html\nProject badges\nInitiative description\nGitHub links"]
FooterFile["footer.html\nGeneration timestamp\nRepository link"]
ReadmeFile["README.md\nDocumentation\nNot processed"]
end
    
    subgraph "Environment Variables"
        TemplateDir["TEMPLATE_DIR\ndefault: /workspace/templates"]
HeaderPath["HEADER_TEMPLATE\ndefault: $TEMPLATE_DIR/header.html"]
FooterPath["FOOTER_TEMPLATE\ndefault: $TEMPLATE_DIR/footer.html"]
end
    
    subgraph "Runtime Resolution"
        BuildScript["build-docs.sh"]
CheckExists["Check file existence"]
UseDefault["Use default templates"]
UseCustom["Use custom templates\n(if mounted)"]
end
    
 
   TemplateDir --> HeaderPath
 
   TemplateDir --> FooterPath
 
   HeaderPath --> BuildScript
 
   FooterPath --> BuildScript
 
   BuildScript --> CheckExists
 
   CheckExists --> UseDefault
 
   CheckExists --> UseCustom
 
   UseDefault --> HeaderFile
 
   UseDefault --> FooterFile

Sources: templates/header.html:1-9 templates/footer.html:1-11 templates/README.md:1-77

Header Template Structure

The default header at templates/header.html:1-9 contains three main sections:

  1. Project Badges Section (lines 2-4): Conditionally displays GitHub badge using flexbox layout with gap spacing and bottom border
  2. Initiative Description (line 7): Displays “Projects with Books” initiative text from zenOSmosis
  3. Conditional GitHub Link (line 7): Second conditional that provides GitHub source link

The inline conditional comment at templates/header.html1 explains why conditionals must be inline: to prevent mdBook from wrapping links in separate paragraph tags.

The default footer at templates/footer.html:1-11 provides:

  1. Visual Separator (line 1): Horizontal rule with custom styling
  2. Generation Timestamp (line 4): Always displayed using GENERATION_DATE variable
  3. Repository Link (lines 5-9): Conditionally displays repository link using GIT_REPO_URL and REPO variables

Sources: templates/header.html:1-9 templates/footer.html:1-11

sequenceDiagram
    participant BS as build-docs.sh
    participant PT as process-template.py
    participant FS as Filesystem
    participant MD as Markdown Files
    
    Note over BS: Phase 2: Enhancement
    
    BS->>BS: Set template variables\nREPO, BOOK_TITLE, etc.
    BS->>FS: Check HEADER_TEMPLATE exists
    BS->>FS: Check FOOTER_TEMPLATE exists
    
    BS->>PT: process-template.py header.html\nVAR1=val1 VAR2=val2...
    PT->>FS: Read header.html
    PT->>PT: Apply conditionals
    PT->>PT: Substitute variables
    PT->>PT: Remove comments
    PT-->>BS: Return processed header
    
    BS->>PT: process-template.py footer.html\nVAR1=val1 VAR2=val2...
    PT->>FS: Read footer.html
    PT->>PT: Apply conditionals
    PT->>PT: Substitute variables
    PT->>PT: Remove comments
    PT-->>BS: Return processed footer
    
    BS->>MD: For each .md file in markdown/
    loop Each File
        BS->>FS: Read original content
        BS->>FS: Write: header + content + footer
    end
    
    Note over BS,MD: Templates now injected\nReady for mdBook build

Template Processing Pipeline

Templates are processed and injected during Phase 2 of the build pipeline, after markdown extraction but before mdBook build. The build-docs.sh script orchestrates this process.

Template Injection Workflow

Sources: python/process-template.py:53-82

Available Template Variables

The following variables are available in all templates. These are set by build-docs.sh before template processing and passed as command-line arguments to process-template.py.

VariableSourceExample ValueDescription
REPOREPO env var or Git remotejzombie/deepwiki-to-mdbookRepository in owner/repo format
BOOK_TITLEBOOK_TITLE env var or defaultMy Project DocumentationTitle displayed in book
BOOK_AUTHORSBOOK_AUTHORS env var or defaultProject ContributorsAuthor attribution
GENERATION_DATEdate -u command2024-01-15 14:30:00 UTCUTC timestamp when docs generated
DEEPWIKI_URLHardcodedhttps://deepwiki.com/...Source wiki URL
DEEPWIKI_BADGE_URLConstructed from DEEPWIKI_URLhttps://deepwiki.com/.../badge.svgDeepWiki badge image
GIT_REPO_URLConstructed from REPOhttps://github.com/owner/repoFull GitHub repository URL
GITHUB_BADGE_URLConstructed from REPOhttps://img.shields.io/badge/...GitHub badge image

Variable Construction Logic:

  • GIT_REPO_URL: Only set if REPO is non-empty. Format: https://github.com/${REPO}
  • GITHUB_BADGE_URL: Only set if REPO is non-empty. Constructed using shields.io badge service
  • DEEPWIKI_BADGE_URL: Derived from DEEPWIKI_URL by appending /badge.svg

Sources: templates/README.md:30-40

Customization Mechanisms

The template system supports customization through three mechanisms: environment variables, volume mounts, and complete template directory replacement.

Environment Variable Configuration

Environment VariableDefault ValuePurpose
TEMPLATE_DIR/workspace/templatesBase directory for template files
HEADER_TEMPLATE$TEMPLATE_DIR/header.htmlPath to header template
FOOTER_TEMPLATE$TEMPLATE_DIR/footer.htmlPath to footer template

These variables allow changing template paths without modifying the Docker image.

Sources: templates/README.md:53-56

Volume Mount Strategies

Strategy 1: Replace Individual Templates

This overrides only the header template, keeping the default footer.

Strategy 2: Replace Entire Template Directory

This replaces all default templates. The custom directory must contain both header.html and footer.html.

Strategy 3: Custom Template Location

This uses environment variables to point to templates in a different location.

Sources: templates/README.md:44-51

graph TB
    subgraph "Phase 1: Extraction"
        Scraper["deepwiki-scraper.py\nGenerate raw markdown"]
RawMD["raw_markdown/\ndirectory"]
end
    
    subgraph "Phase 2: Enhancement"
        DiagramInject["Diagram injection\ninto markdown"]
TemplateProcess["Template processing\nprocess-template.py"]
TemplateInject["Template injection\nPrepend header\nAppend footer"]
EnhancedMD["markdown/\ndirectory"]
end
    
    subgraph "Phase 3: Build"
        SummaryGen["SUMMARY.md generation"]
MDBookBuild["mdbook build"]
HTMLOutput["book/\ndirectory"]
end
    
 
   Scraper --> RawMD
 
   RawMD --> DiagramInject
 
   DiagramInject --> TemplateProcess
 
   TemplateProcess --> TemplateInject
 
   TemplateInject --> EnhancedMD
 
   EnhancedMD --> SummaryGen
 
   SummaryGen --> MDBookBuild
 
   MDBookBuild --> HTMLOutput

Integration with Build Process

The template system integrates into the three-phase build pipeline at a specific injection point during Phase 2.

Build Phase Integration Points

The template injection occurs after diagram enhancement but before SUMMARY.md generation. This ensures:

  1. All content modifications (diagrams) are complete before templates are added
  2. Templates appear in every markdown file processed by mdBook
  3. The same header/footer styling applies consistently across all pages

File Processing Order:

  1. build-docs.sh processes header and footer templates once at the start of Phase 2
  2. The processed templates are stored in temporary variables
  3. For each .md file in the markdown/ directory:
    • Read original file content
    • Concatenate: processed_header + content + processed_footer
    • Write back to the same file
  4. Continue to Phase 3 with enhanced files

Sources: python/process-template.py:1-83

Template Processing Error Handling

The process-template.py script implements basic error handling for common failure scenarios:

Error Conditions and Responses:

ConditionDetectionResponse
Missing template fileos.path.isfile() check at line 62Print error to stderr, exit code 1
Insufficient argumentslen(sys.argv) < 2 check at line 54Print usage message, exit code 1
Missing variableDictionary lookup at line 43Return empty string (silent)
Invalid conditionalRegex fails to matchNo replacement (silent)
File read errorException during file openUnhandled exception propagates

The error handling philosophy is “fail early for missing inputs, fail silently for missing variables.” This allows templates to gracefully degrade when optional variables (like GIT_REPO_URL) are not provided, while catching configuration errors before processing begins.

Sources: python/process-template.py:54-63

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Template Variables

Loading…

Template Variables

Relevant source files

Purpose and Scope

This page provides a complete reference for all template variables available in the deepwiki-to-mdbook template system. Template variables are placeholders that get substituted with actual values during the build process, allowing dynamic customization of headers and footers injected into each markdown file.

For information about the broader template system architecture and conditional logic, see Template System. For guidance on providing custom templates, see Custom Templates.

Variable Overview

Template variables are captured during the build process and passed to process-template.py for substitution into HTML template files. These variables provide context about the repository, documentation metadata, and generated links.

Sources: scripts/build-docs.sh:195-234

Complete Variable Reference

The following table documents all available template variables:

VariableDescriptionSource/DerivationDefault Value
REPORepository identifier in owner/repo formatEnvironment variable or auto-detected from Git remoteRequired (no default)
BOOK_TITLEDocumentation book titleEnvironment variable"Documentation"
BOOK_AUTHORSAuthors of the documentationEnvironment variableValue of REPO_OWNER (first part of REPO)
GENERATION_DATETimestamp when documentation was generatedEnvironment variable or auto-generatedCurrent UTC datetime in format "Month DD, YYYY at HH:MM UTC"
DEEPWIKI_URLURL to DeepWiki documentation pageDerived from REPOhttps://deepwiki.com/{REPO}
DEEPWIKI_BADGE_URLURL to DeepWiki badge imageStatic valuehttps://deepwiki.com/badge.svg
GIT_REPO_URLURL to Git repositoryEnvironment variablehttps://github.com/{REPO}
GITHUB_BADGE_URLURL to GitHub badge imageGenerated from REPO with URL encodinghttps://img.shields.io/badge/GitHub-{encoded_repo}-181717?logo=github

Sources: scripts/build-docs.sh:8-51 scripts/build-docs.sh:195-213

Variable Resolution Flow

The following diagram illustrates how template variables are resolved from multiple sources:

Sources: scripts/build-docs.sh:8-51 scripts/build-docs.sh:195-234

graph TB
    subgraph "Input Sources"
        EnvVars["Environment Variables\n(REPO, BOOK_TITLE, etc.)"]
GitRemote["Git Remote\n(git config remote.origin.url)"]
SystemTime["System Time\n(date -u)"]
end
    
    subgraph "Resolution Logic in build-docs.sh"
        AutoDetect["Auto-Detection\n[build-docs.sh:8-19]"]
SetDefaults["Set Defaults\n[build-docs.sh:21-26]"]
DeriveValues["Derive URLs\n[build-docs.sh:40-51]"]
CaptureDate["Capture Timestamp\n[build-docs.sh:200]"]
end
    
    subgraph "Variable Processing"
        TemplateProcessor["process-template.py"]
HeaderTemplate["templates/header.html"]
FooterTemplate["templates/footer.html"]
end
    
    subgraph "Output"
        ProcessedHeader["Processed Header HTML"]
ProcessedFooter["Processed Footer HTML"]
InjectedMD["Markdown Files\nwith Headers/Footers"]
end
    
 
   EnvVars -->|REPO not set| AutoDetect
 
   GitRemote --> AutoDetect
 
   AutoDetect -->|extract owner/repo| SetDefaults
    
 
   EnvVars --> SetDefaults
 
   SetDefaults --> DeriveValues
    
 
   SystemTime --> CaptureDate
    
 
   DeriveValues -->|8 variables| TemplateProcessor
 
   CaptureDate --> TemplateProcessor
    
 
   HeaderTemplate --> TemplateProcessor
 
   FooterTemplate --> TemplateProcessor
    
 
   TemplateProcessor --> ProcessedHeader
 
   TemplateProcessor --> ProcessedFooter
    
 
   ProcessedHeader --> InjectedMD
 
   ProcessedFooter --> InjectedMD

Variable Processing Implementation

Capture and Derivation

The variable capture process occurs in several stages within build-docs.sh:

  1. Auto-detection scripts/build-docs.sh:8-19: If REPO is not set, the script attempts to extract it from the Git remote URL using git config --get remote.origin.url.

  2. Default assignment scripts/build-docs.sh:21-26: Environment variables are assigned default values. The pattern ${VAR:-default} provides fallback values.

  3. URL derivation scripts/build-docs.sh:40-51: Several URLs are constructed from the base variables:

    • DEEPWIKI_URL is built from REPO
    • GIT_REPO_URL defaults to GitHub URL from REPO
    • GITHUB_BADGE_URL includes URL-encoded REPO with character escaping
  4. Timestamp capture scripts/build-docs.sh200: The generation date is captured in UTC format using date -u.

Sources: scripts/build-docs.sh:8-51 scripts/build-docs.sh200

Badge URL Generation

The GitHub badge URL generation includes special character handling:

REPO_BADGE_LABEL=$(printf '%s' "$REPO" | sed 's/-/--/g' | sed 's/\//%2F/g')
GITHUB_BADGE_URL="https://img.shields.io/badge/GitHub-${REPO_BADGE_LABEL}-181717?logo=github"

This escapes hyphens (doubled) and URL-encodes slashes for shields.io badge compatibility.

Sources: scripts/build-docs.sh:50-51

Template Invocation

Variables are passed to process-template.py as command-line arguments in KEY=VALUE format:

The same variables are passed for both header and footer template processing.

Sources: scripts/build-docs.sh:205-213 scripts/build-docs.sh:222-230

graph LR
    subgraph "Stage 1: Variable Collection"
        CollectEnv["Environment Variables"]
CollectGit["Git Remote Detection"]
CollectTime["Timestamp Generation"]
end
    
    subgraph "Stage 2: build-docs.sh Processing"
        ValidateREPO["Validate REPO\n[build-docs.sh:33-38]"]
ExtractParts["Extract REPO_OWNER\nand REPO_NAME\n[build-docs.sh:40-42]"]
ApplyDefaults["Apply Defaults\n[build-docs.sh:44-46]"]
ConstructURLs["Construct Badge URLs\n[build-docs.sh:48-51]"]
end
    
    subgraph "Stage 3: Template Processing"
        InvokeProcessor["Invoke process-template.py\n[build-docs.sh:205-230]"]
SubstituteVars["Variable Substitution"]
EvalConditionals["Evaluate Conditionals"]
StripComments["Strip HTML Comments"]
end
    
    subgraph "Stage 4: Output"
        HeaderHTML["HEADER_HTML"]
FooterHTML["FOOTER_HTML"]
InjectFiles["Inject into Markdown\n[build-docs.sh:240-261]"]
end
    
 
   CollectEnv --> ValidateREPO
 
   CollectGit --> ValidateREPO
 
   CollectTime --> InvokeProcessor
    
 
   ValidateREPO --> ExtractParts
 
   ExtractParts --> ApplyDefaults
 
   ApplyDefaults --> ConstructURLs
 
   ConstructURLs --> InvokeProcessor
    
 
   InvokeProcessor --> SubstituteVars
 
   SubstituteVars --> EvalConditionals
 
   EvalConditionals --> StripComments
    
 
   StripComments --> HeaderHTML
 
   StripComments --> FooterHTML
    
 
   HeaderHTML --> InjectFiles
 
   FooterHTML --> InjectFiles

Variable Processing Pipeline

The following diagram shows how variables flow through the processing pipeline:

Sources: scripts/build-docs.sh:8-51 scripts/build-docs.sh:195-261

Usage Syntax

Variable Substitution

Variables are referenced in templates using double curly braces:

At render time, this becomes:

Sources: templates/README.md:12-15

Conditional Blocks

Variables can be tested for existence using conditional syntax:

If GIT_REPO_URL is set, the link is rendered. If not set or empty, the entire block is omitted.

Sources: templates/README.md:17-22

Multiple Variables

Multiple variables can be combined in a single template:

Sources: templates/README.md:60-68

Implementation Examples

The default templates use variables to generate badge links. Here’s a typical pattern:

This generates clickable badge images linking to both DeepWiki and GitHub, but only if the URLs are configured.

Sources: templates/README.md:30-39

The footer typically includes the generation timestamp:

With GENERATION_DATE="December 25, 2024 at 14:30 UTC", this renders as:

Sources: templates/README.md:70-76

Environment Variable Override

All template variables can be overridden via environment variables when running the Docker container:

This allows full customization without modifying template files.

Sources: scripts/build-docs.sh:21-26

Variable Validation

Only REPO is strictly required. The build will fail if it cannot be determined:

All other variables have sensible defaults and can be omitted.

Sources: scripts/build-docs.sh:33-38

Variable Storage During Build

After processing, the rendered HTML is stored in shell variables:

  • HEADER_HTML contains the processed header template
  • FOOTER_HTML contains the processed footer template

These are then injected into each markdown file during the copy operation scripts/build-docs.sh:240-261

Sources: scripts/build-docs.sh:205-234 scripts/build-docs.sh:240-261

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Custom Templates

Loading…

Custom Templates

Relevant source files

This page explains how to provide custom header and footer templates for the DeepWiki-to-mdBook converter through Docker volume mounts. Custom templates allow you to control the HTML content injected at the beginning and end of every generated markdown file, enabling branding, navigation elements, and custom styling.

For information about the variables available for use within templates, see Template Variables. For comprehensive details about the template system architecture and processing logic, see Template System.

Purpose and Scope

The template system supports customization through two mechanisms:

  1. Volume Mounts : Replace default templates by mounting custom files into the container
  2. Environment Variables : Override default template file paths

This page documents both mechanisms and provides practical examples for common customization scenarios.

Sources: README.md:39-51 templates/README.md:42-56

Default Template Architecture

The system includes two default templates that are baked into the Docker image:

Template FileContainer PathPurpose
header.html/workspace/templates/header.htmlInjected at the start of each markdown file
footer.html/workspace/templates/footer.htmlInjected at the end of each markdown file
graph LR
    subgraph "Container Filesystem"
        DefaultTemplates["/workspace/templates/\nheader.html\nfooter.html"]
CustomMount["/workspace/templates/\n(volume mount)"]
end
    
    subgraph "Resolution Logic"
        CheckMount{"Custom\ntemplates\nmounted?"}
UseDefault["Use default\ntemplates"]
UseCustom["Use custom\ntemplates"]
end
    
    subgraph "Processing Pipeline"
        ProcessTemplate["process-template.py\nVariable substitution\nConditional rendering"]
MarkdownFiles["markdown/*.md"]
InjectedContent["Markdown with\ninjected headers/footers"]
end
    
 
   DefaultTemplates --> CheckMount
 
   CustomMount --> CheckMount
 
   CheckMount -->|No| UseDefault
 
   CheckMount -->|Yes| UseCustom
 
   UseDefault --> ProcessTemplate
 
   UseCustom --> ProcessTemplate
 
   ProcessTemplate --> MarkdownFiles
 
   MarkdownFiles --> InjectedContent

These defaults provide basic documentation metadata including repository links, DeepWiki badges, and generation timestamps. Custom templates completely replace these defaults when mounted.

Template Processing Flow:

Diagram: Template Resolution and Processing Pipeline

The system checks for mounted templates at container startup. If custom templates are found at the mount point, they replace the defaults entirely. The selected templates are then processed by process-template.py for variable substitution and conditional rendering before being injected into each markdown file.

Sources: templates/README.md:5-8 README.md:39-51

Volume Mount Strategies

Full Directory Mount

The recommended approach is to mount an entire directory containing both custom templates:

Directory Structure:

my-templates/
├── header.html
└── footer.html

This strategy provides clean separation between your custom templates and the system, and allows you to version control both templates together.

Individual File Mounts

For granular control, mount individual template files:

This approach is useful when:

  • You only want to customize one template (header or footer)
  • Your templates are stored in different locations
  • You’re testing template changes without modifying a template directory

Partial Customization

You can mount only one template file, and the system will use the default for the other:

Sources: README.md:44-49 templates/README.md:46-51

Environment Variable Configuration

The template system exposes three environment variables for advanced path customization:

Environment VariableDefault ValuePurpose
TEMPLATE_DIR/workspace/templatesBase directory for template files
HEADER_TEMPLATE$TEMPLATE_DIR/header.htmlPath to header template
FOOTER_TEMPLATE$TEMPLATE_DIR/footer.htmlPath to footer template

Custom Template Directory

Override the entire template directory location:

Custom Template Paths

Override individual template file paths:

This advanced configuration is rarely needed but provides maximum flexibility for non-standard deployment scenarios.

Sources: templates/README.md:53-56

Template Processing Implementation

The template injection mechanism is implemented in process-template.py, which is invoked by the main orchestrator during Phase 2 of the build pipeline.

Template Processing Code Flow:

graph TB
    subgraph "build-docs.sh Orchestration"
        BuildScript["build-docs.sh"]
Phase2["Phase 2: Enhancement"]
end
    
    subgraph "process-template.py"
        LoadTemplate["load_template()\nRead header/footer files"]
ParseVars["Variable Substitution\n{{VAR}} → value"]
ParseCond["Conditional Rendering\n{{#if VAR}}...{{/if}}"]
StripComments["strip_html_comments()\nRemove <!-- ... -->"]
ProcessFile["process_file()\nInject into markdown"]
end
    
    subgraph "File System"
        TemplateFiles["/workspace/templates/\nheader.html\nfooter.html"]
MarkdownDir["markdown/*.md"]
EnhancedMD["Enhanced markdown\nwith headers/footers"]
end
    
 
   BuildScript --> Phase2
 
   Phase2 --> LoadTemplate
 
   TemplateFiles --> LoadTemplate
 
   LoadTemplate --> ParseVars
 
   ParseVars --> ParseCond
 
   ParseCond --> StripComments
 
   StripComments --> ProcessFile
 
   MarkdownDir --> ProcessFile
 
   ProcessFile --> EnhancedMD

Diagram: Template Processing Implementation Flow

The process-template.py script is executed by build-docs.sh during Phase 2. It loads the template files from the configured paths, performs variable substitution and conditional rendering, strips HTML comments, and injects the processed content into each markdown file.

Key Functions in process-template.py:

FunctionResponsibilityImplementation Detail
load_template()Read template file contentReturns raw HTML string from file path
Variable substitutionReplace {{VAR}} with valuesRegex-based pattern matching
Conditional renderingProcess {{#if VAR}}...{{/if}}Evaluates variable truthiness
strip_html_comments()Remove HTML commentsRegex pattern: <!--.*?-->
process_file()Inject templates into markdownPrepends header, appends footer

Sources: templates/README.md:24-28 README.md51

Practical Examples

Minimal Custom Header

Replace the default header with a simple title banner:

File:my-templates/header.html

File:my-templates/footer.html

Usage:

Branded Documentation

Add company branding and navigation:

File:custom/header.html

File:custom/footer.html

Badge-Based Navigation

Create a navigation header using badges:

File:badges/header.html

Sources: templates/README.md:60-76

graph TB
    subgraph "Phase 1: Extraction"
        Scrape["deepwiki-scraper.py\nExtract wiki content"]
RawMD["raw_markdown/"]
end
    
    subgraph "Phase 2: Enhancement"
        DiagramInject["Inject Mermaid diagrams"]
CheckTemplates{"Custom\ntemplates\nmounted?"}
LoadDefaults["Load default templates\n/workspace/templates/"]
LoadCustom["Load custom templates\n(volume mount)"]
ProcessTemplates["process-template.py\nInject headers/footers"]
EnhancedMD["markdown/"]
end
    
    subgraph "Phase 3: Build"
        GenSummary["Generate SUMMARY.md"]
MDBookBuild["mdbook build"]
FinalHTML["book/"]
end
    
 
   Scrape --> RawMD
 
   RawMD --> DiagramInject
 
   DiagramInject --> CheckTemplates
 
   CheckTemplates -->|No| LoadDefaults
 
   CheckTemplates -->|Yes| LoadCustom
 
   LoadDefaults --> ProcessTemplates
 
   LoadCustom --> ProcessTemplates
 
   ProcessTemplates --> EnhancedMD
 
   EnhancedMD --> GenSummary
 
   GenSummary --> MDBookBuild
 
   MDBookBuild --> FinalHTML

Integration with Build Pipeline

Custom templates are integrated at a specific point in the three-phase build pipeline:

Diagram: Custom Templates in Build Pipeline Context

Template customization occurs in Phase 2, after diagram injection but before mdBook structure generation. This ensures that custom headers and footers are present in the markdown files when SUMMARY.md is generated and mdBook performs its final build. The template selection happens once at the beginning of Phase 2, and the same templates are applied consistently to all markdown files.

Sources: README.md:72-76

Advanced Customization Patterns

Conditional Content Based on Environment

Use conditionals to show different content in different contexts:

Testing Template Changes Locally

Workflow for iterating on custom templates:

  1. Create template directory: mkdir -p ./test-templates

  2. Add custom templates: vim ./test-templates/header.html

  3. Run container with mounted templates:

  4. View results: cd output && python3 -m http.server --directory book 8000

  5. Iterate by editing templates and re-running step 3

Template Debugging

When templates don’t render as expected:

  1. Check file paths : Verify templates are mounted correctly

  2. Inspect raw markdown : Check output/markdown/ files to see injected content

  3. Verify variable availability : Ensure variables used in templates are set

    • Check environment variables passed to container
    • Review Template Variables for available variables
  4. Test HTML syntax : Validate HTML before mounting

Sources: templates/README.md:1-77 README.md:39-51

Template Mount Configuration Matrix

Configuration TypeVolume MountEnvironment VariablesUse Case
Default templatesNoneNoneStandard documentation
Full custom directory-v ./templates:/workspace/templatesNoneComplete customization
Individual files-v ./header.html:/workspace/templates/header.htmlNonePartial customization
Custom location-v ./templates:/custom/pathTEMPLATE_DIR=/custom/pathNon-standard paths
Advanced pathsMultiple mountsHEADER_TEMPLATE=...
FOOTER_TEMPLATE=...Complex setups

Sources: templates/README.md:42-56 README.md:44-49

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Advanced Topics

Loading…

Advanced Topics

Relevant source files

This page covers advanced usage scenarios, implementation details, and power-user features of the DeepWiki-to-mdBook Converter. It provides deeper insight into optional features, debugging techniques, and the internal mechanisms that enable the system’s flexibility and robustness.

For basic usage and configuration, see Quick Start and Configuration Reference. For architectural overview, see System Architecture. For component-level details, see Component Reference.

When to Use Advanced Features

The system provides several advanced features designed for specific scenarios:

Markdown-Only Mode : Extract markdown without building the HTML documentation. Useful for:

  • Debugging diagram placement and content extraction
  • Quick iteration during development
  • Creating markdown archives for version control
  • Feeding extracted content into other tools

Auto-Detection : Automatically determine repository metadata from Git remotes. Useful for:

  • CI/CD pipeline integration with minimal configuration
  • Running from within a repository checkout
  • Reducing configuration boilerplate

Custom Configuration : Override default behaviors through environment variables. Useful for:

  • Multi-repository documentation builds
  • Custom branding and themes
  • Specialized output requirements

Decision Flow for Build Modes

Sources: build-docs.sh:60-76 build-docs.sh:78-206 README.md:55-76

Debugging Strategies

Using Markdown-Only Mode for Fast Iteration

The MARKDOWN_ONLY environment variable bypasses the mdBook build phase, reducing build time from minutes to seconds. This is controlled by a simple conditional check in the orchestration script.

Workflow:

  1. Set MARKDOWN_ONLY=true in Docker run command
  2. Script executes build-docs.sh:60-76 which skips Steps 2-6
  3. Only Phase 1 (scraping) and Phase 2 (diagram enhancement) execute
  4. Output written directly to /output/markdown/

Typical debugging session:

The check at build-docs.sh61 determines whether to exit early:

For detailed information about this mode, see Markdown-Only Mode.

Sources: build-docs.sh:60-76 build-docs.sh26 README.md:55-76

Inspecting Intermediate Outputs

The system uses a temporary directory workflow that can be examined for debugging:

StageLocationContents
During Phase 1/workspace/wiki/ (temp)Raw markdown before diagram enhancement
During Phase 2/workspace/wiki/ (temp)Markdown with injected diagrams
During Phase 3/workspace/book/src/Markdown copied for mdBook
Final Output/output/markdown/Final enhanced markdown files

The temporary directory pattern is implemented using Python’s tempfile.TemporaryDirectory at tools/deepwiki-scraper.py808:

This ensures atomic operations—if the script fails mid-process, partial outputs are automatically cleaned up.

Sources: tools/deepwiki-scraper.py:804-916 build-docs.sh:27-30

Diagram Placement Debugging

Diagram injection uses fuzzy matching with progressive chunk sizes. To debug placement:

  1. Check raw extraction count : Look for console output “Found N total diagrams”
  2. Check context extraction : Look for “Found N diagrams with context”
  3. Check matching : Look for “Enhanced X files with diagrams”

The matching algorithm tries progressively smaller chunks at tools/deepwiki-scraper.py:716-730:

Debugging poor matches:

  • If too few diagrams placed: The context from JavaScript may not match converted markdown
  • If diagrams in wrong locations: Context text may appear in multiple locations
  • If no diagrams: Repository may not contain mermaid diagrams

Sources: tools/deepwiki-scraper.py:596-788 tools/deepwiki-scraper.py:216-331

DeepWiki uses absolute URLs like /owner/repo/2-1-subsection. The scraper must convert these to relative markdown paths that work in the mdBook file hierarchy:

output/markdown/
├── 1-overview.md
├── 2-main-section.md
├── section-2/
│   ├── 2-1-subsection.md
│   └── 2-2-another.md
└── 3-next-section.md

Links must account for:

  • Source page location (main page vs. subsection)
  • Target page location (main page vs. subsection)
  • Same section vs. cross-section links

Sources: tools/deepwiki-scraper.py:549-593

The fix_wiki_link function at tools/deepwiki-scraper.py:549-589 implements this logic:

Input parsing:

Location detection:

Path generation rules:

Source LocationTarget LocationGenerated PathExample
Main pageMain pagefile.md3-next.md
Main pageSubsectionsection-N/file.mdsection-2/2-1-sub.md
SubsectionMain page../file.md../3-next.md
Subsection (same section)Subsectionfile.md2-2-another.md
Subsection (diff section)Subsectionsection-N/file.mdsection-3/3-1-sub.md

The regex replacement at tools/deepwiki-scraper.py592 applies this transformation to all links:

For detailed explanation, see Link Rewriting Logic.

Sources: tools/deepwiki-scraper.py:549-593

Auto-Detection Mechanisms

flowchart TD
 
   Start["build-docs.sh starts"] --> CheckRepo{"REPO env var\nprovided?"}
CheckRepo -->|Yes| UseEnv["Use provided REPO value"]
CheckRepo -->|No| CheckGit{"Is current directory\na Git repository?\n(git rev-parse --git-dir)"}
CheckGit -->|Yes| GetRemote["Get remote.origin.url:\ngit config --get\nremote.origin.url"]
CheckGit -->|No| SetEmpty["Set REPO=<empty>"]
GetRemote --> HasRemote{"Remote URL\nfound?"}
HasRemote -->|Yes| ParseURL["Parse GitHub URL using sed regex:\nExtract owner/repo"]
HasRemote -->|No| SetEmpty
    
 
   ParseURL --> ValidateFormat{"Format is\nowner/repo?"}
ValidateFormat -->|Yes| SetRepo["Set REPO variable"]
ValidateFormat -->|No| SetEmpty
    
 
   SetEmpty --> FinalCheck{"REPO is empty?"}
UseEnv --> Continue["Continue with REPO"]
SetRepo --> Continue
    
 
   FinalCheck -->|Yes| Error["ERROR: REPO must be set\nExit with code 1"]
FinalCheck -->|No| Continue

Git Remote Auto-Detection

When REPO environment variable is not provided, the system attempts to auto-detect it from the Git repository in the current working directory.

Sources: build-docs.sh:8-37

Implementation Details

The auto-detection logic at build-docs.sh:8-19 handles multiple Git URL formats:

Supported URL formats:

  • HTTPS: https://github.com/owner/repo.git
  • HTTPS (no .git): https://github.com/owner/repo
  • SSH: git@github.com:owner/repo.git
  • SSH (no .git): git@github.com:owner/repo

The regex pattern .*github\.com<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/#LNaN-LNaN" NaN file-path="">Hii</FileRef>(\.git)?.* captures:

  • [:/] - Matches either : (SSH) or / (HTTPS)
  • ([^/]+/[^/\.]+) - Captures owner/repo (stops at / or .)
  • (\.git)? - Optionally matches .git suffix

Derived defaults:

After determining REPO, the script derives other configuration at build-docs.sh:39-45:

This provides sensible defaults:

  • BOOK_AUTHORS defaults to repository owner
  • GIT_REPO_URL defaults to GitHub URL (for “Edit this page” links)

For detailed explanation, see Auto-Detection Features.

Sources: build-docs.sh:8-45 README.md:47-53

Performance Considerations

Build Time Breakdown

Typical build times for a medium-sized repository (50-100 pages):

PhaseTimeBottleneck
Phase 1: Scraping60-120sNetwork requests + 1s delays
Phase 2: Diagrams5-10sRegex matching + file I/O
Phase 3: mdBook10-20sRust compilation + mermaid assets
Total75-150sNetwork + computation

Optimization Strategies

Network optimization:

Markdown-only mode:

  • Skips Phase 3 entirely, reducing build time by ~15-25%
  • Useful for content-only iterations

Docker build optimization:

  • Multi-stage build discards Rust toolchain (~1.5 GB)
  • Final image only contains binaries (~300-400 MB)
  • See Docker Multi-Stage Build for details

Caching considerations:

  • No internal caching—each run fetches fresh content
  • DeepWiki serves dynamic content (no cache headers)
  • Docker layer caching helps with repeated image builds

Sources: tools/deepwiki-scraper.py:28-42 tools/deepwiki-scraper.py:817-821 tools/deepwiki-scraper.py872

Extending the System

Adding New Output Formats

The system’s three-phase architecture makes it easy to add new output formats:

Integration points:

  1. Before Phase 3: Add code after build-docs.sh188 to read from $WIKI_DIR
  2. Alternative Phase 3: Replace build-docs.sh:174-176 with custom builder
  3. Post-processing: Add steps after build-docs.sh192 to transform mdBook output

Example: Adding PDF export:

Sources: build-docs.sh:174-206

Customizing Diagram Matching

The fuzzy matching algorithm can be tuned by modifying the chunk sizes at tools/deepwiki-scraper.py716:

Matching strategy customization:

The scoring system at tools/deepwiki-scraper.py:709-745 prioritizes:

  1. Anchor text matching (weighted by chunk size)
  2. Heading matching (weight: 50)

You can add additional heuristics by modifying the scoring logic or adding new matching strategies.

Sources: tools/deepwiki-scraper.py:596-788 tools/deepwiki-scraper.py:716-745

Adding New Content Cleaners

The HTML-to-markdown conversion can be enhanced by adding custom cleaners at tools/deepwiki-scraper.py:489-511:

The footer cleaner at tools/deepwiki-scraper.py:127-173 can be extended with additional patterns:

Sources: tools/deepwiki-scraper.py:127-173 tools/deepwiki-scraper.py:466-511

Common Advanced Scenarios

CI/CD Integration

GitHub Actions example:

The auto-detection at build-docs.sh:8-19 determines REPO from Git context. The BOOK_TITLE overrides the default.

Sources: build-docs.sh:8-45 README.md:228-232

Multi-Repository Builds

Build documentation for multiple repositories in parallel:

Each build runs in an isolated container with separate output directories.

Sources: build-docs.sh:21-53 README.md:200-207

Custom Theming

Override mdBook theme by modifying the generated book.toml template at build-docs.sh:85-103:

Or inject custom CSS:

Sources: build-docs.sh:84-103

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Markdown-Only Mode

Loading…

Markdown-Only Mode

Relevant source files

This document describes the MARKDOWN_ONLY mode, a special execution mode that terminates the build pipeline after markdown extraction, skipping all mdBook-related processing. This mode is primarily used for debugging the markdown extraction process and for workflows that require raw markdown files without HTML output.

For information about the complete three-phase pipeline, see Three-Phase Pipeline. For details on the mdBook build process that gets skipped in this mode, see Phase 3: mdBook Build.

Purpose and Scope

Markdown-only mode provides a lightweight execution path that produces only markdown files, bypassing the mdBook initialization, template injection, and HTML generation stages. This mode is controlled by the MARKDOWN_ONLY environment variable and affects the execution flow in build-docs.sh.

Key characteristics:

  • Executes only Phase 1 (markdown extraction) of the pipeline
  • Produces markdown/ and raw_markdown/ outputs
  • Skips book/ HTML generation and book.toml configuration
  • Reduces execution time by ~50-70% (no mdBook build overhead)
  • Useful for debugging, alternative workflows, and markdown inspection

Sources: scripts/build-docs.sh:26-93 README.md37

Execution Path Comparison

The following diagram illustrates how the MARKDOWN_ONLY flag alters the execution flow in build-docs.sh:

Figure 1: Build Pipeline Execution Paths

graph TB
    Start["build-docs.sh starts"]
Config["Configuration phase\n[lines 8-59]"]
Step1["Step 1: deepwiki-scraper.py\nScrape wiki & extract markdown\n[lines 61-65]"]
Check{"MARKDOWN_ONLY == true\n[line 68]"}
subgraph "Markdown-Only Path [lines 69-92]"
        CopyMD["Step 2: Copy markdown/\nfrom WIKI_DIR to OUTPUT_DIR\n[lines 70-73]"]
CopyRaw["Step 3: Copy raw_markdown/\nfrom RAW_DIR to OUTPUT_DIR\n[lines 75-81]"]
Exit1["Exit with status 0\n[line 92]"]
end
    
    subgraph "Standard Path [lines 95-309]"
        Step2["Step 2: Initialize mdBook structure\nCreate book.toml, src/ directory\n[lines 96-122]"]
Step3["Step 3: Generate SUMMARY.md\nDiscover files, build TOC\n[lines 124-188]"]
Step4["Step 4: Process templates\nInject header/footer HTML\n[lines 190-261]"]
Step5["Step 5: Install mdbook-mermaid\n[lines 263-266]"]
Step6["Step 6: mdbook build\n[lines 268-271]"]
Step7["Step 7: Copy all outputs\n[lines 273-294]"]
Exit2["Exit with status 0"]
end
    
 
   Start --> Config
 
   Config --> Step1
 
   Step1 --> Check
 
   Check -->|true| CopyMD
 
   CopyMD --> CopyRaw
 
   CopyRaw --> Exit1
    
 
   Check -->|false default| Step2
 
   Step2 --> Step3
 
   Step3 --> Step4
 
   Step4 --> Step5
 
   Step5 --> Step6
 
   Step6 --> Step7
 
   Step7 --> Exit2

Sources: scripts/build-docs.sh:1-310

Configuration

The MARKDOWN_ONLY mode is enabled by setting the environment variable to the string "true". Any other value (including unset) results in standard mode execution.

Environment variable:

  • Name: MARKDOWN_ONLY
  • Type: String boolean
  • Default: "false"
  • Valid values: "true" (markdown-only) or any other value (standard mode)

Example usage:

Sources: scripts/build-docs.sh26 README.md37

Output Structure

The output directory structure differs significantly between standard and markdown-only modes:

Table 1: Output Comparison

Output PathStandard ModeMarkdown-Only ModeDescription
output/book/✓ Generated✗ Not createdSearchable HTML documentation
output/markdown/✓ Generated✓ GeneratedEnhanced markdown with diagrams
output/raw_markdown/✓ Generated✓ GeneratedPre-enhancement markdown snapshots
output/book.toml✓ Generated✗ Not createdmdBook configuration file

In markdown-only mode, the output/markdown/ directory contains:

  • All scraped wiki pages as .md files with numeric prefixes (e.g., 1-Overview.md)
  • Section subdirectories (e.g., section-4/, section-5/)
  • Enhanced content with Mermaid diagrams inserted via fuzzy matching
  • No template injection (no header/footer HTML)

The output/raw_markdown/ directory contains:

  • Pre-enhancement markdown files (before diagram injection)
  • Useful for comparing before/after states
  • Only present if RAW_DIR exists (created by deepwiki-scraper.py)

Sources: scripts/build-docs.sh:70-92 scripts/build-docs.sh:287-294

Technical Implementation

The markdown-only mode implementation is located in the orchestration script and uses a simple conditional to bypass most processing steps.

Figure 2: Implementation Logic in build-docs.sh

graph TD
    ConfigVar["MARKDOWN_ONLY variable\nline 26: default='false'"]
Step1Complete["deepwiki-scraper.py complete\nWIKI_DIR and RAW_DIR populated"]
Conditional["if [ MARKDOWN_ONLY = true ]\nline 68"]
subgraph "Early Exit Block [lines 69-92]"
        EchoStep2["echo 'Step 2: Copying markdown...'\nline 70"]
RemoveMD["rm -rf OUTPUT_DIR/markdown\nline 71"]
MkdirMD["mkdir -p OUTPUT_DIR/markdown\nline 72"]
CopyMD["cp -r WIKI_DIR/. OUTPUT_DIR/markdown/\nline 73"]
CheckRaw{"if [ -d RAW_DIR ]\nline 75"}
EchoStep3["echo 'Step 3: Copying raw...'\nline 77"]
RemoveRaw["rm -rf OUTPUT_DIR/raw_markdown\nline 78"]
MkdirRaw["mkdir -p OUTPUT_DIR/raw_markdown\nline 79"]
CopyRaw["cp -r RAW_DIR/. OUTPUT_DIR/raw_markdown/\nline 80"]
EchoComplete["echo 'Markdown extraction complete!'\nlines 84-91"]
ExitZero["exit 0\nline 92"]
end
    
    ContinueStandard["Continue to Step 2:\nInitialize mdBook structure\nline 96+"]
ConfigVar --> Step1Complete
 
   Step1Complete --> Conditional
 
   Conditional -->|= true| EchoStep2
 
   EchoStep2 --> RemoveMD
 
   RemoveMD --> MkdirMD
 
   MkdirMD --> CopyMD
 
   CopyMD --> CheckRaw
 
   CheckRaw -->|RAW_DIR exists| EchoStep3
 
   CheckRaw -->|RAW_DIR missing| EchoComplete
 
   EchoStep3 --> RemoveRaw
 
   RemoveRaw --> MkdirRaw
 
   MkdirRaw --> CopyRaw
 
   CopyRaw --> EchoComplete
 
   EchoComplete --> ExitZero
    
 
   Conditional -->|!= true| ContinueStandard

Key implementation details:

  1. String comparison : The check uses shell string equality [ "$MARKDOWN_ONLY" = "true" ], not numeric comparison
  2. Defensive copying : Uses rm -rf before mkdir -p to ensure clean output directories
  3. Dot notation : cp -r "$WIKI_DIR"/. "$OUTPUT_DIR/markdown/" copies contents, not the directory itself
  4. Conditional raw copy : Only copies raw_markdown/ if the directory exists (some runs may not generate it)
  5. Early exit : Uses exit 0 to terminate successfully, preventing any subsequent processing

Sources: scripts/build-docs.sh:68-92

Use Cases

Markdown-only mode supports several workflows beyond standard documentation generation:

Table 2: Common Use Cases

Use CaseBenefitTypical Workflow
Scraper debuggingInspect raw extraction results without mdBook overheadEnable mode → examine raw_markdown/ → modify scraper → re-run
Diagram placement testingVerify fuzzy matching results in markdown/ filesEnable mode → search for mermaid blocks → adjust matching logic
Alternative build toolsUse Hugo, Jekyll, Docusaurus instead of mdBookEnable mode → feed markdown/ to different static site generator
CI/CD optimizationFaster feedback loop for markdown quality checksEnable mode in test pipeline → validate markdown syntax → fail fast
Content exportExtract DeepWiki content for archival or migrationEnable mode → preserve markdown/ directory → import elsewhere
Performance profilingIsolate Phase 1 performance from mdBook build timeTime execution with/without mode → identify bottlenecks

Sources: scripts/build-docs.sh26 (comment), README.md37

Skipped Processing Steps

When MARKDOWN_ONLY=true, the following operations are completely bypassed:

Figure 3: Skipped Components in Markdown-Only Mode

graph TB
    subgraph "Skipped: mdBook Initialization [lines 95-122]"
        BookDir["mkdir -p BOOK_DIR/src"]
BookToml["Generate book.toml\nwith title, authors, config"]
end
    
    subgraph "Skipped: SUMMARY.md Generation [lines 124-188]"
        ScanFiles["ls WIKI_DIR/*.md\nDiscover file structure"]
SortNumeric["sort -t- -k1 -n\nNumeric sorting"]
ExtractTitles["head -1 file / sed 's/^# //'\nExtract titles"]
BuildTOC["Generate nested TOC\nwith section subdirectories"]
WriteSummary["Write src/SUMMARY.md"]
end
    
    subgraph "Skipped: Template Processing [lines 190-261]"
        LoadHeader["Load templates/header.html"]
LoadFooter["Load templates/footer.html"]
ProcessVars["process-template.py\nVariable substitution"]
InjectHTML["Inject header/footer\ninto all .md files"]
end
    
    subgraph "Skipped: mdBook Build [lines 263-271]"
        MermaidInstall["mdbook-mermaid install"]
MdBookBuild["mdbook build"]
GenerateHTML["Generate book/\nHTML output"]
end
    
    Note["None of these operations\nexecute when MARKDOWN_ONLY=true"]
BookDir -.-> Note
 
   BookToml -.-> Note
 
   ScanFiles -.-> Note
 
   SortNumeric -.-> Note
 
   ExtractTitles -.-> Note
 
   BuildTOC -.-> Note
 
   WriteSummary -.-> Note
 
   LoadHeader -.-> Note
 
   LoadFooter -.-> Note
 
   ProcessVars -.-> Note
 
   InjectHTML -.-> Note
 
   MermaidInstall -.-> Note
 
   MdBookBuild -.-> Note
 
   GenerateHTML -.-> Note

Impact of skipped steps:

  • No template injection : Markdown files contain pure content without HTML header/footer wrappers
  • Nobook.toml: Configuration metadata is not generated (title, authors, etc.)
  • NoSUMMARY.md: Table of contents is not created (users must navigate by filename)
  • No Mermaid rendering : Diagrams remain as code blocks (not rendered to SVG)
  • No search index : Full-text search functionality is not available
  • Faster execution : Typical time savings of 50-70% depending on content size

Sources: scripts/build-docs.sh:95-309

Console Output Differences

The console output differs between modes, providing clear feedback about the execution path:

Standard mode output:

Step 1: Scraping wiki from DeepWiki...
Step 2: Initializing mdBook structure...
Step 3: Generating SUMMARY.md from scraped content...
Step 4: Copying and processing markdown files to book...
Step 5: Installing mdbook-mermaid assets...
Step 6: Building mdBook...
Step 7: Copying outputs to /output...
✓ Documentation build complete!

Markdown-only mode output:

Step 1: Scraping wiki from DeepWiki...
Step 2: Copying markdown files to output (markdown-only mode)...
Step 3: Copying raw markdown snapshots...
✓ Markdown extraction complete!

The markdown-only output explicitly mentions “(markdown-only mode)” in Step 2 and terminates after Step 3, providing clear confirmation of the execution path.

Sources: scripts/build-docs.sh:69-91 scripts/build-docs.sh:296-309

Integration with CI/CD

Markdown-only mode can be used in GitHub Actions workflows for testing and validation without full HTML builds:

Example: Markdown quality check workflow

This workflow extracts markdown, then validates it with linters and Mermaid syntax checkers without the overhead of building HTML documentation.

Sources: README.md37 scripts/build-docs.sh:26-92

graph LR
    subgraph "Iteration Loop"
        Enable["Set MARKDOWN_ONLY=true"]
Run["docker run deepwiki-to-mdbook"]
Examine["Examine output/markdown/\nand output/raw_markdown/"]
Issue{"Found issue?"}
ModifyCode["Modify deepwiki-scraper.py\nor diagram processing"]
Rebuild["docker build -t deepwiki-to-mdbook ."]
end
    
    Complete["Issue resolved"]
FullBuild["Remove MARKDOWN_ONLY\nRun full build\nVerify HTML output"]
Enable --> Run
 
   Run --> Examine
 
   Examine --> Issue
 
   Issue -->|Yes| ModifyCode
 
   ModifyCode --> Rebuild
 
   Rebuild --> Run
 
   Issue -->|No| Complete
 
   Complete --> FullBuild

Debugging Workflow

A typical debugging workflow using markdown-only mode:

Figure 4: Iterative Debugging with MARKDOWN_ONLY

Benefits of this approach:

  • Fast feedback : Markdown extraction completes in seconds vs. minutes for full builds
  • Isolated testing : Changes to scraper logic can be verified without mdBook complications
  • Raw comparison : raw_markdown/ shows pre-enhancement state for before/after analysis
  • Incremental development : Iterate rapidly on extraction logic before testing full pipeline

Sources: scripts/build-docs.sh:68-92

Limitations and Considerations

Markdown-only mode has specific limitations that users should be aware of:

Table 3: Limitations

LimitationImpactWorkaround
No HTML outputCannot preview documentation in browserUse markdown preview tool or build separately
No template injectionMissing repository links and metadataAdd manually or post-process markdown files
No SUMMARY.mdNo automated navigation structureGenerate TOC with alternative tool
No Mermaid renderingDiagrams remain as code blocksUse Mermaid CLI or other rendering tool
No validationBroken links not detectedRun mdBook build occasionally to catch issues

When NOT to use markdown-only mode:

  • Final production builds destined for GitHub Pages
  • Workflows requiring search functionality
  • When testing template customizations
  • Validation of cross-references and internal links

Sources: scripts/build-docs.sh26 scripts/build-docs.sh:68-92

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Numbering and Path Resolution

Loading…

Numbering and Path Resolution

Relevant source files

This document explains how the system transforms DeepWiki’s page numbering scheme into the file structure used in the generated mdBook documentation. The process involves three key operations: (1) normalizing DeepWiki page numbers by shifting them down by one, (2) resolving normalized numbers to file paths and directory locations, and (3) rewriting internal wiki links to use correct relative paths.

The numbering and path resolution system is foundational to maintaining a consistent file structure and ensuring that cross-references between wiki pages function correctly in the final mdBook output.

For information about the overall markdown extraction process, see page 6. For details about file organization and directory structure, see page 10.

Overview of Operations

The system performs three distinct but related operations:

OperationFunctionPurpose
Number Normalizationnormalized_number_parts()Shift DeepWiki numbers down by 1 (page 1 becomes unnumbered)
Path Resolutionresolve_output_path()Generate filename and section directory from page number
Link Rewritingfix_wiki_link()Convert absolute URLs to relative markdown paths

Sources: python/deepwiki-scraper.py:28-64

Numbering Scheme Transformation

DeepWiki Numbering Convention

DeepWiki numbers pages starting from 1, with subsections using dot notation (e.g., 1, 2, 2.1, 2.2, 3, 3.1). This numbering includes an “overview” page as page 1, which the system treats specially.

Normalization Algorithm

The normalized_number_parts() function shifts all page numbers down by one, making the overview page unnumbered and adjusting all subsequent numbers:

Diagram: Number Normalization Transformation

graph LR
    subgraph "DeepWiki Numbering"
        DW1["1 (Overview)"]
DW2["2"]
DW3["3"]
DW4["4.1"]
DW5["4.2"]
end
    
    subgraph "normalized_number_parts()"
        Norm["Subtract 1 from\nmain number"]
end
    
    subgraph "Normalized Numbering"
        N1["[] (Unnumbered)"]
N2["[1]"]
N3["[2]"]
N4["[3, 1]"]
N5["[3, 2]"]
end
    
 
   DW1 --> Norm
 
   DW2 --> Norm
 
   DW3 --> Norm
 
   DW4 --> Norm
 
   DW5 --> Norm
    
 
   Norm --> N1
 
   Norm --> N2
 
   Norm --> N3
 
   Norm --> N4
 
   Norm --> N5

Sources: python/deepwiki-scraper.py:28-43

Implementation Details

The function parses the page number string and applies the following rules:

Diagram: normalized_number_parts() Control Flow

Sources: python/deepwiki-scraper.py:28-43

Numbering Examples

DeepWiki NumberInputNormalized PartsNotes
"1"Overview page[]Unnumbered in output
"2"Second page["1"]Becomes first numbered page
"3"Third page["2"]Becomes second numbered page
"1.3"Overview subsection["1", "3"]Special case: kept as page 1
"4.2"Subsection["3", "2"]Main number decremented

Sources: python/tests/test_numbering.py:1-13

Path Resolution

graph TB
    Input["resolve_output_path(page_number, title)"]
Sanitize["sanitize_filename(title)\nConvert title to safe filename slug"]
Normalize["normalized_number_parts(page_number)\nGet normalized parts"]
CheckParts{"Parts valid\nand non-empty?"}
NoNumber["filename = slug + '.md'\nsection_dir = None"]
BuildFilename["filename = parts.join('-') + '-' + slug + '.md'"]
CheckLevel{"len(parts) > 1?"}
WithSection["section_dir = 'section-' + parts[0]"]
NoSection["section_dir = None"]
Return["Return (filename, section_dir)"]
Input --> Sanitize
 
   Input --> Normalize
    
 
   Sanitize --> CheckParts
 
   Normalize --> CheckParts
    
 
   CheckParts -->|No| NoNumber
 
   CheckParts -->|Yes| BuildFilename
    
 
   BuildFilename --> CheckLevel
 
   CheckLevel -->|Yes| WithSection
 
   CheckLevel -->|No| NoSection
    
 
   NoNumber --> Return
 
   WithSection --> Return
 
   NoSection --> Return

File Path Generation

The resolve_output_path() function converts normalized page numbers into file paths, determining both the filename and the optional section directory.

Diagram: Path Resolution Algorithm

Sources: python/deepwiki-scraper.py:45-53

Directory Structure Mapping

Diagram: File Organization After Path Resolution

Sources: python/deepwiki-scraper.py:45-53 python/tests/test_numbering.py:15-31

Path Resolution Examples

DeepWiki NumberTitleFilenameSection Directory
"1"“Overview Title”overview-title.mdNone
"3"“System Architecture”2-system-architecture.mdNone
"5.2"“HTML to Markdown Conversion”4-2-html-to-markdown-conversion.md"section-4"
"2.1"“Components”1-1-components.md"section-1"

Sources: python/tests/test_numbering.py:15-31

Target Path Construction

The build_target_path() function constructs the full relative path for link targets, including section directories when appropriate:

Diagram: Target Path Construction Logic

flowchart TD
    Start["build_target_path(page_number, slug)"]
Sanitize["slug = sanitize_filename(slug)"]
Normalize["parts = normalized_number_parts(page_number)"]
CheckParts{"Parts valid?"}
SimpleFile["Return slug + '.md'"]
BuildFile["filename = parts.join('-') + '-' + slug + '.md'"]
CheckSub{"len(parts) > 1?"}
WithDir["Return 'section-' + parts[0] + '/' + filename"]
JustFile["Return filename"]
Start --> Sanitize
 
   Start --> Normalize
    
 
   Sanitize --> CheckParts
 
   Normalize --> CheckParts
    
 
   CheckParts -->|No| SimpleFile
 
   CheckParts -->|Yes| BuildFile
    
 
   BuildFile --> CheckSub
 
   CheckSub -->|Yes| WithDir
 
   CheckSub -->|No| JustFile

Sources: python/deepwiki-scraper.py:55-63

DeepWiki uses absolute URL paths for internal wiki links in the format /owner/repo/N-page-title or /owner/repo/N.M-subsection-title. These links must be rewritten to relative Markdown file paths that respect the mdBook directory structure where:

  • Main pages (e.g., “1-overview”, “2-architecture”) reside in the root markdown directory
  • Subsections (e.g., “2.1-subsection”, “2.2-another”) reside in subdirectories named section-N/
  • File names use hyphens instead of dots (e.g., 2-1-subsection.md instead of 2.1-subsection.md)

The rewriting logic must compute the correct relative path based on both the source page location and the target page location.

Sources: python/deepwiki-scraper.py:854-875

graph TB
    Root["Root Directory\n(output/markdown/)"]
Main1["overview.md\n(Unnumbered)"]
Main2["1-architecture.md\n(Main Page)"]
Main3["2-installation.md\n(Main Page)"]
Section1["section-1/\n(Subsection Directory)"]
Section2["section-2/\n(Subsection Directory)"]
Sub1_1["1-1-components.md\n(Subsection)"]
Sub1_2["1-2-workflows.md\n(Subsection)"]
Sub2_1["2-1-docker-setup.md\n(Subsection)"]
Sub2_2["2-2-manual-setup.md\n(Subsection)"]
Root --> Main1
 
   Root --> Main2
 
   Root --> Main3
 
   Root --> Section1
 
   Root --> Section2
    
 
   Section1 --> Sub1_1
 
   Section1 --> Sub1_2
    
 
   Section2 --> Sub2_1
 
   Section2 --> Sub2_2

Relative Path Strategy

The system organizes markdown files into a hierarchical structure that affects link rewriting:

Diagram: File Organization Hierarchy

This structure requires different relative path strategies depending on where the link originates and where it points.

Sources: python/deepwiki-scraper.py:843-851

flowchart TD
    Start["Markdown Content"]
Regex["Apply Regex Pattern:\n'\\]\\(/[^/]+/[^/]+/([^)]+)\\)'"]
Extract["Extract Path Component:\ne.g., '4-query-planning'"]
Parse["Parse Page Number and Slug:\nPattern: '(\\d+(?:\\.\\d+)*)-(.+)$'"]
PageNum["page_num\n(e.g., '2.1' or '4')"]
Slug["slug\n(e.g., 'query-planning')"]
Start --> Regex
 
   Regex --> Extract
 
   Extract --> Parse
 
   Parse --> PageNum
 
   Parse --> Slug

The algorithm begins by matching markdown links that reference the DeepWiki URL structure using a regular expression pattern.

Diagram: Link Pattern Matching Flow

The regex \]\(/[^/]+/[^/]+/([^)]+)\) captures the path component after the repository identifier. For example, in <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/text" undefined file-path="text">Hii</FileRef>, it captures 4-query-planning.

Sources: python/deepwiki-scraper.py875

flowchart TD
    Start["extract_page_content(url, session, current_page_info)"]
CheckInfo{"current_page_info\nprovided?"}
NoInfo["source_section_dir = None\n(Default to root)"]
GetPageNum["page_number = current_page_info['number']\ntitle = current_page_info['title']"]
ResolvePath["resolve_output_path(page_number, title)\nReturns (filename, section_dir)"]
SetSource["source_section_dir = section_dir"]
DefineRewriter["Define fix_wiki_link()\nusing source_section_dir"]
ApplyRewriter["markdown = re.sub(pattern, fix_wiki_link, markdown)"]
Start --> CheckInfo
 
   CheckInfo -->|No| NoInfo
 
   CheckInfo -->|Yes| GetPageNum
    
 
   GetPageNum --> ResolvePath
 
   ResolvePath --> SetSource
    
 
   NoInfo --> DefineRewriter
 
   SetSource --> DefineRewriter
    
 
   DefineRewriter --> ApplyRewriter

Source Location Detection

The system determines the source page’s location from the current_page_info parameter passed to extract_page_content():

Diagram: Source Location Detection in extract_page_content

Sources: python/deepwiki-scraper.py:843-851 python/deepwiki-scraper.py875

Relative Path Calculation

The relative path is computed based on the combination of source and target locations:

Source LocationTarget LocationRelative Path StrategyExample
RootRootDirect filename2-installation.md
Rootsection-N/Section prefix + filenamesection-1/1-1-components.md
section-N/RootParent directory prefix../2-installation.md
section-N/Same section-N/Direct filename1-2-workflows.md
section-N/Different section-M/Parent + section prefix../section-2/2-1-setup.md

Sources: python/deepwiki-scraper.py:854-871

flowchart TD
    Start["fix_wiki_link(match)"]
ExtractPath["full_path = match.group(1)\n(e.g., '4-query-planning')"]
ParseLink["link_match = re.search(pattern, full_path)"]
Success{"Match\nsuccessful?"}
NoMatch["Return match.group(0)\n(link unchanged)"]
ExtractParts["page_num = link_match.group(1)\nslug = link_match.group(2)"]
BuildTarget["target_path = build_target_path(page_num, slug)"]
CheckSource{"source_section_dir\nexists?"}
NoSource["Return target_path\n(as-is from root)"]
CheckTargetDir{"target_path starts\nwith 'section-'?"}
NoDir["Add '../' prefix\nReturn '../' + target_path"]
CheckSameSection{"target_path starts with\nsource_section_dir + '/'?"}
SameSection["Strip section directory\nReturn filename only"]
OtherSection["Add '../' prefix\nReturn '../' + target_path"]
Start --> ExtractPath
 
   ExtractPath --> ParseLink
 
   ParseLink --> Success
    
 
   Success -->|No| NoMatch
 
   Success -->|Yes| ExtractParts
    
 
   ExtractParts --> BuildTarget
 
   BuildTarget --> CheckSource
    
 
   CheckSource -->|No| NoSource
 
   CheckSource -->|Yes| CheckTargetDir
    
 
   CheckTargetDir -->|No| NoDir
 
   CheckTargetDir -->|Yes| CheckSameSection
    
 
   CheckSameSection -->|Yes| SameSection
 
   CheckSameSection -->|No| OtherSection

The core implementation is a nested function fix_wiki_link defined within extract_page_content() that serves as a callback for re.sub:

Diagram: fix_wiki_link Function Control Flow

The function delegates target path construction to build_target_path(), then adjusts the path based on the source location captured in the closure variable source_section_dir.

Sources: python/deepwiki-scraper.py:854-871

Code Entity Mapping

Key functions involved in the path resolution pipeline:

FunctionLocationPurpose
normalized_number_parts()python/deepwiki-scraper.py:28-43Shift page numbers down by 1
resolve_output_path()python/deepwiki-scraper.py:45-53Convert number + title to filename and section
build_target_path()python/deepwiki-scraper.py:55-63Construct full relative path for link targets
fix_wiki_link()python/deepwiki-scraper.py:854-871Rewrite individual link (nested function)
extract_page_content()python/deepwiki-scraper.py:751-877Main extraction function with link rewriting

Sources: python/deepwiki-scraper.py:28-63 python/deepwiki-scraper.py:751-877

Scenario 1: Root to Root

When a root-level page (e.g., overview.md) links to another root-level page (e.g., 2-installation.md):

  • Source: overview.md (source_section_dir = None)
  • Target: DeepWiki page 3 → normalized to 2-installation.md
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Installation" undefined file-path="Installation">Hii</FileRef>
  • target_path: 2-installation.md (no section prefix)
  • Generated Path: 2-installation.md (no adjustment needed)
  • Reason: Both files are in root directory

Sources: python/deepwiki-scraper.py:854-871

Scenario 2: Root to Subsection

When a root-level page (e.g., 1-architecture.md) links to a subsection (e.g., 2.1-components1-1-components.md):

  • Source: 1-architecture.md (source_section_dir = None)
  • Target: DeepWiki page 2.1 → normalized to section-1/1-1-components.md
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Components" undefined file-path="Components">Hii</FileRef>
  • target_path: section-1/1-1-components.md
  • Generated Path: section-1/1-1-components.md (no adjustment needed)
  • Reason: Target is in subdirectory, source is in root

Sources: python/deepwiki-scraper.py:854-871

Scenario 3: Subsection to Root

When a subsection (e.g., section-1/1-1-components.md) links to a root-level page:

  • Source: section-1/1-1-components.md (source_section_dir = "section-1")
  • Target: DeepWiki page 3 → normalized to 2-installation.md
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Installation" undefined file-path="Installation">Hii</FileRef>
  • target_path: 2-installation.md (doesn’t start with “section-”)
  • Generated Path: ../2-installation.md (add parent directory)
  • Reason: Source is in subdirectory, target is in parent directory

Sources: python/deepwiki-scraper.py:868-870

Scenario 4: Subsection to Same Section

When a subsection links to another subsection in the same section:

  • Source: section-1/1-1-components.md (source_section_dir = "section-1")
  • Target: DeepWiki page 2.2 → normalized to section-1/1-2-workflows.md
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Workflows" undefined file-path="Workflows">Hii</FileRef>
  • target_path: section-1/1-2-workflows.md
  • Generated Path: 1-2-workflows.md (strip section directory)
  • Reason: Both files are in same section-1/ directory

Sources: python/deepwiki-scraper.py:864-866

Scenario 5: Subsection to Different Section

When a subsection links to a subsection in a different section:

  • Source: section-1/1-1-components.md (source_section_dir = "section-1")
  • Target: DeepWiki page 3.1 → normalized to section-2/2-1-setup.md
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Docker Setup" undefined file-path="Docker Setup">Hii</FileRef>
  • target_path: section-2/2-1-setup.md
  • Generated Path: ../section-2/2-1-setup.md (go to parent, then other section)
  • Reason: Different section directories require parent navigation

Sources: python/deepwiki-scraper.py:868-870

sequenceDiagram
    participant Main as main()
    participant EWS as extract_wiki_structure()
    participant EPC as extract_page_content()
    participant ROP as resolve_output_path()
    participant FWL as fix_wiki_link()
    
    Main->>EWS: Discover pages
    EWS-->>Main: pages list with DeepWiki numbers
    
    loop For each page
        Note over Main: page = {number: 2.1, title: Components, level: 1}
        
        Main->>EPC: extract_page_content(url, session, page)
        Note over EPC: Convert HTML to markdown
        
        EPC->>ROP: Get source location
        ROP-->>EPC: source_section_dir = "section-1"
        
        Note over EPC: Define fix_wiki_link() with closure over source_section_dir
        
        EPC->>FWL: Apply via re.sub() for each link
        FWL->>FWL: Parse target page number
        FWL->>FWL: build_target_path()
        FWL->>FWL: Adjust for source location
        FWL-->>EPC: Rewritten relative path
        
        EPC-->>Main: Markdown with fixed links
        
        Main->>ROP: Determine output path
        ROP-->>Main: (filename, section_dir)
        
        Note over Main: Write to section-1/1-1-components.md
    end

Integration in Content Extraction Pipeline

The numbering and path resolution components integrate into the main extraction flow:

Diagram: Integration Sequence Across Extraction Pipeline

The link rewriting occurs at line 875 using re.sub(r'\]\(/[^/]+/[^/]+/([^)]+)\)', fix_wiki_link, markdown), which finds all internal wiki links and replaces them with their rewritten versions.

Sources: python/deepwiki-scraper.py:1310-1353 python/deepwiki-scraper.py:843-877

flowchart TD
    Input["normalized_number_parts('abc')"]
Split["parts = 'abc'.split('.')"]
TryParse["Try int(parts[0])"]
ValueError["ValueError exception"]
ReturnNone["Return None"]
Input --> Split
 
   Split --> TryParse
 
   TryParse --> ValueError
 
   ValueError --> ReturnNone

Edge Cases and Special Handling

Invalid Page Numbers

If normalized_number_parts() receives an invalid page number (non-numeric main component), it returns None:

Diagram: Invalid Number Handling

This graceful failure allows resolve_output_path() and build_target_path() to fall back to simple slug-based filenames.

Sources: python/deepwiki-scraper.py:28-43

If a link doesn’t match the expected pattern (\d+(?:\.\d+)*)-(.+)$, fix_wiki_link() returns the original match unchanged:

This ensures that malformed or external links are preserved in their original form.

Sources: python/deepwiki-scraper.py:856-872

Missing Page Context

If current_page_info is not provided to extract_page_content(), the function defaults to treating the source as a root-level page:

This allows the function to work in degraded mode, though links from subsections may not be correctly rewritten.

Sources: python/deepwiki-scraper.py:843-850

Overview Page Special Case

The overview page (DeepWiki page 1) is treated specially:

  • normalized_number_parts("1") returns [] (empty list)
  • resolve_output_path("1", "Overview") returns ("overview.md", None)
  • The file is placed at root level with no numeric prefix

Subsections of the overview (e.g., 1.3) are handled differently:

  • normalized_number_parts("1.3") returns ["1", "3"]
  • Main number is kept as 1 (not decremented to 0)
  • These become section-1/1-3-subsection.md

Sources: python/deepwiki-scraper.py:28-43 python/tests/test_numbering.py:1-13

Performance Considerations

The link rewriting is performed using a single re.sub call with a callback function, which is efficient for typical wiki pages with dozens to hundreds of links. The regex compilation is implicit and cached by Python’s re module.

The algorithm has O(n) complexity where n is the number of internal links in the page, with each link requiring constant-time string operations.

Sources: tools/deepwiki-scraper.py592

Testing and Validation

The correctness of link rewriting can be validated by:

  1. Checking that generated links use .md extension
  2. Verifying that links from subsections to main pages use ../
  3. Confirming that links to subsections use the section-N/ prefix when appropriate
  4. Testing cross-section subsection links resolve correctly

The mdBook build process will fail if links are incorrectly rewritten, providing a validation mechanism during Phase 3 of the pipeline.

Sources: tools/deepwiki-scraper.py:547-594

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Auto-Detection Features

Loading…

Auto-Detection Features

Relevant source files

Purpose and Scope

This document details the automatic configuration detection mechanisms in the DeepWiki-to-mdBook converter. These features enable the system to operate with minimal manual configuration by intelligently inferring settings from the Git environment and repository metadata.

The auto-detection system primarily operates in the build orchestration script and focuses on repository identification and related URL construction. For information about other configuration options that require explicit setting, see Configuration Reference. For the overall build orchestration process, see build-docs.sh Orchestrator.

Overview of Auto-Detection

The system implements two categories of auto-detection:

CategoryFeaturesFallback Behavior
Primary DetectionRepository identification from Git remoteFails with error if not detected and not provided
Derived ConfigurationAuthor names, URLs, badge linksUses sensible defaults based on detected repository

The auto-detection executes during the initialization phase of scripts/build-docs.sh:8-46 before any content scraping or processing begins.

Repository Auto-Detection Flow

Detection Algorithm

Detection Algorithm Flow

Sources: scripts/build-docs.sh:8-38

Git Remote Parsing

The repository detection uses a single sed regular expression to handle multiple GitHub URL formats:

Git Remote URL Parsing

graph LR
    subgraph "Supported URL Formats"
        HTTPS["https://github.com/owner/repo.git"]
HTTPSNOGIT["https://github.com/owner/repo"]
SSH["git@github.com:owner/repo.git"]
SSHNOGIT["git@github.com:owner/repo"]
end
    
    subgraph "Extraction Process"
        GitConfig["git config --get\nremote.origin.url"]
SedRegex["sed -E\ngithub\.com[:/]([^/]+/[^/\.]+)"]
RepoVar["REPO variable\nowner/repo"]
end
    
 
   HTTPS --> GitConfig
 
   HTTPSNOGIT --> GitConfig
 
   SSH --> GitConfig
 
   SSHNOGIT --> GitConfig
 
   GitConfig --> SedRegex
 
   SedRegex --> RepoVar

The parsing logic at scripts/build-docs.sh16 uses this pattern:

sed -E 's#.*github\.com<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/#LNaN-LNaN" NaN  file-path="">Hii</FileRef>(\.git)?.*#\1#'
Pattern ComponentPurpose
.*github\.comMatch any characters before github.com
[:/]Match either : (SSH) or / (HTTPS) separator
([^/]+/[^/\.]+)Capture group: owner/repo (stops at / or .)
(\.git)?Optional .git suffix
.*Match remaining characters
#\1#Replace entire string with capture group 1

Sources: scripts/build-docs.sh:8-19

Derived Configuration Values

Once the REPO value is established (either through auto-detection or explicit setting), the system derives several related configuration values automatically.

graph TB
    REPO["REPO\n(owner/repo)"]
Split["String split on '/'"]
RepoOwner["REPO_OWNER\n(first segment)"]
RepoName["REPO_NAME\n(second segment)"]
subgraph "Derived URLs"
        GitURL["GIT_REPO_URL\nhttps://github.com/owner/repo"]
DeepWikiURL["DEEPWIKI_URL\nhttps://deepwiki.com/owner/repo"]
end
    
    subgraph "Derived Badges"
        DeepWikiBadge["DEEPWIKI_BADGE_URL\nhttps://deepwiki.com/badge.svg"]
GitHubBadge["GITHUB_BADGE_URL\nimg.shields.io badge"]
end
    
    subgraph "Default Metadata"
        BookAuthors["BOOK_AUTHORS\n(defaults to REPO_OWNER)"]
end
    
 
   REPO --> Split
 
   Split --> RepoOwner
 
   Split --> RepoName
    
 
   RepoOwner --> BookAuthors
 
   REPO --> GitURL
 
   REPO --> DeepWikiURL
 
   DeepWikiURL --> DeepWikiBadge
 
   REPO --> GitHubBadge

Derivation Chain

Configuration Value Derivation

Sources: scripts/build-docs.sh:40-51

Default Value Assignment

The script uses shell parameter expansion with default values at scripts/build-docs.sh:44-46:

VariableDefault ValueCondition
BOOK_AUTHORS$REPO_OWNERIf not explicitly set
GIT_REPO_URLhttps://github.com/$REPOIf not explicitly set
DEEPWIKI_URLhttps://deepwiki.com/$REPOAlways constructed
DEEPWIKI_BADGE_URLhttps://deepwiki.com/badge.svgAlways constructed
GITHUB_BADGE_URLhttps://img.shields.io/badge/GitHub-{label}-181717?logo=githubAlways constructed with URL encoding

Sources: scripts/build-docs.sh:44-51

Badge URL Construction

The GitHub badge URL requires special encoding for the repository label at scripts/build-docs.sh:50-51:

GitHub Badge URL Encoding

The encoding is necessary because the badge service interprets - and / as special characters. The double-dash (--) escapes the hyphen, and %2F is the URL encoding for forward slash.

Sources: scripts/build-docs.sh:50-51

Configuration Precedence

The system follows a clear precedence order for all configurable values:

Configuration Precedence Order

PrioritySourceExample
1 (Highest)Explicit environment variabledocker run -e REPO=owner/repo
2Git auto-detectiongit config --get remote.origin.url
3 (Lowest)Hard-coded defaultBOOK_TITLE="Documentation"

Sources: scripts/build-docs.sh:8-46

Error Handling

Repository Detection Failure

If repository detection fails and no explicit REPO value is provided, the script terminates with a descriptive error at scripts/build-docs.sh:34-38:

Repository Validation and Error Flow

The error message provides actionable guidance:

ERROR: REPO must be set or run from within a Git repository with a GitHub remote
Usage: REPO=owner/repo $0

Sources: scripts/build-docs.sh:34-38

graph TB
    subgraph "Auto-Detection Phase"
        DetectRepo["Detect/Set REPO"]
DeriveVars["Derive configuration\nvariables"]
end
    
    subgraph "Template Processing"
        LoadTemplate["Load header.html\nand footer.html"]
InvokeScript["Execute\nprocess-template.py"]
PassVars["Pass variables as\ncommand-line arguments"]
Substitute["Variable substitution\nin templates"]
end
    
    subgraph "Available Variables"
        VarRepo["REPO"]
VarTitle["BOOK_TITLE"]
VarAuthors["BOOK_AUTHORS"]
VarGitURL["GIT_REPO_URL"]
VarDeepWiki["DEEPWIKI_URL"]
VarDate["GENERATION_DATE"]
end
    
 
   DetectRepo --> DeriveVars
 
   DeriveVars --> LoadTemplate
 
   LoadTemplate --> InvokeScript
 
   InvokeScript --> PassVars
    
 
   DeriveVars -.-> VarRepo
 
   DeriveVars -.-> VarTitle
 
   DeriveVars -.-> VarAuthors
 
   DeriveVars -.-> VarGitURL
 
   DeriveVars -.-> VarDeepWiki
 
   DeriveVars -.-> VarDate
    
 
   VarRepo --> PassVars
 
   VarTitle --> PassVars
 
   VarAuthors --> PassVars
 
   VarGitURL --> PassVars
 
   VarDeepWiki --> PassVars
 
   VarDate --> PassVars
    
 
   PassVars --> Substitute

Integration with Template System

Auto-detected values are automatically propagated to the template processing system, where they can be used as variables in header and footer templates.

Template Variable Propagation

Template Variable Propagation Flow

The invocation at scripts/build-docs.sh:205-213 passes all auto-detected and derived values to process-template.py:

Sources: scripts/build-docs.sh:195-234 README.md:34-36

Usage Examples

Auto-Detection in Local Development

When running the Docker container from within a Git repository with a GitHub remote configured:

The script automatically:

  1. Detects REPO from git config --get remote.origin.url
  2. Derives BOOK_AUTHORS from the repository owner
  3. Constructs all URLs based on the detected repository

Explicit Override

Users can override auto-detection by explicitly setting environment variables:

In this case:

  • REPO uses the explicit value (no auto-detection)
  • BOOK_AUTHORS uses the explicit value (not derived from REPO)
  • BOOK_TITLE uses the explicit value
  • URLs are still derived from the explicit REPO value

Sources: README.md:14-27 scripts/build-docs.sh:8-46

Detection Validation Output

The build script outputs configuration information after detection completes at scripts/build-docs.sh:53-59:

Configuration:
  Repository:    jzombie/deepwiki-to-mdbook
  Book Title:    Documentation
  Authors:       jzombie
  Git Repo URL:  https://github.com/jzombie/deepwiki-to-mdbook
  Markdown Only: false

This output serves as verification that auto-detection and default derivation completed successfully before content processing begins.

Sources: scripts/build-docs.sh:53-59

Limitations and Considerations

Git Repository Requirement

Auto-detection only works when the Docker container is run with the workspace mounted and contains a Git repository with a GitHub remote. For non-Git scenarios or non-GitHub repositories, the REPO environment variable must be explicitly provided.

GitHub-Specific Detection

The URL parsing logic at scripts/build-docs.sh16 specifically looks for github.com in the remote URL. Repositories hosted on other platforms (GitLab, Bitbucket, etc.) will not be auto-detected and require explicit configuration.

Single Remote Assumption

The detection reads from remote.origin.url specifically. If a repository has multiple remotes or uses a different primary remote name, auto-detection will use the origin remote or fail if it doesn’t exist.

Sources: scripts/build-docs.sh:8-19 scripts/build-docs.sh:34-38

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Development Guide

Loading…

Development Guide

Relevant source files

This page provides guidance for developers who want to modify, extend, or contribute to the DeepWiki-to-mdBook Converter system. It covers the development environment setup, local workflow, testing procedures, and key considerations when working with the codebase.

For detailed information about the repository structure, see Project File Structure. For instructions on building the Docker image, see Building the Docker Image. For Python dependency details, see Python Dependencies.

Development Environment Requirements

The system is designed to run entirely within Docker, but local development requires the following tools:

ToolPurposeVersion
DockerContainer runtimeLatest stable
GitVersion control2.x or later
Text editor/IDECode editingAny (VS Code recommended)
PythonLocal testing (optional)3.12+
Rust toolchainLocal testing (optional)Latest stable

The Docker image handles all runtime dependencies, so local installation of Python and Rust is optional and only needed for testing individual components outside the container.

Sources: Dockerfile:1-33

Development Workflow Architecture

The following diagram shows the typical development cycle and how different components interact during development:

Development Workflow Diagram : Shows the cycle from editing code to building the Docker image to testing with mounted output volume.

graph TB
    subgraph "Development Environment"
        Editor["Code Editor"]
GitRepo["Local Git Repository"]
end
    
    subgraph "Docker Build Process"
        BuildCmd["docker build -t deepwiki-scraper ."]
Stage1["Rust Builder Stage\nCompiles mdbook binaries"]
Stage2["Python Runtime Stage\nAssembles final image"]
FinalImage["deepwiki-scraper:latest"]
end
    
    subgraph "Testing & Validation"
        RunCmd["docker run with test params"]
OutputMount["Volume mount: ./output"]
Validation["Manual inspection of output"]
end
    
    subgraph "Key Development Files"
        Dockerfile["Dockerfile"]
BuildScript["build-docs.sh"]
Scraper["tools/deepwiki-scraper.py"]
Requirements["tools/requirements.txt"]
end
    
 
   Editor -->|Edit| GitRepo
 
   GitRepo --> Dockerfile
 
   GitRepo --> BuildScript
 
   GitRepo --> Scraper
 
   GitRepo --> Requirements
    
 
   BuildCmd --> Stage1
 
   Stage1 --> Stage2
 
   Stage2 --> FinalImage
    
 
   FinalImage --> RunCmd
 
   RunCmd --> OutputMount
 
   OutputMount --> Validation
    
 
   Validation -.->|Iterate| Editor

Sources: Dockerfile:1-33 build-docs.sh:1-206

Component Development Map

This diagram bridges system concepts to actual code entities, showing which files implement which functionality:

Code Entity Mapping Diagram : Maps system functionality to specific code locations, file paths, and binaries.

graph LR
    subgraph "Entry Point Layer"
        CMD["CMD in Dockerfile:32"]
BuildDocs["build-docs.sh"]
end
    
    subgraph "Configuration Layer"
        EnvVars["Environment Variables\nREPO, BOOK_TITLE, etc."]
AutoDetect["Auto-detect logic\nbuild-docs.sh:8-19"]
Validation["Validation\nbuild-docs.sh:33-37"]
end
    
    subgraph "Processing Scripts"
        ScraperPy["deepwiki-scraper.py"]
MdBookBin["/usr/local/bin/mdbook"]
MermaidBin["/usr/local/bin/mdbook-mermaid"]
end
    
    subgraph "Configuration Generation"
        BookToml["book.toml generation\nbuild-docs.sh:85-103"]
SummaryMd["SUMMARY.md generation\nbuild-docs.sh:113-159"]
end
    
    subgraph "Dependency Management"
        ReqTxt["requirements.txt"]
UvInstall["uv pip install\nDockerfile:17"]
CargoInstall["cargo install\nDockerfile:5"]
end
    
 
   CMD --> BuildDocs
 
   BuildDocs --> EnvVars
 
   EnvVars --> AutoDetect
 
   AutoDetect --> Validation
    
 
   Validation --> ScraperPy
 
   BuildDocs --> BookToml
 
   BuildDocs --> SummaryMd
    
 
   BuildDocs --> MdBookBin
 
   MdBookBin --> MermaidBin
    
 
   ReqTxt --> UvInstall
 
   UvInstall --> ScraperPy
 
   CargoInstall --> MdBookBin
 
   CargoInstall --> MermaidBin

Sources: Dockerfile:1-33 build-docs.sh:8-19 build-docs.sh:85-103 build-docs.sh:113-159

Local Development Workflow

1. Clone and Setup

The repository has a minimal structure focused on the essential build artifacts. The .gitignore:1-2 excludes the output/ directory to prevent committing generated files.

2. Make Changes

Key files for common modifications:

Modification TypePrimary FileRelated Files
Scraping logictools/deepwiki-scraper.py-
Build orchestrationbuild-docs.sh-
Python dependenciestools/requirements.txtDockerfile:16-17
Docker build processDockerfile-
Output structurebuild-docs.shLines 179-191

3. Build Docker Image

After making changes, rebuild the Docker image:

The multi-stage build process Dockerfile:1-7 first compiles Rust binaries in a rust:latest builder stage, then Dockerfile:8-33 assembles the final python:3.12-slim image with copied binaries and Python dependencies.

4. Test Changes

Test with a real repository:

Setting MARKDOWN_ONLY=true build-docs.sh:61-76 bypasses the mdBook build phase, allowing faster iteration when testing scraping logic changes.

5. Validate Output

Inspect the generated files:

Sources: .gitignore:1-2 Dockerfile:1-33 build-docs.sh:61-76 build-docs.sh:179-191

Testing Strategies

Fast Iteration with Markdown-Only Mode

The MARKDOWN_ONLY environment variable enables a fast path for testing scraping changes:

This mode executes only Phase 1 (Markdown Extraction) and skips Phase 2 (Diagram Enhancement) and Phase 3 (mdBook Build). See Phase 1: Markdown Extraction for details on what this phase includes.

The conditional logic build-docs.sh:61-76 checks the MARKDOWN_ONLY variable and exits early after copying markdown files to /output/markdown/.

Testing Auto-Detection

The repository auto-detection logic build-docs.sh:8-19 attempts to extract the GitHub repository from Git remotes if REPO is not explicitly set:

The script checks git config --get remote.origin.url and extracts the owner/repo portion using sed pattern matching build-docs.sh16

Testing Configuration Generation

To test book.toml and SUMMARY.md generation without a full build:

The book.toml template build-docs.sh:85-103 uses shell variable substitution to inject environment variables into the TOML structure.

Sources: build-docs.sh:8-19 build-docs.sh:61-76 build-docs.sh:85-103

Debugging Techniques

Inspecting Intermediate Files

The build process creates temporary files in /workspace inside the container. To inspect them:

This allows inspection of:

  • Scraped markdown files in /workspace/wiki/
  • Generated book.toml in /workspace/book/
  • Generated SUMMARY.md in /workspace/book/src/

Adding Debug Output

Both build-docs.sh:1-206 and deepwiki-scraper.py use echo statements for progress tracking. Add additional debug output:

Testing Python Script Independently

To test the scraper without Docker:

This is useful for rapid iteration on scraping logic without rebuilding the Docker image.

Sources: build-docs.sh:1-206 tools/requirements.txt:1-4

Build Optimization Considerations

Multi-Stage Build Rationale

The Dockerfile:1-7 uses a separate Rust builder stage to:

  1. Compile mdbook and mdbook-mermaid with a full Rust toolchain
  2. Discard the ~1.5 GB builder stage after compilation
  3. Copy only the compiled binaries Dockerfile:20-21 to the final image

This reduces the final image size from ~1.5 GB to ~300-400 MB while still providing both Python and Rust tools. See Docker Multi-Stage Build for architectural details.

Dependency Management with uv

The Dockerfile13 copies uv from the official Astral image and uses it Dockerfile17 to install Python dependencies with --no-cache flag:

This approach:

  • Provides faster dependency resolution than pip
  • Reduces layer size with --no-cache
  • Installs system-wide with --system flag

Image Layer Ordering

The Dockerfile orders operations to maximize layer caching:

  1. Copy uv binary (rarely changes)
  2. Install Python dependencies (changes with requirements.txt)
  3. Copy Rust binaries (changes when rebuilding Rust stage)
  4. Copy Python scripts (changes frequently during development)

This ordering means modifying deepwiki-scraper.py only invalidates the final layers Dockerfile:24-29 not the entire dependency installation.

Sources: Dockerfile:1-33

Common Development Tasks

Adding a New Environment Variable

To add a new configuration option:

  1. Define default in build-docs.sh:21-30:

  2. Add to configuration display build-docs.sh:47-53:

  3. Use in downstream processing as needed

  4. Document in Configuration Reference

Modifying SUMMARY.md Generation

The table of contents generation logic build-docs.sh:113-159 uses bash loops and file discovery:

To modify the structure:

  1. Adjust the file pattern matching
  2. Modify the section detection logic
  3. Update the markdown output format
  4. Test with repositories that have different hierarchical structures

Adding New Python Dependencies

  1. Add to tools/requirements.txt:1-4 with version constraint:
new-package>=1.0.0
  1. Rebuild Docker image (triggers Dockerfile17)

  2. Update Python Dependencies documentation

  3. Import and use in deepwiki-scraper.py

Sources: build-docs.sh:21-30 build-docs.sh:113-159 tools/requirements.txt:1-4 Dockerfile17

File Modification Guidelines

Modifying build-docs.sh

The orchestrator script uses several idioms:

PatternPurposeExample
set -eExit on errorbuild-docs.sh2
"${VAR:-default}"Default valuesbuild-docs.sh:22-26
$(command)Command substitutionbuild-docs.sh12
echo ""Visual spacingbuild-docs.sh47
mkdir -pSafe directory creationbuild-docs.sh64

Maintain these patterns for consistency. The script is designed to be readable and self-documenting with clear step labels build-docs.sh:4-6

Modifying Dockerfile

Key considerations:

Modifying Python Scripts

When editing tools/deepwiki-scraper.py:

  • The script is executed via build-docs.sh58 with two arguments: REPO and output directory
  • It must be Python 3.12 compatible Dockerfile8
  • It has access to dependencies from tools/requirements.txt:1-4
  • It should write output to the specified directory argument
  • It should use print() for progress output that appears in build logs

Sources: build-docs.sh2 build-docs.sh58 Dockerfile:1-33 tools/requirements.txt:1-4

Integration Testing

End-to-End Test

Validate the complete pipeline:

Testing Configuration Variants

Test different repository configurations:

Sources: build-docs.sh:8-19 build-docs.sh:61-76

Contributing Guidelines

When submitting changes:

  1. Test locally : Build and run the Docker image with multiple test repositories
  2. Validate output : Ensure markdown files are properly formatted and the HTML site builds correctly
  3. Check backwards compatibility : Existing repositories should continue to work
  4. Update documentation : Modify relevant wiki pages if changing behavior
  5. Follow existing patterns : Match the coding style in build-docs.sh:1-206

The system is designed to be “fully generic” - it should work with any DeepWiki repository without modification. Test that your changes maintain this property.

Sources: build-docs.sh:1-206

Troubleshooting Development Issues

Build Failures

SymptomLikely CauseSolution
Rust compilation failsNetwork issues, incompatible versionsCheck rust:latest image availability
Python package install failsVersion conflicts in requirements.txtVerify package versions are compatible
mdbook not foundBinary copy failedCheck Dockerfile:20-21 paths
Permission denied on scriptsMissing chmod +xVerify Dockerfile:25-29

Runtime Failures

SymptomLikely CauseSolution
“REPO must be set” errorAuto-detection failed, no REPO env varCheck build-docs.sh:33-36 validation logic
Scraper crashesDeepWiki site structure changedDebug deepwiki-scraper.py with local testing
SUMMARY.md is emptyNo markdown files foundVerify scraper output in /workspace/wiki/
mdBook build failsInvalid markdown syntaxInspect markdown files for issues

Output Validation Checklist

After a successful build, verify:

  • output/markdown/ contains .md files
  • Section directories exist (e.g., output/markdown/section-4/)
  • output/book/index.html exists and opens in browser
  • Navigation menu appears in generated site
  • Search functionality works
  • Mermaid diagrams render correctly
  • Links between pages work
  • “Edit this file” links point to correct GitHub URLs

Sources: build-docs.sh:33-36 Dockerfile:20-21 Dockerfile:25-29

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Project Structure

Loading…

Project Structure

Relevant source files

This document describes the repository’s file organization, detailing the purpose of each file and directory in the codebase. Understanding this structure is essential for developers who want to modify or extend the system.

For information about running tests, see page 13.2. For details about the Python dependencies, see page 13.3.

Repository Layout

The repository follows a clean, organized structure that separates Python code, shell scripts, and HTML templates into dedicated directories.

graph TB
    Root["Repository Root"]
Root --> GitIgnore[".gitignore"]
Root --> Dockerfile["Dockerfile"]
Root --> README["README.md"]
Root --> PythonDir["python/"]
Root --> ScriptsDir["scripts/"]
Root --> TemplatesDir["templates/"]
Root --> GithubDir[".github/"]
Root --> OutputDir["output/"]
PythonDir --> Scraper["deepwiki-scraper.py"]
PythonDir --> ProcessTemplate["process-template.py"]
PythonDir --> Requirements["requirements.txt"]
PythonDir --> TestsDir["tests/"]
ScriptsDir --> BuildScript["build-docs.sh"]
ScriptsDir --> RunTests["run-tests.sh"]
TemplatesDir --> Header["header.html"]
TemplatesDir --> Footer["footer.html"]
TemplatesDir --> TemplateREADME["README.md"]
GithubDir --> Workflows["workflows/"]
OutputDir --> MarkdownOut["markdown/"]
OutputDir --> RawMarkdownOut["raw_markdown/"]
OutputDir --> BookOut["book/"]
OutputDir --> ConfigOut["book.toml"]
style Root fill:#f9f9f9,stroke:#333
    style PythonDir fill:#e8f5e9,stroke:#388e3c
    style ScriptsDir fill:#fff4e1,stroke:#f57c00
    style TemplatesDir fill:#e1f5ff,stroke:#0288d1
    style OutputDir fill:#ffe0b2,stroke:#e64a19

Physical File Hierarchy

Sources: README.md:84-88 .gitignore:1-7

Root Directory Files

The repository root contains the primary configuration and documentation files that define the system’s build behavior.

FileTypePurpose
.gitignoreConfigExcludes generated output and temporary files
DockerfileBuildMulti-stage Docker build specification
README.mdDocsQuick start guide and configuration reference

.gitignore

Excludes build artifacts and temporary files from version control:

  • output/ - Generated documentation artifacts
  • *.pyc and __pycache__/ - Python bytecode
  • .env - Local environment variables
  • .DS_Store - macOS metadata
  • tmp/ - Temporary working directory

Sources: .gitignore:1-7

Dockerfile

Implements a two-stage build pattern to optimize image size. The builder stage compiles Rust binaries (mdbook, mdbook-mermaid), and the final stage creates a Python runtime with only the necessary executables.

Sources: README.md78

README.md

Primary documentation file containing quick start instructions, configuration reference, and high-level system overview. Serves as the entry point for new users.

Sources: README.md:1-95

graph TB
    PythonDir["python/"]
PythonDir --> Scraper["deepwiki-scraper.py"]
PythonDir --> ProcessTemplate["process-template.py"]
PythonDir --> Requirements["requirements.txt"]
PythonDir --> TestsDir["tests/"]
TestsDir --> TemplateTest["test_template_processing.py"]
TestsDir --> MermaidTest["test_mermaid_normalization.py"]
TestsDir --> NumberingTest["test_page_numbering.py"]
Scraper --> ExtractWikiStructure["extract_wiki_structure()"]
Scraper --> ExtractPageContent["extract_page_content()"]
Scraper --> ExtractMermaid["extract_mermaid_from_nextjs_data()"]
Scraper --> NormalizeDiagram["normalize_mermaid_diagram()"]
Scraper --> ExtractAndEnhance["extract_and_enhance_diagrams()"]
ProcessTemplate --> ProcessFile["process_template_file()"]
ProcessTemplate --> SubstituteVars["substitute_variables()"]

Python Directory

The python/ directory contains all Python scripts, their dependencies, and test suites.

Python Directory Structure

Sources: README.md85

deepwiki-scraper.py

Core Python module for content extraction and diagram processing. Implements the Phase 1 (markdown extraction) and Phase 2 (diagram enhancement) logic of the pipeline.

Key Functions:

FunctionPurpose
sanitize_filename()Convert page titles to filesystem-safe names
fetch_page()HTTP client with retry logic and error handling
discover_subsections()Recursively probe for nested wiki pages
extract_wiki_structure()Build hierarchical page structure from DeepWiki
clean_deepwiki_footer()Remove DeepWiki UI elements from markdown
convert_html_to_markdown()HTML→Markdown conversion via html2text
extract_mermaid_from_nextjs_data()Extract diagrams from Next.js JavaScript payload
normalize_mermaid_diagram()Seven-step normalization for Mermaid 11 compatibility
extract_page_content()Main content extraction and markdown generation
extract_and_enhance_diagrams()Fuzzy matching and diagram injection
main()Entry point with temporary directory management

The scraper uses a temporary directory pattern to ensure atomic operations. Files are written to tempfile.TemporaryDirectory(), enhanced in-place, then moved to the final output location.

Sources: README.md85

process-template.py

Template processing script that performs variable substitution in header and footer HTML files. Supports conditional rendering and automatic variable detection.

Key Functions:

FunctionPurpose
process_template_file()Main template processing entry point
substitute_variables()Replace {{VARIABLE}} placeholders with values

Template variables include: {{REPO}}, {{BOOK_TITLE}}, {{BOOK_AUTHORS}}, {{GIT_REPO_URL}}, {{DEEPWIKI_URL}}, {{GENERATION_DATE}}.

Sources: README.md51

requirements.txt

Python dependencies for the scraper and template processor:

  • requests>=2.31.0 - HTTP client for fetching wiki pages
  • beautifulsoup4>=4.12.0 - HTML parsing library
  • html2text>=2020.1.16 - HTML-to-Markdown converter

Installed via uv pip install during Docker build for faster, more reliable installation.

Sources: README.md85

tests/

Test suite for Python components. Contains unit tests for template processing, Mermaid normalization, and page numbering logic. See page 13.2 for details on running tests.

Sources: README.md82

Scripts Directory

The scripts/ directory contains shell scripts for orchestration and testing.

Scripts Directory Structure

Sources: README.md82 README.md86

build-docs.sh

Main orchestration script that coordinates the three-phase pipeline. Invoked as the Docker container’s entry point.

Execution Flow:

  1. Auto-detection - Detect REPO from git remote if not provided
  2. Configuration - Parse environment variables and set defaults
  3. Phase 1 - Execute deepwiki-scraper.py to extract markdown
  4. Phase 2 - Process templates and generate book.toml, SUMMARY.md
  5. Phase 3 - Run mdbook build to generate HTML (unless MARKDOWN_ONLY=true)
  6. Cleanup - Copy outputs to /output volume

Environment Variables:

  • REPO - GitHub repository (owner/repo format)
  • BOOK_TITLE - Documentation title
  • BOOK_AUTHORS - Author metadata
  • GIT_REPO_URL - Repository URL for edit links
  • DEEPWIKI_URL - DeepWiki page URL
  • MARKDOWN_ONLY - Skip HTML build for debugging

Critical Paths:

  • WORK_DIR=/workspace - Working directory
  • WIKI_DIR=/workspace/wiki - Temporary markdown location
  • OUTPUT_DIR=/output - Volume mount for outputs
  • BOOK_DIR=/workspace/book - mdBook source directory

Sources: README.md:34-37 README.md86

run-tests.sh

Test execution script that runs pytest on the Python test suite. Provides colored output and detailed test results.

Sources: README.md82

graph TB
    TemplatesDir["templates/"]
TemplatesDir --> Header["header.html"]
TemplatesDir --> Footer["footer.html"]
TemplatesDir --> TemplateREADME["README.md"]
Header --> Variables["Template variables:\n{{REPO}}\n{{BOOK_TITLE}}\n{{GIT_REPO_URL}}\n{{DEEPWIKI_URL}}\n{{GENERATION_DATE}}"]
Footer --> Variables

Templates Directory

The templates/ directory contains HTML template files for header and footer customization.

Templates Directory Structure

Sources: README.md87

header.html

HTML template injected at the beginning of each markdown file. Supports variable substitution for dynamic content like repository links and generation timestamps.

Sources: README.md:40-51

footer.html

HTML template injected at the end of each markdown file. Supports the same variable substitution as header.html.

Sources: README.md:40-51

README.md

Documentation for the template system, including variable reference and customization examples.

Sources: README.md51

graph TB
    Output["output/"]
Output --> Markdown["markdown/"]
Output --> RawMarkdown["raw_markdown/"]
Output --> Book["book/"]
Output --> Config["book.toml"]
Markdown --> MainPages["Main pages:\n1-overview.md\n2-quick-start.md"]
Markdown --> Sections["Subsection dirs:\nsection-2/\nsection-3/"]
Sections --> SubPages["Subsection pages:\n2-1-docker.md\n3-1-environment.md"]
RawMarkdown --> RawPages["Pre-enhanced\nmarkdown files\n(for debugging)"]
Book --> Index["index.html"]
Book --> CSS["css/"]
Book --> JS["mermaid.min.js"]
Book --> Search["searchindex.js"]

Output Directory (Generated)

The output/ directory is created at runtime and excluded from version control. It contains all generated artifacts produced by the build pipeline.

Output Structure

Sources: README.md:54-59

markdown/

Contains enhanced markdown source files with injected diagrams and processed templates. Files are organized hierarchically with subsections in section-N/ subdirectories.

Main Pages:

  • Format: {number}-{slug}.md (e.g., 1-overview.md)
  • Location: output/markdown/

Subsection Pages:

  • Format: section-{main}/{number}-{slug}.md
  • Location: output/markdown/section-{N}/
  • Example: section-3/3-2-environment-variables.md

Sources: README.md56

raw_markdown/

Pre-enhancement markdown files for debugging purposes. Contains the output of Phase 1 before diagram injection and template processing. Useful for troubleshooting diagram matching issues.

Sources: README.md57

book/

Complete HTML documentation site generated by mdBook. Self-contained static website with:

  • Navigation sidebar generated from SUMMARY.md
  • Full-text search via searchindex.js
  • Rendered Mermaid diagrams via mdbook-mermaid
  • Edit-on-GitHub links from GIT_REPO_URL
  • Responsive Rust theme

The entire directory can be served by any static file server or deployed to GitHub Pages.

Sources: README.md55

book.toml

mdBook configuration file with repository-specific metadata. Dynamically generated during Phase 2 of the build pipeline. Contains book title, authors, theme settings, and preprocessor configuration.

Sources: README.md58

graph TB
    BuildContext["Docker Build Context"]
BuildContext --> Included["Included in Image"]
BuildContext --> Excluded["Excluded"]
Included --> DockerfileBuild["Dockerfile\n(Build instructions)"]
Included --> ToolsCopy["tools/\n(COPY instruction)"]
Included --> ScriptCopy["build-docs.sh\n(COPY instruction)"]
ToolsCopy --> ReqInstall["requirements.txt\n→ uv pip install"]
ToolsCopy --> ScraperInstall["deepwiki-scraper.py\n→ /usr/local/bin/"]
ScriptCopy --> BuildInstall["build-docs.sh\n→ /usr/local/bin/"]
Excluded --> GitIgnored["output/\n(git-ignored)"]
Excluded --> GitFiles[".git/\n(implicit)"]
Excluded --> Readme["README.md\n(not referenced)"]
style BuildContext fill:#f9f9f9,stroke:#333
    style Included fill:#e8f5e9,stroke:#388e3c
    style Excluded fill:#ffebee,stroke:#c62828

Docker Build Context

The Docker build process includes only the files needed for container construction. Understanding this context is important for build optimization.

Build Context Inclusion

Copy Operations:

  1. Dockerfile16 - COPY tools/requirements.txt /tmp/requirements.txt
  2. Dockerfile24 - COPY tools/deepwiki-scraper.py /usr/local/bin/
  3. Dockerfile28 - COPY build-docs.sh /usr/local/bin/

Not Copied:

  • .gitignore - only used by Git
  • output/ - generated at runtime
  • .git/ - version control metadata
  • Any documentation files (README, LICENSE)

Sources: Dockerfile:16-28 .gitignore:1-2

graph TB
    subgraph BuildTime["Build-Time Dependencies"]
DF["Dockerfile"]
Req["tools/requirements.txt"]
Scraper["tools/deepwiki-scraper.py"]
BuildSh["build-docs.sh"]
DF -->|COPY [Line 16]| Req
 
       DF -->|RUN install [Line 17]| Req
 
       DF -->|COPY [Line 24]| Scraper
 
       DF -->|COPY [Line 28]| BuildSh
 
       DF -->|CMD [Line 32]| BuildSh
    end
    
    subgraph Runtime["Run-Time Dependencies"]
BuildShRun["build-docs.sh\n(Entry point)"]
ScraperExec["deepwiki-scraper.py\n(Phase 1-2)"]
MdBook["mdbook\n(Phase 3)"]
MdBookMermaid["mdbook-mermaid\n(Phase 3)"]
BuildShRun -->|python3 [Line 58]| ScraperExec
 
       BuildShRun -->|mdbook-mermaid install [Line 171]| MdBookMermaid
 
       BuildShRun -->|mdbook build [Line 176]| MdBook
        
 
       ScraperExec -->|import requests| Req
 
       ScraperExec -->|import bs4| Req
 
       ScraperExec -->|import html2text| Req
    end
    
    subgraph Generated["Generated Artifacts"]
WikiDir["$WIKI_DIR/\n(Temp markdown)"]
BookToml["book.toml\n(Config)"]
Summary["SUMMARY.md\n(TOC)"]
OutputDir["output/\n(Final artifacts)"]
ScraperExec -->|sys.argv[2]| WikiDir
 
       BuildShRun -->|cat > [Line 85]| BookToml
 
       BuildShRun -->|Lines 113-159| Summary
 
       BuildShRun -->|cp [Lines 184-191]| OutputDir
    end
    
 
   BuildTime --> Runtime
 
   Runtime --> Generated
    
    style DF fill:#e1f5ff,stroke:#0288d1
    style BuildShRun fill:#fff4e1,stroke:#f57c00
    style ScraperExec fill:#e8f5e9,stroke:#388e3c
    style OutputDir fill:#ffe0b2,stroke:#e64a19

File Dependency Graph

This diagram maps the relationships between files and shows which files depend on or reference others.

Sources: Dockerfile:1-33 build-docs.sh:1-206 tools/deepwiki-scraper.py:1-920 tools/requirements.txt:1-4

File Size and Complexity Metrics

Understanding the relative complexity of each component helps developers identify which files require the most attention during modifications.

FileLinesPurposeComplexity
tools/deepwiki-scraper.py920Content extraction and diagram matchingHigh
build-docs.sh206Orchestration and configurationMedium
Dockerfile33Multi-stage build specificationLow
tools/requirements.txt4Dependency listMinimal
.gitignore2Git exclusion ruleMinimal

Key Observations:

Sources: tools/deepwiki-scraper.py:1-920 build-docs.sh:1-206 Dockerfile:1-33 tools/requirements.txt:1-4 .gitignore:1-2

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Running Tests

Loading…

Running Tests

Relevant source files

This page provides instructions for running the test suite locally and understanding the test organization within the DeepWiki-to-mdBook converter project. It covers local execution methods, test structure, and integration with the development workflow. For information about the automated CI/CD test workflow, see Test Workflow.

Test Organization

The test suite is located in the python/tests/ directory and consists of multiple test modules that validate different components of the system.

Test Structure

python/
├── tests/
│   ├── conftest.py              # pytest fixtures and configuration
│   ├── test_template_processor.py   # Template system tests
│   ├── test_mermaid_normalization.py  # Mermaid diagram normalization tests
│   └── test_numbering.py        # Page numbering and path resolution tests

Sources: scripts/run-tests.sh:1-43 python/tests/conftest.py:1-16

Test Categories

Test ModulePurposeTest Framework
test_template_processor.pyValidates template variable substitution, conditional rendering, and header/footer injectionStandalone Python (no pytest required)
test_mermaid_normalization.pyTests the seven-step Mermaid diagram normalization pipelinepytest
test_numbering.pyValidates page numbering logic and path resolution algorithmspytest

Sources: scripts/run-tests.sh:7-30

Running Tests Locally

There are two primary methods for running tests locally: using the convenience shell script or invoking pytest directly.

Method 1: Using the Shell Script

The run-tests.sh script provides a unified interface for running all tests with appropriate error handling and formatted output:

This script:

  1. Runs template processor tests directly with Python
  2. Detects if pytest is installed
  3. Runs pytest-based tests if available
  4. Provides summary output with pass/fail status

Sources: scripts/run-tests.sh:1-43

Method 2: Using pytest Directly

For pytest-based tests, you can invoke pytest directly for more control:

Sources: .github/workflows/tests.yml:24-25

Method 3: Individual Test Execution

The template processor tests can run independently without pytest:

Sources: scripts/run-tests.sh11

Test Execution Flow

Test Execution Flow Diagram

This diagram shows how tests are executed locally. The run-tests.sh script checks for pytest availability and runs tests accordingly, while developers can also invoke pytest or Python directly.

Sources: scripts/run-tests.sh:1-43 python/tests/conftest.py:1-16

Prerequisites

Required Dependencies

Install Python dependencies before running tests:

The requirements.txt file contains all runtime dependencies needed by the scraper and test utilities.

Sources: .github/workflows/tests.yml:19-23

Python Version

Tests are designed for Python 3.12, which is the version used in both the Docker container and CI workflow:

Sources: .github/workflows/tests.yml:17-18

Test Module Details

Test Module Dependencies Diagram

This diagram shows the organization of test modules and their relationship to the conftest.py fixture system. The scraper_module fixture dynamically loads deepwiki-scraper.py for use in pytest-based tests.

Sources: python/tests/conftest.py:1-16 scripts/run-tests.sh:7-30

Template Processor Tests

The test_template_processor.py module tests the template variable substitution system used for header and footer injection. It validates:

  • Variable substitution with {{VARIABLE_NAME}} syntax
  • Conditional blocks with {{#if CONDITION}}...{{/if}}
  • Edge cases like missing variables and nested conditions

This module can run independently without pytest and directly imports the template processing functions.

Sources: scripts/run-tests.sh:7-11

Mermaid Normalization Tests

The test_mermaid_normalization.py module validates the seven-step normalization pipeline that ensures Mermaid 11 compatibility. Each normalization step has dedicated tests:

Normalization StepTest Coverage
Unescape sequences\n, \t, \u003c character handling
Multiline edge labelsFlattening logic for edge descriptions
State descriptionsState : Description syntax fixes
Flowchart nodesPipe character removal
Statement separatorsSemicolon insertion
Empty labelsFallback label generation
Gantt task IDsSynthetic ID generation for unnamed tasks

This module uses the scraper_module fixture from conftest.py to access normalization functions.

Sources: scripts/run-tests.sh:16-22 python/tests/conftest.py:7-16

Numbering Tests

The test_numbering.py module validates page numbering logic and path generation algorithms. It tests:

  • Hierarchical numbering schemes (e.g., 1.2.3)
  • Numeric sorting that correctly handles multi-digit sections
  • Path generation from page numbers
  • Link rewriting for internal references

This module also uses the scraper_module fixture to access numbering functions.

Sources: scripts/run-tests.sh:24-30 python/tests/conftest.py:7-16

The conftest.py Fixture System

The conftest.py file provides a session-scoped fixture that loads the deepwiki-scraper.py module dynamically:

This approach allows tests to import functions from the scraper without requiring it to be installed as a package. The fixture is shared across all test sessions for efficiency.

Sources: python/tests/conftest.py:7-16

CI Integration

The test suite integrates with GitHub Actions through the tests.yml workflow, which:

  1. Triggers on push to main and on pull requests
  2. Sets up Python 3.12
  3. Installs dependencies from python/requirements.txt
  4. Installs pytest
  5. Runs all pytest tests with the -s flag (show output)

For detailed information about the CI test workflow, configuration, and failure handling, see Test Workflow.

Sources: .github/workflows/tests.yml:1-26

Understanding Test Output

Successful Test Run

When all tests pass using run-tests.sh, you will see:

==========================================
Running Template Processor Tests
==========================================

[Template test output...]

==========================================
Running Mermaid Normalization Tests
==========================================

[pytest output with test results...]

==========================================
Running Numbering Tests
==========================================

[pytest output with test results...]

==========================================
✓ All tests passed!
==========================================

Sources: scripts/run-tests.sh:34-42

Pytest Not Available

If pytest is not installed, the script will skip pytest-based tests:

==========================================
Running Template Processor Tests
==========================================

[Template test output...]

==========================================
⚠ Template tests passed (mermaid/numbering tests skipped)

Note: pytest not found, install with: pip install pytest
==========================================

Sources: scripts/run-tests.sh:34-42

Pytest Verbose Output

Using pytest with the -v flag provides detailed test information:

The -s flag shows print statements and output, useful for debugging:

Sources: .github/workflows/tests.yml25

Local vs. CI Test Execution

Local vs. CI Test Execution Comparison

This diagram illustrates the difference between local and CI test execution. Local execution allows for flexible Python versions and gracefully handles missing pytest, while CI enforces Python 3.12 and guarantees pytest availability.

Sources: scripts/run-tests.sh:13-31 .github/workflows/tests.yml:1-26

Key Differences

AspectLocal ExecutionCI Execution
Python VersionAny 3.x versionFixed at 3.12
pytest RequirementOptional (graceful fallback)Always installed
Execution Methodrun-tests.sh or manualpytest python/tests/ -s
Output ControlUser-configurable verbosityFixed -s flag for output
TriggerManual by developerAutomatic on push/PR

Sources: scripts/run-tests.sh:13-31 .github/workflows/tests.yml:17-25

Best Practices

Running Tests Before Commits

Always run the test suite before committing changes:

This ensures your changes don’t break existing functionality.

Iterative Testing

When developing new features, run specific test modules for faster feedback:

Adding New Tests

When adding new functionality:

  1. Create test functions in the appropriate test module
  2. Use the scraper_module fixture for accessing scraper functions
  3. Test locally with both methods (script and pytest)
  4. Verify CI passes on your pull request

Sources: python/tests/conftest.py:7-16 .github/workflows/tests.yml:1-26

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Python Dependencies

Loading…

Python Dependencies

Relevant source files

This page documents the Python dependencies required by the deepwiki-scraper.py script, including their purposes, version requirements, and how they are used throughout the content extraction and conversion pipeline. For information about the scraper script itself, see deepwiki-scraper.py. For details about how Rust dependencies (mdBook and mdbook-mermaid) are installed, see Docker Multi-Stage Build.

Dependencies Overview

The system requires three core Python libraries for web scraping and HTML-to-Markdown conversion:

PackageMinimum VersionPrimary Purpose
requests2.31.0HTTP client for fetching web pages
beautifulsoup44.12.0HTML parsing and DOM manipulation
html2text2020.1.16HTML to Markdown conversion

These dependencies are declared in tools/requirements.txt:1-3 and installed during Docker image build using the uv package manager.

Sources: tools/requirements.txt:1-3 Dockerfile:16-17

Dependency Usage Flow

The following diagram illustrates how each Python dependency is used across the three-phase processing pipeline:

Sources: tools/deepwiki-scraper.py:17-19 tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:596-788

flowchart TD
    subgraph "Phase 1: Markdown Extraction"
        FetchPage["fetch_page()\n[tools/deepwiki-scraper.py:27-42]"]
ExtractStruct["extract_wiki_structure()\n[tools/deepwiki-scraper.py:78-125]"]
ExtractContent["extract_page_content()\n[tools/deepwiki-scraper.py:453-594]"]
ConvertHTML["convert_html_to_markdown()\n[tools/deepwiki-scraper.py:175-190]"]
end
    
    subgraph "Phase 2: Diagram Enhancement"
        ExtractDiagrams["extract_and_enhance_diagrams()\n[tools/deepwiki-scraper.py:596-788]"]
end
    
    subgraph "requests Library"
        Session["requests.Session()"]
GetMethod["session.get()"]
HeadMethod["session.head()"]
end
    
    subgraph "BeautifulSoup4 Library"
        BS4Parser["BeautifulSoup(html, 'html.parser')"]
FindAll["soup.find_all()"]
Select["soup.select()"]
Decompose["element.decompose()"]
end
    
    subgraph "html2text Library"
        H2TClass["html2text.HTML2Text()"]
HandleMethod["h.handle()"]
end
    
 
   FetchPage --> Session
 
   FetchPage --> GetMethod
 
   ExtractStruct --> GetMethod
 
   ExtractStruct --> BS4Parser
 
   ExtractStruct --> FindAll
    
 
   ExtractContent --> GetMethod
 
   ExtractContent --> BS4Parser
 
   ExtractContent --> Select
 
   ExtractContent --> Decompose
 
   ExtractContent --> ConvertHTML
    
 
   ConvertHTML --> H2TClass
 
   ConvertHTML --> HandleMethod
    
 
   ExtractDiagrams --> GetMethod

requests

The requests library provides HTTP client functionality for fetching web pages from DeepWiki.com. It is imported at tools/deepwiki-scraper.py17 and used throughout the scraper.

Key Usage Patterns

Session Management: A requests.Session() object is created at tools/deepwiki-scraper.py:818-821 to maintain connection pooling and share headers across multiple requests:

HTTP GET Requests: The fetch_page() function at tools/deepwiki-scraper.py:27-42 uses session.get() with retry logic, browser-like headers, and 30-second timeout to fetch HTML content.

HTTP HEAD Requests: The discover_subsections() function at tools/deepwiki-scraper.py:44-76 uses session.head() to efficiently check for page existence without downloading full content.

Configuration Options

The library is configured with:

Sources: tools/deepwiki-scraper.py17 tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:818-821

BeautifulSoup4

The beautifulsoup4 library (imported as bs4) provides HTML parsing and DOM manipulation capabilities. It is imported at tools/deepwiki-scraper.py18 as from bs4 import BeautifulSoup.

Parser Selection

BeautifulSoup is instantiated with the built-in html.parser backend at multiple locations:

This parser choice avoids external dependencies (lxml, html5lib) and provides sufficient functionality for well-formed HTML.

flowchart LR
    subgraph "Navigation Methods"
        FindAll["soup.find_all()"]
Find["soup.find()"]
Select["soup.select()"]
SelectOne["soup.select_one()"]
end
    
    subgraph "Usage in extract_wiki_structure()"
        StructLinks["Find wiki page links\n[line 90]"]
end
    
    subgraph "Usage in extract_page_content()"
        RemoveNav["Remove navigation elements\n[line 466]"]
FindContent["Locate main content area\n[line 473-485]"]
RemoveUI["Remove DeepWiki UI elements\n[line 491-511]"]
end
    
 
   FindAll --> StructLinks
 
   FindAll --> RemoveUI
    
 
   Select --> RemoveNav
 
   SelectOne --> FindContent
    
 
   Find --> FindContent

DOM Navigation Methods

The following diagram maps BeautifulSoup methods to their usage contexts in the codebase:

Sources: tools/deepwiki-scraper.py18 tools/deepwiki-scraper.py84 tools/deepwiki-scraper.py90 tools/deepwiki-scraper.py463 tools/deepwiki-scraper.py:466-511

Content Manipulation

Element Removal: The element.decompose() method permanently removes elements from the DOM tree:

CSS Selectors: BeautifulSoup’s select() and select_one() methods support CSS selector syntax for finding content areas:

tools/deepwiki-scraper.py:473-476

Attribute-Based Selection: The find() method with attrs parameter locates elements by ARIA roles:

tools/deepwiki-scraper.py480

Text Extraction

BeautifulSoup’s get_text() method extracts plain text from elements:

Sources: tools/deepwiki-scraper.py:466-511

html2text

The html2text library converts HTML content to Markdown format. It is imported at tools/deepwiki-scraper.py19 and used exclusively in the convert_html_to_markdown() function.

Configuration

An HTML2Text instance is created with specific configuration at tools/deepwiki-scraper.py:178-180:

Key Settings:

  • ignore_links = False : Preserves hyperlinks as Markdown link syntax <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/text" undefined file-path="text">Hii</FileRef>
  • body_width = 0 : Disables automatic line wrapping at 80 characters, preserving original formatting

Conversion Process

The handle() method at tools/deepwiki-scraper.py181 performs the actual HTML-to-Markdown conversion:

This processes the cleaned HTML (after BeautifulSoup removes unwanted elements) and produces Markdown text with:

  • Headers converted to # syntax
  • Links converted to <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/text" undefined file-path="text">Hii</FileRef> format
  • Lists converted to - or 1. format
  • Bold/italic formatting preserved
  • Code blocks and inline code preserved

Post-Processing

The conversion output undergoes additional cleanup at tools/deepwiki-scraper.py188:

Sources: tools/deepwiki-scraper.py19 tools/deepwiki-scraper.py:175-190

flowchart TD
    subgraph "Dockerfile Stage 2"
        BaseImage["FROM python:3.12-slim\n[Dockerfile:8]"]
CopyUV["COPY uv from astral-sh image\n[Dockerfile:13]"]
CopyReqs["COPY tools/requirements.txt\n[Dockerfile:16]"]
InstallDeps["RUN uv pip install --system\n[Dockerfile:17]"]
end
    
    subgraph "requirements.txt"
        Requests["requests>=2.31.0"]
BS4["beautifulsoup4>=4.12.0"]
HTML2Text["html2text>=2020.1.16"]
end
    
 
   BaseImage --> CopyUV
 
   CopyUV --> CopyReqs
 
   CopyReqs --> InstallDeps
    
 
   Requests --> InstallDeps
 
   BS4 --> InstallDeps
 
   HTML2Text --> InstallDeps

Installation Process

The dependencies are installed during Docker image build using the uv package manager, which provides fast, reliable Python package installation.

Multi-Stage Build Integration

Sources: Dockerfile8 Dockerfile13 Dockerfile:16-17 tools/requirements.txt:1-3

Installation Command

The dependencies are installed with a single uv pip install command at Dockerfile17:

Flags:

  • --system : Installs into system Python, not a virtual environment
  • --no-cache : Avoids caching to reduce Docker image size
  • -r /tmp/requirements.txt : Specifies requirements file path

The uv tool is significantly faster than standard pip and provides deterministic dependency resolution, making builds more reliable and reproducible.

Sources: Dockerfile:16-17

Version Requirements

The minimum version constraints specified in tools/requirements.txt:1-3 ensure compatibility with required features:

requests >= 2.31.0

This version requirement ensures:

  • Security fixes : Addresses CVE-2023-32681 (Proxy-Authorization header leakage)
  • Session improvements : Enhanced connection pooling and retry mechanisms
  • HTTP/2 support : Better performance for multiple requests

The codebase relies on stable Session API behavior introduced in 2.x releases.

beautifulsoup4 >= 4.12.0

This version requirement ensures:

  • Python 3.12 compatibility : Required for the base image python:3.12-slim
  • Parser stability : Consistent behavior with html.parser backend
  • Security updates : Protection against XML parsing vulnerabilities

The codebase uses standard find/select methods that are stable across 4.x versions.

html2text >= 2020.1.16

This version requirement ensures:

  • Python 3 compatibility : Earlier versions targeted Python 2.7
  • Markdown formatting fixes : Improved handling of nested lists and code blocks
  • Link preservation : Proper conversion of HTML links to Markdown syntax

The codebase uses the body_width=0 configuration which was stabilized in this version.

Sources: tools/requirements.txt:1-3

Import Locations

All three dependencies are imported at the top of deepwiki-scraper.py:

These are the only external dependencies required by the Python layer. The script uses only standard library modules for all other functionality (sys, re, time, pathlib, urllib.parse).

Sources: tools/deepwiki-scraper.py:17-19

Dismiss

Refresh this wiki

Enter email to refresh