Advanced Topics

Relevant source files

This page covers advanced usage scenarios, implementation details, and power-user features of the DeepWiki-to-mdBook Converter. It provides deeper insight into optional features, debugging techniques, and the internal mechanisms that enable the system's flexibility and robustness.

For basic usage and configuration, see Quick Start and Configuration Reference. For architectural overview, see System Architecture. For component-level details, see Component Reference.

When to Use Advanced Features

The system provides several advanced features designed for specific scenarios:

Markdown-Only Mode : Extract markdown without building the HTML documentation. Useful for:

Debugging diagram placement and content extraction
Quick iteration during development
Creating markdown archives for version control
Feeding extracted content into other tools

Auto-Detection : Automatically determine repository metadata from Git remotes. Useful for:

CI/CD pipeline integration with minimal configuration
Running from within a repository checkout
Reducing configuration boilerplate

Custom Configuration : Override default behaviors through environment variables. Useful for:

Multi-repository documentation builds
Custom branding and themes
Specialized output requirements

Decision Flow for Build Modes

Sources: build-docs.sh:60-76 build-docs.sh:78-206 README.md:55-76

Debugging Strategies

Using Markdown-Only Mode for Fast Iteration

The MARKDOWN_ONLY environment variable bypasses the mdBook build phase, reducing build time from minutes to seconds. This is controlled by a simple conditional check in the orchestration script.

Workflow:

Set MARKDOWN_ONLY=true in Docker run command
Script executes build-docs.sh:60-76 which skips Steps 2-6
Only Phase 1 (scraping) and Phase 2 (diagram enhancement) execute
Output written directly to /output/markdown/

Typical debugging session:

The check at build-docs.sh61 determines whether to exit early:

For detailed information about this mode, see Markdown-Only Mode.

Sources: build-docs.sh:60-76 build-docs.sh26 README.md:55-76

Inspecting Intermediate Outputs

The system uses a temporary directory workflow that can be examined for debugging:

Stage	Location	Contents
During Phase 1	`/workspace/wiki/` (temp)	Raw markdown before diagram enhancement
During Phase 2	`/workspace/wiki/` (temp)	Markdown with injected diagrams
During Phase 3	`/workspace/book/src/`	Markdown copied for mdBook
Final Output	`/output/markdown/`	Final enhanced markdown files

The temporary directory pattern is implemented using Python's tempfile.TemporaryDirectory at tools/deepwiki-scraper.py808:

This ensures atomic operations—if the script fails mid-process, partial outputs are automatically cleaned up.

Sources: tools/deepwiki-scraper.py:804-916 build-docs.sh:27-30

Diagram Placement Debugging

Diagram injection uses fuzzy matching with progressive chunk sizes. To debug placement:

Check raw extraction count : Look for console output "Found N total diagrams"
Check context extraction : Look for "Found N diagrams with context"
Check matching : Look for "Enhanced X files with diagrams"

The matching algorithm tries progressively smaller chunks at tools/deepwiki-scraper.py:716-730:

Debugging poor matches:

If too few diagrams placed: The context from JavaScript may not match converted markdown
If diagrams in wrong locations: Context text may appear in multiple locations
If no diagrams: Repository may not contain mermaid diagrams

Sources: tools/deepwiki-scraper.py:596-788 tools/deepwiki-scraper.py:216-331

Link Rewriting Implementation

The Link Rewriting Problem

DeepWiki uses absolute URLs like /owner/repo/2-1-subsection. The scraper must convert these to relative markdown paths that work in the mdBook file hierarchy:

output/markdown/
├── 1-overview.md
├── 2-main-section.md
├── section-2/
│   ├── 2-1-subsection.md
│   └── 2-2-another.md
└── 3-next-section.md

Links must account for:

Source page location (main page vs. subsection)
Target page location (main page vs. subsection)
Same section vs. cross-section links

Link Rewriting Algorithm

Sources: tools/deepwiki-scraper.py:549-593

Link Rewriting Code Structure

The fix_wiki_link function at tools/deepwiki-scraper.py:549-589 implements this logic:

Input parsing:

Location detection:

Path generation rules:

Source Location	Target Location	Generated Path	Example
Main page	Main page	`file.md`	`3-next.md`
Main page	Subsection	`section-N/file.md`	`section-2/2-1-sub.md`
Subsection	Main page	`../file.md`	`../3-next.md`
Subsection (same section)	Subsection	`file.md`	`2-2-another.md`
Subsection (diff section)	Subsection	`section-N/file.md`	`section-3/3-1-sub.md`

The regex replacement at tools/deepwiki-scraper.py592 applies this transformation to all links:

For detailed explanation, see Link Rewriting Logic.

Sources: tools/deepwiki-scraper.py:549-593

Auto-Detection Mechanisms

flowchart TD
 
   Start["build-docs.sh starts"] --> CheckRepo{"REPO env var\nprovided?"}
CheckRepo -->|Yes| UseEnv["Use provided REPO value"]
CheckRepo -->|No| CheckGit{"Is current directory\na Git repository?\n(git rev-parse --git-dir)"}
CheckGit -->|Yes| GetRemote["Get remote.origin.url:\ngit config --get\nremote.origin.url"]
CheckGit -->|No| SetEmpty["Set REPO=<empty>"]
GetRemote --> HasRemote{"Remote URL\nfound?"}
HasRemote -->|Yes| ParseURL["Parse GitHub URL using sed regex:\nExtract owner/repo"]
HasRemote -->|No| SetEmpty
    
 
   ParseURL --> ValidateFormat{"Format is\nowner/repo?"}
ValidateFormat -->|Yes| SetRepo["Set REPO variable"]
ValidateFormat -->|No| SetEmpty
    
 
   SetEmpty --> FinalCheck{"REPO is empty?"}
UseEnv --> Continue["Continue with REPO"]
SetRepo --> Continue
    
 
   FinalCheck -->|Yes| Error["ERROR: REPO must be set\nExit with code 1"]
FinalCheck -->|No| Continue

Git Remote Auto-Detection

When REPO environment variable is not provided, the system attempts to auto-detect it from the Git repository in the current working directory.

Sources: build-docs.sh:8-37

Implementation Details

The auto-detection logic at build-docs.sh:8-19 handles multiple Git URL formats:

Supported URL formats:

HTTPS: https://github.com/owner/repo.git
HTTPS (no .git): https://github.com/owner/repo
SSH: git@github.com:owner/repo.git
SSH (no .git): git@github.com:owner/repo

The regex pattern .*github\.com<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/#LNaN-LNaN" NaN file-path="">Hii</FileRef>(\.git)?.* captures:

[:/] - Matches either : (SSH) or / (HTTPS)
([^/]+/[^/\.]+) - Captures owner/repo (stops at / or .)
(\.git)? - Optionally matches .git suffix

Derived defaults:

After determining REPO, the script derives other configuration at build-docs.sh:39-45:

This provides sensible defaults:

BOOK_AUTHORS defaults to repository owner
GIT_REPO_URL defaults to GitHub URL (for "Edit this page" links)

For detailed explanation, see Auto-Detection Features.

Sources: build-docs.sh:8-45 README.md:47-53

Performance Considerations

Build Time Breakdown

Typical build times for a medium-sized repository (50-100 pages):

Phase	Time	Bottleneck
Phase 1: Scraping	60-120s	Network requests + 1s delays
Phase 2: Diagrams	5-10s	Regex matching + file I/O
Phase 3: mdBook	10-20s	Rust compilation + mermaid assets
Total	75-150s	Network + computation

Optimization Strategies

Network optimization:

The scraper includes time.sleep(1) at tools/deepwiki-scraper.py872 between pages
Retry logic with exponential backoff at tools/deepwiki-scraper.py:33-42
HTTP session reuse via requests.Session() at tools/deepwiki-scraper.py:818-821

Markdown-only mode:

Skips Phase 3 entirely, reducing build time by ~15-25%
Useful for content-only iterations

Docker build optimization:

Multi-stage build discards Rust toolchain (~1.5 GB)
Final image only contains binaries (~300-400 MB)
See Docker Multi-Stage Build for details

Caching considerations:

No internal caching—each run fetches fresh content
DeepWiki serves dynamic content (no cache headers)
Docker layer caching helps with repeated image builds

Sources: tools/deepwiki-scraper.py:28-42 tools/deepwiki-scraper.py:817-821 tools/deepwiki-scraper.py872

Extending the System

Adding New Output Formats

The system's three-phase architecture makes it easy to add new output formats:

Integration points:

Before Phase 3: Add code after build-docs.sh188 to read from $WIKI_DIR
Alternative Phase 3: Replace build-docs.sh:174-176 with custom builder
Post-processing: Add steps after build-docs.sh192 to transform mdBook output

Example: Adding PDF export:

Sources: build-docs.sh:174-206

Customizing Diagram Matching

The fuzzy matching algorithm can be tuned by modifying the chunk sizes at tools/deepwiki-scraper.py716:

Matching strategy customization:

The scoring system at tools/deepwiki-scraper.py:709-745 prioritizes:

Anchor text matching (weighted by chunk size)
Heading matching (weight: 50)

You can add additional heuristics by modifying the scoring logic or adding new matching strategies.

Sources: tools/deepwiki-scraper.py:596-788 tools/deepwiki-scraper.py:716-745

Adding New Content Cleaners

The HTML-to-markdown conversion can be enhanced by adding custom cleaners at tools/deepwiki-scraper.py:489-511:

The footer cleaner at tools/deepwiki-scraper.py:127-173 can be extended with additional patterns:

Sources: tools/deepwiki-scraper.py:127-173 tools/deepwiki-scraper.py:466-511

Common Advanced Scenarios

CI/CD Integration

GitHub Actions example:

The auto-detection at build-docs.sh:8-19 determines REPO from Git context. The BOOK_TITLE overrides the default.

Sources: build-docs.sh:8-45 README.md:228-232

Multi-Repository Builds

Build documentation for multiple repositories in parallel:

Each build runs in an isolated container with separate output directories.

Sources: build-docs.sh:21-53 README.md:200-207

Custom Theming

Override mdBook theme by modifying the generated book.toml template at build-docs.sh:85-103:

Or inject custom CSS:

Sources: build-docs.sh:84-103

Keyboard shortcuts

deepwiki-to-mdbook Documentation