Advanced Topics
Relevant source files
This page covers advanced usage scenarios, implementation details, and power-user features of the DeepWiki-to-mdBook Converter. It provides deeper insight into optional features, debugging techniques, and the internal mechanisms that enable the system's flexibility and robustness.
For basic usage and configuration, see Quick Start and Configuration Reference. For architectural overview, see System Architecture. For component-level details, see Component Reference.
When to Use Advanced Features
The system provides several advanced features designed for specific scenarios:
Markdown-Only Mode : Extract markdown without building the HTML documentation. Useful for:
- Debugging diagram placement and content extraction
- Quick iteration during development
- Creating markdown archives for version control
- Feeding extracted content into other tools
Auto-Detection : Automatically determine repository metadata from Git remotes. Useful for:
- CI/CD pipeline integration with minimal configuration
- Running from within a repository checkout
- Reducing configuration boilerplate
Custom Configuration : Override default behaviors through environment variables. Useful for:
- Multi-repository documentation builds
- Custom branding and themes
- Specialized output requirements
Decision Flow for Build Modes
Sources: build-docs.sh:60-76 build-docs.sh:78-206 README.md:55-76
Debugging Strategies
Using Markdown-Only Mode for Fast Iteration
The MARKDOWN_ONLY environment variable bypasses the mdBook build phase, reducing build time from minutes to seconds. This is controlled by a simple conditional check in the orchestration script.
Workflow:
- Set
MARKDOWN_ONLY=truein Docker run command - Script executes build-docs.sh:60-76 which skips Steps 2-6
- Only Phase 1 (scraping) and Phase 2 (diagram enhancement) execute
- Output written directly to
/output/markdown/
Typical debugging session:
The check at build-docs.sh61 determines whether to exit early:
For detailed information about this mode, see Markdown-Only Mode.
Sources: build-docs.sh:60-76 build-docs.sh26 README.md:55-76
Inspecting Intermediate Outputs
The system uses a temporary directory workflow that can be examined for debugging:
| Stage | Location | Contents |
|---|---|---|
| During Phase 1 | /workspace/wiki/ (temp) | Raw markdown before diagram enhancement |
| During Phase 2 | /workspace/wiki/ (temp) | Markdown with injected diagrams |
| During Phase 3 | /workspace/book/src/ | Markdown copied for mdBook |
| Final Output | /output/markdown/ | Final enhanced markdown files |
The temporary directory pattern is implemented using Python's tempfile.TemporaryDirectory at tools/deepwiki-scraper.py808:
This ensures atomic operations—if the script fails mid-process, partial outputs are automatically cleaned up.
Sources: tools/deepwiki-scraper.py:804-916 build-docs.sh:27-30
Diagram Placement Debugging
Diagram injection uses fuzzy matching with progressive chunk sizes. To debug placement:
- Check raw extraction count : Look for console output "Found N total diagrams"
- Check context extraction : Look for "Found N diagrams with context"
- Check matching : Look for "Enhanced X files with diagrams"
The matching algorithm tries progressively smaller chunks at tools/deepwiki-scraper.py:716-730:
Debugging poor matches:
- If too few diagrams placed: The context from JavaScript may not match converted markdown
- If diagrams in wrong locations: Context text may appear in multiple locations
- If no diagrams: Repository may not contain mermaid diagrams
Sources: tools/deepwiki-scraper.py:596-788 tools/deepwiki-scraper.py:216-331
Link Rewriting Implementation
The Link Rewriting Problem
DeepWiki uses absolute URLs like /owner/repo/2-1-subsection. The scraper must convert these to relative markdown paths that work in the mdBook file hierarchy:
output/markdown/
├── 1-overview.md
├── 2-main-section.md
├── section-2/
│ ├── 2-1-subsection.md
│ └── 2-2-another.md
└── 3-next-section.md
Links must account for:
- Source page location (main page vs. subsection)
- Target page location (main page vs. subsection)
- Same section vs. cross-section links
Link Rewriting Algorithm
Sources: tools/deepwiki-scraper.py:549-593
Link Rewriting Code Structure
The fix_wiki_link function at tools/deepwiki-scraper.py:549-589 implements this logic:
Input parsing:
Location detection:
Path generation rules:
| Source Location | Target Location | Generated Path | Example |
|---|---|---|---|
| Main page | Main page | file.md | 3-next.md |
| Main page | Subsection | section-N/file.md | section-2/2-1-sub.md |
| Subsection | Main page | ../file.md | ../3-next.md |
| Subsection (same section) | Subsection | file.md | 2-2-another.md |
| Subsection (diff section) | Subsection | section-N/file.md | section-3/3-1-sub.md |
The regex replacement at tools/deepwiki-scraper.py592 applies this transformation to all links:
For detailed explanation, see Link Rewriting Logic.
Sources: tools/deepwiki-scraper.py:549-593
Auto-Detection Mechanisms
flowchart TD
Start["build-docs.sh starts"] --> CheckRepo{"REPO env var\nprovided?"}
CheckRepo -->|Yes| UseEnv["Use provided REPO value"]
CheckRepo -->|No| CheckGit{"Is current directory\na Git repository?\n(git rev-parse --git-dir)"}
CheckGit -->|Yes| GetRemote["Get remote.origin.url:\ngit config --get\nremote.origin.url"]
CheckGit -->|No| SetEmpty["Set REPO=<empty>"]
GetRemote --> HasRemote{"Remote URL\nfound?"}
HasRemote -->|Yes| ParseURL["Parse GitHub URL using sed regex:\nExtract owner/repo"]
HasRemote -->|No| SetEmpty
ParseURL --> ValidateFormat{"Format is\nowner/repo?"}
ValidateFormat -->|Yes| SetRepo["Set REPO variable"]
ValidateFormat -->|No| SetEmpty
SetEmpty --> FinalCheck{"REPO is empty?"}
UseEnv --> Continue["Continue with REPO"]
SetRepo --> Continue
FinalCheck -->|Yes| Error["ERROR: REPO must be set\nExit with code 1"]
FinalCheck -->|No| Continue
Git Remote Auto-Detection
When REPO environment variable is not provided, the system attempts to auto-detect it from the Git repository in the current working directory.
Sources: build-docs.sh:8-37
Implementation Details
The auto-detection logic at build-docs.sh:8-19 handles multiple Git URL formats:
Supported URL formats:
- HTTPS:
https://github.com/owner/repo.git - HTTPS (no .git):
https://github.com/owner/repo - SSH:
git@github.com:owner/repo.git - SSH (no .git):
git@github.com:owner/repo
The regex pattern .*github\.com<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/#LNaN-LNaN" NaN file-path="">Hii</FileRef>(\.git)?.* captures:
[:/]- Matches either:(SSH) or/(HTTPS)([^/]+/[^/\.]+)- Capturesowner/repo(stops at/or.)(\.git)?- Optionally matches.gitsuffix
Derived defaults:
After determining REPO, the script derives other configuration at build-docs.sh:39-45:
This provides sensible defaults:
BOOK_AUTHORSdefaults to repository ownerGIT_REPO_URLdefaults to GitHub URL (for "Edit this page" links)
For detailed explanation, see Auto-Detection Features.
Sources: build-docs.sh:8-45 README.md:47-53
Performance Considerations
Build Time Breakdown
Typical build times for a medium-sized repository (50-100 pages):
| Phase | Time | Bottleneck |
|---|---|---|
| Phase 1: Scraping | 60-120s | Network requests + 1s delays |
| Phase 2: Diagrams | 5-10s | Regex matching + file I/O |
| Phase 3: mdBook | 10-20s | Rust compilation + mermaid assets |
| Total | 75-150s | Network + computation |
Optimization Strategies
Network optimization:
- The scraper includes
time.sleep(1)at tools/deepwiki-scraper.py872 between pages - Retry logic with exponential backoff at tools/deepwiki-scraper.py:33-42
- HTTP session reuse via
requests.Session()at tools/deepwiki-scraper.py:818-821
Markdown-only mode:
- Skips Phase 3 entirely, reducing build time by ~15-25%
- Useful for content-only iterations
Docker build optimization:
- Multi-stage build discards Rust toolchain (~1.5 GB)
- Final image only contains binaries (~300-400 MB)
- See Docker Multi-Stage Build for details
Caching considerations:
- No internal caching—each run fetches fresh content
- DeepWiki serves dynamic content (no cache headers)
- Docker layer caching helps with repeated image builds
Sources: tools/deepwiki-scraper.py:28-42 tools/deepwiki-scraper.py:817-821 tools/deepwiki-scraper.py872
Extending the System
Adding New Output Formats
The system's three-phase architecture makes it easy to add new output formats:
Integration points:
- Before Phase 3: Add code after build-docs.sh188 to read from
$WIKI_DIR - Alternative Phase 3: Replace build-docs.sh:174-176 with custom builder
- Post-processing: Add steps after build-docs.sh192 to transform mdBook output
Example: Adding PDF export:
Sources: build-docs.sh:174-206
Customizing Diagram Matching
The fuzzy matching algorithm can be tuned by modifying the chunk sizes at tools/deepwiki-scraper.py716:
Matching strategy customization:
The scoring system at tools/deepwiki-scraper.py:709-745 prioritizes:
- Anchor text matching (weighted by chunk size)
- Heading matching (weight: 50)
You can add additional heuristics by modifying the scoring logic or adding new matching strategies.
Sources: tools/deepwiki-scraper.py:596-788 tools/deepwiki-scraper.py:716-745
Adding New Content Cleaners
The HTML-to-markdown conversion can be enhanced by adding custom cleaners at tools/deepwiki-scraper.py:489-511:
The footer cleaner at tools/deepwiki-scraper.py:127-173 can be extended with additional patterns:
Sources: tools/deepwiki-scraper.py:127-173 tools/deepwiki-scraper.py:466-511
Common Advanced Scenarios
CI/CD Integration
GitHub Actions example:
The auto-detection at build-docs.sh:8-19 determines REPO from Git context. The BOOK_TITLE overrides the default.
Sources: build-docs.sh:8-45 README.md:228-232
Multi-Repository Builds
Build documentation for multiple repositories in parallel:
Each build runs in an isolated container with separate output directories.
Sources: build-docs.sh:21-53 README.md:200-207
Custom Theming
Override mdBook theme by modifying the generated book.toml template at build-docs.sh:85-103:
Or inject custom CSS:
Sources: build-docs.sh:84-103