Three-Phase Pipeline

Relevant source files

Purpose and Scope

This document describes the three-phase processing pipeline that transforms DeepWiki HTML pages into searchable mdBook documentation. The pipeline consists of Phase 1: Clean Markdown Extraction , Phase 2: Diagram Enhancement , and Phase 3: mdBook Build. Each phase has distinct responsibilities and uses different technology stacks.

For overall system architecture, see System Architecture. For detailed implementation of individual phases, see Phase 1: Markdown Extraction, Phase 2: Diagram Enhancement, and Phase 3: mdBook Build. For configuration that affects pipeline behavior, see Configuration Reference.

Pipeline Overview

The system processes content through three sequential phases, with an optional bypass mechanism for Phase 3.

Pipeline Execution Flow

stateDiagram-v2
    [*] --> Initialize
    
    Initialize --> Phase1 : Start build-docs.sh
    
    state "Phase 1 : Markdown Extraction" as Phase1 {
        [*] --> extract_wiki_structure
        extract_wiki_structure --> extract_page_content : For each page
        extract_page_content --> convert_html_to_markdown
        convert_html_to_markdown --> WriteTemp : Write to /workspace/wiki
        WriteTemp --> [*]
    }
    
    Phase1 --> CheckMode : deepwiki-scraper.py complete
    
    state CheckMode <<choice>>
    CheckMode --> Phase2 : MARKDOWN_ONLY=false
    CheckMode --> CopyOutput : MARKDOWN_ONLY=true
    
    state "Phase 2 : Diagram Enhancement" as Phase2 {
        [*] --> extract_and_enhance_diagrams
        extract_and_enhance_diagrams --> ExtractJS : Fetch JS payload
        ExtractJS --> FuzzyMatch : ~461 diagrams found
        FuzzyMatch --> InjectDiagrams : ~48 placed
        InjectDiagrams --> [*] : Update temp files
    }
    
    Phase2 --> Phase3 : Enhancement complete
    
    state "Phase 3 : mdBook Build" as Phase3 {
        [*] --> CreateBookToml : build-docs.sh
        CreateBookToml --> GenerateSummary : book.toml created
        GenerateSummary --> CopyToSrc : SUMMARY.md generated
        CopyToSrc --> MdbookMermaidInstall : Copy to /workspace/book/src
        MdbookMermaidInstall --> MdbookBuild : Install assets
        MdbookBuild --> [*] : HTML in /workspace/book/book
    }
    
    Phase3 --> CopyOutput
    CopyOutput --> [*] : Copy to /output

Sources: build-docs.sh:1-206 tools/deepwiki-scraper.py:790-919

Phase Coordination

The build-docs.sh orchestrator coordinates all three phases and handles the decision point for markdown-only mode.

Orchestrator Control Flow

flowchart TD
    Start[/"docker run with env vars"/]
    
 
   Start --> ParseEnv["Parse environment variables\nREPO, BOOK_TITLE, MARKDOWN_ONLY"]
ParseEnv --> ValidateRepo{"REPO set?"}
ValidateRepo -->|No| AutoDetect["git config --get remote.origin.url\nExtract owner/repo"]
ValidateRepo -->|Yes| CallScraper
 
   AutoDetect --> CallScraper
    
    CallScraper["python3 /usr/local/bin/deepwiki-scraper.py\nArgs: REPO, /workspace/wiki"]
CallScraper --> ScraperPhase1["Phase 1: extract_wiki_structure()\nextract_page_content()\nWrite to temp directory"]
ScraperPhase1 --> ScraperPhase2["Phase 2: extract_and_enhance_diagrams()\nFuzzy match and inject\nUpdate temp files"]
ScraperPhase2 --> CheckMarkdownOnly{"MARKDOWN_ONLY\n== true?"}
CheckMarkdownOnly -->|Yes| CopyMdOnly["cp -r /workspace/wiki/* /output/markdown/\nExit"]
CheckMarkdownOnly -->|No| InitMdBook
    
    InitMdBook["mkdir -p /workspace/book\nGenerate book.toml"]
InitMdBook --> GenSummary["Generate src/SUMMARY.md\nScan /workspace/wiki/*.md\nBuild table of contents"]
GenSummary --> CopyToSrc["cp -r /workspace/wiki/* src/"]
CopyToSrc --> InstallMermaid["mdbook-mermaid install /workspace/book"]
InstallMermaid --> BuildBook["mdbook build"]
BuildBook --> CopyOutputs["cp -r book /output/\ncp -r /workspace/wiki/* /output/markdown/\ncp book.toml /output/"]
CopyMdOnly --> End[/"Exit with outputs in /output"/]
 
   CopyOutputs --> End

Sources: build-docs.sh:8-76 build-docs.sh:78-206

Phase 1: Clean Markdown Extraction

Phase 1 discovers the wiki structure and converts HTML pages to clean Markdown, writing files to a temporary directory (/workspace/wiki). This phase is implemented entirely in Python within deepwiki-scraper.py.

Phase 1 Data Flow

flowchart LR
    DeepWiki["https://deepwiki.com/\nowner/repo"]
DeepWiki -->|HTTP GET| extract_wiki_structure
    
    extract_wiki_structure["extract_wiki_structure()\nParse sidebar links\nBuild page list"]
extract_wiki_structure --> PageList["pages = [\n {number, title, url, href, level},\n ...\n]"]
PageList --> Loop["For each page"]
Loop --> extract_page_content["extract_page_content(url, session)\nFetch HTML\nRemove nav/footer elements"]
extract_page_content --> BeautifulSoup["BeautifulSoup(response.text)\nFind article/main/body\nRemove DeepWiki UI"]
BeautifulSoup --> convert_html_to_markdown["convert_html_to_markdown(html)\nhtml2text.HTML2Text()\nbody_width=0"]
convert_html_to_markdown --> clean_deepwiki_footer["clean_deepwiki_footer(markdown)\nRemove footer patterns"]
clean_deepwiki_footer --> FixLinks["Fix internal links\nRegex: /owner/repo/N-title\nConvert to relative .md paths"]
FixLinks --> WriteTempFile["Write to /workspace/wiki/\nMain: N-title.md\nSubsection: section-N/N-M-title.md"]
WriteTempFile --> Loop
    
    style extract_wiki_structure fill:#f9f9f9
    style extract_page_content fill:#f9f9f9
    style convert_html_to_markdown fill:#f9f9f9
    style clean_deepwiki_footer fill:#f9f9f9

Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:175-216 tools/deepwiki-scraper.py:127-173

Key Functions and Their Roles

Function	File Location	Responsibility
`extract_wiki_structure()`	tools/deepwiki-scraper.py:78-125	Discover all pages by parsing sidebar links with pattern `/repo/\d+`
`extract_page_content()`	tools/deepwiki-scraper.py:453-594	Fetch individual page, parse HTML, remove navigation elements
`convert_html_to_markdown()`	tools/deepwiki-scraper.py:175-216	Convert HTML to Markdown using `html2text` with `body_width=0`
`clean_deepwiki_footer()`	tools/deepwiki-scraper.py:127-173	Remove DeepWiki UI elements using regex pattern matching
`sanitize_filename()`	tools/deepwiki-scraper.py:21-25	Convert page titles to safe filenames
`fix_wiki_link()`	tools/deepwiki-scraper.py:549-589	Rewrite internal links to relative `.md` paths

File Organization Logic

flowchart TD
    PageNum["page['number']"]
PageNum --> CheckLevel{"page['level']\n== 0?"}
CheckLevel -->|Yes main page| RootFile["Filename: N-title.md\nPath: /workspace/wiki/N-title.md\nExample: 2-quick-start.md"]
CheckLevel -->|No subsection| ExtractMain["Extract main section\nmain_section = number.split('.')[0]"]
ExtractMain --> SubDir["Create directory\nsection-{main_section}/"]
SubDir --> SubFile["Filename: N-M-title.md\nPath: section-N/N-M-title.md\nExample: section-2/2-1-installation.md"]

The system organizes files hierarchically based on page numbering:

Sources: tools/deepwiki-scraper.py:849-860

Phase 2: Diagram Enhancement

Phase 2 extracts Mermaid diagrams from the JavaScript payload and uses fuzzy matching to intelligently place them in the appropriate Markdown files. This phase operates on files in the temporary directory (/workspace/wiki).

Phase 2 Algorithm Flow

flowchart TD
    Start["extract_and_enhance_diagrams(repo, temp_dir, session)"]
Start --> FetchJS["GET https://deepwiki.com/owner/repo/1-overview\nExtract response.text"]
FetchJS --> ExtractAll["Regex: ```mermaid\\\n(.*?)```\nFind all diagram blocks"]
ExtractAll --> CountTotal["all_diagrams list\n(~461 total)"]
CountTotal --> ExtractContext["Regex: ([^`]{'{500,}'}?)```mermaid\\ (.*?)```\nExtract 500-char context before each"]
ExtractContext --> Unescape["For each diagram:\ncontext.replace('\\\n', '\\\n')\ndiagram.replace('\\\n', '\\\n')\nUnescape HTML entities"]
Unescape --> BuildContext["diagram_contexts = [\n {\n last_heading: str,\n anchor_text: str (last 300 chars),\n diagram: str\n },\n ...\n]\n(~48 with context)"]
BuildContext --> ScanFiles["For each .md file in temp_dir.glob('**/*.md')"]
ScanFiles --> SkipExisting{"File contains\n'```mermaid'?"}
SkipExisting -->|Yes| ScanFiles
 
   SkipExisting -->|No| NormalizeContent
    
    NormalizeContent["content_normalized = content.lower()\ncontent_normalized = ' '.join(content_normalized.split())"]
NormalizeContent --> MatchLoop["For each diagram in diagram_contexts"]
MatchLoop --> TryChunks["Try chunk sizes: [300, 200, 150, 100, 80]\ntest_chunk = anchor_normalized[-chunk_size:]\npos = content_normalized.find(test_chunk)"]
TryChunks --> FoundMatch{"Match found?"}
FoundMatch -->|Yes| ConvertToLine["Convert char position to line number\nScan through lines counting chars"]
FoundMatch -->|No| TryHeading["Try heading match\nCompare normalized heading text"]
TryHeading --> FoundMatch2{"Match found?"}
FoundMatch2 -->|Yes| ConvertToLine
 
   FoundMatch2 -->|No| MatchLoop
    
 
   ConvertToLine --> FindInsertPoint["Find insertion point:\nIf heading: skip blank lines, skip paragraph\nIf paragraph: find end of paragraph"]
FindInsertPoint --> QueueInsert["pending_insertions.append(\n (insert_line, diagram, score, idx)\n)"]
QueueInsert --> MatchLoop
    
 
   MatchLoop --> InsertDiagrams["Sort by line number (reverse)\nInsert from bottom up:\nlines.insert(pos, '')\nlines.insert(pos, '```')\nlines.insert(pos, diagram)\nlines.insert(pos, '```mermaid')\nlines.insert(pos, '')"]
InsertDiagrams --> WriteFile["Write enhanced file back to disk"]
WriteFile --> ScanFiles
    
 
   ScanFiles --> Complete["Return to orchestrator"]

Sources: tools/deepwiki-scraper.py:596-788

Fuzzy Matching Algorithm

The algorithm uses progressive chunk sizes to find the best match location for each diagram:

Sources: tools/deepwiki-scraper.py:716-730 tools/deepwiki-scraper.py:732-745

flowchart LR
    Anchor["anchor_text\n(300 chars from JS context)"]
Anchor --> Normalize1["Normalize:\nlowercase\ncollapse whitespace"]
Content["markdown file content"]
Content --> Normalize2["Normalize:\nlowercase\ncollapse whitespace"]
Normalize1 --> Try300["Try 300-char chunk\ntest_chunk = anchor[-300:]"]
Normalize2 --> Try300
    
 
   Try300 --> Found300{"Found?"}
Found300 -->|Yes| Match300["best_match_score = 300"]
Found300 -->|No| Try200["Try 200-char chunk"]
Try200 --> Found200{"Found?"}
Found200 -->|Yes| Match200["best_match_score = 200"]
Found200 -->|No| Try150["Try 150-char chunk"]
Try150 --> Found150{"Found?"}
Found150 -->|Yes| Match150["best_match_score = 150"]
Found150 -->|No| Try100["Try 100-char chunk"]
Try100 --> Found100{"Found?"}
Found100 -->|Yes| Match100["best_match_score = 100"]
Found100 -->|No| Try80["Try 80-char chunk"]
Try80 --> Found80{"Found?"}
Found80 -->|Yes| Match80["best_match_score = 80"]
Found80 -->|No| TryHeading["Fallback: heading match"]
TryHeading --> FoundH{"Found?"}
FoundH -->|Yes| Match50["best_match_score = 50"]
FoundH -->|No| NoMatch["No match\nSkip this diagram"]

Diagram Extraction from JavaScript

Diagrams are extracted from the Next.js JavaScript payload using two strategies:

Extraction Strategies

Strategy	Pattern	Description
Fenced blocks	`mermaid\\n(.*?)`	Primary strategy: extract code blocks with escaped newlines
JavaScript strings	`"graph TD..."`	Fallback: find Mermaid start keywords in quoted strings

The function extract_mermaid_from_nextjs_data() at tools/deepwiki-scraper.py:218-331 handles unescaping:

block.replace('\\n', '\n')
block.replace('\\t', '\t')
block.replace('\\"', '"')
block.replace('\\\\', '\\')
block.replace('\\u003c', '<')
block.replace('\\u003e', '>')
block.replace('\\u0026', '&')

Sources: tools/deepwiki-scraper.py:218-331 tools/deepwiki-scraper.py:615-646

Phase 3: mdBook Build

Phase 3 generates mdBook configuration, creates the table of contents, and builds the final HTML documentation. This phase is orchestrated by build-docs.sh and invokes Rust tools (mdbook, mdbook-mermaid).

Phase 3 Component Interactions

flowchart TD
    Start["Phase 3 entry point\n(build-docs.sh:78)"]
Start --> MkdirBook["mkdir -p /workspace/book\ncd /workspace/book"]
MkdirBook --> GenToml["Generate book.toml:\n[book]\ntitle, authors, language\n[output.html]\ndefault-theme=rust\ngit-repository-url\n[preprocessor.mermaid]"]
GenToml --> MkdirSrc["mkdir -p src"]
MkdirSrc --> GenSummary["Generate src/SUMMARY.md"]
GenSummary --> ScanRoot["Scan /workspace/wiki/*.md\nFind first page for intro"]
ScanRoot --> ProcessMain["For each main page:\nExtract title from first line\nCheck for section-N/ subdirectory"]
ProcessMain --> HasSubs{"Has\nsubsections?"}
HasSubs -->|Yes| WriteSection["Write to SUMMARY.md:\n# Title\n- [Title](N-title.md)\n - [Subtitle](section-N/N-M-title.md)"]
HasSubs -->|No| WriteStandalone["Write to SUMMARY.md:\n- [Title](N-title.md)"]
WriteSection --> ProcessMain
 
   WriteStandalone --> ProcessMain
    
 
   ProcessMain --> CopySrc["cp -r /workspace/wiki/* src/"]
CopySrc --> InstallMermaid["mdbook-mermaid install /workspace/book\nInstalls mermaid.min.js\nInstalls mermaid-init.js\nUpdates book.toml"]
InstallMermaid --> MdbookBuild["mdbook build\nReads src/SUMMARY.md\nProcesses all .md files\nApplies rust theme\nGenerates book/index.html\nGenerates book/*/index.html"]
MdbookBuild --> CopyOut["Copy outputs:\ncp -r book /output/\ncp -r /workspace/wiki/* /output/markdown/\ncp book.toml /output/"]

Sources: build-docs.sh:78-206

book.toml Generation

The orchestrator dynamically generates book.toml with runtime configuration:

Sources: build-docs.sh:84-103

flowchart TD
    Start["Generate SUMMARY.md"]
Start --> FindFirst["first_page = ls /workspace/wiki/*.md /head -1 Extract title from first line Write: [Title] filename"]
FindFirst --> LoopMain["For each /workspace/wiki/*.md excluding first_page"]
LoopMain --> ExtractNum["section_num = filename.match /^[0-9]+/"]
ExtractNum --> CheckDir{"section-{num}/ exists?"}
CheckDir -->|Yes|WriteSectionHeader["Write: # {title} - [{title}] {filename}"]
WriteSectionHeader --> LoopSubs["For each section-{num}/*.md"]
LoopSubs --> WriteSubitem["Write: - [{subtitle}] section-{num}/{subfilename}"]
WriteSubitem --> LoopSubs
 LoopSubs --> LoopMain
 CheckDir -->|No| WriteStandalone["Write:\n- [{title}]({filename})"]
WriteStandalone --> LoopMain
    
 
   LoopMain --> Complete["SUMMARY.md complete\ngrep -c '\\[' to count entries"]

SUMMARY.md Generation Algorithm

The table of contents is generated by scanning the actual file structure in /workspace/wiki:

Sources: build-docs.sh:108-162

mdBook and mdbook-mermaid Execution

The build process invokes two Rust binaries:

Command	Purpose	Output
`mdbook-mermaid install $BOOK_DIR`	Install Mermaid.js assets and update book.toml	`mermaid.min.js`, `mermaid-init.js` in book/
`mdbook build`	Parse SUMMARY.md, process Markdown, generate HTML	HTML files in `/workspace/book/book/`

The mdbook binary:

Reads src/SUMMARY.md to determine structure
Processes each Markdown file referenced in SUMMARY.md
Applies the rust theme specified in book.toml
Generates navigation sidebar
Adds search functionality
Creates "Edit this page" links using git-repository-url

Sources: build-docs.sh:169-176

Data Transformation Summary

Each phase transforms data in specific ways:

Phase	Input Format	Processing	Output Format
Phase 1	HTML pages from DeepWiki	`BeautifulSoup` parsing, `html2text` conversion, link rewriting	Clean Markdown in `/workspace/wiki/`
Phase 2	Markdown files + JavaScript payload	Regex extraction, fuzzy matching, diagram injection	Enhanced Markdown in `/workspace/wiki/` (modified in place)
Phase 3	Markdown files + environment variables	`book.toml` generation, `SUMMARY.md` generation, `mdbook build`	HTML site in `/workspace/book/book/`

Final Output Structure:

/output/
├── book/                    # HTML documentation site
│   ├── index.html
│   ├── 1-overview.html
│   ├── section-2/
│   │   └── 2-1-subsection.html
│   ├── mermaid.min.js
│   ├── mermaid-init.js
│   └── ...
├── markdown/                # Source Markdown files
│   ├── 1-overview.md
│   ├── section-2/
│   │   └── 2-1-subsection.md
│   └── ...
└── book.toml               # mdBook configuration

Sources: build-docs.sh:178-205 README.md:89-119

flowchart TD
    Phase1["Phase 1: Extraction\n(deepwiki-scraper.py)"]
Phase2["Phase 2: Enhancement\n(deepwiki-scraper.py)"]
Phase1 --> Phase2
    
 
   Phase2 --> Check{"MARKDOWN_ONLY\n== true?"}
Check -->|Yes| FastPath["cp -r /workspace/wiki/* /output/markdown/\nExit (fast path)"]
Check -->|No| Phase3["Phase 3: mdBook Build\n(build-docs.sh)"]
Phase3 --> FullOutput["Copy book/ and markdown/ to /output/\nExit (full build)"]
FastPath --> End[/"Build complete"/]
 
   FullOutput --> End

Conditional Execution: MARKDOWN_ONLY Mode

The MARKDOWN_ONLY environment variable allows bypassing Phase 3 for faster iteration during development:

When MARKDOWN_ONLY=true:

Execution time: ~30-60 seconds (scraping + diagram matching only)
Output: /output/markdown/ only
Use case: Debugging diagram placement, testing content extraction

When MARKDOWN_ONLY=false (default):

Execution time: ~60-120 seconds (full pipeline)
Output: /output/book/, /output/markdown/, /output/book.toml
Use case: Production documentation builds

Sources: build-docs.sh:60-76 README.md:55-76

Keyboard shortcuts

deepwiki-to-mdbook Documentation