Three-Phase Pipeline
Relevant source files
Purpose and Scope
This document describes the three-phase processing pipeline that transforms DeepWiki HTML pages into searchable mdBook documentation. The pipeline consists of Phase 1: Clean Markdown Extraction , Phase 2: Diagram Enhancement , and Phase 3: mdBook Build. Each phase has distinct responsibilities and uses different technology stacks.
For overall system architecture, see System Architecture. For detailed implementation of individual phases, see Phase 1: Markdown Extraction, Phase 2: Diagram Enhancement, and Phase 3: mdBook Build. For configuration that affects pipeline behavior, see Configuration Reference.
Pipeline Overview
The system processes content through three sequential phases, with an optional bypass mechanism for Phase 3.
Pipeline Execution Flow
stateDiagram-v2
[*] --> Initialize
Initialize --> Phase1 : Start build-docs.sh
state "Phase 1 : Markdown Extraction" as Phase1 {
[*] --> extract_wiki_structure
extract_wiki_structure --> extract_page_content : For each page
extract_page_content --> convert_html_to_markdown
convert_html_to_markdown --> WriteTemp : Write to /workspace/wiki
WriteTemp --> [*]
}
Phase1 --> CheckMode : deepwiki-scraper.py complete
state CheckMode <<choice>>
CheckMode --> Phase2 : MARKDOWN_ONLY=false
CheckMode --> CopyOutput : MARKDOWN_ONLY=true
state "Phase 2 : Diagram Enhancement" as Phase2 {
[*] --> extract_and_enhance_diagrams
extract_and_enhance_diagrams --> ExtractJS : Fetch JS payload
ExtractJS --> FuzzyMatch : ~461 diagrams found
FuzzyMatch --> InjectDiagrams : ~48 placed
InjectDiagrams --> [*] : Update temp files
}
Phase2 --> Phase3 : Enhancement complete
state "Phase 3 : mdBook Build" as Phase3 {
[*] --> CreateBookToml : build-docs.sh
CreateBookToml --> GenerateSummary : book.toml created
GenerateSummary --> CopyToSrc : SUMMARY.md generated
CopyToSrc --> MdbookMermaidInstall : Copy to /workspace/book/src
MdbookMermaidInstall --> MdbookBuild : Install assets
MdbookBuild --> [*] : HTML in /workspace/book/book
}
Phase3 --> CopyOutput
CopyOutput --> [*] : Copy to /output
Sources: build-docs.sh:1-206 tools/deepwiki-scraper.py:790-919
Phase Coordination
The build-docs.sh orchestrator coordinates all three phases and handles the decision point for markdown-only mode.
Orchestrator Control Flow
flowchart TD
Start[/"docker run with env vars"/]
Start --> ParseEnv["Parse environment variables\nREPO, BOOK_TITLE, MARKDOWN_ONLY"]
ParseEnv --> ValidateRepo{"REPO set?"}
ValidateRepo -->|No| AutoDetect["git config --get remote.origin.url\nExtract owner/repo"]
ValidateRepo -->|Yes| CallScraper
AutoDetect --> CallScraper
CallScraper["python3 /usr/local/bin/deepwiki-scraper.py\nArgs: REPO, /workspace/wiki"]
CallScraper --> ScraperPhase1["Phase 1: extract_wiki_structure()\nextract_page_content()\nWrite to temp directory"]
ScraperPhase1 --> ScraperPhase2["Phase 2: extract_and_enhance_diagrams()\nFuzzy match and inject\nUpdate temp files"]
ScraperPhase2 --> CheckMarkdownOnly{"MARKDOWN_ONLY\n== true?"}
CheckMarkdownOnly -->|Yes| CopyMdOnly["cp -r /workspace/wiki/* /output/markdown/\nExit"]
CheckMarkdownOnly -->|No| InitMdBook
InitMdBook["mkdir -p /workspace/book\nGenerate book.toml"]
InitMdBook --> GenSummary["Generate src/SUMMARY.md\nScan /workspace/wiki/*.md\nBuild table of contents"]
GenSummary --> CopyToSrc["cp -r /workspace/wiki/* src/"]
CopyToSrc --> InstallMermaid["mdbook-mermaid install /workspace/book"]
InstallMermaid --> BuildBook["mdbook build"]
BuildBook --> CopyOutputs["cp -r book /output/\ncp -r /workspace/wiki/* /output/markdown/\ncp book.toml /output/"]
CopyMdOnly --> End[/"Exit with outputs in /output"/]
CopyOutputs --> End
Sources: build-docs.sh:8-76 build-docs.sh:78-206
Phase 1: Clean Markdown Extraction
Phase 1 discovers the wiki structure and converts HTML pages to clean Markdown, writing files to a temporary directory (/workspace/wiki). This phase is implemented entirely in Python within deepwiki-scraper.py.
Phase 1 Data Flow
flowchart LR
DeepWiki["https://deepwiki.com/\nowner/repo"]
DeepWiki -->|HTTP GET| extract_wiki_structure
extract_wiki_structure["extract_wiki_structure()\nParse sidebar links\nBuild page list"]
extract_wiki_structure --> PageList["pages = [\n {number, title, url, href, level},\n ...\n]"]
PageList --> Loop["For each page"]
Loop --> extract_page_content["extract_page_content(url, session)\nFetch HTML\nRemove nav/footer elements"]
extract_page_content --> BeautifulSoup["BeautifulSoup(response.text)\nFind article/main/body\nRemove DeepWiki UI"]
BeautifulSoup --> convert_html_to_markdown["convert_html_to_markdown(html)\nhtml2text.HTML2Text()\nbody_width=0"]
convert_html_to_markdown --> clean_deepwiki_footer["clean_deepwiki_footer(markdown)\nRemove footer patterns"]
clean_deepwiki_footer --> FixLinks["Fix internal links\nRegex: /owner/repo/N-title\nConvert to relative .md paths"]
FixLinks --> WriteTempFile["Write to /workspace/wiki/\nMain: N-title.md\nSubsection: section-N/N-M-title.md"]
WriteTempFile --> Loop
style extract_wiki_structure fill:#f9f9f9
style extract_page_content fill:#f9f9f9
style convert_html_to_markdown fill:#f9f9f9
style clean_deepwiki_footer fill:#f9f9f9
Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:175-216 tools/deepwiki-scraper.py:127-173
Key Functions and Their Roles
| Function | File Location | Responsibility |
|---|---|---|
extract_wiki_structure() | tools/deepwiki-scraper.py:78-125 | Discover all pages by parsing sidebar links with pattern /repo/\d+ |
extract_page_content() | tools/deepwiki-scraper.py:453-594 | Fetch individual page, parse HTML, remove navigation elements |
convert_html_to_markdown() | tools/deepwiki-scraper.py:175-216 | Convert HTML to Markdown using html2text with body_width=0 |
clean_deepwiki_footer() | tools/deepwiki-scraper.py:127-173 | Remove DeepWiki UI elements using regex pattern matching |
sanitize_filename() | tools/deepwiki-scraper.py:21-25 | Convert page titles to safe filenames |
fix_wiki_link() | tools/deepwiki-scraper.py:549-589 | Rewrite internal links to relative .md paths |
File Organization Logic
flowchart TD
PageNum["page['number']"]
PageNum --> CheckLevel{"page['level']\n== 0?"}
CheckLevel -->|Yes main page| RootFile["Filename: N-title.md\nPath: /workspace/wiki/N-title.md\nExample: 2-quick-start.md"]
CheckLevel -->|No subsection| ExtractMain["Extract main section\nmain_section = number.split('.')[0]"]
ExtractMain --> SubDir["Create directory\nsection-{main_section}/"]
SubDir --> SubFile["Filename: N-M-title.md\nPath: section-N/N-M-title.md\nExample: section-2/2-1-installation.md"]
The system organizes files hierarchically based on page numbering:
Sources: tools/deepwiki-scraper.py:849-860
Phase 2: Diagram Enhancement
Phase 2 extracts Mermaid diagrams from the JavaScript payload and uses fuzzy matching to intelligently place them in the appropriate Markdown files. This phase operates on files in the temporary directory (/workspace/wiki).
Phase 2 Algorithm Flow
flowchart TD
Start["extract_and_enhance_diagrams(repo, temp_dir, session)"]
Start --> FetchJS["GET https://deepwiki.com/owner/repo/1-overview\nExtract response.text"]
FetchJS --> ExtractAll["Regex: ```mermaid\\\n(.*?)```\nFind all diagram blocks"]
ExtractAll --> CountTotal["all_diagrams list\n(~461 total)"]
CountTotal --> ExtractContext["Regex: ([^`]{'{500,}'}?)```mermaid\\ (.*?)```\nExtract 500-char context before each"]
ExtractContext --> Unescape["For each diagram:\ncontext.replace('\\\n', '\\\n')\ndiagram.replace('\\\n', '\\\n')\nUnescape HTML entities"]
Unescape --> BuildContext["diagram_contexts = [\n {\n last_heading: str,\n anchor_text: str (last 300 chars),\n diagram: str\n },\n ...\n]\n(~48 with context)"]
BuildContext --> ScanFiles["For each .md file in temp_dir.glob('**/*.md')"]
ScanFiles --> SkipExisting{"File contains\n'```mermaid'?"}
SkipExisting -->|Yes| ScanFiles
SkipExisting -->|No| NormalizeContent
NormalizeContent["content_normalized = content.lower()\ncontent_normalized = ' '.join(content_normalized.split())"]
NormalizeContent --> MatchLoop["For each diagram in diagram_contexts"]
MatchLoop --> TryChunks["Try chunk sizes: [300, 200, 150, 100, 80]\ntest_chunk = anchor_normalized[-chunk_size:]\npos = content_normalized.find(test_chunk)"]
TryChunks --> FoundMatch{"Match found?"}
FoundMatch -->|Yes| ConvertToLine["Convert char position to line number\nScan through lines counting chars"]
FoundMatch -->|No| TryHeading["Try heading match\nCompare normalized heading text"]
TryHeading --> FoundMatch2{"Match found?"}
FoundMatch2 -->|Yes| ConvertToLine
FoundMatch2 -->|No| MatchLoop
ConvertToLine --> FindInsertPoint["Find insertion point:\nIf heading: skip blank lines, skip paragraph\nIf paragraph: find end of paragraph"]
FindInsertPoint --> QueueInsert["pending_insertions.append(\n (insert_line, diagram, score, idx)\n)"]
QueueInsert --> MatchLoop
MatchLoop --> InsertDiagrams["Sort by line number (reverse)\nInsert from bottom up:\nlines.insert(pos, '')\nlines.insert(pos, '```')\nlines.insert(pos, diagram)\nlines.insert(pos, '```mermaid')\nlines.insert(pos, '')"]
InsertDiagrams --> WriteFile["Write enhanced file back to disk"]
WriteFile --> ScanFiles
ScanFiles --> Complete["Return to orchestrator"]
Sources: tools/deepwiki-scraper.py:596-788
Fuzzy Matching Algorithm
The algorithm uses progressive chunk sizes to find the best match location for each diagram:
Sources: tools/deepwiki-scraper.py:716-730 tools/deepwiki-scraper.py:732-745
flowchart LR
Anchor["anchor_text\n(300 chars from JS context)"]
Anchor --> Normalize1["Normalize:\nlowercase\ncollapse whitespace"]
Content["markdown file content"]
Content --> Normalize2["Normalize:\nlowercase\ncollapse whitespace"]
Normalize1 --> Try300["Try 300-char chunk\ntest_chunk = anchor[-300:]"]
Normalize2 --> Try300
Try300 --> Found300{"Found?"}
Found300 -->|Yes| Match300["best_match_score = 300"]
Found300 -->|No| Try200["Try 200-char chunk"]
Try200 --> Found200{"Found?"}
Found200 -->|Yes| Match200["best_match_score = 200"]
Found200 -->|No| Try150["Try 150-char chunk"]
Try150 --> Found150{"Found?"}
Found150 -->|Yes| Match150["best_match_score = 150"]
Found150 -->|No| Try100["Try 100-char chunk"]
Try100 --> Found100{"Found?"}
Found100 -->|Yes| Match100["best_match_score = 100"]
Found100 -->|No| Try80["Try 80-char chunk"]
Try80 --> Found80{"Found?"}
Found80 -->|Yes| Match80["best_match_score = 80"]
Found80 -->|No| TryHeading["Fallback: heading match"]
TryHeading --> FoundH{"Found?"}
FoundH -->|Yes| Match50["best_match_score = 50"]
FoundH -->|No| NoMatch["No match\nSkip this diagram"]
Diagram Extraction from JavaScript
Diagrams are extracted from the Next.js JavaScript payload using two strategies:
Extraction Strategies
| Strategy | Pattern | Description |
|---|---|---|
| Fenced blocks | mermaid\\n(.*?) | Primary strategy: extract code blocks with escaped newlines |
| JavaScript strings | "graph TD..." | Fallback: find Mermaid start keywords in quoted strings |
The function extract_mermaid_from_nextjs_data() at tools/deepwiki-scraper.py:218-331 handles unescaping:
block.replace('\\n', '\n')
block.replace('\\t', '\t')
block.replace('\\"', '"')
block.replace('\\\\', '\\')
block.replace('\\u003c', '<')
block.replace('\\u003e', '>')
block.replace('\\u0026', '&')
Sources: tools/deepwiki-scraper.py:218-331 tools/deepwiki-scraper.py:615-646
Phase 3: mdBook Build
Phase 3 generates mdBook configuration, creates the table of contents, and builds the final HTML documentation. This phase is orchestrated by build-docs.sh and invokes Rust tools (mdbook, mdbook-mermaid).
Phase 3 Component Interactions
flowchart TD
Start["Phase 3 entry point\n(build-docs.sh:78)"]
Start --> MkdirBook["mkdir -p /workspace/book\ncd /workspace/book"]
MkdirBook --> GenToml["Generate book.toml:\n[book]\ntitle, authors, language\n[output.html]\ndefault-theme=rust\ngit-repository-url\n[preprocessor.mermaid]"]
GenToml --> MkdirSrc["mkdir -p src"]
MkdirSrc --> GenSummary["Generate src/SUMMARY.md"]
GenSummary --> ScanRoot["Scan /workspace/wiki/*.md\nFind first page for intro"]
ScanRoot --> ProcessMain["For each main page:\nExtract title from first line\nCheck for section-N/ subdirectory"]
ProcessMain --> HasSubs{"Has\nsubsections?"}
HasSubs -->|Yes| WriteSection["Write to SUMMARY.md:\n# Title\n- [Title](N-title.md)\n - [Subtitle](section-N/N-M-title.md)"]
HasSubs -->|No| WriteStandalone["Write to SUMMARY.md:\n- [Title](N-title.md)"]
WriteSection --> ProcessMain
WriteStandalone --> ProcessMain
ProcessMain --> CopySrc["cp -r /workspace/wiki/* src/"]
CopySrc --> InstallMermaid["mdbook-mermaid install /workspace/book\nInstalls mermaid.min.js\nInstalls mermaid-init.js\nUpdates book.toml"]
InstallMermaid --> MdbookBuild["mdbook build\nReads src/SUMMARY.md\nProcesses all .md files\nApplies rust theme\nGenerates book/index.html\nGenerates book/*/index.html"]
MdbookBuild --> CopyOut["Copy outputs:\ncp -r book /output/\ncp -r /workspace/wiki/* /output/markdown/\ncp book.toml /output/"]
Sources: build-docs.sh:78-206
book.toml Generation
The orchestrator dynamically generates book.toml with runtime configuration:
Sources: build-docs.sh:84-103
flowchart TD
Start["Generate SUMMARY.md"]
Start --> FindFirst["first_page = ls /workspace/wiki/*.md /head -1 Extract title from first line Write: [Title] filename"]
FindFirst --> LoopMain["For each /workspace/wiki/*.md excluding first_page"]
LoopMain --> ExtractNum["section_num = filename.match /^[0-9]+/"]
ExtractNum --> CheckDir{"section-{num}/ exists?"}
CheckDir -->|Yes|WriteSectionHeader["Write: # {title} - [{title}] {filename}"]
WriteSectionHeader --> LoopSubs["For each section-{num}/*.md"]
LoopSubs --> WriteSubitem["Write: - [{subtitle}] section-{num}/{subfilename}"]
WriteSubitem --> LoopSubs
LoopSubs --> LoopMain
CheckDir -->|No| WriteStandalone["Write:\n- [{title}]({filename})"]
WriteStandalone --> LoopMain
LoopMain --> Complete["SUMMARY.md complete\ngrep -c '\\[' to count entries"]
SUMMARY.md Generation Algorithm
The table of contents is generated by scanning the actual file structure in /workspace/wiki:
Sources: build-docs.sh:108-162
mdBook and mdbook-mermaid Execution
The build process invokes two Rust binaries:
| Command | Purpose | Output |
|---|---|---|
mdbook-mermaid install $BOOK_DIR | Install Mermaid.js assets and update book.toml | mermaid.min.js, mermaid-init.js in book/ |
mdbook build | Parse SUMMARY.md, process Markdown, generate HTML | HTML files in /workspace/book/book/ |
The mdbook binary:
- Reads
src/SUMMARY.mdto determine structure - Processes each Markdown file referenced in SUMMARY.md
- Applies the
rusttheme specified in book.toml - Generates navigation sidebar
- Adds search functionality
- Creates "Edit this page" links using
git-repository-url
Sources: build-docs.sh:169-176
Data Transformation Summary
Each phase transforms data in specific ways:
| Phase | Input Format | Processing | Output Format |
|---|---|---|---|
| Phase 1 | HTML pages from DeepWiki | BeautifulSoup parsing, html2text conversion, link rewriting | Clean Markdown in /workspace/wiki/ |
| Phase 2 | Markdown files + JavaScript payload | Regex extraction, fuzzy matching, diagram injection | Enhanced Markdown in /workspace/wiki/ (modified in place) |
| Phase 3 | Markdown files + environment variables | book.toml generation, SUMMARY.md generation, mdbook build | HTML site in /workspace/book/book/ |
Final Output Structure:
/output/
├── book/ # HTML documentation site
│ ├── index.html
│ ├── 1-overview.html
│ ├── section-2/
│ │ └── 2-1-subsection.html
│ ├── mermaid.min.js
│ ├── mermaid-init.js
│ └── ...
├── markdown/ # Source Markdown files
│ ├── 1-overview.md
│ ├── section-2/
│ │ └── 2-1-subsection.md
│ └── ...
└── book.toml # mdBook configuration
Sources: build-docs.sh:178-205 README.md:89-119
flowchart TD
Phase1["Phase 1: Extraction\n(deepwiki-scraper.py)"]
Phase2["Phase 2: Enhancement\n(deepwiki-scraper.py)"]
Phase1 --> Phase2
Phase2 --> Check{"MARKDOWN_ONLY\n== true?"}
Check -->|Yes| FastPath["cp -r /workspace/wiki/* /output/markdown/\nExit (fast path)"]
Check -->|No| Phase3["Phase 3: mdBook Build\n(build-docs.sh)"]
Phase3 --> FullOutput["Copy book/ and markdown/ to /output/\nExit (full build)"]
FastPath --> End[/"Build complete"/]
FullOutput --> End
Conditional Execution: MARKDOWN_ONLY Mode
The MARKDOWN_ONLY environment variable allows bypassing Phase 3 for faster iteration during development:
When MARKDOWN_ONLY=true:
- Execution time: ~30-60 seconds (scraping + diagram matching only)
- Output:
/output/markdown/only - Use case: Debugging diagram placement, testing content extraction
When MARKDOWN_ONLY=false (default):
- Execution time: ~60-120 seconds (full pipeline)
- Output:
/output/book/,/output/markdown/,/output/book.toml - Use case: Production documentation builds
Sources: build-docs.sh:60-76 README.md:55-76