Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Three-Phase Pipeline

Loading…

Three-Phase Pipeline

Relevant source files

Purpose and Scope

This document describes the three-phase processing pipeline that transforms DeepWiki HTML pages into searchable mdBook documentation. The pipeline consists of Phase 1: Clean Markdown Extraction , Phase 2: Diagram Enhancement , and Phase 3: mdBook Build. Each phase has distinct responsibilities and uses different technology stacks.

For overall system architecture, see System Architecture. For detailed implementation of individual phases, see Phase 1: Markdown Extraction, Phase 2: Diagram Enhancement, and Phase 3: mdBook Build. For configuration that affects pipeline behavior, see Configuration Reference.

Pipeline Overview

The system processes content through three sequential phases, with an optional bypass mechanism for Phase 3. Each phase is implemented by different components and operates on files in specific directories.

Pipeline Execution Flow

stateDiagram-v2
    [*] --> Init["build-docs.sh\nParse env vars"]
    
    Init --> Phase1["Phase 1 : deepwiki-scraper.py"]
    
    state "Phase 1 : Markdown Extraction" as Phase1 {
        [*] --> ExtractStruct["extract_wiki_structure()"]
        ExtractStruct --> LoopPages["for page in pages"]
        LoopPages --> ExtractPage["extract_page_content(url)"]
        ExtractPage --> ConvertHTML["convert_html_to_markdown()"]
        ConvertHTML --> CleanFooter["clean_deepwiki_footer()"]
        CleanFooter --> WriteTemp["/workspace/wiki/*.md"]
        WriteTemp --> LoopPages
        LoopPages --> RawSnapshot["/workspace/raw_markdown/\n(debug snapshot)"]
    }
    
    Phase1 --> Phase2["Phase 2 : deepwiki-scraper.py"]
    
    state "Phase 2 : Diagram Enhancement" as Phase2 {
        [*] --> ExtractDiagrams["extract_and_enhance_diagrams()"]
        ExtractDiagrams --> FetchJS["Fetch JS payload\nextract_mermaid_from_nextjs_data()"]
        FetchJS --> NormalizeDiagrams["normalize_mermaid_diagram()\n7 normalization passes"]
        NormalizeDiagrams --> FuzzyMatch["Fuzzy match loop\n300/200/150/100/80 char chunks"]
        FuzzyMatch --> InjectFiles["Modify /workspace/wiki/*.md\nInsert ```mermaid blocks"]
    }
    
    Phase2 --> CheckMode{"MARKDOWN_ONLY\nenv var?"}
    
    CheckMode --> CopyMarkdown["build-docs.sh\ncp -r /workspace/wiki /output/markdown"] : true
    CheckMode --> Phase3["Phase 3 : build-docs.sh"] - false
    
    state "Phase 3 : mdBook Build" as Phase3 {
        [*] --> GenToml["Generate book.toml\n[book], [output.html]"]
        GenToml --> GenSummary["Generate src/SUMMARY.md\nScan .md files"]
        GenSummary --> CopyToSrc["cp -r /workspace/wiki/* src/"]
        CopyToSrc --> MermaidInstall["mdbook-mermaid install"]
        MermaidInstall --> MdBookBuild["mdbook build"]
        MdBookBuild --> OutputBook["/workspace/book/book/"]
    }
    
    Phase3 --> CopyAll["cp -r book /output/\ncp -r markdown /output/"]
    CopyMarkdown --> Done["/output directory\nready"]
    CopyAll --> Done
    Done --> [*]

Sources: scripts/build-docs.sh:61-93 python/deepwiki-scraper.py:1277-1408 python/deepwiki-scraper.py:880-1276

Phase Coordination

The build-docs.sh orchestrator coordinates all three phases and handles the decision point for markdown-only mode.

Orchestrator Control Flow

flowchart TD
    Start["Container entrypoint\nCMD: /usr/local/bin/build-docs.sh"]
Start --> ParseEnv["Parse environment\n$REPO, $BOOK_TITLE, $BOOK_AUTHORS\n$MARKDOWN_ONLY, $GIT_REPO_URL"]
ParseEnv --> CheckRepo{"$REPO\nset?"}
CheckRepo -->|No| GitDetect["git config --get remote.origin.url\nsed -E 's#.*github.com[:/]([^/]+/[^/.]+).*#\1#'"]
CheckRepo -->|Yes| SetVars["Set defaults:\nBOOK_AUTHORS=$REPO_OWNER\nGIT_REPO_URL=https://github.com/$REPO"]
GitDetect --> SetVars
    
 
   SetVars --> SetPaths["WORK_DIR=/workspace\nWIKI_DIR=/workspace/wiki\nRAW_DIR=/workspace/raw_markdown\nOUTPUT_DIR=/output\nBOOK_DIR=/workspace/book"]
SetPaths --> CallScraper["python3 /usr/local/bin/deepwiki-scraper.py $REPO $WIKI_DIR"]
CallScraper --> ScraperRuns["deepwiki-scraper.py executes:\nPhase 1: extract_wiki_structure()\nPhase 2: extract_and_enhance_diagrams()"]
ScraperRuns --> CheckMode{"$MARKDOWN_ONLY\n== 'true'?"}
CheckMode -->|Yes| QuickCopy["rm -rf $OUTPUT_DIR/markdown\nmkdir -p $OUTPUT_DIR/markdown\ncp -r $WIKI_DIR/. $OUTPUT_DIR/markdown/\nexit 0"]
CheckMode -->|No| MkdirBook["mkdir -p $BOOK_DIR\ncd $BOOK_DIR"]
MkdirBook --> GenToml["cat > book.toml <<EOF\n[book] title=$BOOK_TITLE\n[output.html] git-repository-url=$GIT_REPO_URL\n[preprocessor.mermaid]"]
GenToml --> MkdirSrc["mkdir -p src"]
MkdirSrc --> GenSummary["Generate src/SUMMARY.md\nls $WIKI_DIR/*.md /sort -t- -k1 -n for file: head -1 $file/ sed 's/^# //'"]
GenSummary --> CopyToSrc["cp -r $WIKI_DIR/* src/"]
CopyToSrc --> ProcessTemplates["python3 process-template.py header.html\npython3 process-template.py footer.html\nInject into src/*.md"]
ProcessTemplates --> InstallMermaid["mdbook-mermaid install $BOOK_DIR"]
InstallMermaid --> BuildBook["mdbook build"]
BuildBook --> CopyOutputs["mkdir -p $OUTPUT_DIR\ncp -r book $OUTPUT_DIR/\ncp -r $WIKI_DIR/. $OUTPUT_DIR/markdown/\ncp book.toml $OUTPUT_DIR/"]
QuickCopy --> Done["Exit 0"]
CopyOutputs --> Done

Sources: scripts/build-docs.sh:8-47 scripts/build-docs.sh:61-93 scripts/build-docs.sh:95-309

Phase 1: Clean Markdown Extraction

Phase 1 discovers the wiki structure and converts HTML pages to clean Markdown, writing files to a temporary directory (/workspace/wiki). This phase is implemented entirely in Python within deepwiki-scraper.py.

Phase 1 Data Flow

flowchart LR
    DeepWiki["https://deepwiki.com/$REPO"]
DeepWiki -->|session.get base_url| ExtractStruct["extract_wiki_structure(repo, session)"]
ExtractStruct -->|soup.find_all 'a', href=re.compile ...| ParseLinks["Parse sidebar links\nPattern: /$REPO/\d+"]
ParseLinks --> PageList["pages = [\n {'number': '1', 'title': 'Overview',\n 'url': '...', 'href': '...', 'level': 0},\n {'number': '2.1', 'title': 'Sub',\n 'url': '...', 'href': '...', 'level': 1},\n ...\n]\nsorted by page number"]
PageList --> Loop["for page in pages:]
 Loop --> FetchPage[fetch_page(url, session)\nUser-Agent header\n3 retries with timeout=30"]
FetchPage --> ParseHTML["BeautifulSoup(response.text)\nRemove: nav, header, footer, aside\nFind: article or main or body"]
ParseHTML --> ConvertMD["h = html2text.HTML2Text()\nh.body_width = 0\nmarkdown = h.handle(html_content)"]
ConvertMD --> CleanFooter["clean_deepwiki_footer(markdown)\nRegex patterns:\n'Dismiss', 'Refresh this wiki',\n'On this page', 'Edit Wiki'"]
CleanFooter --> FixLinks["fix_wiki_link(match)\nRegex: /owner/repo/(\d+(?:\.\d+)*)-(.+)\nConvert to: section-N/N-M-slug.md"]
FixLinks --> ResolvePath["resolve_output_path(page_number, title)\nnormalized_number_parts()\nsanitize_filename()"]
ResolvePath --> WriteFile["filepath.write_text(markdown)\nMain: /workspace/wiki/N-slug.md\nSub: /workspace/wiki/section-N/N-M-slug.md"]
WriteFile --> Loop

Sources: python/deepwiki-scraper.py:116-163 python/deepwiki-scraper.py:751-877 python/deepwiki-scraper.py:213-228 python/deepwiki-scraper.py:165-211 python/deepwiki-scraper.py:22-26 python/deepwiki-scraper.py:28-53

Key Functions and Their Roles

FunctionFile LocationResponsibility
extract_wiki_structure()python/deepwiki-scraper.py:116-163Parse main wiki page, discover all pages via sidebar links matching /repo/\d+, return sorted list of page metadata
extract_page_content()python/deepwiki-scraper.py:751-877Fetch individual page HTML, parse with BeautifulSoup, remove nav/footer elements, convert to Markdown
convert_html_to_markdown()python/deepwiki-scraper.py:213-228Convert HTML string to Markdown using html2text.HTML2Text() with body_width=0 (no line wrapping)
clean_deepwiki_footer()python/deepwiki-scraper.py:165-211Scan last 50 lines for DeepWiki UI patterns (Dismiss, Refresh this wiki, etc.) and truncate
sanitize_filename()python/deepwiki-scraper.py:22-26Strip special chars, replace spaces/hyphens, convert to lowercase: re.sub(r'[^\w\s-]', '', text)
normalized_number_parts()python/deepwiki-scraper.py:28-43Shift DeepWiki page numbers down by 1 (page 1 becomes unnumbered), split on . into parts
resolve_output_path()python/deepwiki-scraper.py:45-53Determine filename (N-slug.md) and optional subdirectory (section-N/) based on page numbering
fix_wiki_link()python/deepwiki-scraper.py:854-876Rewrite internal links from /owner/repo/N-title to relative paths like ../section-N/N-M-title.md

File Organization Logic

flowchart TD
    PageNum["page['number']\n(from DeepWiki)"]
PageNum --> Normalize["normalized_number_parts(page_number)\nSplit on '.', shift main number down by 1\nDeepWiki '1' → []\nDeepWiki '2' → ['1']\nDeepWiki '2.1' → ['1', '1']"]
Normalize --> CheckParts{"len(parts)?"}
CheckParts -->|0 was page 1| RootOverview["Filename: overview.md\nPath: $WIKI_DIR/overview.md\nNo section dir"]
CheckParts -->|1 main page| RootMain["Filename: N-slug.md\nExample: 1-quick-start.md\nPath: $WIKI_DIR/1-quick-start.md\nNo section dir"]
CheckParts -->|2+ subsection| ExtractSection["main_section = parts[0]\nsection_dir = f'section-{main_section}'"]
ExtractSection --> CreateDir["section_path = Path($WIKI_DIR) / section_dir\nsection_path.mkdir(exist_ok=True)"]
CreateDir --> SubFile["Filename: N-M-slug.md\nExample: 1-1-installation.md\nPath: $WIKI_DIR/section-1/1-1-installation.md"]

The system organizes files hierarchically based on page numbering. DeepWiki pages are numbered starting from 1, but the system shifts them down by 1 so that the first page becomes unnumbered (the overview).

Sources: python/deepwiki-scraper.py:28-43 python/deepwiki-scraper.py:45-63 python/deepwiki-scraper.py:1332-1338

Phase 2: Diagram Enhancement

Phase 2 extracts Mermaid diagrams from the JavaScript payload and uses fuzzy matching to intelligently place them in the appropriate Markdown files. This phase operates on files in the temporary directory (/workspace/wiki).

Phase 2 Algorithm Flow

flowchart TD
    Start["extract_and_enhance_diagrams(repo, temp_dir, session, diagram_source_url)"]
Start --> FetchJS["response = session.get(diagram_source_url)\nhtml_text = response.text"]
FetchJS --> ExtractRegex["pattern = r'```mermaid(?:\\r\\\n/\\/\r?\)(.*?)(?:\\r\\\n/\\/\r?\)```'\ndiagram_matches = re.finditer(pattern, html_text, re.DOTALL)"]
ExtractRegex --> CountTotal["print(f'Found {len(diagram_matches)}
total diagrams')"]
CountTotal --> ExtractContext["for match in diagram_matches:\ncontext_start = max(0, match.start() - 2000)\ncontext_before = html_text[context_start:match.start()]"]
ExtractContext --> Unescape["Unescape escape sequences:\nreplace('\\\n', '\\n')\nreplace('\\	', '\	')\nreplace('\\\"', '\"')\nreplace('\<', '<')\nreplace('\>', '>')"]
Unescape --> ParseContext["context_lines = [l for l in context.split('\\n')
if l.strip()]\nFind last_heading (line starting with #)\nExtract anchor_text (last 2-3 non-heading lines, max 300 chars)"]
ParseContext --> Normalize["normalize_mermaid_diagram(diagram)\n7 normalization passes:\nnormalize_mermaid_edge_labels()\nnormalize_mermaid_state_descriptions()\nnormalize_flowchart_nodes()\nnormalize_statement_separators()\nnormalize_empty_node_labels()\nnormalize_gantt_diagram()"]
Normalize --> BuildContexts["diagram_contexts.append({\n 'last_heading': last_heading,\n 'anchor_text': anchor_text[-300:],\n 'diagram': normalized_diagram\n})"]
BuildContexts --> ScanFiles["md_files = list(temp_dir.glob('**/*.md'))\nfor md_file in md_files:]
    
 
   ScanFiles --> SkipExisting{re.search(r'^\s*`{3,}\s*mermaid\b',\ncontent)?"}
SkipExisting -->|Yes| ScanFiles
 
   SkipExisting -->|No| NormalizeContent["content_normalized = content.lower()\ncontent_normalized = ' '.join(content_normalized.split())"]
NormalizeContent --> MatchLoop["for idx, item in enumerate(diagram_contexts):]
 MatchLoop --> TryChunks[for chunk_size in [300, 200, 150, 100, 80]:\ntest_chunk = anchor_normalized[-chunk_size:]\npos = content_normalized.find(test_chunk)"]
TryChunks --> FoundMatch{"pos != -1?"}
FoundMatch -->|Yes| ConvertToLine["char_count = 0\nfor line_num, line in enumerate(lines):\n char_count += len(' '.join(line.split())) + 1\n if char_count >= pos: best_match_line = line_num"]
FoundMatch -->|No| TryHeading["for line_num, line in enumerate(lines):\nif heading_normalized in line_normalized:\n best_match_line = line_num"]
TryHeading --> FoundMatch2{"best_match_line != -1?"}
FoundMatch2 -->|Yes| ConvertToLine
 
   FoundMatch2 -->|No| MatchLoop
    
 
   ConvertToLine --> CheckScore{"best_match_score >= 80?"}
CheckScore -->|Yes| FindInsert["insert_line = best_match_line + 1\nSkip blank lines, skip paragraph/list"]
CheckScore -->|No| MatchLoop
    
 
   FindInsert --> QueueInsert["pending_insertions.append(\n (insert_line, diagram, score, idx)\n)\ndiagrams_used.add(idx)"]
QueueInsert --> MatchLoop
    
 
   MatchLoop --> SortInsert["pending_insertions.sort(key=lambda x: x[0], reverse=True)"]
SortInsert --> InsertLoop["for insert_line, diagram, score, idx in pending_insertions:\nlines.insert(insert_line, '')\nlines.insert(insert_line, f'{fence}mermaid')\nlines.insert(insert_line, diagram)\nlines.insert(insert_line, fence)\nlines.insert(insert_line, '')"]
InsertLoop --> WriteFile["with open(md_file, 'w')
as f:\n f.write('\\n'.join(lines))"]
WriteFile --> ScanFiles

Sources: python/deepwiki-scraper.py:880-1276 python/deepwiki-scraper.py:899-1088 python/deepwiki-scraper.py:1149-1273 python/deepwiki-scraper.py:230-393

Fuzzy Matching Algorithm

The algorithm uses progressively shorter anchor text chunks to find the best match location for each diagram. The score threshold of 80 ensures only high-confidence matches are inserted.

Sources: python/deepwiki-scraper.py:1184-1218

flowchart LR
    AnchorText["anchor_text\n(last 300 chars from context)"]
AnchorText --> NormalizeA["anchor_normalized = anchor.lower()\nanchor_normalized = ' '.join(anchor_normalized.split())"]
MDFile["markdown file content"]
MDFile --> NormalizeC["content_normalized = content.lower()\ncontent_normalized = ' '.join(content_normalized.split())"]
NormalizeA --> Loop["for chunk_size in [300, 200, 150, 100, 80]:]
 NormalizeC --> Loop
 Loop --> Extract[if len(anchor_normalized) >= chunk_size:\n test_chunk = anchor_normalized[-chunk_size:]"]
Extract --> Find["pos = content_normalized.find(test_chunk)"]
Find --> FoundPos{"pos != -1?"}
FoundPos -->|Yes| CharToLine["char_count = 0\nfor line_num, line in enumerate(lines):\n char_count += len(' '.join(line.split())) + 1\n if char_count >= pos:\n best_match_line = line_num\n best_match_score = chunk_size"]
FoundPos -->|No| Loop
    
 
   CharToLine --> CheckThresh{"best_match_score >= 80?"}
CheckThresh -->|Yes| Accept["Accept match\nQueue for insertion"]
CheckThresh -->|No| HeadingFallback["Try heading_normalized in line_normalized\nbest_match_score = 50"]
HeadingFallback --> CheckThresh2{"best_match_score >= 80?"}
CheckThresh2 -->|Yes| Accept
 
   CheckThresh2 -->|No| Reject["Reject match\nSkip diagram"]

Diagram Extraction from JavaScript

Diagrams are extracted from the Next.js JavaScript payload embedded in the HTML response. DeepWiki stores all diagrams for all pages in a single JavaScript bundle, which requires fuzzy matching to place each diagram in the correct file.

Extraction Method

The primary extraction pattern captures fenced Mermaid blocks with various newline representations:

Unescape Sequence

Each diagram block undergoes unescaping to convert JavaScript string literals to actual text:

Escape SequenceReplacementPurpose
\\n\nNewline characters in diagram syntax
\\t\tTab characters for indentation
\\""Quoted strings in node labels
\\\\\Literal backslashes
\\u003c<HTML less-than entity
\\u003e>HTML greater-than entity
\\u0026&HTML ampersand entity
<br/>, <br> (space)HTML line breaks in labels

Sources: python/deepwiki-scraper.py:899-901 python/deepwiki-scraper.py:1039-1047

Phase 3: mdBook Build

Phase 3 generates mdBook configuration, creates the table of contents, and builds the final HTML documentation. This phase is orchestrated by build-docs.sh and invokes Rust tools (mdbook, mdbook-mermaid).

Phase 3 Component Interactions

flowchart TD
    Entry["build-docs.sh line 95\nPhase 3 starts"]
Entry --> MkdirBook["mkdir -p $BOOK_DIR\ncd $BOOK_DIR\n$BOOK_DIR=/workspace/book"]
MkdirBook --> GenToml["cat > book.toml <<EOF\n[book]\ntitle = \"$BOOK_TITLE\"\nauthors = [\"$BOOK_AUTHORS\"]\n[output.html]\ndefault-theme = \"rust\"\ngit-repository-url = \"$GIT_REPO_URL\"\n[preprocessor.mermaid]\ncommand = \"mdbook-mermaid\"\n[output.html.fold]\nenable = true"]
GenToml --> MkdirSrc["mkdir -p src"]
MkdirSrc --> InitSummary["{ echo '# Summary'; echo ''; } > src/SUMMARY.md"]
InitSummary --> FindOverview["main_pages_list=$(ls $WIKI_DIR/*.md)\noverview_file=$(printf '%s\ ' $main_pages_list / grep -Ev '^[0-9]' / head -1)\ntitle=$(head -1 $WIKI_DIR/$overview_file /sed 's/^# //'"]
FindOverview --> WriteOverview["echo \"[$title] $overview_file \" >> src/SUMMARY.md echo '' >> src/SUMMARY.md"]
WriteOverview --> SortPages["main_pages=$ printf '%s\ ' $main_pages_list/ awk -F/ '{print $NF}'\n /grep -E '^[0-9]'/ sort -t- -k1 -n)"]
SortPages --> LoopPages["echo \"$main_pages\" |while read -r file; do"]
LoopPages --> ExtractTitle["filename=$ basename \"$file\" title=$ head -1 \"$file\"| sed 's/^# //')"]
ExtractTitle --> ExtractNum["section_num=$(echo \"$filename\" |grep -oE '^[0-9]+' "]
ExtractNum --> CheckSubdir{"[ -d \"$WIKI_DIR/section-$section_num\" ]?"}
CheckSubdir -->|Yes|WriteMain["echo \"- [$title] $filename \" >> src/SUMMARY.md"]
WriteMain --> LoopSubs["ls $WIKI_DIR/section-$section_num/*.md/ awk -F/ '{print $NF}'\n /sort -t- -k1 -n/ while read subname; do"]
LoopSubs --> WriteSub["subfile=\"section-$section_num/$subname\"\nsubtitle=$(head -1 \"$subfile\" |sed 's/^# //' echo \" - [$subtitle] $subfile \" >> src/SUMMARY.md"]
WriteSub --> LoopSubs
 CheckSubdir -->|No| WriteStandalone["echo \"- [$title]($filename)\" >> src/SUMMARY.md"]
LoopSubs --> LoopPages
 
   WriteStandalone --> LoopPages
    
 
   LoopPages --> CopySrc["cp -r $WIKI_DIR/* src/"]
CopySrc --> ProcessTemplates["python3 /usr/local/bin/process-template.py $HEADER_TEMPLATE\npython3 /usr/local/bin/process-template.py $FOOTER_TEMPLATE\nInject into src/*.md and src/*/*.md"]
ProcessTemplates --> MermaidInstall["mdbook-mermaid install $BOOK_DIR"]
MermaidInstall --> MdBookBuild["mdbook build\nReads book.toml and src/SUMMARY.md\nProcesses src/*.md files\nGenerates book/index.html"]
MdBookBuild --> CopyOut["mkdir -p $OUTPUT_DIR\ncp -r book $OUTPUT_DIR/\ncp -r $WIKI_DIR/. $OUTPUT_DIR/markdown/\ncp book.toml $OUTPUT_DIR/"]

Sources: scripts/build-docs.sh:95-309 scripts/build-docs.sh:124-188 scripts/build-docs.sh:190-261 scripts/build-docs.sh:263-271

book.toml Generation

The orchestrator dynamically generates book.toml with runtime configuration from environment variables:

Sources: scripts/build-docs.sh:102-119

flowchart TD
    Start["Generate src/SUMMARY.md\n{ echo '# Summary'; echo ''; } > src/SUMMARY.md"]
Start --> ListFiles["main_pages_list=$(ls $WIKI_DIR/*.md 2>/dev/null // true)"]
ListFiles --> FindOverview["overview_file=$(printf '%s\ ' $main_pages_list\n /awk -F/ '{print $NF}'/ grep -Ev '^[0-9]'\n /head -1"]
FindOverview --> WriteOverview["if [ -n \"$overview_file\" ]; then title=$ head -1 $WIKI_DIR/$overview_file| sed 's/^# //')\n  echo \"[${title:-Overview}]($overview_file)\" >> src/SUMMARY.md\n  echo '' >> src/SUMMARY.md\nfi"]
WriteOverview --> FilterMain["main_pages=$(printf '%s\ ' $main_pages_list\n /awk -F/ '{print $NF}'/ grep -E '^[0-9]'\n /sort -t- -k1 -n"]
FilterMain --> LoopMain["echo \"$main_pages\"| while read -r file; do"]
LoopMain --> CheckFile{"[ -f \"$file\" ]?"}
CheckFile -->|Yes| GetFilename["filename=$(basename \"$file\")\ntitle=$(head -1 \"$file\" |sed 's/^# //' "]
CheckFile -->|No|LoopMain
 GetFilename --> ExtractNum["section_num=$ echo \"$filename\"| grep -oE '^[0-9]+' || true)\nsection_dir=\"$WIKI_DIR/section-$section_num\""]
ExtractNum --> CheckSubdir{"[ -n \"$section_num\" ] &&\n[ -d \"$section_dir\" ]?"}
CheckSubdir -->|Yes| WriteMainWithSubs["echo \"- [$title]($filename)\" >> src/SUMMARY.md"]
WriteMainWithSubs --> ListSubs["ls $section_dir/*.md 2>/dev/null\n /awk -F/ '{print $NF}'/ sort -t- -k1 -n"]
ListSubs --> LoopSubs["while read subname; do"]
LoopSubs --> CheckSubFile{"[ -f \"$subfile\" ]?"}
CheckSubFile -->|Yes| WriteSub["subfile=\"$section_dir/$subname\"\nsubfilename=$(basename \"$subfile\")\nsubtitle=$(head -1 \"$subfile\" |sed 's/^# //' echo \" - [$subtitle] section-$section_num/$subfilename \" >> src/SUMMARY.md"]
CheckSubFile -->|No|LoopSubs
 WriteSub --> LoopSubs
 CheckSubdir -->|No| WriteStandalone["echo \"- [$title]($filename)\" >> src/SUMMARY.md"]
LoopSubs --> LoopMain
 
   WriteStandalone --> LoopMain
    
 
   LoopMain --> CountEntries["echo \"Generated SUMMARY.md with $(grep -c '\\[' src/SUMMARY.md)
entries\""]

SUMMARY.md Generation Algorithm

The table of contents is generated by scanning the actual file structure in /workspace/wiki and extracting titles from the first line of each file:

Sources: scripts/build-docs.sh:124-188

mdBook and mdbook-mermaid Execution

The build process invokes two Rust binaries installed via the Docker multi-stage build:

CommandLocationPurposeOutput
mdbook-mermaid install $BOOK_DIRscripts/build-docs.sh265Install Mermaid.js assets into book directory and update book.tomlmermaid.min.js, mermaid-init.js in $BOOK_DIR/
mdbook buildscripts/build-docs.sh270Parse SUMMARY.md, process all Markdown files, generate static HTML siteHTML files in $BOOK_DIR/book/ (subdirectory, not root)

mdbook Build Process:

The mdbook build command performs the following operations:

  1. Parse Structure : Read src/SUMMARY.md to determine page hierarchy and navigation order
  2. Process Files : For each .md file referenced in SUMMARY.md:
    • Parse Markdown with CommonMark parser
    • Process Mermaid fenced code blocks via mdbook-mermaid preprocessor
    • Apply rust theme styles (configurable via default-theme in book.toml)
    • Generate sidebar navigation
  3. Generate HTML : Create HTML files with:
    • Responsive navigation sidebar
    • Client-side search functionality (elasticlunr.js)
    • “Edit this page” links using git-repository-url from book.toml
    • Syntax highlighting for code blocks
  4. Copy Assets : Bundle theme assets, fonts, and JavaScript libraries

Sources: scripts/build-docs.sh:263-271 scripts/build-docs.sh:102-119

Data Transformation Summary

Each phase transforms data in specific ways, with temporary directories used for intermediate work:

PhaseInputProcessing ComponentsOutput
Phase 1HTML from https://deepwiki.com/$REPOextract_wiki_structure(), extract_page_content(), BeautifulSoup, html2text.HTML2Text(), clean_deepwiki_footer()Clean Markdown in /workspace/wiki/
Phase 2Markdown files + JavaScript payload from DeepWikiextract_and_enhance_diagrams(), normalize_mermaid_diagram(), fuzzy matching with 300/200/150/100/80 char chunksEnhanced Markdown in /workspace/wiki/ (modified in place)
Phase 3Markdown files + environment variables ($BOOK_TITLE, $BOOK_AUTHORS, etc.)Shell script generates book.toml and src/SUMMARY.md, process-template.py, mdbook-mermaid install, mdbook buildHTML site in /workspace/book/book/

Working Directories:

DirectoryPurposeContentsLifecycle
/workspace/wiki/Primary working directoryMarkdown files organized by numbering schemeCreated in Phase 1, modified in Phase 2, read in Phase 3
/workspace/raw_markdown/Debug snapshotCopy of /workspace/wiki/ before Phase 2 enhancementCreated between Phase 1 and Phase 2, copied to /output/raw_markdown/
/workspace/book/mdBook project directorybook.toml, src/ subdirectory, final book/ subdirectoryCreated in Phase 3
/workspace/book/src/mdBook sourceCopy of /workspace/wiki/ with injected headers/footers, SUMMARY.mdCreated in Phase 3
/workspace/book/book/Final HTML outputComplete static HTML siteGenerated by mdbook build
/output/Final container outputbook/, markdown/, raw_markdown/, book.tomlPopulated at end of Phase 3 (or end of Phase 2 if MARKDOWN_ONLY=true)

Final Output Structure:

/output/
├── book/                          # Static HTML site (from /workspace/book/book/)
│   ├── index.html
│   ├── overview.html               # First page (unnumbered)
│   ├── 1-quick-start.html         # Main pages
│   ├── section-1/
│   │   ├── 1-1-installation.html  # Subsections
│   │   └── ...
│   ├── mermaid.min.js             # Installed by mdbook-mermaid
│   ├── mermaid-init.js            # Installed by mdbook-mermaid
│   └── ...
├── markdown/                       # Enhanced Markdown source (from /workspace/wiki/)
│   ├── overview.md
│   ├── 1-quick-start.md
│   ├── section-1/
│   │   ├── 1-1-installation.md
│   │   └── ...
│   └── ...
├── raw_markdown/                   # Pre-enhancement snapshot (from /workspace/raw_markdown/)
│   ├── overview.md                 # Same structure as markdown/ but without diagrams
│   └── ...
└── book.toml                      # mdBook configuration (from /workspace/book/book.toml)

Sources: scripts/build-docs.sh:273-294 python/deepwiki-scraper.py:1358-1366

flowchart TD
    Phase1["Phase 1: Extraction\n(deepwiki-scraper.py)"]
Phase2["Phase 2: Enhancement\n(deepwiki-scraper.py)"]
Phase1 --> Phase2
    
 
   Phase2 --> Check{"MARKDOWN_ONLY\n== true?"}
Check -->|Yes| FastPath["cp -r /workspace/wiki/* /output/markdown/\nExit (fast path)"]
Check -->|No| Phase3["Phase 3: mdBook Build\n(build-docs.sh)"]
Phase3 --> FullOutput["Copy book/ and markdown/ to /output/\nExit (full build)"]
FastPath --> End[/"Build complete"/]
 
   FullOutput --> End

Conditional Execution: MARKDOWN_ONLY Mode

The MARKDOWN_ONLY environment variable allows bypassing Phase 3 for faster iteration during development:

When MARKDOWN_ONLY=true:

  • Execution time: ~30-60 seconds (scraping + diagram matching only)
  • Output: /output/markdown/ only
  • Use case: Debugging diagram placement, testing content extraction

When MARKDOWN_ONLY=false (default):

  • Execution time: ~60-120 seconds (full pipeline)
  • Output: /output/book/, /output/markdown/, /output/book.toml
  • Use case: Production documentation builds

Sources: build-docs.sh:60-76 README.md:55-76

Dismiss

Refresh this wiki

Enter email to refresh