This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Three-Phase Pipeline
Loading…
Three-Phase Pipeline
Relevant source files
Purpose and Scope
This document describes the three-phase processing pipeline that transforms DeepWiki HTML pages into searchable mdBook documentation. The pipeline consists of Phase 1: Clean Markdown Extraction , Phase 2: Diagram Enhancement , and Phase 3: mdBook Build. Each phase has distinct responsibilities and uses different technology stacks.
For overall system architecture, see System Architecture. For detailed implementation of individual phases, see Phase 1: Markdown Extraction, Phase 2: Diagram Enhancement, and Phase 3: mdBook Build. For configuration that affects pipeline behavior, see Configuration Reference.
Pipeline Overview
The system processes content through three sequential phases, with an optional bypass mechanism for Phase 3. Each phase is implemented by different components and operates on files in specific directories.
Pipeline Execution Flow
stateDiagram-v2
[*] --> Init["build-docs.sh\nParse env vars"]
Init --> Phase1["Phase 1 : deepwiki-scraper.py"]
state "Phase 1 : Markdown Extraction" as Phase1 {
[*] --> ExtractStruct["extract_wiki_structure()"]
ExtractStruct --> LoopPages["for page in pages"]
LoopPages --> ExtractPage["extract_page_content(url)"]
ExtractPage --> ConvertHTML["convert_html_to_markdown()"]
ConvertHTML --> CleanFooter["clean_deepwiki_footer()"]
CleanFooter --> WriteTemp["/workspace/wiki/*.md"]
WriteTemp --> LoopPages
LoopPages --> RawSnapshot["/workspace/raw_markdown/\n(debug snapshot)"]
}
Phase1 --> Phase2["Phase 2 : deepwiki-scraper.py"]
state "Phase 2 : Diagram Enhancement" as Phase2 {
[*] --> ExtractDiagrams["extract_and_enhance_diagrams()"]
ExtractDiagrams --> FetchJS["Fetch JS payload\nextract_mermaid_from_nextjs_data()"]
FetchJS --> NormalizeDiagrams["normalize_mermaid_diagram()\n7 normalization passes"]
NormalizeDiagrams --> FuzzyMatch["Fuzzy match loop\n300/200/150/100/80 char chunks"]
FuzzyMatch --> InjectFiles["Modify /workspace/wiki/*.md\nInsert ```mermaid blocks"]
}
Phase2 --> CheckMode{"MARKDOWN_ONLY\nenv var?"}
CheckMode --> CopyMarkdown["build-docs.sh\ncp -r /workspace/wiki /output/markdown"] : true
CheckMode --> Phase3["Phase 3 : build-docs.sh"] - false
state "Phase 3 : mdBook Build" as Phase3 {
[*] --> GenToml["Generate book.toml\n[book], [output.html]"]
GenToml --> GenSummary["Generate src/SUMMARY.md\nScan .md files"]
GenSummary --> CopyToSrc["cp -r /workspace/wiki/* src/"]
CopyToSrc --> MermaidInstall["mdbook-mermaid install"]
MermaidInstall --> MdBookBuild["mdbook build"]
MdBookBuild --> OutputBook["/workspace/book/book/"]
}
Phase3 --> CopyAll["cp -r book /output/\ncp -r markdown /output/"]
CopyMarkdown --> Done["/output directory\nready"]
CopyAll --> Done
Done --> [*]
Sources: scripts/build-docs.sh:61-93 python/deepwiki-scraper.py:1277-1408 python/deepwiki-scraper.py:880-1276
Phase Coordination
The build-docs.sh orchestrator coordinates all three phases and handles the decision point for markdown-only mode.
Orchestrator Control Flow
flowchart TD
Start["Container entrypoint\nCMD: /usr/local/bin/build-docs.sh"]
Start --> ParseEnv["Parse environment\n$REPO, $BOOK_TITLE, $BOOK_AUTHORS\n$MARKDOWN_ONLY, $GIT_REPO_URL"]
ParseEnv --> CheckRepo{"$REPO\nset?"}
CheckRepo -->|No| GitDetect["git config --get remote.origin.url\nsed -E 's#.*github.com[:/]([^/]+/[^/.]+).*#\1#'"]
CheckRepo -->|Yes| SetVars["Set defaults:\nBOOK_AUTHORS=$REPO_OWNER\nGIT_REPO_URL=https://github.com/$REPO"]
GitDetect --> SetVars
SetVars --> SetPaths["WORK_DIR=/workspace\nWIKI_DIR=/workspace/wiki\nRAW_DIR=/workspace/raw_markdown\nOUTPUT_DIR=/output\nBOOK_DIR=/workspace/book"]
SetPaths --> CallScraper["python3 /usr/local/bin/deepwiki-scraper.py $REPO $WIKI_DIR"]
CallScraper --> ScraperRuns["deepwiki-scraper.py executes:\nPhase 1: extract_wiki_structure()\nPhase 2: extract_and_enhance_diagrams()"]
ScraperRuns --> CheckMode{"$MARKDOWN_ONLY\n== 'true'?"}
CheckMode -->|Yes| QuickCopy["rm -rf $OUTPUT_DIR/markdown\nmkdir -p $OUTPUT_DIR/markdown\ncp -r $WIKI_DIR/. $OUTPUT_DIR/markdown/\nexit 0"]
CheckMode -->|No| MkdirBook["mkdir -p $BOOK_DIR\ncd $BOOK_DIR"]
MkdirBook --> GenToml["cat > book.toml <<EOF\n[book] title=$BOOK_TITLE\n[output.html] git-repository-url=$GIT_REPO_URL\n[preprocessor.mermaid]"]
GenToml --> MkdirSrc["mkdir -p src"]
MkdirSrc --> GenSummary["Generate src/SUMMARY.md\nls $WIKI_DIR/*.md /sort -t- -k1 -n for file: head -1 $file/ sed 's/^# //'"]
GenSummary --> CopyToSrc["cp -r $WIKI_DIR/* src/"]
CopyToSrc --> ProcessTemplates["python3 process-template.py header.html\npython3 process-template.py footer.html\nInject into src/*.md"]
ProcessTemplates --> InstallMermaid["mdbook-mermaid install $BOOK_DIR"]
InstallMermaid --> BuildBook["mdbook build"]
BuildBook --> CopyOutputs["mkdir -p $OUTPUT_DIR\ncp -r book $OUTPUT_DIR/\ncp -r $WIKI_DIR/. $OUTPUT_DIR/markdown/\ncp book.toml $OUTPUT_DIR/"]
QuickCopy --> Done["Exit 0"]
CopyOutputs --> Done
Sources: scripts/build-docs.sh:8-47 scripts/build-docs.sh:61-93 scripts/build-docs.sh:95-309
Phase 1: Clean Markdown Extraction
Phase 1 discovers the wiki structure and converts HTML pages to clean Markdown, writing files to a temporary directory (/workspace/wiki). This phase is implemented entirely in Python within deepwiki-scraper.py.
Phase 1 Data Flow
flowchart LR
DeepWiki["https://deepwiki.com/$REPO"]
DeepWiki -->|session.get base_url| ExtractStruct["extract_wiki_structure(repo, session)"]
ExtractStruct -->|soup.find_all 'a', href=re.compile ...| ParseLinks["Parse sidebar links\nPattern: /$REPO/\d+"]
ParseLinks --> PageList["pages = [\n {'number': '1', 'title': 'Overview',\n 'url': '...', 'href': '...', 'level': 0},\n {'number': '2.1', 'title': 'Sub',\n 'url': '...', 'href': '...', 'level': 1},\n ...\n]\nsorted by page number"]
PageList --> Loop["for page in pages:]
Loop --> FetchPage[fetch_page(url, session)\nUser-Agent header\n3 retries with timeout=30"]
FetchPage --> ParseHTML["BeautifulSoup(response.text)\nRemove: nav, header, footer, aside\nFind: article or main or body"]
ParseHTML --> ConvertMD["h = html2text.HTML2Text()\nh.body_width = 0\nmarkdown = h.handle(html_content)"]
ConvertMD --> CleanFooter["clean_deepwiki_footer(markdown)\nRegex patterns:\n'Dismiss', 'Refresh this wiki',\n'On this page', 'Edit Wiki'"]
CleanFooter --> FixLinks["fix_wiki_link(match)\nRegex: /owner/repo/(\d+(?:\.\d+)*)-(.+)\nConvert to: section-N/N-M-slug.md"]
FixLinks --> ResolvePath["resolve_output_path(page_number, title)\nnormalized_number_parts()\nsanitize_filename()"]
ResolvePath --> WriteFile["filepath.write_text(markdown)\nMain: /workspace/wiki/N-slug.md\nSub: /workspace/wiki/section-N/N-M-slug.md"]
WriteFile --> Loop
Sources: python/deepwiki-scraper.py:116-163 python/deepwiki-scraper.py:751-877 python/deepwiki-scraper.py:213-228 python/deepwiki-scraper.py:165-211 python/deepwiki-scraper.py:22-26 python/deepwiki-scraper.py:28-53
Key Functions and Their Roles
| Function | File Location | Responsibility |
|---|---|---|
extract_wiki_structure() | python/deepwiki-scraper.py:116-163 | Parse main wiki page, discover all pages via sidebar links matching /repo/\d+, return sorted list of page metadata |
extract_page_content() | python/deepwiki-scraper.py:751-877 | Fetch individual page HTML, parse with BeautifulSoup, remove nav/footer elements, convert to Markdown |
convert_html_to_markdown() | python/deepwiki-scraper.py:213-228 | Convert HTML string to Markdown using html2text.HTML2Text() with body_width=0 (no line wrapping) |
clean_deepwiki_footer() | python/deepwiki-scraper.py:165-211 | Scan last 50 lines for DeepWiki UI patterns (Dismiss, Refresh this wiki, etc.) and truncate |
sanitize_filename() | python/deepwiki-scraper.py:22-26 | Strip special chars, replace spaces/hyphens, convert to lowercase: re.sub(r'[^\w\s-]', '', text) |
normalized_number_parts() | python/deepwiki-scraper.py:28-43 | Shift DeepWiki page numbers down by 1 (page 1 becomes unnumbered), split on . into parts |
resolve_output_path() | python/deepwiki-scraper.py:45-53 | Determine filename (N-slug.md) and optional subdirectory (section-N/) based on page numbering |
fix_wiki_link() | python/deepwiki-scraper.py:854-876 | Rewrite internal links from /owner/repo/N-title to relative paths like ../section-N/N-M-title.md |
File Organization Logic
flowchart TD
PageNum["page['number']\n(from DeepWiki)"]
PageNum --> Normalize["normalized_number_parts(page_number)\nSplit on '.', shift main number down by 1\nDeepWiki '1' → []\nDeepWiki '2' → ['1']\nDeepWiki '2.1' → ['1', '1']"]
Normalize --> CheckParts{"len(parts)?"}
CheckParts -->|0 was page 1| RootOverview["Filename: overview.md\nPath: $WIKI_DIR/overview.md\nNo section dir"]
CheckParts -->|1 main page| RootMain["Filename: N-slug.md\nExample: 1-quick-start.md\nPath: $WIKI_DIR/1-quick-start.md\nNo section dir"]
CheckParts -->|2+ subsection| ExtractSection["main_section = parts[0]\nsection_dir = f'section-{main_section}'"]
ExtractSection --> CreateDir["section_path = Path($WIKI_DIR) / section_dir\nsection_path.mkdir(exist_ok=True)"]
CreateDir --> SubFile["Filename: N-M-slug.md\nExample: 1-1-installation.md\nPath: $WIKI_DIR/section-1/1-1-installation.md"]
The system organizes files hierarchically based on page numbering. DeepWiki pages are numbered starting from 1, but the system shifts them down by 1 so that the first page becomes unnumbered (the overview).
Sources: python/deepwiki-scraper.py:28-43 python/deepwiki-scraper.py:45-63 python/deepwiki-scraper.py:1332-1338
Phase 2: Diagram Enhancement
Phase 2 extracts Mermaid diagrams from the JavaScript payload and uses fuzzy matching to intelligently place them in the appropriate Markdown files. This phase operates on files in the temporary directory (/workspace/wiki).
Phase 2 Algorithm Flow
flowchart TD
Start["extract_and_enhance_diagrams(repo, temp_dir, session, diagram_source_url)"]
Start --> FetchJS["response = session.get(diagram_source_url)\nhtml_text = response.text"]
FetchJS --> ExtractRegex["pattern = r'```mermaid(?:\\r\\\n/\\/\r?\)(.*?)(?:\\r\\\n/\\/\r?\)```'\ndiagram_matches = re.finditer(pattern, html_text, re.DOTALL)"]
ExtractRegex --> CountTotal["print(f'Found {len(diagram_matches)}
total diagrams')"]
CountTotal --> ExtractContext["for match in diagram_matches:\ncontext_start = max(0, match.start() - 2000)\ncontext_before = html_text[context_start:match.start()]"]
ExtractContext --> Unescape["Unescape escape sequences:\nreplace('\\\n', '\\n')\nreplace('\\ ', '\ ')\nreplace('\\\"', '\"')\nreplace('\<', '<')\nreplace('\>', '>')"]
Unescape --> ParseContext["context_lines = [l for l in context.split('\\n')
if l.strip()]\nFind last_heading (line starting with #)\nExtract anchor_text (last 2-3 non-heading lines, max 300 chars)"]
ParseContext --> Normalize["normalize_mermaid_diagram(diagram)\n7 normalization passes:\nnormalize_mermaid_edge_labels()\nnormalize_mermaid_state_descriptions()\nnormalize_flowchart_nodes()\nnormalize_statement_separators()\nnormalize_empty_node_labels()\nnormalize_gantt_diagram()"]
Normalize --> BuildContexts["diagram_contexts.append({\n 'last_heading': last_heading,\n 'anchor_text': anchor_text[-300:],\n 'diagram': normalized_diagram\n})"]
BuildContexts --> ScanFiles["md_files = list(temp_dir.glob('**/*.md'))\nfor md_file in md_files:]
ScanFiles --> SkipExisting{re.search(r'^\s*`{3,}\s*mermaid\b',\ncontent)?"}
SkipExisting -->|Yes| ScanFiles
SkipExisting -->|No| NormalizeContent["content_normalized = content.lower()\ncontent_normalized = ' '.join(content_normalized.split())"]
NormalizeContent --> MatchLoop["for idx, item in enumerate(diagram_contexts):]
MatchLoop --> TryChunks[for chunk_size in [300, 200, 150, 100, 80]:\ntest_chunk = anchor_normalized[-chunk_size:]\npos = content_normalized.find(test_chunk)"]
TryChunks --> FoundMatch{"pos != -1?"}
FoundMatch -->|Yes| ConvertToLine["char_count = 0\nfor line_num, line in enumerate(lines):\n char_count += len(' '.join(line.split())) + 1\n if char_count >= pos: best_match_line = line_num"]
FoundMatch -->|No| TryHeading["for line_num, line in enumerate(lines):\nif heading_normalized in line_normalized:\n best_match_line = line_num"]
TryHeading --> FoundMatch2{"best_match_line != -1?"}
FoundMatch2 -->|Yes| ConvertToLine
FoundMatch2 -->|No| MatchLoop
ConvertToLine --> CheckScore{"best_match_score >= 80?"}
CheckScore -->|Yes| FindInsert["insert_line = best_match_line + 1\nSkip blank lines, skip paragraph/list"]
CheckScore -->|No| MatchLoop
FindInsert --> QueueInsert["pending_insertions.append(\n (insert_line, diagram, score, idx)\n)\ndiagrams_used.add(idx)"]
QueueInsert --> MatchLoop
MatchLoop --> SortInsert["pending_insertions.sort(key=lambda x: x[0], reverse=True)"]
SortInsert --> InsertLoop["for insert_line, diagram, score, idx in pending_insertions:\nlines.insert(insert_line, '')\nlines.insert(insert_line, f'{fence}mermaid')\nlines.insert(insert_line, diagram)\nlines.insert(insert_line, fence)\nlines.insert(insert_line, '')"]
InsertLoop --> WriteFile["with open(md_file, 'w')
as f:\n f.write('\\n'.join(lines))"]
WriteFile --> ScanFiles
Sources: python/deepwiki-scraper.py:880-1276 python/deepwiki-scraper.py:899-1088 python/deepwiki-scraper.py:1149-1273 python/deepwiki-scraper.py:230-393
Fuzzy Matching Algorithm
The algorithm uses progressively shorter anchor text chunks to find the best match location for each diagram. The score threshold of 80 ensures only high-confidence matches are inserted.
Sources: python/deepwiki-scraper.py:1184-1218
flowchart LR
AnchorText["anchor_text\n(last 300 chars from context)"]
AnchorText --> NormalizeA["anchor_normalized = anchor.lower()\nanchor_normalized = ' '.join(anchor_normalized.split())"]
MDFile["markdown file content"]
MDFile --> NormalizeC["content_normalized = content.lower()\ncontent_normalized = ' '.join(content_normalized.split())"]
NormalizeA --> Loop["for chunk_size in [300, 200, 150, 100, 80]:]
NormalizeC --> Loop
Loop --> Extract[if len(anchor_normalized) >= chunk_size:\n test_chunk = anchor_normalized[-chunk_size:]"]
Extract --> Find["pos = content_normalized.find(test_chunk)"]
Find --> FoundPos{"pos != -1?"}
FoundPos -->|Yes| CharToLine["char_count = 0\nfor line_num, line in enumerate(lines):\n char_count += len(' '.join(line.split())) + 1\n if char_count >= pos:\n best_match_line = line_num\n best_match_score = chunk_size"]
FoundPos -->|No| Loop
CharToLine --> CheckThresh{"best_match_score >= 80?"}
CheckThresh -->|Yes| Accept["Accept match\nQueue for insertion"]
CheckThresh -->|No| HeadingFallback["Try heading_normalized in line_normalized\nbest_match_score = 50"]
HeadingFallback --> CheckThresh2{"best_match_score >= 80?"}
CheckThresh2 -->|Yes| Accept
CheckThresh2 -->|No| Reject["Reject match\nSkip diagram"]
Diagram Extraction from JavaScript
Diagrams are extracted from the Next.js JavaScript payload embedded in the HTML response. DeepWiki stores all diagrams for all pages in a single JavaScript bundle, which requires fuzzy matching to place each diagram in the correct file.
Extraction Method
The primary extraction pattern captures fenced Mermaid blocks with various newline representations:
Unescape Sequence
Each diagram block undergoes unescaping to convert JavaScript string literals to actual text:
| Escape Sequence | Replacement | Purpose |
|---|---|---|
\\n | \n | Newline characters in diagram syntax |
\\t | \t | Tab characters for indentation |
\\" | " | Quoted strings in node labels |
\\\\ | \ | Literal backslashes |
\\u003c | < | HTML less-than entity |
\\u003e | > | HTML greater-than entity |
\\u0026 | & | HTML ampersand entity |
<br/>, <br> | (space) | HTML line breaks in labels |
Sources: python/deepwiki-scraper.py:899-901 python/deepwiki-scraper.py:1039-1047
Phase 3: mdBook Build
Phase 3 generates mdBook configuration, creates the table of contents, and builds the final HTML documentation. This phase is orchestrated by build-docs.sh and invokes Rust tools (mdbook, mdbook-mermaid).
Phase 3 Component Interactions
flowchart TD
Entry["build-docs.sh line 95\nPhase 3 starts"]
Entry --> MkdirBook["mkdir -p $BOOK_DIR\ncd $BOOK_DIR\n$BOOK_DIR=/workspace/book"]
MkdirBook --> GenToml["cat > book.toml <<EOF\n[book]\ntitle = \"$BOOK_TITLE\"\nauthors = [\"$BOOK_AUTHORS\"]\n[output.html]\ndefault-theme = \"rust\"\ngit-repository-url = \"$GIT_REPO_URL\"\n[preprocessor.mermaid]\ncommand = \"mdbook-mermaid\"\n[output.html.fold]\nenable = true"]
GenToml --> MkdirSrc["mkdir -p src"]
MkdirSrc --> InitSummary["{ echo '# Summary'; echo ''; } > src/SUMMARY.md"]
InitSummary --> FindOverview["main_pages_list=$(ls $WIKI_DIR/*.md)\noverview_file=$(printf '%s\ ' $main_pages_list / grep -Ev '^[0-9]' / head -1)\ntitle=$(head -1 $WIKI_DIR/$overview_file /sed 's/^# //'"]
FindOverview --> WriteOverview["echo \"[$title] $overview_file \" >> src/SUMMARY.md echo '' >> src/SUMMARY.md"]
WriteOverview --> SortPages["main_pages=$ printf '%s\ ' $main_pages_list/ awk -F/ '{print $NF}'\n /grep -E '^[0-9]'/ sort -t- -k1 -n)"]
SortPages --> LoopPages["echo \"$main_pages\" |while read -r file; do"]
LoopPages --> ExtractTitle["filename=$ basename \"$file\" title=$ head -1 \"$file\"| sed 's/^# //')"]
ExtractTitle --> ExtractNum["section_num=$(echo \"$filename\" |grep -oE '^[0-9]+' "]
ExtractNum --> CheckSubdir{"[ -d \"$WIKI_DIR/section-$section_num\" ]?"}
CheckSubdir -->|Yes|WriteMain["echo \"- [$title] $filename \" >> src/SUMMARY.md"]
WriteMain --> LoopSubs["ls $WIKI_DIR/section-$section_num/*.md/ awk -F/ '{print $NF}'\n /sort -t- -k1 -n/ while read subname; do"]
LoopSubs --> WriteSub["subfile=\"section-$section_num/$subname\"\nsubtitle=$(head -1 \"$subfile\" |sed 's/^# //' echo \" - [$subtitle] $subfile \" >> src/SUMMARY.md"]
WriteSub --> LoopSubs
CheckSubdir -->|No| WriteStandalone["echo \"- [$title]($filename)\" >> src/SUMMARY.md"]
LoopSubs --> LoopPages
WriteStandalone --> LoopPages
LoopPages --> CopySrc["cp -r $WIKI_DIR/* src/"]
CopySrc --> ProcessTemplates["python3 /usr/local/bin/process-template.py $HEADER_TEMPLATE\npython3 /usr/local/bin/process-template.py $FOOTER_TEMPLATE\nInject into src/*.md and src/*/*.md"]
ProcessTemplates --> MermaidInstall["mdbook-mermaid install $BOOK_DIR"]
MermaidInstall --> MdBookBuild["mdbook build\nReads book.toml and src/SUMMARY.md\nProcesses src/*.md files\nGenerates book/index.html"]
MdBookBuild --> CopyOut["mkdir -p $OUTPUT_DIR\ncp -r book $OUTPUT_DIR/\ncp -r $WIKI_DIR/. $OUTPUT_DIR/markdown/\ncp book.toml $OUTPUT_DIR/"]
Sources: scripts/build-docs.sh:95-309 scripts/build-docs.sh:124-188 scripts/build-docs.sh:190-261 scripts/build-docs.sh:263-271
book.toml Generation
The orchestrator dynamically generates book.toml with runtime configuration from environment variables:
Sources: scripts/build-docs.sh:102-119
flowchart TD
Start["Generate src/SUMMARY.md\n{ echo '# Summary'; echo ''; } > src/SUMMARY.md"]
Start --> ListFiles["main_pages_list=$(ls $WIKI_DIR/*.md 2>/dev/null // true)"]
ListFiles --> FindOverview["overview_file=$(printf '%s\ ' $main_pages_list\n /awk -F/ '{print $NF}'/ grep -Ev '^[0-9]'\n /head -1"]
FindOverview --> WriteOverview["if [ -n \"$overview_file\" ]; then title=$ head -1 $WIKI_DIR/$overview_file| sed 's/^# //')\n echo \"[${title:-Overview}]($overview_file)\" >> src/SUMMARY.md\n echo '' >> src/SUMMARY.md\nfi"]
WriteOverview --> FilterMain["main_pages=$(printf '%s\ ' $main_pages_list\n /awk -F/ '{print $NF}'/ grep -E '^[0-9]'\n /sort -t- -k1 -n"]
FilterMain --> LoopMain["echo \"$main_pages\"| while read -r file; do"]
LoopMain --> CheckFile{"[ -f \"$file\" ]?"}
CheckFile -->|Yes| GetFilename["filename=$(basename \"$file\")\ntitle=$(head -1 \"$file\" |sed 's/^# //' "]
CheckFile -->|No|LoopMain
GetFilename --> ExtractNum["section_num=$ echo \"$filename\"| grep -oE '^[0-9]+' || true)\nsection_dir=\"$WIKI_DIR/section-$section_num\""]
ExtractNum --> CheckSubdir{"[ -n \"$section_num\" ] &&\n[ -d \"$section_dir\" ]?"}
CheckSubdir -->|Yes| WriteMainWithSubs["echo \"- [$title]($filename)\" >> src/SUMMARY.md"]
WriteMainWithSubs --> ListSubs["ls $section_dir/*.md 2>/dev/null\n /awk -F/ '{print $NF}'/ sort -t- -k1 -n"]
ListSubs --> LoopSubs["while read subname; do"]
LoopSubs --> CheckSubFile{"[ -f \"$subfile\" ]?"}
CheckSubFile -->|Yes| WriteSub["subfile=\"$section_dir/$subname\"\nsubfilename=$(basename \"$subfile\")\nsubtitle=$(head -1 \"$subfile\" |sed 's/^# //' echo \" - [$subtitle] section-$section_num/$subfilename \" >> src/SUMMARY.md"]
CheckSubFile -->|No|LoopSubs
WriteSub --> LoopSubs
CheckSubdir -->|No| WriteStandalone["echo \"- [$title]($filename)\" >> src/SUMMARY.md"]
LoopSubs --> LoopMain
WriteStandalone --> LoopMain
LoopMain --> CountEntries["echo \"Generated SUMMARY.md with $(grep -c '\\[' src/SUMMARY.md)
entries\""]
SUMMARY.md Generation Algorithm
The table of contents is generated by scanning the actual file structure in /workspace/wiki and extracting titles from the first line of each file:
Sources: scripts/build-docs.sh:124-188
mdBook and mdbook-mermaid Execution
The build process invokes two Rust binaries installed via the Docker multi-stage build:
| Command | Location | Purpose | Output |
|---|---|---|---|
mdbook-mermaid install $BOOK_DIR | scripts/build-docs.sh265 | Install Mermaid.js assets into book directory and update book.toml | mermaid.min.js, mermaid-init.js in $BOOK_DIR/ |
mdbook build | scripts/build-docs.sh270 | Parse SUMMARY.md, process all Markdown files, generate static HTML site | HTML files in $BOOK_DIR/book/ (subdirectory, not root) |
mdbook Build Process:
The mdbook build command performs the following operations:
- Parse Structure : Read
src/SUMMARY.mdto determine page hierarchy and navigation order - Process Files : For each
.mdfile referenced in SUMMARY.md:- Parse Markdown with CommonMark parser
- Process Mermaid fenced code blocks via
mdbook-mermaidpreprocessor - Apply
rusttheme styles (configurable viadefault-themein book.toml) - Generate sidebar navigation
- Generate HTML : Create HTML files with:
- Responsive navigation sidebar
- Client-side search functionality (elasticlunr.js)
- “Edit this page” links using
git-repository-urlfrom book.toml - Syntax highlighting for code blocks
- Copy Assets : Bundle theme assets, fonts, and JavaScript libraries
Sources: scripts/build-docs.sh:263-271 scripts/build-docs.sh:102-119
Data Transformation Summary
Each phase transforms data in specific ways, with temporary directories used for intermediate work:
| Phase | Input | Processing Components | Output |
|---|---|---|---|
| Phase 1 | HTML from https://deepwiki.com/$REPO | extract_wiki_structure(), extract_page_content(), BeautifulSoup, html2text.HTML2Text(), clean_deepwiki_footer() | Clean Markdown in /workspace/wiki/ |
| Phase 2 | Markdown files + JavaScript payload from DeepWiki | extract_and_enhance_diagrams(), normalize_mermaid_diagram(), fuzzy matching with 300/200/150/100/80 char chunks | Enhanced Markdown in /workspace/wiki/ (modified in place) |
| Phase 3 | Markdown files + environment variables ($BOOK_TITLE, $BOOK_AUTHORS, etc.) | Shell script generates book.toml and src/SUMMARY.md, process-template.py, mdbook-mermaid install, mdbook build | HTML site in /workspace/book/book/ |
Working Directories:
| Directory | Purpose | Contents | Lifecycle |
|---|---|---|---|
/workspace/wiki/ | Primary working directory | Markdown files organized by numbering scheme | Created in Phase 1, modified in Phase 2, read in Phase 3 |
/workspace/raw_markdown/ | Debug snapshot | Copy of /workspace/wiki/ before Phase 2 enhancement | Created between Phase 1 and Phase 2, copied to /output/raw_markdown/ |
/workspace/book/ | mdBook project directory | book.toml, src/ subdirectory, final book/ subdirectory | Created in Phase 3 |
/workspace/book/src/ | mdBook source | Copy of /workspace/wiki/ with injected headers/footers, SUMMARY.md | Created in Phase 3 |
/workspace/book/book/ | Final HTML output | Complete static HTML site | Generated by mdbook build |
/output/ | Final container output | book/, markdown/, raw_markdown/, book.toml | Populated at end of Phase 3 (or end of Phase 2 if MARKDOWN_ONLY=true) |
Final Output Structure:
/output/
├── book/ # Static HTML site (from /workspace/book/book/)
│ ├── index.html
│ ├── overview.html # First page (unnumbered)
│ ├── 1-quick-start.html # Main pages
│ ├── section-1/
│ │ ├── 1-1-installation.html # Subsections
│ │ └── ...
│ ├── mermaid.min.js # Installed by mdbook-mermaid
│ ├── mermaid-init.js # Installed by mdbook-mermaid
│ └── ...
├── markdown/ # Enhanced Markdown source (from /workspace/wiki/)
│ ├── overview.md
│ ├── 1-quick-start.md
│ ├── section-1/
│ │ ├── 1-1-installation.md
│ │ └── ...
│ └── ...
├── raw_markdown/ # Pre-enhancement snapshot (from /workspace/raw_markdown/)
│ ├── overview.md # Same structure as markdown/ but without diagrams
│ └── ...
└── book.toml # mdBook configuration (from /workspace/book/book.toml)
Sources: scripts/build-docs.sh:273-294 python/deepwiki-scraper.py:1358-1366
flowchart TD
Phase1["Phase 1: Extraction\n(deepwiki-scraper.py)"]
Phase2["Phase 2: Enhancement\n(deepwiki-scraper.py)"]
Phase1 --> Phase2
Phase2 --> Check{"MARKDOWN_ONLY\n== true?"}
Check -->|Yes| FastPath["cp -r /workspace/wiki/* /output/markdown/\nExit (fast path)"]
Check -->|No| Phase3["Phase 3: mdBook Build\n(build-docs.sh)"]
Phase3 --> FullOutput["Copy book/ and markdown/ to /output/\nExit (full build)"]
FastPath --> End[/"Build complete"/]
FullOutput --> End
Conditional Execution: MARKDOWN_ONLY Mode
The MARKDOWN_ONLY environment variable allows bypassing Phase 3 for faster iteration during development:
When MARKDOWN_ONLY=true:
- Execution time: ~30-60 seconds (scraping + diagram matching only)
- Output:
/output/markdown/only - Use case: Debugging diagram placement, testing content extraction
When MARKDOWN_ONLY=false (default):
- Execution time: ~60-120 seconds (full pipeline)
- Output:
/output/book/,/output/markdown/,/output/book.toml - Use case: Production documentation builds
Sources: build-docs.sh:60-76 README.md:55-76
Dismiss
Refresh this wiki
Enter email to refresh