This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Numbering and Path Resolution
Loading…
Numbering and Path Resolution
Relevant source files
This document explains how the system transforms DeepWiki’s page numbering scheme into the file structure used in the generated mdBook documentation. The process involves three key operations: (1) normalizing DeepWiki page numbers by shifting them down by one, (2) resolving normalized numbers to file paths and directory locations, and (3) rewriting internal wiki links to use correct relative paths.
The numbering and path resolution system is foundational to maintaining a consistent file structure and ensuring that cross-references between wiki pages function correctly in the final mdBook output.
For information about the overall markdown extraction process, see page 6. For details about file organization and directory structure, see page 10.
Overview of Operations
The system performs three distinct but related operations:
| Operation | Function | Purpose |
|---|---|---|
| Number Normalization | normalized_number_parts() | Shift DeepWiki numbers down by 1 (page 1 becomes unnumbered) |
| Path Resolution | resolve_output_path() | Generate filename and section directory from page number |
| Link Rewriting | fix_wiki_link() | Convert absolute URLs to relative markdown paths |
Sources: python/deepwiki-scraper.py:28-64
Numbering Scheme Transformation
DeepWiki Numbering Convention
DeepWiki numbers pages starting from 1, with subsections using dot notation (e.g., 1, 2, 2.1, 2.2, 3, 3.1). This numbering includes an “overview” page as page 1, which the system treats specially.
Normalization Algorithm
The normalized_number_parts() function shifts all page numbers down by one, making the overview page unnumbered and adjusting all subsequent numbers:
Diagram: Number Normalization Transformation
graph LR
subgraph "DeepWiki Numbering"
DW1["1 (Overview)"]
DW2["2"]
DW3["3"]
DW4["4.1"]
DW5["4.2"]
end
subgraph "normalized_number_parts()"
Norm["Subtract 1 from\nmain number"]
end
subgraph "Normalized Numbering"
N1["[] (Unnumbered)"]
N2["[1]"]
N3["[2]"]
N4["[3, 1]"]
N5["[3, 2]"]
end
DW1 --> Norm
DW2 --> Norm
DW3 --> Norm
DW4 --> Norm
DW5 --> Norm
Norm --> N1
Norm --> N2
Norm --> N3
Norm --> N4
Norm --> N5
Sources: python/deepwiki-scraper.py:28-43
Implementation Details
The function parses the page number string and applies the following rules:
Diagram: normalized_number_parts() Control Flow
Sources: python/deepwiki-scraper.py:28-43
Numbering Examples
| DeepWiki Number | Input | Normalized Parts | Notes |
|---|---|---|---|
"1" | Overview page | [] | Unnumbered in output |
"2" | Second page | ["1"] | Becomes first numbered page |
"3" | Third page | ["2"] | Becomes second numbered page |
"1.3" | Overview subsection | ["1", "3"] | Special case: kept as page 1 |
"4.2" | Subsection | ["3", "2"] | Main number decremented |
Sources: python/tests/test_numbering.py:1-13
Path Resolution
graph TB
Input["resolve_output_path(page_number, title)"]
Sanitize["sanitize_filename(title)\nConvert title to safe filename slug"]
Normalize["normalized_number_parts(page_number)\nGet normalized parts"]
CheckParts{"Parts valid\nand non-empty?"}
NoNumber["filename = slug + '.md'\nsection_dir = None"]
BuildFilename["filename = parts.join('-') + '-' + slug + '.md'"]
CheckLevel{"len(parts) > 1?"}
WithSection["section_dir = 'section-' + parts[0]"]
NoSection["section_dir = None"]
Return["Return (filename, section_dir)"]
Input --> Sanitize
Input --> Normalize
Sanitize --> CheckParts
Normalize --> CheckParts
CheckParts -->|No| NoNumber
CheckParts -->|Yes| BuildFilename
BuildFilename --> CheckLevel
CheckLevel -->|Yes| WithSection
CheckLevel -->|No| NoSection
NoNumber --> Return
WithSection --> Return
NoSection --> Return
File Path Generation
The resolve_output_path() function converts normalized page numbers into file paths, determining both the filename and the optional section directory.
Diagram: Path Resolution Algorithm
Sources: python/deepwiki-scraper.py:45-53
Directory Structure Mapping
Diagram: File Organization After Path Resolution
Sources: python/deepwiki-scraper.py:45-53 python/tests/test_numbering.py:15-31
Path Resolution Examples
| DeepWiki Number | Title | Filename | Section Directory |
|---|---|---|---|
"1" | “Overview Title” | overview-title.md | None |
"3" | “System Architecture” | 2-system-architecture.md | None |
"5.2" | “HTML to Markdown Conversion” | 4-2-html-to-markdown-conversion.md | "section-4" |
"2.1" | “Components” | 1-1-components.md | "section-1" |
Sources: python/tests/test_numbering.py:15-31
Link Rewriting
Target Path Construction
The build_target_path() function constructs the full relative path for link targets, including section directories when appropriate:
Diagram: Target Path Construction Logic
flowchart TD
Start["build_target_path(page_number, slug)"]
Sanitize["slug = sanitize_filename(slug)"]
Normalize["parts = normalized_number_parts(page_number)"]
CheckParts{"Parts valid?"}
SimpleFile["Return slug + '.md'"]
BuildFile["filename = parts.join('-') + '-' + slug + '.md'"]
CheckSub{"len(parts) > 1?"}
WithDir["Return 'section-' + parts[0] + '/' + filename"]
JustFile["Return filename"]
Start --> Sanitize
Start --> Normalize
Sanitize --> CheckParts
Normalize --> CheckParts
CheckParts -->|No| SimpleFile
CheckParts -->|Yes| BuildFile
BuildFile --> CheckSub
CheckSub -->|Yes| WithDir
CheckSub -->|No| JustFile
Sources: python/deepwiki-scraper.py:55-63
Link Transformation Overview
DeepWiki uses absolute URL paths for internal wiki links in the format /owner/repo/N-page-title or /owner/repo/N.M-subsection-title. These links must be rewritten to relative Markdown file paths that respect the mdBook directory structure where:
- Main pages (e.g., “1-overview”, “2-architecture”) reside in the root markdown directory
- Subsections (e.g., “2.1-subsection”, “2.2-another”) reside in subdirectories named
section-N/ - File names use hyphens instead of dots (e.g.,
2-1-subsection.mdinstead of2.1-subsection.md)
The rewriting logic must compute the correct relative path based on both the source page location and the target page location.
Sources: python/deepwiki-scraper.py:854-875
graph TB
Root["Root Directory\n(output/markdown/)"]
Main1["overview.md\n(Unnumbered)"]
Main2["1-architecture.md\n(Main Page)"]
Main3["2-installation.md\n(Main Page)"]
Section1["section-1/\n(Subsection Directory)"]
Section2["section-2/\n(Subsection Directory)"]
Sub1_1["1-1-components.md\n(Subsection)"]
Sub1_2["1-2-workflows.md\n(Subsection)"]
Sub2_1["2-1-docker-setup.md\n(Subsection)"]
Sub2_2["2-2-manual-setup.md\n(Subsection)"]
Root --> Main1
Root --> Main2
Root --> Main3
Root --> Section1
Root --> Section2
Section1 --> Sub1_1
Section1 --> Sub1_2
Section2 --> Sub2_1
Section2 --> Sub2_2
Relative Path Strategy
The system organizes markdown files into a hierarchical structure that affects link rewriting:
Diagram: File Organization Hierarchy
This structure requires different relative path strategies depending on where the link originates and where it points.
Sources: python/deepwiki-scraper.py:843-851
flowchart TD
Start["Markdown Content"]
Regex["Apply Regex Pattern:\n'\\]\\(/[^/]+/[^/]+/([^)]+)\\)'"]
Extract["Extract Path Component:\ne.g., '4-query-planning'"]
Parse["Parse Page Number and Slug:\nPattern: '(\\d+(?:\\.\\d+)*)-(.+)$'"]
PageNum["page_num\n(e.g., '2.1' or '4')"]
Slug["slug\n(e.g., 'query-planning')"]
Start --> Regex
Regex --> Extract
Extract --> Parse
Parse --> PageNum
Parse --> Slug
Link Pattern Matching
The algorithm begins by matching markdown links that reference the DeepWiki URL structure using a regular expression pattern.
Diagram: Link Pattern Matching Flow
The regex \]\(/[^/]+/[^/]+/([^)]+)\) captures the path component after the repository identifier. For example, in <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/text" undefined file-path="text">Hii</FileRef>, it captures 4-query-planning.
Sources: python/deepwiki-scraper.py875
flowchart TD
Start["extract_page_content(url, session, current_page_info)"]
CheckInfo{"current_page_info\nprovided?"}
NoInfo["source_section_dir = None\n(Default to root)"]
GetPageNum["page_number = current_page_info['number']\ntitle = current_page_info['title']"]
ResolvePath["resolve_output_path(page_number, title)\nReturns (filename, section_dir)"]
SetSource["source_section_dir = section_dir"]
DefineRewriter["Define fix_wiki_link()\nusing source_section_dir"]
ApplyRewriter["markdown = re.sub(pattern, fix_wiki_link, markdown)"]
Start --> CheckInfo
CheckInfo -->|No| NoInfo
CheckInfo -->|Yes| GetPageNum
GetPageNum --> ResolvePath
ResolvePath --> SetSource
NoInfo --> DefineRewriter
SetSource --> DefineRewriter
DefineRewriter --> ApplyRewriter
Source Location Detection
The system determines the source page’s location from the current_page_info parameter passed to extract_page_content():
Diagram: Source Location Detection in extract_page_content
Sources: python/deepwiki-scraper.py:843-851 python/deepwiki-scraper.py875
Relative Path Calculation
The relative path is computed based on the combination of source and target locations:
| Source Location | Target Location | Relative Path Strategy | Example |
|---|---|---|---|
| Root | Root | Direct filename | 2-installation.md |
| Root | section-N/ | Section prefix + filename | section-1/1-1-components.md |
section-N/ | Root | Parent directory prefix | ../2-installation.md |
section-N/ | Same section-N/ | Direct filename | 1-2-workflows.md |
section-N/ | Different section-M/ | Parent + section prefix | ../section-2/2-1-setup.md |
Sources: python/deepwiki-scraper.py:854-871
flowchart TD
Start["fix_wiki_link(match)"]
ExtractPath["full_path = match.group(1)\n(e.g., '4-query-planning')"]
ParseLink["link_match = re.search(pattern, full_path)"]
Success{"Match\nsuccessful?"}
NoMatch["Return match.group(0)\n(link unchanged)"]
ExtractParts["page_num = link_match.group(1)\nslug = link_match.group(2)"]
BuildTarget["target_path = build_target_path(page_num, slug)"]
CheckSource{"source_section_dir\nexists?"}
NoSource["Return target_path\n(as-is from root)"]
CheckTargetDir{"target_path starts\nwith 'section-'?"}
NoDir["Add '../' prefix\nReturn '../' + target_path"]
CheckSameSection{"target_path starts with\nsource_section_dir + '/'?"}
SameSection["Strip section directory\nReturn filename only"]
OtherSection["Add '../' prefix\nReturn '../' + target_path"]
Start --> ExtractPath
ExtractPath --> ParseLink
ParseLink --> Success
Success -->|No| NoMatch
Success -->|Yes| ExtractParts
ExtractParts --> BuildTarget
BuildTarget --> CheckSource
CheckSource -->|No| NoSource
CheckSource -->|Yes| CheckTargetDir
CheckTargetDir -->|No| NoDir
CheckTargetDir -->|Yes| CheckSameSection
CheckSameSection -->|Yes| SameSection
CheckSameSection -->|No| OtherSection
The fix_wiki_link Implementation
The core implementation is a nested function fix_wiki_link defined within extract_page_content() that serves as a callback for re.sub:
Diagram: fix_wiki_link Function Control Flow
The function delegates target path construction to build_target_path(), then adjusts the path based on the source location captured in the closure variable source_section_dir.
Sources: python/deepwiki-scraper.py:854-871
Code Entity Mapping
Key functions involved in the path resolution pipeline:
| Function | Location | Purpose |
|---|---|---|
normalized_number_parts() | python/deepwiki-scraper.py:28-43 | Shift page numbers down by 1 |
resolve_output_path() | python/deepwiki-scraper.py:45-53 | Convert number + title to filename and section |
build_target_path() | python/deepwiki-scraper.py:55-63 | Construct full relative path for link targets |
fix_wiki_link() | python/deepwiki-scraper.py:854-871 | Rewrite individual link (nested function) |
extract_page_content() | python/deepwiki-scraper.py:751-877 | Main extraction function with link rewriting |
Sources: python/deepwiki-scraper.py:28-63 python/deepwiki-scraper.py:751-877
Link Rewriting Examples
Scenario 1: Root to Root
When a root-level page (e.g., overview.md) links to another root-level page (e.g., 2-installation.md):
- Source:
overview.md(source_section_dir = None) - Target: DeepWiki page
3→ normalized to2-installation.md - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Installation" undefined file-path="Installation">Hii</FileRef> - target_path:
2-installation.md(no section prefix) - Generated Path:
2-installation.md(no adjustment needed) - Reason: Both files are in root directory
Sources: python/deepwiki-scraper.py:854-871
Scenario 2: Root to Subsection
When a root-level page (e.g., 1-architecture.md) links to a subsection (e.g., 2.1-components → 1-1-components.md):
- Source:
1-architecture.md(source_section_dir = None) - Target: DeepWiki page
2.1→ normalized tosection-1/1-1-components.md - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Components" undefined file-path="Components">Hii</FileRef> - target_path:
section-1/1-1-components.md - Generated Path:
section-1/1-1-components.md(no adjustment needed) - Reason: Target is in subdirectory, source is in root
Sources: python/deepwiki-scraper.py:854-871
Scenario 3: Subsection to Root
When a subsection (e.g., section-1/1-1-components.md) links to a root-level page:
- Source:
section-1/1-1-components.md(source_section_dir = "section-1") - Target: DeepWiki page
3→ normalized to2-installation.md - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Installation" undefined file-path="Installation">Hii</FileRef> - target_path:
2-installation.md(doesn’t start with “section-”) - Generated Path:
../2-installation.md(add parent directory) - Reason: Source is in subdirectory, target is in parent directory
Sources: python/deepwiki-scraper.py:868-870
Scenario 4: Subsection to Same Section
When a subsection links to another subsection in the same section:
- Source:
section-1/1-1-components.md(source_section_dir = "section-1") - Target: DeepWiki page
2.2→ normalized tosection-1/1-2-workflows.md - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Workflows" undefined file-path="Workflows">Hii</FileRef> - target_path:
section-1/1-2-workflows.md - Generated Path:
1-2-workflows.md(strip section directory) - Reason: Both files are in same
section-1/directory
Sources: python/deepwiki-scraper.py:864-866
Scenario 5: Subsection to Different Section
When a subsection links to a subsection in a different section:
- Source:
section-1/1-1-components.md(source_section_dir = "section-1") - Target: DeepWiki page
3.1→ normalized tosection-2/2-1-setup.md - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Docker Setup" undefined file-path="Docker Setup">Hii</FileRef> - target_path:
section-2/2-1-setup.md - Generated Path:
../section-2/2-1-setup.md(go to parent, then other section) - Reason: Different section directories require parent navigation
Sources: python/deepwiki-scraper.py:868-870
sequenceDiagram
participant Main as main()
participant EWS as extract_wiki_structure()
participant EPC as extract_page_content()
participant ROP as resolve_output_path()
participant FWL as fix_wiki_link()
Main->>EWS: Discover pages
EWS-->>Main: pages list with DeepWiki numbers
loop For each page
Note over Main: page = {number: 2.1, title: Components, level: 1}
Main->>EPC: extract_page_content(url, session, page)
Note over EPC: Convert HTML to markdown
EPC->>ROP: Get source location
ROP-->>EPC: source_section_dir = "section-1"
Note over EPC: Define fix_wiki_link() with closure over source_section_dir
EPC->>FWL: Apply via re.sub() for each link
FWL->>FWL: Parse target page number
FWL->>FWL: build_target_path()
FWL->>FWL: Adjust for source location
FWL-->>EPC: Rewritten relative path
EPC-->>Main: Markdown with fixed links
Main->>ROP: Determine output path
ROP-->>Main: (filename, section_dir)
Note over Main: Write to section-1/1-1-components.md
end
Integration in Content Extraction Pipeline
The numbering and path resolution components integrate into the main extraction flow:
Diagram: Integration Sequence Across Extraction Pipeline
The link rewriting occurs at line 875 using re.sub(r'\]\(/[^/]+/[^/]+/([^)]+)\)', fix_wiki_link, markdown), which finds all internal wiki links and replaces them with their rewritten versions.
Sources: python/deepwiki-scraper.py:1310-1353 python/deepwiki-scraper.py:843-877
flowchart TD
Input["normalized_number_parts('abc')"]
Split["parts = 'abc'.split('.')"]
TryParse["Try int(parts[0])"]
ValueError["ValueError exception"]
ReturnNone["Return None"]
Input --> Split
Split --> TryParse
TryParse --> ValueError
ValueError --> ReturnNone
Edge Cases and Special Handling
Invalid Page Numbers
If normalized_number_parts() receives an invalid page number (non-numeric main component), it returns None:
Diagram: Invalid Number Handling
This graceful failure allows resolve_output_path() and build_target_path() to fall back to simple slug-based filenames.
Sources: python/deepwiki-scraper.py:28-43
Malformed Links
If a link doesn’t match the expected pattern (\d+(?:\.\d+)*)-(.+)$, fix_wiki_link() returns the original match unchanged:
This ensures that malformed or external links are preserved in their original form.
Sources: python/deepwiki-scraper.py:856-872
Missing Page Context
If current_page_info is not provided to extract_page_content(), the function defaults to treating the source as a root-level page:
This allows the function to work in degraded mode, though links from subsections may not be correctly rewritten.
Sources: python/deepwiki-scraper.py:843-850
Overview Page Special Case
The overview page (DeepWiki page 1) is treated specially:
normalized_number_parts("1")returns[](empty list)resolve_output_path("1", "Overview")returns("overview.md", None)- The file is placed at root level with no numeric prefix
Subsections of the overview (e.g., 1.3) are handled differently:
normalized_number_parts("1.3")returns["1", "3"]- Main number is kept as 1 (not decremented to 0)
- These become
section-1/1-3-subsection.md
Sources: python/deepwiki-scraper.py:28-43 python/tests/test_numbering.py:1-13
Performance Considerations
The link rewriting is performed using a single re.sub call with a callback function, which is efficient for typical wiki pages with dozens to hundreds of links. The regex compilation is implicit and cached by Python’s re module.
The algorithm has O(n) complexity where n is the number of internal links in the page, with each link requiring constant-time string operations.
Sources: tools/deepwiki-scraper.py592
Testing and Validation
The correctness of link rewriting can be validated by:
- Checking that generated links use
.mdextension - Verifying that links from subsections to main pages use
../ - Confirming that links to subsections use the
section-N/prefix when appropriate - Testing cross-section subsection links resolve correctly
The mdBook build process will fail if links are incorrectly rewritten, providing a validation mechanism during Phase 3 of the pipeline.
Sources: tools/deepwiki-scraper.py:547-594
Dismiss
Refresh this wiki
Enter email to refresh