Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Numbering and Path Resolution

Loading…

Numbering and Path Resolution

Relevant source files

This document explains how the system transforms DeepWiki’s page numbering scheme into the file structure used in the generated mdBook documentation. The process involves three key operations: (1) normalizing DeepWiki page numbers by shifting them down by one, (2) resolving normalized numbers to file paths and directory locations, and (3) rewriting internal wiki links to use correct relative paths.

The numbering and path resolution system is foundational to maintaining a consistent file structure and ensuring that cross-references between wiki pages function correctly in the final mdBook output.

For information about the overall markdown extraction process, see page 6. For details about file organization and directory structure, see page 10.

Overview of Operations

The system performs three distinct but related operations:

OperationFunctionPurpose
Number Normalizationnormalized_number_parts()Shift DeepWiki numbers down by 1 (page 1 becomes unnumbered)
Path Resolutionresolve_output_path()Generate filename and section directory from page number
Link Rewritingfix_wiki_link()Convert absolute URLs to relative markdown paths

Sources: python/deepwiki-scraper.py:28-64

Numbering Scheme Transformation

DeepWiki Numbering Convention

DeepWiki numbers pages starting from 1, with subsections using dot notation (e.g., 1, 2, 2.1, 2.2, 3, 3.1). This numbering includes an “overview” page as page 1, which the system treats specially.

Normalization Algorithm

The normalized_number_parts() function shifts all page numbers down by one, making the overview page unnumbered and adjusting all subsequent numbers:

Diagram: Number Normalization Transformation

graph LR
    subgraph "DeepWiki Numbering"
        DW1["1 (Overview)"]
DW2["2"]
DW3["3"]
DW4["4.1"]
DW5["4.2"]
end
    
    subgraph "normalized_number_parts()"
        Norm["Subtract 1 from\nmain number"]
end
    
    subgraph "Normalized Numbering"
        N1["[] (Unnumbered)"]
N2["[1]"]
N3["[2]"]
N4["[3, 1]"]
N5["[3, 2]"]
end
    
 
   DW1 --> Norm
 
   DW2 --> Norm
 
   DW3 --> Norm
 
   DW4 --> Norm
 
   DW5 --> Norm
    
 
   Norm --> N1
 
   Norm --> N2
 
   Norm --> N3
 
   Norm --> N4
 
   Norm --> N5

Sources: python/deepwiki-scraper.py:28-43

Implementation Details

The function parses the page number string and applies the following rules:

Diagram: normalized_number_parts() Control Flow

Sources: python/deepwiki-scraper.py:28-43

Numbering Examples

DeepWiki NumberInputNormalized PartsNotes
"1"Overview page[]Unnumbered in output
"2"Second page["1"]Becomes first numbered page
"3"Third page["2"]Becomes second numbered page
"1.3"Overview subsection["1", "3"]Special case: kept as page 1
"4.2"Subsection["3", "2"]Main number decremented

Sources: python/tests/test_numbering.py:1-13

Path Resolution

graph TB
    Input["resolve_output_path(page_number, title)"]
Sanitize["sanitize_filename(title)\nConvert title to safe filename slug"]
Normalize["normalized_number_parts(page_number)\nGet normalized parts"]
CheckParts{"Parts valid\nand non-empty?"}
NoNumber["filename = slug + '.md'\nsection_dir = None"]
BuildFilename["filename = parts.join('-') + '-' + slug + '.md'"]
CheckLevel{"len(parts) > 1?"}
WithSection["section_dir = 'section-' + parts[0]"]
NoSection["section_dir = None"]
Return["Return (filename, section_dir)"]
Input --> Sanitize
 
   Input --> Normalize
    
 
   Sanitize --> CheckParts
 
   Normalize --> CheckParts
    
 
   CheckParts -->|No| NoNumber
 
   CheckParts -->|Yes| BuildFilename
    
 
   BuildFilename --> CheckLevel
 
   CheckLevel -->|Yes| WithSection
 
   CheckLevel -->|No| NoSection
    
 
   NoNumber --> Return
 
   WithSection --> Return
 
   NoSection --> Return

File Path Generation

The resolve_output_path() function converts normalized page numbers into file paths, determining both the filename and the optional section directory.

Diagram: Path Resolution Algorithm

Sources: python/deepwiki-scraper.py:45-53

Directory Structure Mapping

Diagram: File Organization After Path Resolution

Sources: python/deepwiki-scraper.py:45-53 python/tests/test_numbering.py:15-31

Path Resolution Examples

DeepWiki NumberTitleFilenameSection Directory
"1"“Overview Title”overview-title.mdNone
"3"“System Architecture”2-system-architecture.mdNone
"5.2"“HTML to Markdown Conversion”4-2-html-to-markdown-conversion.md"section-4"
"2.1"“Components”1-1-components.md"section-1"

Sources: python/tests/test_numbering.py:15-31

Target Path Construction

The build_target_path() function constructs the full relative path for link targets, including section directories when appropriate:

Diagram: Target Path Construction Logic

flowchart TD
    Start["build_target_path(page_number, slug)"]
Sanitize["slug = sanitize_filename(slug)"]
Normalize["parts = normalized_number_parts(page_number)"]
CheckParts{"Parts valid?"}
SimpleFile["Return slug + '.md'"]
BuildFile["filename = parts.join('-') + '-' + slug + '.md'"]
CheckSub{"len(parts) > 1?"}
WithDir["Return 'section-' + parts[0] + '/' + filename"]
JustFile["Return filename"]
Start --> Sanitize
 
   Start --> Normalize
    
 
   Sanitize --> CheckParts
 
   Normalize --> CheckParts
    
 
   CheckParts -->|No| SimpleFile
 
   CheckParts -->|Yes| BuildFile
    
 
   BuildFile --> CheckSub
 
   CheckSub -->|Yes| WithDir
 
   CheckSub -->|No| JustFile

Sources: python/deepwiki-scraper.py:55-63

DeepWiki uses absolute URL paths for internal wiki links in the format /owner/repo/N-page-title or /owner/repo/N.M-subsection-title. These links must be rewritten to relative Markdown file paths that respect the mdBook directory structure where:

  • Main pages (e.g., “1-overview”, “2-architecture”) reside in the root markdown directory
  • Subsections (e.g., “2.1-subsection”, “2.2-another”) reside in subdirectories named section-N/
  • File names use hyphens instead of dots (e.g., 2-1-subsection.md instead of 2.1-subsection.md)

The rewriting logic must compute the correct relative path based on both the source page location and the target page location.

Sources: python/deepwiki-scraper.py:854-875

graph TB
    Root["Root Directory\n(output/markdown/)"]
Main1["overview.md\n(Unnumbered)"]
Main2["1-architecture.md\n(Main Page)"]
Main3["2-installation.md\n(Main Page)"]
Section1["section-1/\n(Subsection Directory)"]
Section2["section-2/\n(Subsection Directory)"]
Sub1_1["1-1-components.md\n(Subsection)"]
Sub1_2["1-2-workflows.md\n(Subsection)"]
Sub2_1["2-1-docker-setup.md\n(Subsection)"]
Sub2_2["2-2-manual-setup.md\n(Subsection)"]
Root --> Main1
 
   Root --> Main2
 
   Root --> Main3
 
   Root --> Section1
 
   Root --> Section2
    
 
   Section1 --> Sub1_1
 
   Section1 --> Sub1_2
    
 
   Section2 --> Sub2_1
 
   Section2 --> Sub2_2

Relative Path Strategy

The system organizes markdown files into a hierarchical structure that affects link rewriting:

Diagram: File Organization Hierarchy

This structure requires different relative path strategies depending on where the link originates and where it points.

Sources: python/deepwiki-scraper.py:843-851

flowchart TD
    Start["Markdown Content"]
Regex["Apply Regex Pattern:\n'\\]\\(/[^/]+/[^/]+/([^)]+)\\)'"]
Extract["Extract Path Component:\ne.g., '4-query-planning'"]
Parse["Parse Page Number and Slug:\nPattern: '(\\d+(?:\\.\\d+)*)-(.+)$'"]
PageNum["page_num\n(e.g., '2.1' or '4')"]
Slug["slug\n(e.g., 'query-planning')"]
Start --> Regex
 
   Regex --> Extract
 
   Extract --> Parse
 
   Parse --> PageNum
 
   Parse --> Slug

The algorithm begins by matching markdown links that reference the DeepWiki URL structure using a regular expression pattern.

Diagram: Link Pattern Matching Flow

The regex \]\(/[^/]+/[^/]+/([^)]+)\) captures the path component after the repository identifier. For example, in <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/text" undefined file-path="text">Hii</FileRef>, it captures 4-query-planning.

Sources: python/deepwiki-scraper.py875

flowchart TD
    Start["extract_page_content(url, session, current_page_info)"]
CheckInfo{"current_page_info\nprovided?"}
NoInfo["source_section_dir = None\n(Default to root)"]
GetPageNum["page_number = current_page_info['number']\ntitle = current_page_info['title']"]
ResolvePath["resolve_output_path(page_number, title)\nReturns (filename, section_dir)"]
SetSource["source_section_dir = section_dir"]
DefineRewriter["Define fix_wiki_link()\nusing source_section_dir"]
ApplyRewriter["markdown = re.sub(pattern, fix_wiki_link, markdown)"]
Start --> CheckInfo
 
   CheckInfo -->|No| NoInfo
 
   CheckInfo -->|Yes| GetPageNum
    
 
   GetPageNum --> ResolvePath
 
   ResolvePath --> SetSource
    
 
   NoInfo --> DefineRewriter
 
   SetSource --> DefineRewriter
    
 
   DefineRewriter --> ApplyRewriter

Source Location Detection

The system determines the source page’s location from the current_page_info parameter passed to extract_page_content():

Diagram: Source Location Detection in extract_page_content

Sources: python/deepwiki-scraper.py:843-851 python/deepwiki-scraper.py875

Relative Path Calculation

The relative path is computed based on the combination of source and target locations:

Source LocationTarget LocationRelative Path StrategyExample
RootRootDirect filename2-installation.md
Rootsection-N/Section prefix + filenamesection-1/1-1-components.md
section-N/RootParent directory prefix../2-installation.md
section-N/Same section-N/Direct filename1-2-workflows.md
section-N/Different section-M/Parent + section prefix../section-2/2-1-setup.md

Sources: python/deepwiki-scraper.py:854-871

flowchart TD
    Start["fix_wiki_link(match)"]
ExtractPath["full_path = match.group(1)\n(e.g., '4-query-planning')"]
ParseLink["link_match = re.search(pattern, full_path)"]
Success{"Match\nsuccessful?"}
NoMatch["Return match.group(0)\n(link unchanged)"]
ExtractParts["page_num = link_match.group(1)\nslug = link_match.group(2)"]
BuildTarget["target_path = build_target_path(page_num, slug)"]
CheckSource{"source_section_dir\nexists?"}
NoSource["Return target_path\n(as-is from root)"]
CheckTargetDir{"target_path starts\nwith 'section-'?"}
NoDir["Add '../' prefix\nReturn '../' + target_path"]
CheckSameSection{"target_path starts with\nsource_section_dir + '/'?"}
SameSection["Strip section directory\nReturn filename only"]
OtherSection["Add '../' prefix\nReturn '../' + target_path"]
Start --> ExtractPath
 
   ExtractPath --> ParseLink
 
   ParseLink --> Success
    
 
   Success -->|No| NoMatch
 
   Success -->|Yes| ExtractParts
    
 
   ExtractParts --> BuildTarget
 
   BuildTarget --> CheckSource
    
 
   CheckSource -->|No| NoSource
 
   CheckSource -->|Yes| CheckTargetDir
    
 
   CheckTargetDir -->|No| NoDir
 
   CheckTargetDir -->|Yes| CheckSameSection
    
 
   CheckSameSection -->|Yes| SameSection
 
   CheckSameSection -->|No| OtherSection

The core implementation is a nested function fix_wiki_link defined within extract_page_content() that serves as a callback for re.sub:

Diagram: fix_wiki_link Function Control Flow

The function delegates target path construction to build_target_path(), then adjusts the path based on the source location captured in the closure variable source_section_dir.

Sources: python/deepwiki-scraper.py:854-871

Code Entity Mapping

Key functions involved in the path resolution pipeline:

FunctionLocationPurpose
normalized_number_parts()python/deepwiki-scraper.py:28-43Shift page numbers down by 1
resolve_output_path()python/deepwiki-scraper.py:45-53Convert number + title to filename and section
build_target_path()python/deepwiki-scraper.py:55-63Construct full relative path for link targets
fix_wiki_link()python/deepwiki-scraper.py:854-871Rewrite individual link (nested function)
extract_page_content()python/deepwiki-scraper.py:751-877Main extraction function with link rewriting

Sources: python/deepwiki-scraper.py:28-63 python/deepwiki-scraper.py:751-877

Scenario 1: Root to Root

When a root-level page (e.g., overview.md) links to another root-level page (e.g., 2-installation.md):

  • Source: overview.md (source_section_dir = None)
  • Target: DeepWiki page 3 → normalized to 2-installation.md
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Installation" undefined file-path="Installation">Hii</FileRef>
  • target_path: 2-installation.md (no section prefix)
  • Generated Path: 2-installation.md (no adjustment needed)
  • Reason: Both files are in root directory

Sources: python/deepwiki-scraper.py:854-871

Scenario 2: Root to Subsection

When a root-level page (e.g., 1-architecture.md) links to a subsection (e.g., 2.1-components1-1-components.md):

  • Source: 1-architecture.md (source_section_dir = None)
  • Target: DeepWiki page 2.1 → normalized to section-1/1-1-components.md
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Components" undefined file-path="Components">Hii</FileRef>
  • target_path: section-1/1-1-components.md
  • Generated Path: section-1/1-1-components.md (no adjustment needed)
  • Reason: Target is in subdirectory, source is in root

Sources: python/deepwiki-scraper.py:854-871

Scenario 3: Subsection to Root

When a subsection (e.g., section-1/1-1-components.md) links to a root-level page:

  • Source: section-1/1-1-components.md (source_section_dir = "section-1")
  • Target: DeepWiki page 3 → normalized to 2-installation.md
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Installation" undefined file-path="Installation">Hii</FileRef>
  • target_path: 2-installation.md (doesn’t start with “section-”)
  • Generated Path: ../2-installation.md (add parent directory)
  • Reason: Source is in subdirectory, target is in parent directory

Sources: python/deepwiki-scraper.py:868-870

Scenario 4: Subsection to Same Section

When a subsection links to another subsection in the same section:

  • Source: section-1/1-1-components.md (source_section_dir = "section-1")
  • Target: DeepWiki page 2.2 → normalized to section-1/1-2-workflows.md
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Workflows" undefined file-path="Workflows">Hii</FileRef>
  • target_path: section-1/1-2-workflows.md
  • Generated Path: 1-2-workflows.md (strip section directory)
  • Reason: Both files are in same section-1/ directory

Sources: python/deepwiki-scraper.py:864-866

Scenario 5: Subsection to Different Section

When a subsection links to a subsection in a different section:

  • Source: section-1/1-1-components.md (source_section_dir = "section-1")
  • Target: DeepWiki page 3.1 → normalized to section-2/2-1-setup.md
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/Docker Setup" undefined file-path="Docker Setup">Hii</FileRef>
  • target_path: section-2/2-1-setup.md
  • Generated Path: ../section-2/2-1-setup.md (go to parent, then other section)
  • Reason: Different section directories require parent navigation

Sources: python/deepwiki-scraper.py:868-870

sequenceDiagram
    participant Main as main()
    participant EWS as extract_wiki_structure()
    participant EPC as extract_page_content()
    participant ROP as resolve_output_path()
    participant FWL as fix_wiki_link()
    
    Main->>EWS: Discover pages
    EWS-->>Main: pages list with DeepWiki numbers
    
    loop For each page
        Note over Main: page = {number: 2.1, title: Components, level: 1}
        
        Main->>EPC: extract_page_content(url, session, page)
        Note over EPC: Convert HTML to markdown
        
        EPC->>ROP: Get source location
        ROP-->>EPC: source_section_dir = "section-1"
        
        Note over EPC: Define fix_wiki_link() with closure over source_section_dir
        
        EPC->>FWL: Apply via re.sub() for each link
        FWL->>FWL: Parse target page number
        FWL->>FWL: build_target_path()
        FWL->>FWL: Adjust for source location
        FWL-->>EPC: Rewritten relative path
        
        EPC-->>Main: Markdown with fixed links
        
        Main->>ROP: Determine output path
        ROP-->>Main: (filename, section_dir)
        
        Note over Main: Write to section-1/1-1-components.md
    end

Integration in Content Extraction Pipeline

The numbering and path resolution components integrate into the main extraction flow:

Diagram: Integration Sequence Across Extraction Pipeline

The link rewriting occurs at line 875 using re.sub(r'\]\(/[^/]+/[^/]+/([^)]+)\)', fix_wiki_link, markdown), which finds all internal wiki links and replaces them with their rewritten versions.

Sources: python/deepwiki-scraper.py:1310-1353 python/deepwiki-scraper.py:843-877

flowchart TD
    Input["normalized_number_parts('abc')"]
Split["parts = 'abc'.split('.')"]
TryParse["Try int(parts[0])"]
ValueError["ValueError exception"]
ReturnNone["Return None"]
Input --> Split
 
   Split --> TryParse
 
   TryParse --> ValueError
 
   ValueError --> ReturnNone

Edge Cases and Special Handling

Invalid Page Numbers

If normalized_number_parts() receives an invalid page number (non-numeric main component), it returns None:

Diagram: Invalid Number Handling

This graceful failure allows resolve_output_path() and build_target_path() to fall back to simple slug-based filenames.

Sources: python/deepwiki-scraper.py:28-43

If a link doesn’t match the expected pattern (\d+(?:\.\d+)*)-(.+)$, fix_wiki_link() returns the original match unchanged:

This ensures that malformed or external links are preserved in their original form.

Sources: python/deepwiki-scraper.py:856-872

Missing Page Context

If current_page_info is not provided to extract_page_content(), the function defaults to treating the source as a root-level page:

This allows the function to work in degraded mode, though links from subsections may not be correctly rewritten.

Sources: python/deepwiki-scraper.py:843-850

Overview Page Special Case

The overview page (DeepWiki page 1) is treated specially:

  • normalized_number_parts("1") returns [] (empty list)
  • resolve_output_path("1", "Overview") returns ("overview.md", None)
  • The file is placed at root level with no numeric prefix

Subsections of the overview (e.g., 1.3) are handled differently:

  • normalized_number_parts("1.3") returns ["1", "3"]
  • Main number is kept as 1 (not decremented to 0)
  • These become section-1/1-3-subsection.md

Sources: python/deepwiki-scraper.py:28-43 python/tests/test_numbering.py:1-13

Performance Considerations

The link rewriting is performed using a single re.sub call with a callback function, which is efficient for typical wiki pages with dozens to hundreds of links. The regex compilation is implicit and cached by Python’s re module.

The algorithm has O(n) complexity where n is the number of internal links in the page, with each link requiring constant-time string operations.

Sources: tools/deepwiki-scraper.py592

Testing and Validation

The correctness of link rewriting can be validated by:

  1. Checking that generated links use .md extension
  2. Verifying that links from subsections to main pages use ../
  3. Confirming that links to subsections use the section-N/ prefix when appropriate
  4. Testing cross-section subsection links resolve correctly

The mdBook build process will fail if links are incorrectly rewritten, providing a validation mechanism during Phase 3 of the pipeline.

Sources: tools/deepwiki-scraper.py:547-594

Dismiss

Refresh this wiki

Enter email to refresh