Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

DeepWiki GitHub

Link Rewriting Logic

Relevant source files

This document details the algorithm for converting internal DeepWiki URL links into relative Markdown file paths during the content extraction process. The link rewriting system ensures that cross-references between wiki pages function correctly in the final mdBook output by transforming absolute web URLs into appropriate relative file paths based on the hierarchical structure of the documentation.

For information about the overall markdown extraction process, see Phase 1: Markdown Extraction. For details about file organization and directory structure, see Output Structure.

Overview

DeepWiki uses absolute URL paths for internal wiki links in the format /owner/repo/N-page-title or /owner/repo/N.M-subsection-title. These links must be rewritten to relative Markdown file paths that respect the mdBook directory structure where:

  • Main pages (e.g., "1-overview", "2-architecture") reside in the root markdown directory
  • Subsections (e.g., "2.1-subsection", "2.2-another") reside in subdirectories named section-N/
  • File names use hyphens instead of dots (e.g., 2-1-subsection.md instead of 2.1-subsection.md)

The rewriting logic must compute the correct relative path based on both the source page location and the target page location.

Sources: tools/deepwiki-scraper.py:547-594

Directory Structure Context

The system organizes markdown files into a hierarchical structure that affects link rewriting:

Diagram: File Organization Hierarchy

graph TB
    Root["Root Directory\n(output/markdown/)"]
Main1["1-overview.md\n(Main Page)"]
Main2["2-architecture.md\n(Main Page)"]
Main3["3-installation.md\n(Main Page)"]
Section2["section-2/\n(Subsection Directory)"]
Section3["section-3/\n(Subsection Directory)"]
Sub2_1["2-1-components.md\n(Subsection)"]
Sub2_2["2-2-workflows.md\n(Subsection)"]
Sub3_1["3-1-docker-setup.md\n(Subsection)"]
Sub3_2["3-2-manual-setup.md\n(Subsection)"]
Root --> Main1
 
   Root --> Main2
 
   Root --> Main3
 
   Root --> Section2
 
   Root --> Section3
    
 
   Section2 --> Sub2_1
 
   Section2 --> Sub2_2
    
 
   Section3 --> Sub3_1
 
   Section3 --> Sub3_2

This structure requires different relative path strategies depending on where the link originates and where it points.

Sources: tools/deepwiki-scraper.py:848-860

Input Format Detection

The algorithm begins by matching markdown links that reference the DeepWiki URL structure using a regular expression pattern.

Diagram: Link Pattern Matching Flow

flowchart TD
    Start["Markdown Content"]
Regex["Apply Regex Pattern:\n'\\]\\(/[^/]+/[^/]+/([^)]+)\\)'"]
Extract["Extract Path Component:\ne.g., '4-query-planning'"]
Parse["Parse Page Number and Slug:\nPattern: '(\\d+(?:\\.\\d+)*)-(.+)$'"]
PageNum["page_num\n(e.g., '2.1' or '4')"]
Slug["slug\n(e.g., 'query-planning')"]
Start --> Regex
 
   Regex --> Extract
 
   Extract --> Parse
 
   Parse --> PageNum
 
   Parse --> Slug

The regex \]\(/[^/]+/[^/]+/([^)]+)\) captures the path component after the repository identifier. For example, in <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/text" undefined file-path="text">Hii</FileRef>, it captures 4-query-planning.

Sources: tools/deepwiki-scraper.py592

Page Classification Logic

Each page (source and target) is classified based on whether it contains a dot in its page number, indicating a subsection.

Diagram: Page Type Classification

graph TB
    subgraph "Target Classification"
        TargetNum["Target Page Number"]
CheckDot["Contains '.' ?"]
IsTargetSub["is_target_subsection = True\ntarget_main_section = N"]
IsTargetMain["is_target_subsection = False\ntarget_main_section = None"]
TargetNum --> CheckDot
 
       CheckDot -->|Yes "2.1"| IsTargetSub
 
       CheckDot -->|No "2"| IsTargetMain
    end
    
    subgraph "Source Classification"
        SourceInfo["current_page_info"]
SourceLevel["Check 'level' field"]
IsSourceSub["is_source_subsection = True\nsource_main_section = N"]
IsSourceMain["is_source_subsection = False\nsource_main_section = None"]
SourceInfo --> SourceLevel
 
       SourceLevel -->|> 0| IsSourceSub
 
       SourceLevel -->|= 0| IsSourceMain
    end

The level field in current_page_info is set during wiki structure discovery and indicates the depth in the hierarchy (0 for main pages, 1+ for subsections).

Sources: tools/deepwiki-scraper.py:554-570

Path Generation Decision Matrix

The relative path is computed based on the combination of source and target types:

Source TypeTarget TypeRelative PathExample
Main PageMain Page{file_num}-{slug}.md3-installation.md
Main PageSubsectionsection-{N}/{file_num}-{slug}.mdsection-2/2-1-components.md
SubsectionMain Page../{file_num}-{slug}.md../3-installation.md
Subsection (same section)Subsection (same section){file_num}-{slug}.md2-2-workflows.md
Subsection (section A)Subsection (section B)../section-{N}/{file_num}-{slug}.md../section-3/3-1-setup.md

Sources: tools/deepwiki-scraper.py:573-588

Implementation Details

flowchart TD
    Start["fix_wiki_link(match)"]
ExtractPath["full_path = match.group(1)"]
ParseLink["Match pattern: '(\\d+(?:\\.\\d+)*)-(.+)$'"]
Success{"Match\nsuccessful?"}
NoMatch["Return match.group(0)\n(unchanged)"]
ExtractParts["page_num = match.group(1)\nslug = match.group(2)"]
ConvertNum["file_num = page_num.replace('.', '-')"]
ClassifyTarget["Classify target:\nis_target_subsection\ntarget_main_section"]
ClassifySource["Classify source:\nis_source_subsection\nsource_main_section"]
Decision{"Target is\nsubsection?"}
DecisionYes{"Source in same\nsection?"}
DecisionNo{"Source is\nsubsection?"}
Path1["Return '{file_num}-{slug}.md'"]
Path2["Return 'section-{N}/{file_num}-{slug}.md'"]
Path3["Return '../{file_num}-{slug}.md'"]
Start --> ExtractPath
 
   ExtractPath --> ParseLink
 
   ParseLink --> Success
 
   Success -->|No| NoMatch
 
   Success -->|Yes| ExtractParts
 
   ExtractParts --> ConvertNum
 
   ConvertNum --> ClassifyTarget
 
   ClassifyTarget --> ClassifySource
 
   ClassifySource --> Decision
    
 
   Decision -->|Yes| DecisionYes
 
   DecisionYes -->|Yes| Path1
 
   DecisionYes -->|No| Path2
    
 
   Decision -->|No| DecisionNo
 
   DecisionNo -->|Yes| Path3
 
   DecisionNo -->|No| Path1

The core implementation is a nested function fix_wiki_link that serves as a callback for re.sub.

Diagram: fix_wiki_link Function Control Flow

The function handles all path generation cases through a series of conditional checks, using information from both the link match and the current_page_info parameter.

Sources: tools/deepwiki-scraper.py:549-589

Page Number Transformation

The transformation from page numbers with dots to file names with hyphens is critical for matching the file system structure:

Diagram: Page Number Format Conversion

graph LR
    subgraph "DeepWiki Format"
        DW1["Page: '2.1'"]
DW2["URL: '/repo/2.1-title'"]
end
    
    subgraph "Transformation"
        Trans["Replace '.' with '-'"]
end
    
    subgraph "File System Format"
        FS1["File Number: '2-1'"]
FS2["Path: 'section-2/2-1-title.md'"]
end
    
 
   DW1 --> Trans
 
   DW2 --> Trans
 
   Trans --> FS1
 
   Trans --> FS2

This conversion is performed by the line file_num = page_num.replace('.', '-'), which ensures that subsection identifiers match the actual file names created during extraction.

Sources: tools/deepwiki-scraper.py558

Detailed Example Scenarios

When a main page (e.g., 1-overview.md) links to another main page (e.g., 4-features.md):

  • Source: 1-overview.md (level = 0, in root directory)
  • Target: 4-features (no dot, is main page)
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Features" undefined file-path="Features">Hii</FileRef>
  • Generated Path: 4-features.md
  • Reason: Both files are in the same root directory, so only the filename is needed

Sources: tools/deepwiki-scraper.py:586-588

When a main page (e.g., 2-architecture.md) links to a subsection (e.g., 2.1-components):

  • Source: 2-architecture.md (level = 0, in root directory)
  • Target: 2.1-components (contains dot, is subsection in section-2/)
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Components" undefined file-path="Components">Hii</FileRef>
  • Generated Path: section-2/2-1-components.md
  • Reason: Target is in subdirectory section-2/, source is in root, so full relative path is needed

Sources: tools/deepwiki-scraper.py:579-580

When a subsection (e.g., 2.1-components.md in section-2/) links to a main page (e.g., 3-installation.md):

  • Source: 2.1-components.md (level = 1, in section-2/ directory)
  • Target: 3-installation (no dot, is main page)
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Installation" undefined file-path="Installation">Hii</FileRef>
  • Generated Path: ../3-installation.md
  • Reason: Source is in subdirectory, target is in parent directory, so ../ is needed to go up one level

Sources: tools/deepwiki-scraper.py:583-585

Scenario 4: Subsection to Subsection (Same Section)

When a subsection (e.g., 2.1-components.md) links to another subsection in the same section (e.g., 2.2-workflows.md):

  • Source: 2.1-components.md (level = 1, in section-2/)
  • Source Main Section: 2
  • Target: 2.2-workflows (contains dot, in section-2/)
  • Target Main Section: 2
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Workflows" undefined file-path="Workflows">Hii</FileRef>
  • Generated Path: 2-2-workflows.md
  • Reason: Both files are in the same section-2/ directory, so only the filename is needed

Sources: tools/deepwiki-scraper.py:575-577

Scenario 5: Subsection to Subsection (Different Section)

When a subsection (e.g., 2.1-components.md in section-2/) links to a subsection in a different section (e.g., 3.1-docker-setup.md in section-3/):

  • Source: 2.1-components.md (level = 1, in section-2/)
  • Source Main Section: 2
  • Target: 3.1-docker-setup (contains dot, in section-3/)
  • Target Main Section: 3
  • Input Link: <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Docker Setup" undefined file-path="Docker Setup">Hii</FileRef>
  • Generated Path: section-3/3-1-docker-setup.md
  • Reason: Sections don't match, so full path from root perspective is used (implicitly going up and into different section directory)

Sources: tools/deepwiki-scraper.py:579-580

sequenceDiagram
    participant EPC as extract_page_content
    participant CTM as convert_html_to_markdown
    participant FWL as fix_wiki_link
    participant RE as re.sub
    
    EPC->>CTM: Convert HTML to Markdown
    CTM-->>EPC: Raw Markdown (with DeepWiki URLs)
    
    Note over EPC: Clean up content
    
    EPC->>RE: Apply link rewriting regex
    loop For each matched link
        RE->>FWL: Call with match object
        FWL->>FWL: Parse page number and slug
        FWL->>FWL: Classify source and target
        FWL->>FWL: Compute relative path
        FWL-->>RE: Return rewritten link
    end
    RE-->>EPC: Markdown with relative paths
    
    EPC-->>EPC: Return final markdown

Integration with Content Extraction

The link rewriting is integrated into the extract_page_content function and applied after HTML-to-Markdown conversion:

Diagram: Link Rewriting Integration Sequence

The rewriting occurs at line 592 using re.sub(r'\]\(/[^/]+/[^/]+/([^)]+)\)', fix_wiki_link, markdown), which finds all internal wiki links and replaces them with their rewritten versions.

Sources: tools/deepwiki-scraper.py:547-594

Edge Cases and Error Handling

If a link doesn't match the expected pattern (\d+(?:\.\d+)*)-(.+)$, the function returns the original match unchanged:

This ensures that malformed or external links are preserved in their original form.

Sources: tools/deepwiki-scraper.py:551-589

Missing current_page_info

If current_page_info is not provided (e.g., during development or testing), the function defaults to treating the source as a main page:

This allows the function to work in degraded mode, though links from subsections may not be correctly rewritten.

Sources: tools/deepwiki-scraper.py:565-570

Performance Considerations

The link rewriting is performed using a single re.sub call with a callback function, which is efficient for typical wiki pages with dozens to hundreds of links. The regex compilation is implicit and cached by Python's re module.

The algorithm has O(n) complexity where n is the number of internal links in the page, with each link requiring constant-time string operations.

Sources: tools/deepwiki-scraper.py592

Testing and Validation

The correctness of link rewriting can be validated by:

  1. Checking that generated links use .md extension
  2. Verifying that links from subsections to main pages use ../
  3. Confirming that links to subsections use the section-N/ prefix when appropriate
  4. Testing cross-section subsection links resolve correctly

The mdBook build process will fail if links are incorrectly rewritten, providing a validation mechanism during Phase 3 of the pipeline.

Sources: tools/deepwiki-scraper.py:547-594