Link Rewriting Logic
Relevant source files
This document details the algorithm for converting internal DeepWiki URL links into relative Markdown file paths during the content extraction process. The link rewriting system ensures that cross-references between wiki pages function correctly in the final mdBook output by transforming absolute web URLs into appropriate relative file paths based on the hierarchical structure of the documentation.
For information about the overall markdown extraction process, see Phase 1: Markdown Extraction. For details about file organization and directory structure, see Output Structure.
Overview
DeepWiki uses absolute URL paths for internal wiki links in the format /owner/repo/N-page-title or /owner/repo/N.M-subsection-title. These links must be rewritten to relative Markdown file paths that respect the mdBook directory structure where:
- Main pages (e.g., "1-overview", "2-architecture") reside in the root markdown directory
- Subsections (e.g., "2.1-subsection", "2.2-another") reside in subdirectories named
section-N/ - File names use hyphens instead of dots (e.g.,
2-1-subsection.mdinstead of2.1-subsection.md)
The rewriting logic must compute the correct relative path based on both the source page location and the target page location.
Sources: tools/deepwiki-scraper.py:547-594
Directory Structure Context
The system organizes markdown files into a hierarchical structure that affects link rewriting:
Diagram: File Organization Hierarchy
graph TB
Root["Root Directory\n(output/markdown/)"]
Main1["1-overview.md\n(Main Page)"]
Main2["2-architecture.md\n(Main Page)"]
Main3["3-installation.md\n(Main Page)"]
Section2["section-2/\n(Subsection Directory)"]
Section3["section-3/\n(Subsection Directory)"]
Sub2_1["2-1-components.md\n(Subsection)"]
Sub2_2["2-2-workflows.md\n(Subsection)"]
Sub3_1["3-1-docker-setup.md\n(Subsection)"]
Sub3_2["3-2-manual-setup.md\n(Subsection)"]
Root --> Main1
Root --> Main2
Root --> Main3
Root --> Section2
Root --> Section3
Section2 --> Sub2_1
Section2 --> Sub2_2
Section3 --> Sub3_1
Section3 --> Sub3_2
This structure requires different relative path strategies depending on where the link originates and where it points.
Sources: tools/deepwiki-scraper.py:848-860
Link Transformation Algorithm
Input Format Detection
The algorithm begins by matching markdown links that reference the DeepWiki URL structure using a regular expression pattern.
Diagram: Link Pattern Matching Flow
flowchart TD
Start["Markdown Content"]
Regex["Apply Regex Pattern:\n'\\]\\(/[^/]+/[^/]+/([^)]+)\\)'"]
Extract["Extract Path Component:\ne.g., '4-query-planning'"]
Parse["Parse Page Number and Slug:\nPattern: '(\\d+(?:\\.\\d+)*)-(.+)$'"]
PageNum["page_num\n(e.g., '2.1' or '4')"]
Slug["slug\n(e.g., 'query-planning')"]
Start --> Regex
Regex --> Extract
Extract --> Parse
Parse --> PageNum
Parse --> Slug
The regex \]\(/[^/]+/[^/]+/([^)]+)\) captures the path component after the repository identifier. For example, in <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/text" undefined file-path="text">Hii</FileRef>, it captures 4-query-planning.
Sources: tools/deepwiki-scraper.py592
Page Classification Logic
Each page (source and target) is classified based on whether it contains a dot in its page number, indicating a subsection.
Diagram: Page Type Classification
graph TB
subgraph "Target Classification"
TargetNum["Target Page Number"]
CheckDot["Contains '.' ?"]
IsTargetSub["is_target_subsection = True\ntarget_main_section = N"]
IsTargetMain["is_target_subsection = False\ntarget_main_section = None"]
TargetNum --> CheckDot
CheckDot -->|Yes "2.1"| IsTargetSub
CheckDot -->|No "2"| IsTargetMain
end
subgraph "Source Classification"
SourceInfo["current_page_info"]
SourceLevel["Check 'level' field"]
IsSourceSub["is_source_subsection = True\nsource_main_section = N"]
IsSourceMain["is_source_subsection = False\nsource_main_section = None"]
SourceInfo --> SourceLevel
SourceLevel -->|> 0| IsSourceSub
SourceLevel -->|= 0| IsSourceMain
end
The level field in current_page_info is set during wiki structure discovery and indicates the depth in the hierarchy (0 for main pages, 1+ for subsections).
Sources: tools/deepwiki-scraper.py:554-570
Path Generation Decision Matrix
The relative path is computed based on the combination of source and target types:
| Source Type | Target Type | Relative Path | Example |
|---|---|---|---|
| Main Page | Main Page | {file_num}-{slug}.md | 3-installation.md |
| Main Page | Subsection | section-{N}/{file_num}-{slug}.md | section-2/2-1-components.md |
| Subsection | Main Page | ../{file_num}-{slug}.md | ../3-installation.md |
| Subsection (same section) | Subsection (same section) | {file_num}-{slug}.md | 2-2-workflows.md |
| Subsection (section A) | Subsection (section B) | ../section-{N}/{file_num}-{slug}.md | ../section-3/3-1-setup.md |
Sources: tools/deepwiki-scraper.py:573-588
Implementation Details
flowchart TD
Start["fix_wiki_link(match)"]
ExtractPath["full_path = match.group(1)"]
ParseLink["Match pattern: '(\\d+(?:\\.\\d+)*)-(.+)$'"]
Success{"Match\nsuccessful?"}
NoMatch["Return match.group(0)\n(unchanged)"]
ExtractParts["page_num = match.group(1)\nslug = match.group(2)"]
ConvertNum["file_num = page_num.replace('.', '-')"]
ClassifyTarget["Classify target:\nis_target_subsection\ntarget_main_section"]
ClassifySource["Classify source:\nis_source_subsection\nsource_main_section"]
Decision{"Target is\nsubsection?"}
DecisionYes{"Source in same\nsection?"}
DecisionNo{"Source is\nsubsection?"}
Path1["Return '{file_num}-{slug}.md'"]
Path2["Return 'section-{N}/{file_num}-{slug}.md'"]
Path3["Return '../{file_num}-{slug}.md'"]
Start --> ExtractPath
ExtractPath --> ParseLink
ParseLink --> Success
Success -->|No| NoMatch
Success -->|Yes| ExtractParts
ExtractParts --> ConvertNum
ConvertNum --> ClassifyTarget
ClassifyTarget --> ClassifySource
ClassifySource --> Decision
Decision -->|Yes| DecisionYes
DecisionYes -->|Yes| Path1
DecisionYes -->|No| Path2
Decision -->|No| DecisionNo
DecisionNo -->|Yes| Path3
DecisionNo -->|No| Path1
The fix_wiki_link Function
The core implementation is a nested function fix_wiki_link that serves as a callback for re.sub.
Diagram: fix_wiki_link Function Control Flow
The function handles all path generation cases through a series of conditional checks, using information from both the link match and the current_page_info parameter.
Sources: tools/deepwiki-scraper.py:549-589
Page Number Transformation
The transformation from page numbers with dots to file names with hyphens is critical for matching the file system structure:
Diagram: Page Number Format Conversion
graph LR
subgraph "DeepWiki Format"
DW1["Page: '2.1'"]
DW2["URL: '/repo/2.1-title'"]
end
subgraph "Transformation"
Trans["Replace '.' with '-'"]
end
subgraph "File System Format"
FS1["File Number: '2-1'"]
FS2["Path: 'section-2/2-1-title.md'"]
end
DW1 --> Trans
DW2 --> Trans
Trans --> FS1
Trans --> FS2
This conversion is performed by the line file_num = page_num.replace('.', '-'), which ensures that subsection identifiers match the actual file names created during extraction.
Sources: tools/deepwiki-scraper.py558
Detailed Example Scenarios
Scenario 1: Main Page to Main Page Link
When a main page (e.g., 1-overview.md) links to another main page (e.g., 4-features.md):
- Source:
1-overview.md(level = 0, in root directory) - Target:
4-features(no dot, is main page) - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Features" undefined file-path="Features">Hii</FileRef> - Generated Path:
4-features.md - Reason: Both files are in the same root directory, so only the filename is needed
Sources: tools/deepwiki-scraper.py:586-588
Scenario 2: Main Page to Subsection Link
When a main page (e.g., 2-architecture.md) links to a subsection (e.g., 2.1-components):
- Source:
2-architecture.md(level = 0, in root directory) - Target:
2.1-components(contains dot, is subsection in section-2/) - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Components" undefined file-path="Components">Hii</FileRef> - Generated Path:
section-2/2-1-components.md - Reason: Target is in subdirectory
section-2/, source is in root, so full relative path is needed
Sources: tools/deepwiki-scraper.py:579-580
Scenario 3: Subsection to Main Page Link
When a subsection (e.g., 2.1-components.md in section-2/) links to a main page (e.g., 3-installation.md):
- Source:
2.1-components.md(level = 1, in section-2/ directory) - Target:
3-installation(no dot, is main page) - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Installation" undefined file-path="Installation">Hii</FileRef> - Generated Path:
../3-installation.md - Reason: Source is in subdirectory, target is in parent directory, so
../is needed to go up one level
Sources: tools/deepwiki-scraper.py:583-585
Scenario 4: Subsection to Subsection (Same Section)
When a subsection (e.g., 2.1-components.md) links to another subsection in the same section (e.g., 2.2-workflows.md):
- Source:
2.1-components.md(level = 1, in section-2/) - Source Main Section:
2 - Target:
2.2-workflows(contains dot, in section-2/) - Target Main Section:
2 - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Workflows" undefined file-path="Workflows">Hii</FileRef> - Generated Path:
2-2-workflows.md - Reason: Both files are in the same
section-2/directory, so only the filename is needed
Sources: tools/deepwiki-scraper.py:575-577
Scenario 5: Subsection to Subsection (Different Section)
When a subsection (e.g., 2.1-components.md in section-2/) links to a subsection in a different section (e.g., 3.1-docker-setup.md in section-3/):
- Source:
2.1-components.md(level = 1, in section-2/) - Source Main Section:
2 - Target:
3.1-docker-setup(contains dot, in section-3/) - Target Main Section:
3 - Input Link:
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/Docker Setup" undefined file-path="Docker Setup">Hii</FileRef> - Generated Path:
section-3/3-1-docker-setup.md - Reason: Sections don't match, so full path from root perspective is used (implicitly going up and into different section directory)
Sources: tools/deepwiki-scraper.py:579-580
sequenceDiagram
participant EPC as extract_page_content
participant CTM as convert_html_to_markdown
participant FWL as fix_wiki_link
participant RE as re.sub
EPC->>CTM: Convert HTML to Markdown
CTM-->>EPC: Raw Markdown (with DeepWiki URLs)
Note over EPC: Clean up content
EPC->>RE: Apply link rewriting regex
loop For each matched link
RE->>FWL: Call with match object
FWL->>FWL: Parse page number and slug
FWL->>FWL: Classify source and target
FWL->>FWL: Compute relative path
FWL-->>RE: Return rewritten link
end
RE-->>EPC: Markdown with relative paths
EPC-->>EPC: Return final markdown
Integration with Content Extraction
The link rewriting is integrated into the extract_page_content function and applied after HTML-to-Markdown conversion:
Diagram: Link Rewriting Integration Sequence
The rewriting occurs at line 592 using re.sub(r'\]\(/[^/]+/[^/]+/([^)]+)\)', fix_wiki_link, markdown), which finds all internal wiki links and replaces them with their rewritten versions.
Sources: tools/deepwiki-scraper.py:547-594
Edge Cases and Error Handling
Invalid Link Format
If a link doesn't match the expected pattern (\d+(?:\.\d+)*)-(.+)$, the function returns the original match unchanged:
This ensures that malformed or external links are preserved in their original form.
Sources: tools/deepwiki-scraper.py:551-589
Missing current_page_info
If current_page_info is not provided (e.g., during development or testing), the function defaults to treating the source as a main page:
This allows the function to work in degraded mode, though links from subsections may not be correctly rewritten.
Sources: tools/deepwiki-scraper.py:565-570
Performance Considerations
The link rewriting is performed using a single re.sub call with a callback function, which is efficient for typical wiki pages with dozens to hundreds of links. The regex compilation is implicit and cached by Python's re module.
The algorithm has O(n) complexity where n is the number of internal links in the page, with each link requiring constant-time string operations.
Sources: tools/deepwiki-scraper.py592
Testing and Validation
The correctness of link rewriting can be validated by:
- Checking that generated links use
.mdextension - Verifying that links from subsections to main pages use
../ - Confirming that links to subsections use the
section-N/prefix when appropriate - Testing cross-section subsection links resolve correctly
The mdBook build process will fail if links are incorrectly rewritten, providing a validation mechanism during Phase 3 of the pipeline.
Sources: tools/deepwiki-scraper.py:547-594