Wiki Structure Discovery

Relevant source files

tools/deepwiki-scraper.py

Purpose and Scope

This document describes the wiki structure discovery mechanism in Phase 1 of the processing pipeline. The system analyzes the main DeepWiki repository page to identify all available wiki pages and their hierarchical relationships. This discovery phase produces a structured page list that drives subsequent content extraction.

For the HTML-to-Markdown conversion that follows discovery, see HTML to Markdown Conversion. For the overall Phase 1 process, see Phase 1: Markdown Extraction.

Overview

The discovery process fetches the main wiki page for a repository and parses its HTML to extract all wiki page references. The system identifies both main pages (e.g., 1, 2, 3) and subsections (e.g., 2.1, 2.2, 3.1) by analyzing link patterns. The output is a sorted list of page metadata that includes page numbers, titles, URLs, and hierarchical levels.

flowchart TD
 
   Start["main()
entry point"] --> ValidateRepo["Validate repo format\n(owner/repo)"]
ValidateRepo --> CreateSession["Create requests.Session\nwith User-Agent headers"]
CreateSession --> CallExtract["extract_wiki_structure(repo, session)"]
CallExtract --> FetchMain["Fetch https://deepwiki.com/{repo}"]
FetchMain --> ParseHTML["BeautifulSoup(response.text)"]
ParseHTML --> FindLinks["soup.find_all('a', href=regex)"]
FindLinks --> IterateLinks["Iterate over all links"]
IterateLinks --> ExtractPattern["Regex: /(\d+(?:\.\d+)*)-(.+)$"]
ExtractPattern --> BuildPageDict["Build page dict:\n{number, title, url, href, level}"]
BuildPageDict --> CheckDupe{"href in seen_urls?"}
CheckDupe -->|Yes| IterateLinks
 
   CheckDupe -->|No| AddToList["pages.append(page_dict)"]
AddToList --> IterateLinks
    
 
   IterateLinks -->|Done| SortPages["Sort by numeric parts:\nsort_key([int(x)
for x in num.split('.')])"]
SortPages --> ReturnPages["Return pages list"]
ReturnPages --> ProcessPages["Process each page\nin main loop"]
style CallExtract fill:#f9f,stroke:#333,stroke-width:2px
    style ExtractPattern fill:#f9f,stroke:#333,stroke-width:2px
    style SortPages fill:#f9f,stroke:#333,stroke-width:2px

Discovery Flow Diagram

Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:790-831

Main Discovery Function

The extract_wiki_structure function performs the core discovery logic. It accepts a repository identifier (e.g., "jzombie/deepwiki-to-mdbook") and an HTTP session object, then returns a list of page dictionaries.

Function Signature and Entry Point

Sources: tools/deepwiki-scraper.py:78-79

HTTP Request and HTML Parsing

The function constructs the base URL and fetches the main wiki page:

The fetch_page helper includes retry logic (3 attempts) and browser-like headers to handle transient failures.

Sources: tools/deepwiki-scraper.py:80-84 tools/deepwiki-scraper.py:27-42

Link Pattern Matching

Regex-Based Link Discovery

The system uses a compiled regex pattern to find all wiki page links:

This pattern matches URLs like:

/jzombie/deepwiki-to-mdbook/1-overview
/jzombie/deepwiki-to-mdbook/2-quick-start
/jzombie/deepwiki-to-mdbook/2-1-basic-usage

Sources: tools/deepwiki-scraper.py:88-90

Page Information Extraction

For each matched link, the system extracts page metadata using a detailed regex pattern:

The regex r'/(\d+(?:\.\d+)*)-(.+)$' captures:

Group 1: Page number with optional dots (e.g., 1, 2.1, 3.2.1)
Group 2: URL slug (e.g., overview, basic-usage)

Sources: tools/deepwiki-scraper.py:98-107

Link Extraction Data Flow

Sources: tools/deepwiki-scraper.py:98-115

Deduplication and Sorting

Deduplication Strategy

The system maintains a seen_urls set to prevent duplicate page entries:

Sources: tools/deepwiki-scraper.py:92-116

Hierarchical Sorting

Pages are sorted by their numeric components to maintain proper ordering:

This ensures ordering like: 1 → 2 → 2.1 → 2.2 → 3 → 3.1

Sources: tools/deepwiki-scraper.py:118-123

Sorting Example

Before Sorting (Link Order)	Page Number	After Sorting (Numeric Order)
`/3-phase-3`	`3`	`/1-overview`
`/2-1-subsection-one`	`2.1`	`/2-quick-start`
`/1-overview`	`1`	`/2-1-subsection-one`
`/2-quick-start`	`2`	`/2-2-subsection-two`
`/2-2-subsection-two`	`2.2`	`/3-phase-3`

Page Data Structure

Page Dictionary Schema

Each discovered page is represented as a dictionary:

Sources: tools/deepwiki-scraper.py:109-115

Level Calculation

The level field indicates hierarchical depth:

Page Number	Level	Type
`1`	`0`	Main page
`2`	`0`	Main page
`2.1`	`1`	Subsection
`2.2`	`1`	Subsection
`3.1.1`	`2`	Sub-subsection

Sources: tools/deepwiki-scraper.py:106-114

Discovery Result Processing

Output Statistics

After discovery, the system categorizes pages and reports statistics:

Sources: tools/deepwiki-scraper.py:824-837

Integration with Content Extraction

The discovered page list drives the extraction loop in main():

Sources: tools/deepwiki-scraper.py:841-860

Alternative Discovery Method (Unused)

Subsection Probing Function

The codebase includes a discover_subsections function that uses HTTP HEAD requests to probe for subsections, but this function is not invoked in the current implementation:

This function attempts to discover subsections by making HEAD requests to potential URLs (e.g., /repo/2-1-, /repo/2-2-). However, the actual implementation relies entirely on parsing links from the main wiki page.

Sources: tools/deepwiki-scraper.py:44-76

Discovery Method Comparison

Sources: tools/deepwiki-scraper.py:44-76 tools/deepwiki-scraper.py:78-125

Error Handling

No Pages Found

The system validates that at least one page was discovered:

Sources: tools/deepwiki-scraper.py:828-830

Network Failures

The fetch_page function includes retry logic:

Sources: tools/deepwiki-scraper.py:33-42

Summary

The wiki structure discovery process provides a robust mechanism for identifying all pages in a DeepWiki repository through a single HTML parse operation. The resulting page list is hierarchically organized and drives all subsequent extraction operations in Phase 1.

Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:790-831

Keyboard shortcuts

deepwiki-to-mdbook Documentation