Wiki Structure Discovery
Relevant source files
Purpose and Scope
This document describes the wiki structure discovery mechanism in Phase 1 of the processing pipeline. The system analyzes the main DeepWiki repository page to identify all available wiki pages and their hierarchical relationships. This discovery phase produces a structured page list that drives subsequent content extraction.
For the HTML-to-Markdown conversion that follows discovery, see HTML to Markdown Conversion. For the overall Phase 1 process, see Phase 1: Markdown Extraction.
Overview
The discovery process fetches the main wiki page for a repository and parses its HTML to extract all wiki page references. The system identifies both main pages (e.g., 1, 2, 3) and subsections (e.g., 2.1, 2.2, 3.1) by analyzing link patterns. The output is a sorted list of page metadata that includes page numbers, titles, URLs, and hierarchical levels.
flowchart TD
Start["main()
entry point"] --> ValidateRepo["Validate repo format\n(owner/repo)"]
ValidateRepo --> CreateSession["Create requests.Session\nwith User-Agent headers"]
CreateSession --> CallExtract["extract_wiki_structure(repo, session)"]
CallExtract --> FetchMain["Fetch https://deepwiki.com/{repo}"]
FetchMain --> ParseHTML["BeautifulSoup(response.text)"]
ParseHTML --> FindLinks["soup.find_all('a', href=regex)"]
FindLinks --> IterateLinks["Iterate over all links"]
IterateLinks --> ExtractPattern["Regex: /(\d+(?:\.\d+)*)-(.+)$"]
ExtractPattern --> BuildPageDict["Build page dict:\n{number, title, url, href, level}"]
BuildPageDict --> CheckDupe{"href in seen_urls?"}
CheckDupe -->|Yes| IterateLinks
CheckDupe -->|No| AddToList["pages.append(page_dict)"]
AddToList --> IterateLinks
IterateLinks -->|Done| SortPages["Sort by numeric parts:\nsort_key([int(x)
for x in num.split('.')])"]
SortPages --> ReturnPages["Return pages list"]
ReturnPages --> ProcessPages["Process each page\nin main loop"]
style CallExtract fill:#f9f,stroke:#333,stroke-width:2px
style ExtractPattern fill:#f9f,stroke:#333,stroke-width:2px
style SortPages fill:#f9f,stroke:#333,stroke-width:2px
Discovery Flow Diagram
Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:790-831
Main Discovery Function
The extract_wiki_structure function performs the core discovery logic. It accepts a repository identifier (e.g., "jzombie/deepwiki-to-mdbook") and an HTTP session object, then returns a list of page dictionaries.
Function Signature and Entry Point
Sources: tools/deepwiki-scraper.py:78-79
HTTP Request and HTML Parsing
The function constructs the base URL and fetches the main wiki page:
The fetch_page helper includes retry logic (3 attempts) and browser-like headers to handle transient failures.
Sources: tools/deepwiki-scraper.py:80-84 tools/deepwiki-scraper.py:27-42
Link Pattern Matching
Regex-Based Link Discovery
The system uses a compiled regex pattern to find all wiki page links:
This pattern matches URLs like:
/jzombie/deepwiki-to-mdbook/1-overview/jzombie/deepwiki-to-mdbook/2-quick-start/jzombie/deepwiki-to-mdbook/2-1-basic-usage
Sources: tools/deepwiki-scraper.py:88-90
Page Information Extraction
For each matched link, the system extracts page metadata using a detailed regex pattern:
The regex r'/(\d+(?:\.\d+)*)-(.+)$' captures:
- Group 1: Page number with optional dots (e.g.,
1,2.1,3.2.1) - Group 2: URL slug (e.g.,
overview,basic-usage)
Sources: tools/deepwiki-scraper.py:98-107
Link Extraction Data Flow
Sources: tools/deepwiki-scraper.py:98-115
Deduplication and Sorting
Deduplication Strategy
The system maintains a seen_urls set to prevent duplicate page entries:
Sources: tools/deepwiki-scraper.py:92-116
Hierarchical Sorting
Pages are sorted by their numeric components to maintain proper ordering:
This ensures ordering like: 1 → 2 → 2.1 → 2.2 → 3 → 3.1
Sources: tools/deepwiki-scraper.py:118-123
Sorting Example
| Before Sorting (Link Order) | Page Number | After Sorting (Numeric Order) |
|---|---|---|
/3-phase-3 | 3 | /1-overview |
/2-1-subsection-one | 2.1 | /2-quick-start |
/1-overview | 1 | /2-1-subsection-one |
/2-quick-start | 2 | /2-2-subsection-two |
/2-2-subsection-two | 2.2 | /3-phase-3 |
Page Data Structure
Page Dictionary Schema
Each discovered page is represented as a dictionary:
Sources: tools/deepwiki-scraper.py:109-115
Level Calculation
The level field indicates hierarchical depth:
| Page Number | Level | Type |
|---|---|---|
1 | 0 | Main page |
2 | 0 | Main page |
2.1 | 1 | Subsection |
2.2 | 1 | Subsection |
3.1.1 | 2 | Sub-subsection |
Sources: tools/deepwiki-scraper.py:106-114
Discovery Result Processing
Output Statistics
After discovery, the system categorizes pages and reports statistics:
Sources: tools/deepwiki-scraper.py:824-837
Integration with Content Extraction
The discovered page list drives the extraction loop in main():
Sources: tools/deepwiki-scraper.py:841-860
Alternative Discovery Method (Unused)
Subsection Probing Function
The codebase includes a discover_subsections function that uses HTTP HEAD requests to probe for subsections, but this function is not invoked in the current implementation:
This function attempts to discover subsections by making HEAD requests to potential URLs (e.g., /repo/2-1-, /repo/2-2-). However, the actual implementation relies entirely on parsing links from the main wiki page.
Sources: tools/deepwiki-scraper.py:44-76
Discovery Method Comparison
Sources: tools/deepwiki-scraper.py:44-76 tools/deepwiki-scraper.py:78-125
Error Handling
No Pages Found
The system validates that at least one page was discovered:
Sources: tools/deepwiki-scraper.py:828-830
Network Failures
The fetch_page function includes retry logic:
Sources: tools/deepwiki-scraper.py:33-42
Summary
The wiki structure discovery process provides a robust mechanism for identifying all pages in a DeepWiki repository through a single HTML parse operation. The resulting page list is hierarchically organized and drives all subsequent extraction operations in Phase 1.
Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:790-831