deepwiki-scraper.py

Relevant source files

Purpose and Scope

The deepwiki-scraper.py script is the core content extraction engine that scrapes wiki pages from DeepWiki.com and converts them into clean Markdown files with intelligently placed Mermaid diagrams. This page documents the script's internal architecture, algorithms, and data transformations.

For information about how this script is orchestrated within the larger build system, see 5.1: build-docs.sh Orchestrator. For detailed explanations of the extraction and enhancement phases, see 6: Phase 1: Markdown Extraction and 7: Phase 2: Diagram Enhancement.

Sources: tools/deepwiki-scraper.py:1-11

Command-Line Interface

The script accepts exactly two arguments and is designed to be called programmatically:

Parameter	Description	Example
`owner/repo`	GitHub repository identifier in format `owner/repo`	`facebook/react`
`output-dir`	Directory where markdown files will be written	`./output/markdown`

The script validates the repository format using regex ^[\w-]+/[\w-]+$ and exits with an error if the format is invalid.

Sources: tools/deepwiki-scraper.py:790-802

Main Execution Flow

The main() function orchestrates all operations using a temporary directory workflow to ensure atomic file operations:

Atomic Workflow Design: All scraping and enhancement operations occur in a temporary directory. Files are only moved to the final output directory after all processing completes successfully. If the script crashes or is interrupted, the output directory remains untouched.

graph TB
 
   Start["main()"] --> Validate["Validate Arguments\nRegex: ^[\w-]+/[\w-]+$"]
Validate --> TempDir["Create Temporary Directory\ntempfile.TemporaryDirectory()"]
TempDir --> Session["Create requests.Session()\nwith User-Agent headers"]
Session --> Phase1["PHASE 1: Clean Markdown\nextract_wiki_structure()\nextract_page_content()"]
Phase1 --> WriteTemp["Write files to temp_dir\nOrganized by hierarchy"]
WriteTemp --> Phase2["PHASE 2: Diagram Enhancement\nextract_and_enhance_diagrams()"]
Phase2 --> EnhanceTemp["Enhance files in temp_dir\nInsert diagrams via fuzzy matching"]
EnhanceTemp --> Phase3["PHASE 3: Atomic Move\nshutil.copytree()\nshutil.copy2()"]
Phase3 --> CleanOutput["Clear output_dir\nMove temp files to output"]
CleanOutput --> Complete["Complete\ntemp_dir auto-deleted"]
style Phase1 fill:#e8f5e9
    style Phase2 fill:#f3e5f5
    style Phase3 fill:#fff4e1

Sources: tools/deepwiki-scraper.py:790-919

Dependencies and HTTP Session

The script imports three primary libraries for web scraping and conversion:

Dependency	Purpose	Key Usage
`requests`	HTTP client with session support	tools/deepwiki-scraper.py17
`beautifulsoup4`	HTML parsing and DOM traversal	tools/deepwiki-scraper.py18
`html2text`	HTML to Markdown conversion	tools/deepwiki-scraper.py19

The HTTP session is configured with browser-like headers to avoid being blocked:

Sources: tools/deepwiki-scraper.py:817-821 tools/requirements.txt:1-4

Core Function Reference

Structure Discovery Functions

extract_wiki_structure(repo, session) tools/deepwiki-scraper.py:78-125

Fetches the repository's main wiki page
Extracts all links matching pattern /owner/repo/\d+
Parses page numbers (e.g., 1, 2.1, 3.2.1) and titles
Determines hierarchy level by counting dots in page number
Returns sorted list of page dictionaries with keys: number, title, url, href, level

discover_subsections(repo, main_page_num, session) tools/deepwiki-scraper.py:44-76

Attempts to discover subsections by testing URL patterns
Tests up to 10 subsections per main page (e.g., /repo/2-1-, /repo/2-2-)
Uses HEAD requests for efficiency
Returns list of discovered subsection metadata

Sources: tools/deepwiki-scraper.py:44-125

Content Extraction Functions

extract_page_content(url, session, current_page_info) tools/deepwiki-scraper.py:453-594

Main content extraction function called for each wiki page
Removes navigation and UI elements before conversion
Converts HTML to Markdown using html2text library
Rewrites internal DeepWiki links to relative Markdown file paths
Returns clean Markdown string

fetch_page(url, session) tools/deepwiki-scraper.py:27-42

Implements retry logic with exponential backoff
Attempts each request up to 3 times with 2-second delays
Raises exception on final failure
Returns requests.Response object

convert_html_to_markdown(html_content) tools/deepwiki-scraper.py:175-216

Configures html2text.HTML2Text() with body_width=0 (no line wrapping)
Sets ignore_links=False to preserve link structure
Calls clean_deepwiki_footer() to remove UI elements
Diagrams are not extracted here (handled in Phase 2)

Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:175-216 tools/deepwiki-scraper.py:453-594

graph TB
 
   Input["Input: /owner/repo/4-2-query-planning"] --> Extract["Extract via regex:\n/(\d+(?:\.\d+)*)-(.+)$"]
Extract --> ParseNum["page_num = '4.2'\nslug = 'query-planning'"]
ParseNum --> ConvertNum["file_num = page_num.replace('.', '-')\nResult: '4-2'"]
ConvertNum --> CheckTarget{"Target is\nsubsection?\n(has dot)"}
CheckTarget -->|Yes| CheckSource{"Source is\nsubsection?\n(level > 0)"}
CheckTarget -->|No| CheckSource2{"Source is\nsubsection?"}
CheckSource -->|Yes, same section| SameSec["Return: '4-2-query-planning.md'"]
CheckSource -->|No or different| DiffSec["Return: 'section-4/4-2-query-planning.md'"]
CheckSource2 -->|Yes| UpLevel["Return: '../4-2-query-planning.md'"]
CheckSource2 -->|No| SameLevel["Return: '4-2-query-planning.md'"]
style CheckTarget fill:#e8f5e9
    style CheckSource fill:#fff4e1

Link Rewriting Logic

The script converts DeepWiki's absolute URLs to relative Markdown file paths, handling hierarchical section directories:

Algorithm Implementation: tools/deepwiki-scraper.py:549-592

The fix_wiki_link() nested function handles four scenarios:

Both main pages: Use filename only (e.g., 2-overview.md)
Source subsection → target main page: Use ../ prefix (e.g., ../2-overview.md)
Both in same section directory: Use filename only (e.g., 4-2-sql-parser.md)
Different sections: Use full path (e.g., section-4/4-2-sql-parser.md)

Sources: tools/deepwiki-scraper.py:549-592

Diagram Enhancement Architecture

Diagram Extraction from Next.js Payload

DeepWiki embeds all Mermaid diagrams in a JavaScript payload within the HTML. The extract_and_enhance_diagrams() function extracts diagrams with contextual information:

Key Data Structures:

graph TB
 
   Start["extract_and_enhance_diagrams(repo, temp_dir, session)"] --> FetchJS["Fetch https://deepwiki.com/{repo}/1-overview\nAny page contains all diagrams"]
FetchJS --> Pattern1["Regex: ```mermaid\\\\\n(.*?)```\nFind all diagram blocks"]
Pattern1 --> Count["Print: Found {N}
total diagrams"]
Count --> Pattern2["Regex with context:\n([^`]{500,}?)```mermaid\\\\ (.*?)```"]
Pattern2 --> Extract["For each match:\n- Extract 500-char context before\n- Extract diagram code"]
Extract --> Unescape["Unescape sequences:\n\\\n→ newline\n\\	 → tab\n\\\" → quote\n\< → '<'"]
Unescape --> Parse["Parse context:\n- Find last heading\n- Extract last 2-3 non-heading lines\n- Create anchor_text (last 300 chars)"]
Parse --> Store["Store diagram_contexts[]\nKeys: last_heading, anchor_text, diagram"]
Store --> Enhance["Enhance all .md files in temp_dir"]
style Pattern1 fill:#e8f5e9
    style Pattern2 fill:#fff4e1
    style Parse fill:#f3e5f5

Sources: tools/deepwiki-scraper.py:596-674

graph TB
 
   Start["For each markdown file"] --> Normalize["Normalize content:\n- Convert to lowercase\n- Collapse whitespace\n- content_normalized = ' '.join(content.split())"]
Normalize --> Loop["For each diagram in diagram_contexts"]
Loop --> GetAnchors["Get anchor_text and last_heading\nfrom diagram context"]
GetAnchors --> TryChunks{"Try chunk sizes:\n300, 200, 150, 100, 80"}
TryChunks --> ExtractChunk["Extract last N chars of anchor_text\ntest_chunk = anchor[-chunk_size:]"]
ExtractChunk --> FindPos["pos = content_normalized.find(test_chunk)"]
FindPos --> Found{"pos != -1?"}
Found -->|Yes| ConvertLine["Convert char position to line number\nby counting chars in each line"]
Found -->|No| TrySmaller{"Try smaller\nchunk?"}
TrySmaller -->|Yes| ExtractChunk
 
   TrySmaller -->|No| Fallback["Fallback: Match heading text\nheading_normalized in line_normalized"]
ConvertLine --> FindInsert["Find insertion point:\n- After heading: skip blanks, skip paragraph\n- After paragraph: find blank line"]
Fallback --> FindInsert
    
 
   FindInsert --> Queue["Add to pending_insertions[]\n(line_num, diagram, score, idx)"]
Queue --> InsertAll["Sort by line_num (reverse)\nInsert diagrams bottom-up"]
InsertAll --> Save["Write enhanced file\nto same path in temp_dir"]
style TryChunks fill:#e8f5e9
    style Found fill:#fff4e1
    style FindInsert fill:#f3e5f5

Fuzzy Matching Algorithm

The script uses progressive chunk matching to find where diagrams belong in the Markdown content:

Progressive Chunk Sizes: The algorithm tries matching increasingly smaller chunks (300 → 200 → 150 → 100 → 80 characters) until it finds a match. This handles variations in text formatting between the JavaScript payload and html2text output.

Scoring: Each match is scored based on the chunk size used. Larger chunks indicate more confident matches.

Bottom-Up Insertion: Diagrams are inserted from the bottom of the file upward to preserve line numbers during insertion.

Sources: tools/deepwiki-scraper.py:676-788

Helper Functions

Filename Sanitization

sanitize_filename(text) tools/deepwiki-scraper.py:21-25

Removes non-alphanumeric characters except hyphens and spaces
Collapses multiple hyphens/spaces into single hyphens
Converts to lowercase
Example: "Query Planning & Optimization" → "query-planning-optimization"

clean_deepwiki_footer(markdown) tools/deepwiki-scraper.py:127-173

Removes DeepWiki UI elements from markdown using regex patterns
Patterns include: "Dismiss", "Refresh this wiki", "Edit Wiki", "On this page"
Scans last 50 lines backwards to find footer start
Removes all content from footer start to end of file
Also removes trailing empty lines

Sources: tools/deepwiki-scraper.py:21-25 tools/deepwiki-scraper.py:127-173

File Organization and Output

The script organizes output files based on the hierarchical page structure:

File Naming Convention: {number}-{title-slug}.md

Number with dots replaced by hyphens (e.g., 2.1 → 2-1)
Title sanitized to safe filename format
Examples: 1-overview.md, 2-1-workspace.md, 4-3-2-optimizer.md

Sources: tools/deepwiki-scraper.py:842-877 tools/deepwiki-scraper.py:897-908

Error Handling and Resilience

The script implements multiple layers of error handling:

Retry Logic

HTTP Request Retries: tools/deepwiki-scraper.py:33-42

Each HTTP request attempts up to 3 times
2-second delay between attempts
Only raises exception on final failure

Graceful Degradation

Scenario	Behavior
No pages found	Exit with error message and status code 1
Page extraction fails	Print error, continue with remaining pages
Diagram extraction fails	Print warning, continue without diagrams
Content selector not found	Fall back to `<body>` tag as last resort

Uses streaming HTTP responses where possible
Processes one page at a time rather than loading all pages into memory
Temporary directory is cleared automatically after completion

Typical Execution Times

For a repository with approximately 20 pages and 50 diagrams:

Phase 1 (Extraction): ~30-40 seconds (with 1-second delays between requests)
Phase 2 (Enhancement): ~5-10 seconds (local processing)
Phase 3 (Move): <1 second (file operations)

Total: Approximately 40-50 seconds for a medium-sized wiki.