deepwiki-scraper.py
Relevant source files
Purpose and Scope
The deepwiki-scraper.py script is the core content extraction engine that scrapes wiki pages from DeepWiki.com and converts them into clean Markdown files with intelligently placed Mermaid diagrams. This page documents the script's internal architecture, algorithms, and data transformations.
For information about how this script is orchestrated within the larger build system, see 5.1: build-docs.sh Orchestrator. For detailed explanations of the extraction and enhancement phases, see 6: Phase 1: Markdown Extraction and 7: Phase 2: Diagram Enhancement.
Sources: tools/deepwiki-scraper.py:1-11
Command-Line Interface
The script accepts exactly two arguments and is designed to be called programmatically:
| Parameter | Description | Example |
|---|---|---|
owner/repo | GitHub repository identifier in format owner/repo | facebook/react |
output-dir | Directory where markdown files will be written | ./output/markdown |
The script validates the repository format using regex ^[\w-]+/[\w-]+$ and exits with an error if the format is invalid.
Sources: tools/deepwiki-scraper.py:790-802
Main Execution Flow
The main() function orchestrates all operations using a temporary directory workflow to ensure atomic file operations:
Atomic Workflow Design: All scraping and enhancement operations occur in a temporary directory. Files are only moved to the final output directory after all processing completes successfully. If the script crashes or is interrupted, the output directory remains untouched.
graph TB
Start["main()"] --> Validate["Validate Arguments\nRegex: ^[\w-]+/[\w-]+$"]
Validate --> TempDir["Create Temporary Directory\ntempfile.TemporaryDirectory()"]
TempDir --> Session["Create requests.Session()\nwith User-Agent headers"]
Session --> Phase1["PHASE 1: Clean Markdown\nextract_wiki_structure()\nextract_page_content()"]
Phase1 --> WriteTemp["Write files to temp_dir\nOrganized by hierarchy"]
WriteTemp --> Phase2["PHASE 2: Diagram Enhancement\nextract_and_enhance_diagrams()"]
Phase2 --> EnhanceTemp["Enhance files in temp_dir\nInsert diagrams via fuzzy matching"]
EnhanceTemp --> Phase3["PHASE 3: Atomic Move\nshutil.copytree()\nshutil.copy2()"]
Phase3 --> CleanOutput["Clear output_dir\nMove temp files to output"]
CleanOutput --> Complete["Complete\ntemp_dir auto-deleted"]
style Phase1 fill:#e8f5e9
style Phase2 fill:#f3e5f5
style Phase3 fill:#fff4e1
Sources: tools/deepwiki-scraper.py:790-919
Dependencies and HTTP Session
The script imports three primary libraries for web scraping and conversion:
| Dependency | Purpose | Key Usage |
|---|---|---|
requests | HTTP client with session support | tools/deepwiki-scraper.py17 |
beautifulsoup4 | HTML parsing and DOM traversal | tools/deepwiki-scraper.py18 |
html2text | HTML to Markdown conversion | tools/deepwiki-scraper.py19 |
The HTTP session is configured with browser-like headers to avoid being blocked:
Sources: tools/deepwiki-scraper.py:817-821 tools/requirements.txt:1-4
Core Function Reference
Structure Discovery Functions
extract_wiki_structure(repo, session) tools/deepwiki-scraper.py:78-125
- Fetches the repository's main wiki page
- Extracts all links matching pattern
/owner/repo/\d+ - Parses page numbers (e.g.,
1,2.1,3.2.1) and titles - Determines hierarchy level by counting dots in page number
- Returns sorted list of page dictionaries with keys:
number,title,url,href,level
discover_subsections(repo, main_page_num, session) tools/deepwiki-scraper.py:44-76
- Attempts to discover subsections by testing URL patterns
- Tests up to 10 subsections per main page (e.g.,
/repo/2-1-,/repo/2-2-) - Uses HEAD requests for efficiency
- Returns list of discovered subsection metadata
Sources: tools/deepwiki-scraper.py:44-125
Content Extraction Functions
extract_page_content(url, session, current_page_info) tools/deepwiki-scraper.py:453-594
- Main content extraction function called for each wiki page
- Removes navigation and UI elements before conversion
- Converts HTML to Markdown using
html2textlibrary - Rewrites internal DeepWiki links to relative Markdown file paths
- Returns clean Markdown string
fetch_page(url, session) tools/deepwiki-scraper.py:27-42
- Implements retry logic with exponential backoff
- Attempts each request up to 3 times with 2-second delays
- Raises exception on final failure
- Returns
requests.Responseobject
convert_html_to_markdown(html_content) tools/deepwiki-scraper.py:175-216
- Configures
html2text.HTML2Text()withbody_width=0(no line wrapping) - Sets
ignore_links=Falseto preserve link structure - Calls
clean_deepwiki_footer()to remove UI elements - Diagrams are not extracted here (handled in Phase 2)
Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:175-216 tools/deepwiki-scraper.py:453-594
graph TB
Input["Input: /owner/repo/4-2-query-planning"] --> Extract["Extract via regex:\n/(\d+(?:\.\d+)*)-(.+)$"]
Extract --> ParseNum["page_num = '4.2'\nslug = 'query-planning'"]
ParseNum --> ConvertNum["file_num = page_num.replace('.', '-')\nResult: '4-2'"]
ConvertNum --> CheckTarget{"Target is\nsubsection?\n(has dot)"}
CheckTarget -->|Yes| CheckSource{"Source is\nsubsection?\n(level > 0)"}
CheckTarget -->|No| CheckSource2{"Source is\nsubsection?"}
CheckSource -->|Yes, same section| SameSec["Return: '4-2-query-planning.md'"]
CheckSource -->|No or different| DiffSec["Return: 'section-4/4-2-query-planning.md'"]
CheckSource2 -->|Yes| UpLevel["Return: '../4-2-query-planning.md'"]
CheckSource2 -->|No| SameLevel["Return: '4-2-query-planning.md'"]
style CheckTarget fill:#e8f5e9
style CheckSource fill:#fff4e1
Link Rewriting Logic
The script converts DeepWiki's absolute URLs to relative Markdown file paths, handling hierarchical section directories:
Algorithm Implementation: tools/deepwiki-scraper.py:549-592
The fix_wiki_link() nested function handles four scenarios:
- Both main pages: Use filename only (e.g.,
2-overview.md) - Source subsection → target main page: Use
../prefix (e.g.,../2-overview.md) - Both in same section directory: Use filename only (e.g.,
4-2-sql-parser.md) - Different sections: Use full path (e.g.,
section-4/4-2-sql-parser.md)
Sources: tools/deepwiki-scraper.py:549-592
Diagram Enhancement Architecture
Diagram Extraction from Next.js Payload
DeepWiki embeds all Mermaid diagrams in a JavaScript payload within the HTML. The extract_and_enhance_diagrams() function extracts diagrams with contextual information:
Key Data Structures:
graph TB
Start["extract_and_enhance_diagrams(repo, temp_dir, session)"] --> FetchJS["Fetch https://deepwiki.com/{repo}/1-overview\nAny page contains all diagrams"]
FetchJS --> Pattern1["Regex: ```mermaid\\\\\n(.*?)```\nFind all diagram blocks"]
Pattern1 --> Count["Print: Found {N}
total diagrams"]
Count --> Pattern2["Regex with context:\n([^`]{500,}?)```mermaid\\\\ (.*?)```"]
Pattern2 --> Extract["For each match:\n- Extract 500-char context before\n- Extract diagram code"]
Extract --> Unescape["Unescape sequences:\n\\\n→ newline\n\\ → tab\n\\\" → quote\n\< → '<'"]
Unescape --> Parse["Parse context:\n- Find last heading\n- Extract last 2-3 non-heading lines\n- Create anchor_text (last 300 chars)"]
Parse --> Store["Store diagram_contexts[]\nKeys: last_heading, anchor_text, diagram"]
Store --> Enhance["Enhance all .md files in temp_dir"]
style Pattern1 fill:#e8f5e9
style Pattern2 fill:#fff4e1
style Parse fill:#f3e5f5
Sources: tools/deepwiki-scraper.py:596-674
graph TB
Start["For each markdown file"] --> Normalize["Normalize content:\n- Convert to lowercase\n- Collapse whitespace\n- content_normalized = ' '.join(content.split())"]
Normalize --> Loop["For each diagram in diagram_contexts"]
Loop --> GetAnchors["Get anchor_text and last_heading\nfrom diagram context"]
GetAnchors --> TryChunks{"Try chunk sizes:\n300, 200, 150, 100, 80"}
TryChunks --> ExtractChunk["Extract last N chars of anchor_text\ntest_chunk = anchor[-chunk_size:]"]
ExtractChunk --> FindPos["pos = content_normalized.find(test_chunk)"]
FindPos --> Found{"pos != -1?"}
Found -->|Yes| ConvertLine["Convert char position to line number\nby counting chars in each line"]
Found -->|No| TrySmaller{"Try smaller\nchunk?"}
TrySmaller -->|Yes| ExtractChunk
TrySmaller -->|No| Fallback["Fallback: Match heading text\nheading_normalized in line_normalized"]
ConvertLine --> FindInsert["Find insertion point:\n- After heading: skip blanks, skip paragraph\n- After paragraph: find blank line"]
Fallback --> FindInsert
FindInsert --> Queue["Add to pending_insertions[]\n(line_num, diagram, score, idx)"]
Queue --> InsertAll["Sort by line_num (reverse)\nInsert diagrams bottom-up"]
InsertAll --> Save["Write enhanced file\nto same path in temp_dir"]
style TryChunks fill:#e8f5e9
style Found fill:#fff4e1
style FindInsert fill:#f3e5f5
Fuzzy Matching Algorithm
The script uses progressive chunk matching to find where diagrams belong in the Markdown content:
Progressive Chunk Sizes: The algorithm tries matching increasingly smaller chunks (300 → 200 → 150 → 100 → 80 characters) until it finds a match. This handles variations in text formatting between the JavaScript payload and html2text output.
Scoring: Each match is scored based on the chunk size used. Larger chunks indicate more confident matches.
Bottom-Up Insertion: Diagrams are inserted from the bottom of the file upward to preserve line numbers during insertion.
Sources: tools/deepwiki-scraper.py:676-788
Helper Functions
Filename Sanitization
sanitize_filename(text) tools/deepwiki-scraper.py:21-25
- Removes non-alphanumeric characters except hyphens and spaces
- Collapses multiple hyphens/spaces into single hyphens
- Converts to lowercase
- Example:
"Query Planning & Optimization"→"query-planning-optimization"
Footer Cleaning
clean_deepwiki_footer(markdown) tools/deepwiki-scraper.py:127-173
- Removes DeepWiki UI elements from markdown using regex patterns
- Patterns include:
"Dismiss","Refresh this wiki","Edit Wiki","On this page" - Scans last 50 lines backwards to find footer start
- Removes all content from footer start to end of file
- Also removes trailing empty lines
Sources: tools/deepwiki-scraper.py:21-25 tools/deepwiki-scraper.py:127-173
File Organization and Output
The script organizes output files based on the hierarchical page structure:
File Naming Convention: {number}-{title-slug}.md
- Number with dots replaced by hyphens (e.g.,
2.1→2-1) - Title sanitized to safe filename format
- Examples:
1-overview.md,2-1-workspace.md,4-3-2-optimizer.md
Sources: tools/deepwiki-scraper.py:842-877 tools/deepwiki-scraper.py:897-908
Error Handling and Resilience
The script implements multiple layers of error handling:
Retry Logic
HTTP Request Retries: tools/deepwiki-scraper.py:33-42
- Each HTTP request attempts up to 3 times
- 2-second delay between attempts
- Only raises exception on final failure
Graceful Degradation
| Scenario | Behavior |
|---|---|
| No pages found | Exit with error message and status code 1 |
| Page extraction fails | Print error, continue with remaining pages |
| Diagram extraction fails | Print warning, continue without diagrams |
| Content selector not found | Fall back to <body> tag as last resort |
Temporary Directory Cleanup
The script uses Python's tempfile.TemporaryDirectory() context manager, which automatically deletes the temporary directory even if the script crashes or is interrupted. This prevents accumulation of partial work files.
Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:808-916
Performance Characteristics
Rate Limiting
The script includes a 1-second sleep between page fetches to be respectful to the DeepWiki server:
Sources: tools/deepwiki-scraper.py872
Memory Efficiency
- Uses streaming HTTP responses where possible
- Processes one page at a time rather than loading all pages into memory
- Temporary directory is cleared automatically after completion
Typical Execution Times
For a repository with approximately 20 pages and 50 diagrams:
- Phase 1 (Extraction): ~30-40 seconds (with 1-second delays between requests)
- Phase 2 (Enhancement): ~5-10 seconds (local processing)
- Phase 3 (Move): <1 second (file operations)
Total: Approximately 40-50 seconds for a medium-sized wiki.
Sources: tools/deepwiki-scraper.py:790-919
Data Flow Summary
Sources: tools/deepwiki-scraper.py:1-919