This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
deepwiki-scraper.py
Loading…
deepwiki-scraper.py
Relevant source files
Purpose and Scope
The deepwiki-scraper.py script is the primary data extraction and transformation component that converts DeepWiki wiki content into enhanced markdown files. It orchestrates a three-phase pipeline: (1) extracting clean markdown from DeepWiki HTML, (2) enhancing files with normalized Mermaid diagrams using fuzzy matching, and (3) moving completed files to the output directory.
This page documents the script’s architecture, execution model, and key algorithms. For information about how this script is invoked by the build system, see build-docs.sh Orchestrator. For detailed explanations of the extraction and enhancement phases, see Phase 1: Markdown Extraction and Phase 2: Diagram Enhancement.
Sources: python/deepwiki-scraper.py:1-11
Command-Line Interface
The script requires two positional arguments:
- Repository identifier : Format
owner/repo(e.g.,jzombie/deepwiki-to-mdbook) - Output directory : Destination path for generated markdown files
The repository identifier is validated using the regex pattern ^[\w-]+/[\w-]+$ at python/deepwiki-scraper.py:1287-1289 The script exits with status code 1 if validation fails or if the wiki structure cannot be extracted.
Sources: python/deepwiki-scraper.py:1-10 python/deepwiki-scraper.py:1277-1289
Three-Phase Execution Model
Figure 1: Three-Phase Execution Pipeline
The main() function at python/deepwiki-scraper.py:1277-1410 implements a three-phase workflow:
| Phase | Function | Primary Responsibility | Output Location |
|---|---|---|---|
| 1 | extract_wiki_structure + extract_page_content | Scrape HTML and convert to markdown | Temporary directory |
| 2 | extract_and_enhance_diagrams | Match and inject Mermaid diagrams | In-place modification of temp directory |
| 3 | File system operations | Move validated files to output | Final output directory |
A temporary directory is created at python/deepwiki-scraper.py:1295-1296 using Python’s tempfile.TemporaryDirectory context manager. This ensures automatic cleanup even if the script fails. A raw markdown snapshot is saved to raw_markdown/ at python/deepwiki-scraper.py:1358-1366 before diagram enhancement for debugging purposes.
Sources: python/deepwiki-scraper.py:1277-1410 python/deepwiki-scraper.py:1298-1371
Wiki Structure Discovery
Figure 2: Structure Discovery Algorithm Using extract_wiki_structure
The extract_wiki_structure function at python/deepwiki-scraper.py:116-163 discovers all wiki pages by parsing the main wiki index page. It uses a compiled regex pattern to find all links matching ^/{repo_pattern}/\d+ at python/deepwiki-scraper.py:128-129
The page numbering scheme distinguishes main pages from subsections using dot notation:
- Level 0 : Main pages (e.g.,
1,2,3) - pages with no dots - Level 1 : Subsections (e.g.,
2.1,2.2) - pages with one dot - Level N : Deeper subsections (e.g.,
2.1.3) - pages with N dots
The level is calculated at python/deepwiki-scraper.py145 as page_num.count('.'). Pages are sorted using a custom key function at python/deepwiki-scraper.py:157-159 that splits the page number by dots and converts each component to an integer, ensuring proper numerical ordering (e.g., 2.10 comes after 2.9).
Each page dictionary contains:
number: Page number string (e.g.,"2.1")title: Extracted link texturl: Full URL to the pagehref: Relative path (used for link rewriting)level: Nesting depth based on dot count
Sources: python/deepwiki-scraper.py:116-163 python/deepwiki-scraper.py145 python/deepwiki-scraper.py:157-161
Path Resolution and Numbering Normalization
Figure 3: Path Resolution Using normalized_number_parts and resolve_output_path
The path resolution system normalizes DeepWiki’s numbering scheme to match mdBook’s conventions. The normalized_number_parts function at python/deepwiki-scraper.py:28-43 shifts page numbers down by one so that DeepWiki’s page 1 becomes unnumbered (the index page), and subsequent pages start at 1.
| DeepWiki Number | normalized_number_parts Output | Final Filename |
|---|---|---|
"1" | [] (empty list) | overview.md (unnumbered) |
"2" | ["1"] | 1-introduction.md |
"3.1" | ["2", "1"] | 2-1-subsection.md |
"3.2" | ["2", "2"] | 2-2-another.md |
The resolve_output_path function at python/deepwiki-scraper.py:45-53 combines normalized numbers with sanitized titles. Subsections (with len(parts) > 1) are placed in directories named section-{main_number} at python/deepwiki-scraper.py52 The sanitize_filename function at python/deepwiki-scraper.py:22-26 strips special characters and normalizes whitespace using regex patterns r'[^\w\s-]' and r'[-\s]+'.
The build_target_path function at python/deepwiki-scraper.py:55-63 constructs full relative paths for link rewriting, used by the link fixing logic at python/deepwiki-scraper.py:854-875
Sources: python/deepwiki-scraper.py:28-63 python/deepwiki-scraper.py:22-26 python/deepwiki-scraper.py:854-875
Content Extraction and HTML-to-Markdown Conversion
Figure 4: Content Extraction Pipeline Using extract_page_content
The extract_page_content function at python/deepwiki-scraper.py:751-877 implements a multi-stage HTML cleaning and conversion pipeline. BeautifulSoup selectors at python/deepwiki-scraper.py:761-762 remove navigation elements before content extraction.
The content finder at python/deepwiki-scraper.py:765-779 tries a prioritized list of selectors: article, main, .wiki-content, .content, #content, .markdown-body, and finally falls back to body. DeepWiki-specific UI elements are removed at python/deepwiki-scraper.py:786-795 by searching for text patterns like “Index your code with Devin” and “Edit Wiki”.
Navigation list removal at python/deepwiki-scraper.py:799-806 detects and removes <ul> elements containing more than 5 links where 80%+ are internal wiki links.
The convert_html_to_markdown function at python/deepwiki-scraper.py:213-228 uses the html2text library with configuration:
ignore_links = False- preserve all linksbody_width = 0- disable line wrapping to prevent formatting issues
Note at python/deepwiki-scraper.py:221-223 explicitly documents that Mermaid diagram processing is disabled during HTML conversion because diagrams from ALL pages are mixed together in the JavaScript payload.
The clean_deepwiki_footer function at python/deepwiki-scraper.py:165-211 removes DeepWiki UI elements using compiled regex patterns for text like “Dismiss”, “Refresh this wiki”, and “On this page”. It scans backwards from the end of the file up to 50 lines to find footer markers at python/deepwiki-scraper.py:187-191
Link rewriting at python/deepwiki-scraper.py:854-875 converts DeepWiki URLs to relative markdown paths, handling both same-section and cross-section references by calculating relative paths based on the source file’s section directory.
Sources: python/deepwiki-scraper.py:751-877 python/deepwiki-scraper.py:213-228 python/deepwiki-scraper.py:165-211 python/deepwiki-scraper.py:395-406
Diagram Extraction from JavaScript Payload
Figure 5: Diagram Extraction Using extract_and_enhance_diagrams
The extract_and_enhance_diagrams function at python/deepwiki-scraper.py:880-1275 extracts all Mermaid diagrams from DeepWiki’s Next.js JavaScript payload. The regex pattern at python/deepwiki-scraper.py899 matches fenced code blocks with various newline formats: \\r\\n, \\n, or actual newline characters.
Context extraction at python/deepwiki-scraper.py:903-1087 captures up to 2000 characters before each diagram to enable fuzzy matching. For each diagram, the context is parsed to extract:
- Last heading : The most recent line starting with
#(searched backwards from diagram position) - Anchor text : The last 2-3 non-heading lines exceeding 20 characters in length, concatenated and truncated to 300 characters
The context extraction logic at python/deepwiki-scraper.py:1066-1081 searches backwards through context lines to find the last heading, then collects up to 3 substantial non-heading lines as anchor text.
The unescaping phase at python/deepwiki-scraper.py:1039-1046 handles JavaScript string escapes:
| Escaped Sequence | Unescaped Result |
|---|---|
\\n | Newline character |
\\t | Tab character |
\\" | Double quote |
\\\\ | Single backslash |
\\u003c | < character |
\\u003e | > character |
\\u0026 | & character |
The merge_multiline_labels function at python/deepwiki-scraper.py:907-1009 collapses wrapped Mermaid labels into literal \n sequences. This is crucial because DeepWiki sometimes wraps long labels across multiple lines in the HTML, but Mermaid 11 expects these to be explicitly marked with \n tokens.
Sources: python/deepwiki-scraper.py:880-1087 python/deepwiki-scraper.py899 python/deepwiki-scraper.py:1039-1046 python/deepwiki-scraper.py:907-1009
Seven-Step Mermaid Normalization Pipeline
Figure 6: Seven-Step Normalization Pipeline Using normalize_mermaid_diagram
The normalize_mermaid_diagram function at python/deepwiki-scraper.py:385-393 applies seven normalization passes to ensure Mermaid 11 compatibility:
Step 1: normalize_mermaid_edge_labels
Function at python/deepwiki-scraper.py:230-251 Applies only to graphs and flowcharts (detected by checking if first line starts with graph or flowchart). Uses regex r'\|([^|]*)\|' to find edge labels and flattens any containing \n, \\n, (, or ) by:
- Replacing
\\nand\nwith spaces - Removing parentheses
- Collapsing whitespace with
re.sub(r'\s+', ' ', cleaned).strip()
Step 2: normalize_mermaid_state_descriptions
Function at python/deepwiki-scraper.py:253-277 Applies only to state diagrams. Ensures state descriptions use the syntax State : Description by:
- Skipping lines with
::(already valid) - Splitting on single
:and cleaning suffix - Replacing colons in description with
- - Rebuilding as
{prefix.rstrip()} : {cleaned_suffix}
Step 3: normalize_flowchart_nodes
Function at python/deepwiki-scraper.py:279-301 Applies to graphs and flowcharts. Uses regex r'\["([^"]*)"\]' to find node labels and:
- Replaces pipe characters with forward slashes
- Collapses whitespace
- Inserts newlines between consecutive statements using regex at python/deepwiki-scraper.py:298-299
Step 4: normalize_statement_separators
Function at python/deepwiki-scraper.py:313-328 Applies to graphs and flowcharts. The STATEMENT_BREAK_PATTERN at python/deepwiki-scraper.py:309-311 detects consecutive statements on one line and inserts newlines between them while preserving indentation.
Step 5: normalize_empty_node_labels
Function at python/deepwiki-scraper.py:330-341 Uses regex r'(\b[A-Za-z0-9_]+)\[""\]' to find nodes with empty labels. Generates fallback label from node ID by replacing underscores/hyphens with spaces.
Step 6: normalize_gantt_diagram
Function at python/deepwiki-scraper.py:343-383 Applies only to Gantt diagrams. Detects task lines missing IDs using pattern r'^(\s*"[^"]+"\s*):\s*(.+)$' and inserts synthetic IDs (task1, task2, etc.) when the first token after the colon is not an ID or after reference.
Sources: python/deepwiki-scraper.py:385-393 python/deepwiki-scraper.py:230-251 python/deepwiki-scraper.py:253-277 python/deepwiki-scraper.py:279-301 python/deepwiki-scraper.py:313-328 python/deepwiki-scraper.py:330-341 python/deepwiki-scraper.py:343-383
Fuzzy Matching and Diagram Injection
Figure 7: Fuzzy Matching Algorithm for Diagram Injection
The fuzzy matching algorithm at python/deepwiki-scraper.py:1150-1275 pairs each diagram with its correct markdown file by matching context against file contents. The algorithm uses a progressive chunk size strategy at python/deepwiki-scraper.py1188 to find matches:
| Chunk Size | Use Case |
|---|---|
| 300 chars | Highest precision - exact context match |
| 200 chars | Medium precision - paragraph-level match |
| 150 chars | Lower precision - sentence-level match |
| 100 chars | Low precision - phrase-level match |
| 80 chars | Minimum threshold - short phrase match |
The matching loop at python/deepwiki-scraper.py:1170-1238 attempts anchor text matching first. The anchor text (last 2-3 lines of context before the diagram) is normalized to lowercase with whitespace collapsed at python/deepwiki-scraper.py:1185-1186 For each chunk size, the algorithm searches for the test chunk at the end of the anchor text (anchor_normalized[-chunk_size:]) in the normalized file content.
If anchor matching fails (score < 80), the algorithm falls back to heading matching at python/deepwiki-scraper.py:1204-1216 This compares the last_heading from diagram context against all headings in the file after normalizing both by removing # symbols and collapsing whitespace.
Only matches with best_match_score >= 80 are accepted at python/deepwiki-scraper.py1218 This threshold balances precision (avoiding false matches) with recall (ensuring most diagrams are placed).
Insertion Point Logic
The insertion point finder at python/deepwiki-scraper.py:1220-1236 behaves differently based on match type:
After heading match :
- Skip blank lines after heading
- Skip through the following paragraph
- Insert after the paragraph ends (blank line or next heading)
After paragraph match :
- Find end of current paragraph
- Insert when encountering blank line or heading
Content Guards
The enforce_content_start function at python/deepwiki-scraper.py:1138-1147 and advance_past_lists function at python/deepwiki-scraper.py:1125-1137 implement content guards to prevent diagram insertion in protected areas:
Protected prefix (detected by protected_prefix_end at python/deepwiki-scraper.py:1101-1115):
- Title line (first line starting with
#) - “Relevant source files” section and its list items
- Blank lines in these sections
List blocks (detected by is_list_line at python/deepwiki-scraper.py:1117-1123):
- Lines starting with
-,*,+ - Lines matching
\d+[.)]\s(numbered lists)
Diagrams are never inserted inside list blocks. If the insertion point lands in a list, advance_past_lists moves the insertion point to after the list ends.
Dynamic Fence Length
The insertion logic at python/deepwiki-scraper.py:1249-1266 calculates a dynamic fence length to handle diagrams containing backticks. It scans the diagram text for the longest run of consecutive backticks and sets fence_len = max(3, max_backticks + 1). This ensures the fence markers (````mermaid`) always properly delimit the diagram content.
Sources: python/deepwiki-scraper.py:1150-1275 python/deepwiki-scraper.py:1170-1238 python/deepwiki-scraper.py:1101-1147 python/deepwiki-scraper.py:1249-1266
Error Handling and Retry Logic
Figure 8: Retry Logic in fetch_page Function
The fetch_page function at python/deepwiki-scraper.py:65-80 implements a 3-attempt retry strategy with exponential backoff. The retry loop at python/deepwiki-scraper.py:71-80 catches all exceptions using a bare except Exception as e clause and retries with a 2-second delay using time.sleep(2).
Browser-like headers are set at python/deepwiki-scraper.py:67-69 to avoid bot detection:
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
The timeout is set to 30 seconds at python/deepwiki-scraper.py73 After a successful fetch, response.raise_for_status() validates the HTTP status code.
The main extraction loop at python/deepwiki-scraper.py:1328-1353 catches exceptions per-page and continues processing remaining pages:
This ensures that a single page failure doesn’t abort the entire scraping process. The success count is reported at python/deepwiki-scraper.py1355
The top-level try-except block at python/deepwiki-scraper.py:1310-1407 catches any unhandled exceptions and exits with status code 1, signaling failure to the calling build script.
Sources: python/deepwiki-scraper.py:65-80 python/deepwiki-scraper.py:1328-1353 python/deepwiki-scraper.py:1310-1407
Session Management and Rate Limiting
The script uses a requests.Session object created at python/deepwiki-scraper.py:1305-1308 with persistent headers:
Session reuse provides connection pooling and persistent cookies across requests. The session is passed to all HTTP functions: extract_wiki_structure, extract_page_content, and extract_and_enhance_diagrams.
Rate limiting is implemented at python/deepwiki-scraper.py1350 with a 1-second sleep between page extractions:
This prevents overwhelming the DeepWiki server and reduces the risk of rate limiting or IP blocking. The comment at python/deepwiki-scraper.py1349 explicitly states “Be nice to the server”.
Sources: python/deepwiki-scraper.py:1305-1308 python/deepwiki-scraper.py:1349-1350
Key Function Reference
| Function | Lines | Purpose |
|---|---|---|
main() | 1277-1410 | Entry point - orchestrates three-phase pipeline |
extract_wiki_structure(repo, session) | 116-163 | Discover all wiki pages from index |
extract_page_content(url, session, page_info) | 751-877 | Extract and clean single page content |
extract_and_enhance_diagrams(repo, temp_dir, session, url) | 880-1275 | Extract diagrams and inject into files |
convert_html_to_markdown(html_content) | 213-228 | Convert HTML to markdown using html2text |
clean_deepwiki_footer(markdown) | 165-211 | Remove DeepWiki UI elements from footer |
normalize_mermaid_diagram(diagram_text) | 385-393 | Apply seven-step normalization pipeline |
normalize_mermaid_edge_labels(diagram_text) | 230-251 | Flatten multiline edge labels |
normalize_mermaid_state_descriptions(diagram_text) | 253-277 | Fix state diagram syntax |
normalize_flowchart_nodes(diagram_text) | 279-301 | Clean flowchart node labels |
normalize_statement_separators(diagram_text) | 313-328 | Insert newlines between statements |
normalize_empty_node_labels(diagram_text) | 330-341 | Provide fallback labels |
normalize_gantt_diagram(diagram_text) | 343-383 | Add synthetic task IDs |
merge_multiline_labels(diagram_text) | 907-1009 | Collapse wrapped labels |
strip_wrapping_quotes(diagram_text) | 1011-1022 | Remove extra quotes |
fetch_page(url, session) | 65-80 | HTTP fetch with retry logic |
sanitize_filename(text) | 22-26 | Convert text to safe filename |
normalized_number_parts(page_number) | 28-43 | Shift DeepWiki numbering down by 1 |
resolve_output_path(page_number, title) | 45-53 | Determine filename and section dir |
build_target_path(page_number, slug) | 55-63 | Build relative path for links |
format_source_references(markdown) | 397-406 | Insert colons in source links |
Sources: python/deepwiki-scraper.py:1-1411
Dismiss
Refresh this wiki
Enter email to refresh