Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

deepwiki-scraper.py

Loading…

deepwiki-scraper.py

Relevant source files

Purpose and Scope

The deepwiki-scraper.py script is the primary data extraction and transformation component that converts DeepWiki wiki content into enhanced markdown files. It orchestrates a three-phase pipeline: (1) extracting clean markdown from DeepWiki HTML, (2) enhancing files with normalized Mermaid diagrams using fuzzy matching, and (3) moving completed files to the output directory.

This page documents the script’s architecture, execution model, and key algorithms. For information about how this script is invoked by the build system, see build-docs.sh Orchestrator. For detailed explanations of the extraction and enhancement phases, see Phase 1: Markdown Extraction and Phase 2: Diagram Enhancement.

Sources: python/deepwiki-scraper.py:1-11


Command-Line Interface

The script requires two positional arguments:

  • Repository identifier : Format owner/repo (e.g., jzombie/deepwiki-to-mdbook)
  • Output directory : Destination path for generated markdown files

The repository identifier is validated using the regex pattern ^[\w-]+/[\w-]+$ at python/deepwiki-scraper.py:1287-1289 The script exits with status code 1 if validation fails or if the wiki structure cannot be extracted.

Sources: python/deepwiki-scraper.py:1-10 python/deepwiki-scraper.py:1277-1289


Three-Phase Execution Model

Figure 1: Three-Phase Execution Pipeline

The main() function at python/deepwiki-scraper.py:1277-1410 implements a three-phase workflow:

PhaseFunctionPrimary ResponsibilityOutput Location
1extract_wiki_structure + extract_page_contentScrape HTML and convert to markdownTemporary directory
2extract_and_enhance_diagramsMatch and inject Mermaid diagramsIn-place modification of temp directory
3File system operationsMove validated files to outputFinal output directory

A temporary directory is created at python/deepwiki-scraper.py:1295-1296 using Python’s tempfile.TemporaryDirectory context manager. This ensures automatic cleanup even if the script fails. A raw markdown snapshot is saved to raw_markdown/ at python/deepwiki-scraper.py:1358-1366 before diagram enhancement for debugging purposes.

Sources: python/deepwiki-scraper.py:1277-1410 python/deepwiki-scraper.py:1298-1371


Wiki Structure Discovery

Figure 2: Structure Discovery Algorithm Using extract_wiki_structure

The extract_wiki_structure function at python/deepwiki-scraper.py:116-163 discovers all wiki pages by parsing the main wiki index page. It uses a compiled regex pattern to find all links matching ^/{repo_pattern}/\d+ at python/deepwiki-scraper.py:128-129

The page numbering scheme distinguishes main pages from subsections using dot notation:

  • Level 0 : Main pages (e.g., 1, 2, 3) - pages with no dots
  • Level 1 : Subsections (e.g., 2.1, 2.2) - pages with one dot
  • Level N : Deeper subsections (e.g., 2.1.3) - pages with N dots

The level is calculated at python/deepwiki-scraper.py145 as page_num.count('.'). Pages are sorted using a custom key function at python/deepwiki-scraper.py:157-159 that splits the page number by dots and converts each component to an integer, ensuring proper numerical ordering (e.g., 2.10 comes after 2.9).

Each page dictionary contains:

  • number: Page number string (e.g., "2.1")
  • title: Extracted link text
  • url: Full URL to the page
  • href: Relative path (used for link rewriting)
  • level: Nesting depth based on dot count

Sources: python/deepwiki-scraper.py:116-163 python/deepwiki-scraper.py145 python/deepwiki-scraper.py:157-161


Path Resolution and Numbering Normalization

Figure 3: Path Resolution Using normalized_number_parts and resolve_output_path

The path resolution system normalizes DeepWiki’s numbering scheme to match mdBook’s conventions. The normalized_number_parts function at python/deepwiki-scraper.py:28-43 shifts page numbers down by one so that DeepWiki’s page 1 becomes unnumbered (the index page), and subsequent pages start at 1.

DeepWiki Numbernormalized_number_parts OutputFinal Filename
"1"[] (empty list)overview.md (unnumbered)
"2"["1"]1-introduction.md
"3.1"["2", "1"]2-1-subsection.md
"3.2"["2", "2"]2-2-another.md

The resolve_output_path function at python/deepwiki-scraper.py:45-53 combines normalized numbers with sanitized titles. Subsections (with len(parts) > 1) are placed in directories named section-{main_number} at python/deepwiki-scraper.py52 The sanitize_filename function at python/deepwiki-scraper.py:22-26 strips special characters and normalizes whitespace using regex patterns r'[^\w\s-]' and r'[-\s]+'.

The build_target_path function at python/deepwiki-scraper.py:55-63 constructs full relative paths for link rewriting, used by the link fixing logic at python/deepwiki-scraper.py:854-875

Sources: python/deepwiki-scraper.py:28-63 python/deepwiki-scraper.py:22-26 python/deepwiki-scraper.py:854-875


Content Extraction and HTML-to-Markdown Conversion

Figure 4: Content Extraction Pipeline Using extract_page_content

The extract_page_content function at python/deepwiki-scraper.py:751-877 implements a multi-stage HTML cleaning and conversion pipeline. BeautifulSoup selectors at python/deepwiki-scraper.py:761-762 remove navigation elements before content extraction.

The content finder at python/deepwiki-scraper.py:765-779 tries a prioritized list of selectors: article, main, .wiki-content, .content, #content, .markdown-body, and finally falls back to body. DeepWiki-specific UI elements are removed at python/deepwiki-scraper.py:786-795 by searching for text patterns like “Index your code with Devin” and “Edit Wiki”.

Navigation list removal at python/deepwiki-scraper.py:799-806 detects and removes <ul> elements containing more than 5 links where 80%+ are internal wiki links.

The convert_html_to_markdown function at python/deepwiki-scraper.py:213-228 uses the html2text library with configuration:

  • ignore_links = False - preserve all links
  • body_width = 0 - disable line wrapping to prevent formatting issues

Note at python/deepwiki-scraper.py:221-223 explicitly documents that Mermaid diagram processing is disabled during HTML conversion because diagrams from ALL pages are mixed together in the JavaScript payload.

The clean_deepwiki_footer function at python/deepwiki-scraper.py:165-211 removes DeepWiki UI elements using compiled regex patterns for text like “Dismiss”, “Refresh this wiki”, and “On this page”. It scans backwards from the end of the file up to 50 lines to find footer markers at python/deepwiki-scraper.py:187-191

Link rewriting at python/deepwiki-scraper.py:854-875 converts DeepWiki URLs to relative markdown paths, handling both same-section and cross-section references by calculating relative paths based on the source file’s section directory.

Sources: python/deepwiki-scraper.py:751-877 python/deepwiki-scraper.py:213-228 python/deepwiki-scraper.py:165-211 python/deepwiki-scraper.py:395-406


Diagram Extraction from JavaScript Payload

Figure 5: Diagram Extraction Using extract_and_enhance_diagrams

The extract_and_enhance_diagrams function at python/deepwiki-scraper.py:880-1275 extracts all Mermaid diagrams from DeepWiki’s Next.js JavaScript payload. The regex pattern at python/deepwiki-scraper.py899 matches fenced code blocks with various newline formats: \\r\\n, \\n, or actual newline characters.

Context extraction at python/deepwiki-scraper.py:903-1087 captures up to 2000 characters before each diagram to enable fuzzy matching. For each diagram, the context is parsed to extract:

  1. Last heading : The most recent line starting with # (searched backwards from diagram position)
  2. Anchor text : The last 2-3 non-heading lines exceeding 20 characters in length, concatenated and truncated to 300 characters

The context extraction logic at python/deepwiki-scraper.py:1066-1081 searches backwards through context lines to find the last heading, then collects up to 3 substantial non-heading lines as anchor text.

The unescaping phase at python/deepwiki-scraper.py:1039-1046 handles JavaScript string escapes:

Escaped SequenceUnescaped Result
\\nNewline character
\\tTab character
\\"Double quote
\\\\Single backslash
\\u003c< character
\\u003e> character
\\u0026& character

The merge_multiline_labels function at python/deepwiki-scraper.py:907-1009 collapses wrapped Mermaid labels into literal \n sequences. This is crucial because DeepWiki sometimes wraps long labels across multiple lines in the HTML, but Mermaid 11 expects these to be explicitly marked with \n tokens.

Sources: python/deepwiki-scraper.py:880-1087 python/deepwiki-scraper.py899 python/deepwiki-scraper.py:1039-1046 python/deepwiki-scraper.py:907-1009


Seven-Step Mermaid Normalization Pipeline

Figure 6: Seven-Step Normalization Pipeline Using normalize_mermaid_diagram

The normalize_mermaid_diagram function at python/deepwiki-scraper.py:385-393 applies seven normalization passes to ensure Mermaid 11 compatibility:

Step 1: normalize_mermaid_edge_labels

Function at python/deepwiki-scraper.py:230-251 Applies only to graphs and flowcharts (detected by checking if first line starts with graph or flowchart). Uses regex r'\|([^|]*)\|' to find edge labels and flattens any containing \n, \\n, (, or ) by:

  • Replacing \\n and \n with spaces
  • Removing parentheses
  • Collapsing whitespace with re.sub(r'\s+', ' ', cleaned).strip()

Step 2: normalize_mermaid_state_descriptions

Function at python/deepwiki-scraper.py:253-277 Applies only to state diagrams. Ensures state descriptions use the syntax State : Description by:

  • Skipping lines with :: (already valid)
  • Splitting on single : and cleaning suffix
  • Replacing colons in description with -
  • Rebuilding as {prefix.rstrip()} : {cleaned_suffix}

Step 3: normalize_flowchart_nodes

Function at python/deepwiki-scraper.py:279-301 Applies to graphs and flowcharts. Uses regex r'\["([^"]*)"\]' to find node labels and:

Step 4: normalize_statement_separators

Function at python/deepwiki-scraper.py:313-328 Applies to graphs and flowcharts. The STATEMENT_BREAK_PATTERN at python/deepwiki-scraper.py:309-311 detects consecutive statements on one line and inserts newlines between them while preserving indentation.

Step 5: normalize_empty_node_labels

Function at python/deepwiki-scraper.py:330-341 Uses regex r'(\b[A-Za-z0-9_]+)\[""\]' to find nodes with empty labels. Generates fallback label from node ID by replacing underscores/hyphens with spaces.

Step 6: normalize_gantt_diagram

Function at python/deepwiki-scraper.py:343-383 Applies only to Gantt diagrams. Detects task lines missing IDs using pattern r'^(\s*"[^"]+"\s*):\s*(.+)$' and inserts synthetic IDs (task1, task2, etc.) when the first token after the colon is not an ID or after reference.

Sources: python/deepwiki-scraper.py:385-393 python/deepwiki-scraper.py:230-251 python/deepwiki-scraper.py:253-277 python/deepwiki-scraper.py:279-301 python/deepwiki-scraper.py:313-328 python/deepwiki-scraper.py:330-341 python/deepwiki-scraper.py:343-383


Fuzzy Matching and Diagram Injection

Figure 7: Fuzzy Matching Algorithm for Diagram Injection

The fuzzy matching algorithm at python/deepwiki-scraper.py:1150-1275 pairs each diagram with its correct markdown file by matching context against file contents. The algorithm uses a progressive chunk size strategy at python/deepwiki-scraper.py1188 to find matches:

Chunk SizeUse Case
300 charsHighest precision - exact context match
200 charsMedium precision - paragraph-level match
150 charsLower precision - sentence-level match
100 charsLow precision - phrase-level match
80 charsMinimum threshold - short phrase match

The matching loop at python/deepwiki-scraper.py:1170-1238 attempts anchor text matching first. The anchor text (last 2-3 lines of context before the diagram) is normalized to lowercase with whitespace collapsed at python/deepwiki-scraper.py:1185-1186 For each chunk size, the algorithm searches for the test chunk at the end of the anchor text (anchor_normalized[-chunk_size:]) in the normalized file content.

If anchor matching fails (score < 80), the algorithm falls back to heading matching at python/deepwiki-scraper.py:1204-1216 This compares the last_heading from diagram context against all headings in the file after normalizing both by removing # symbols and collapsing whitespace.

Only matches with best_match_score >= 80 are accepted at python/deepwiki-scraper.py1218 This threshold balances precision (avoiding false matches) with recall (ensuring most diagrams are placed).

Insertion Point Logic

The insertion point finder at python/deepwiki-scraper.py:1220-1236 behaves differently based on match type:

After heading match :

  1. Skip blank lines after heading
  2. Skip through the following paragraph
  3. Insert after the paragraph ends (blank line or next heading)

After paragraph match :

  1. Find end of current paragraph
  2. Insert when encountering blank line or heading

Content Guards

The enforce_content_start function at python/deepwiki-scraper.py:1138-1147 and advance_past_lists function at python/deepwiki-scraper.py:1125-1137 implement content guards to prevent diagram insertion in protected areas:

Protected prefix (detected by protected_prefix_end at python/deepwiki-scraper.py:1101-1115):

  • Title line (first line starting with #)
  • “Relevant source files” section and its list items
  • Blank lines in these sections

List blocks (detected by is_list_line at python/deepwiki-scraper.py:1117-1123):

  • Lines starting with -, *, +
  • Lines matching \d+[.)]\s (numbered lists)

Diagrams are never inserted inside list blocks. If the insertion point lands in a list, advance_past_lists moves the insertion point to after the list ends.

Dynamic Fence Length

The insertion logic at python/deepwiki-scraper.py:1249-1266 calculates a dynamic fence length to handle diagrams containing backticks. It scans the diagram text for the longest run of consecutive backticks and sets fence_len = max(3, max_backticks + 1). This ensures the fence markers (````mermaid`) always properly delimit the diagram content.

Sources: python/deepwiki-scraper.py:1150-1275 python/deepwiki-scraper.py:1170-1238 python/deepwiki-scraper.py:1101-1147 python/deepwiki-scraper.py:1249-1266


Error Handling and Retry Logic

Figure 8: Retry Logic in fetch_page Function

The fetch_page function at python/deepwiki-scraper.py:65-80 implements a 3-attempt retry strategy with exponential backoff. The retry loop at python/deepwiki-scraper.py:71-80 catches all exceptions using a bare except Exception as e clause and retries with a 2-second delay using time.sleep(2).

Browser-like headers are set at python/deepwiki-scraper.py:67-69 to avoid bot detection:

User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36

The timeout is set to 30 seconds at python/deepwiki-scraper.py73 After a successful fetch, response.raise_for_status() validates the HTTP status code.

The main extraction loop at python/deepwiki-scraper.py:1328-1353 catches exceptions per-page and continues processing remaining pages:

This ensures that a single page failure doesn’t abort the entire scraping process. The success count is reported at python/deepwiki-scraper.py1355

The top-level try-except block at python/deepwiki-scraper.py:1310-1407 catches any unhandled exceptions and exits with status code 1, signaling failure to the calling build script.

Sources: python/deepwiki-scraper.py:65-80 python/deepwiki-scraper.py:1328-1353 python/deepwiki-scraper.py:1310-1407


Session Management and Rate Limiting

The script uses a requests.Session object created at python/deepwiki-scraper.py:1305-1308 with persistent headers:

Session reuse provides connection pooling and persistent cookies across requests. The session is passed to all HTTP functions: extract_wiki_structure, extract_page_content, and extract_and_enhance_diagrams.

Rate limiting is implemented at python/deepwiki-scraper.py1350 with a 1-second sleep between page extractions:

This prevents overwhelming the DeepWiki server and reduces the risk of rate limiting or IP blocking. The comment at python/deepwiki-scraper.py1349 explicitly states “Be nice to the server”.

Sources: python/deepwiki-scraper.py:1305-1308 python/deepwiki-scraper.py:1349-1350


Key Function Reference

FunctionLinesPurpose
main()1277-1410Entry point - orchestrates three-phase pipeline
extract_wiki_structure(repo, session)116-163Discover all wiki pages from index
extract_page_content(url, session, page_info)751-877Extract and clean single page content
extract_and_enhance_diagrams(repo, temp_dir, session, url)880-1275Extract diagrams and inject into files
convert_html_to_markdown(html_content)213-228Convert HTML to markdown using html2text
clean_deepwiki_footer(markdown)165-211Remove DeepWiki UI elements from footer
normalize_mermaid_diagram(diagram_text)385-393Apply seven-step normalization pipeline
normalize_mermaid_edge_labels(diagram_text)230-251Flatten multiline edge labels
normalize_mermaid_state_descriptions(diagram_text)253-277Fix state diagram syntax
normalize_flowchart_nodes(diagram_text)279-301Clean flowchart node labels
normalize_statement_separators(diagram_text)313-328Insert newlines between statements
normalize_empty_node_labels(diagram_text)330-341Provide fallback labels
normalize_gantt_diagram(diagram_text)343-383Add synthetic task IDs
merge_multiline_labels(diagram_text)907-1009Collapse wrapped labels
strip_wrapping_quotes(diagram_text)1011-1022Remove extra quotes
fetch_page(url, session)65-80HTTP fetch with retry logic
sanitize_filename(text)22-26Convert text to safe filename
normalized_number_parts(page_number)28-43Shift DeepWiki numbering down by 1
resolve_output_path(page_number, title)45-53Determine filename and section dir
build_target_path(page_number, slug)55-63Build relative path for links
format_source_references(markdown)397-406Insert colons in source links

Sources: python/deepwiki-scraper.py:1-1411

Dismiss

Refresh this wiki

Enter email to refresh