Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

DeepWiki GitHub

Phase 2: Diagram Enhancement

Relevant source files

Purpose and Scope

Phase 2 performs intelligent diagram extraction and placement after Phase 1 has generated clean markdown files. This phase extracts Mermaid diagrams from DeepWiki's JavaScript payload, matches them to appropriate locations in the markdown content using fuzzy text matching, and inserts them contextually after relevant paragraphs.

For information about the initial markdown extraction that precedes this phase, see Phase 1: Markdown Extraction. For details on the specific fuzzy matching algorithm implementation, see Fuzzy Diagram Matching Algorithm. For information about the extraction patterns used, see Diagram Extraction from Next.js.

Sources: README.md:130-136 tools/deepwiki-scraper.py:596-789

The Client-Side Rendering Problem

DeepWiki renders diagrams client-side using JavaScript, making them invisible to traditional HTML scraping. All Mermaid diagrams are embedded in a JavaScript payload (self.__next_f.push) that contains diagram code for all pages in the wiki , not just the current page. This creates a matching problem: given ~461 diagrams in a single payload and individual markdown files, how do we determine which diagrams belong in which files?

Key challenges:

  • Diagrams are escaped JavaScript strings (\n, \t, \")
  • No metadata associates diagrams with specific pages
  • html2text conversion changes text formatting from the original JavaScript context
  • Must avoid false positives (placing diagrams in wrong locations)

Sources: tools/deepwiki-scraper.py:458-461 README.md:131-136

Architecture Overview

Diagram: Phase 2 Processing Pipeline

Sources: tools/deepwiki-scraper.py:596-789

Diagram Extraction Process

The extraction process reads the JavaScript payload from any DeepWiki page and locates all Mermaid diagram blocks using regex pattern matching.

flowchart TD
    Start["extract_and_enhance_diagrams()"]
FetchURL["Fetch https://deepwiki.com/repo/1-overview"]
subgraph "Pattern Matching"
        Pattern1["Pattern: r'```mermaid\\\\\n(.*?)```'\n(re.DOTALL)"]
Pattern2["Pattern: r'([^`]{500,}?)```mermaid\\\\ (.*?)```'\n(with context)"]
FindAll["re.findall() → all_diagrams list"]
FindIter["re.finditer() → diagram_contexts with context"]
end
    
    subgraph "Unescaping"
        ReplaceNewline["Replace '\\\\\n' → newline"]
ReplaceTab["Replace '\\\\ ' → tab"]
ReplaceQuote["Replace '\\\\\"' → double-quote"]
ReplaceUnicode["Replace Unicode escapes:\n\\\< → '<'\n\\\> → '>'\n\\\& → '&'"]
end
    
    subgraph "Context Processing"
        Last500["Extract last 500 chars of context"]
FindHeading["Scan for last heading starting with #"]
ExtractAnchor["Extract last 2-3 non-heading lines\n(min 20 chars each)"]
BuildDict["Build dict: {last_heading, anchor_text, diagram}"]
end
    
 
   Start --> FetchURL
 
   FetchURL --> Pattern1
 
   FetchURL --> Pattern2
 
   Pattern1 --> FindAll
 
   Pattern2 --> FindIter
    
 
   FindAll --> ReplaceNewline
 
   FindIter --> ReplaceNewline
 
   ReplaceNewline --> ReplaceTab
 
   ReplaceTab --> ReplaceQuote
 
   ReplaceQuote --> ReplaceUnicode
    
 
   ReplaceUnicode --> Last500
 
   Last500 --> FindHeading
 
   FindHeading --> ExtractAnchor
 
   ExtractAnchor --> BuildDict
    
 
   BuildDict --> Output["Returns:\n- all_diagrams count\n- diagram_contexts list"]

Extraction Function Flow

Diagram: Diagram Extraction and Context Building

Sources: tools/deepwiki-scraper.py:604-674

Key Implementation Details

ComponentImplementationLocation
Regex Patternr'```mermaid\\n(.*?)```' with re.DOTALL flagtools/deepwiki-scraper.py615
Context Patternr'([^]{500,}?)mermaid\\n(.*?)'` captures 500+ charstools/deepwiki-scraper.py621
Unescape Operationsreplace('\\n', '\n'), replace('\\t', '\t'), etc.tools/deepwiki-scraper.py:628-635 tools/deepwiki-scraper.py:639-645
Heading Detectionline.startswith('#') on reversed context linestools/deepwiki-scraper.py:652-656
Anchor ExtractionLast 2-3 lines with len(line) > 20, max 300 charstools/deepwiki-scraper.py:658-666
Context StorageDict with keys: last_heading, anchor_text, diagramtools/deepwiki-scraper.py:668-672

Sources: tools/deepwiki-scraper.py:614-674

Fuzzy Matching Algorithm

The fuzzy matching algorithm determines where each diagram should be inserted by finding the best match between the diagram's context and the markdown file's content.

flowchart TD
    Start["For each diagram_contexts[idx]"]
CheckUsed["idx in diagrams_used?"]
Skip["Skip to next diagram"]
subgraph "Text Normalization"
        NormFile["Normalize file content:\ncontent.lower()\n' '.join(content.split())"]
NormAnchor["Normalize anchor_text:\nanchor.lower()\n' '.join(anchor.split())"]
NormHeading["Normalize heading:\nheading.lower().replace('#', '').strip()"]
end
    
    subgraph "Progressive Chunk Matching"
        Try300["Try chunk_size=300"]
Try200["Try chunk_size=200"]
Try150["Try chunk_size=150"]
Try100["Try chunk_size=100"]
Try80["Try chunk_size=80"]
ExtractChunk["test_chunk = anchor_normalized[-chunk_size:]"]
FindPos["pos = content_normalized.find(test_chunk)"]
CheckPos["pos != -1?"]
ConvertLine["Convert char position to line number"]
RecordMatch["Record best_match_line, best_match_score"]
end
    
    subgraph "Heading Fallback"
        IterLines["For each line in markdown"]
CheckHeadingLine["line.strip().startswith('#')?"]
NormalizeLinе["Normalize line heading"]
CheckContains["heading_normalized in line_normalized?"]
RecordHeadingMatch["best_match_line = line_num\nbest_match_score = 50"]
end
    
 
   Start --> CheckUsed
 
   CheckUsed -->|Yes| Skip
 
   CheckUsed -->|No| NormFile
    
 
   NormFile --> NormAnchor
 
   NormAnchor --> Try300
 
   Try300 --> ExtractChunk
 
   ExtractChunk --> FindPos
 
   FindPos --> CheckPos
 
   CheckPos -->|Found| ConvertLine
 
   CheckPos -->|Not found| Try200
 
   ConvertLine --> RecordMatch
    
 
   Try200 --> Try150
 
   Try150 --> Try100
 
   Try100 --> Try80
 
   Try80 -->|All failed| IterLines
    
 
   RecordMatch --> Success["Return match with score"]
IterLines --> CheckHeadingLine
 
   CheckHeadingLine -->|Yes| NormalizeLinе
 
   NormalizeLinе --> CheckContains
 
   CheckContains -->|Yes| RecordHeadingMatch
 
   RecordHeadingMatch --> Success

Matching Strategy

Diagram: Progressive Chunk Matching with Fallback

Sources: tools/deepwiki-scraper.py:708-746

Chunk Size Progression

The algorithm tries progressively smaller chunk sizes to accommodate variations in text formatting between the JavaScript context and the html2text-converted markdown:

Chunk SizeUse CaseSuccess Rate
300 charsPerfect or near-perfect matchesHighest precision
200 charsMinor formatting differencesGood precision
150 charsModerate text variationsAcceptable precision
100 charsSignificant reformattingLower precision
80 charsMinimal context availableLowest precision
Heading matchFallback when text matching failsScore: 50

The algorithm stops at the first successful match, prioritizing larger chunks for higher confidence.

Sources: tools/deepwiki-scraper.py:716-730 README.md134

flowchart TD
    Start["Found best_match_line"]
CheckType["lines[best_match_line].strip().startswith('#')?"]
subgraph "Heading Case"
        H1["insert_line = best_match_line + 1"]
H2["Skip blank lines after heading"]
H3["Skip through paragraph content"]
H4["Stop at next blank line or heading"]
end
    
    subgraph "Paragraph Case"
        P1["insert_line = best_match_line + 1"]
P2["Find end of current paragraph"]
P3["Stop at next blank line or heading"]
end
    
    subgraph "Insertion Format"
        I1["Insert: empty line"]
I2["Insert: ```mermaid"]
I3["Insert: diagram code"]
I4["Insert: ```"]
I5["Insert: empty line"]
end
    
 
   Start --> CheckType
 
   CheckType -->|Heading| H1
 
   CheckType -->|Paragraph| P1
    
 
   H1 --> H2
 
   H2 --> H3
 
   H3 --> H4
    
 
   P1 --> P2
 
   P2 --> P3
    
 
   H4 --> I1
 
   P3 --> I1
    
 
   I1 --> I2
 
   I2 --> I3
 
   I3 --> I4
 
   I4 --> I5
    
 
   I5 --> Append["Append to pending_insertions list:\n(insert_line, diagram, score, idx)"]

Insertion Point Logic

After finding a match, the system determines the precise line number where the diagram should be inserted.

Insertion Algorithm

Diagram: Insertion Point Calculation

Sources: tools/deepwiki-scraper.py:747-768

graph LR
    Collect["Collect all\npending_insertions"]
Sort["Sort by insert_line\n(descending)"]
Insert["Insert from bottom to top\npreserves line numbers"]
Write["Write enhanced file\nto temp_dir"]
Collect --> Sort
 
   Sort --> Insert
 
   Insert --> Write

Batch Insertion Strategy

Diagrams are inserted in descending line order to avoid invalidating insertion points:

Diagram: Batch Insertion Order

Implementation:

Sources: tools/deepwiki-scraper.py:771-783

sequenceDiagram
    participant Main as extract_and_enhance_diagrams()
    participant Glob as temp_dir.glob('**/*.md')
    participant File as Individual .md file
    participant Matcher as Fuzzy Matcher
    participant Writer as File Writer
    
    Main->>Main: Extract all diagram_contexts
    Main->>Glob: Find all markdown files
    
    loop For each md_file
        Glob->>File: Open and read content
        File->>File: Check if '```mermaid' already present
        
        alt Already has diagrams
            File->>Glob: Skip (continue)
        else No diagrams
            File->>Matcher: Normalize content
            
            loop For each diagram_context
                Matcher->>Matcher: Try progressive chunk matching
                Matcher->>Matcher: Try heading fallback
                Matcher->>Matcher: Record best match
            end
            
            Matcher->>File: Return pending_insertions list
            File->>File: Sort insertions (descending)
            File->>File: Insert diagrams bottom-up
            File->>Writer: Write enhanced content
            Writer->>Main: Increment enhanced_count
        end
    end
    
    Main->>Main: Print summary

File Processing Workflow

Phase 2 operates on files in the temporary directory created by Phase 1, enhancing them in-place before they are moved to the final output directory.

Processing Loop

Diagram: File Processing Sequence

Sources: tools/deepwiki-scraper.py:676-788

Performance Characteristics

Extraction Statistics

From a typical wiki with ~10 pages:

MetricValueLocation
Total diagrams in JS payload~461README.md132
Diagrams with context (500+ chars)~48README.md133
Context window size500 characterstools/deepwiki-scraper.py621
Anchor text max length300 characterstools/deepwiki-scraper.py666
Typical enhanced filesVaries by contentPrinted in output

Sources: README.md:132-133 tools/deepwiki-scraper.py674 tools/deepwiki-scraper.py788

Matching Performance

The progressive chunk size strategy balances precision and recall:

  • High precision matches (300-200 chars) : Strong contextual alignment
  • Medium precision matches (150-100 chars) : Acceptable with some risk
  • Low precision matches (80 chars) : Risk of false positives
  • Heading-only matches (score: 50) : Last resort fallback

The algorithm prefers to skip a diagram rather than place it incorrectly, prioritizing documentation quality over diagram count.

Sources: tools/deepwiki-scraper.py:716-745

Integration with Phases 1 and 3

Input Requirements (from Phase 1)

  • Clean markdown files in temp_dir
  • Files must not already contain \```mermaid blocks
  • Proper heading structure for fallback matching
  • Normalized link structure

Sources: tools/deepwiki-scraper.py:810-877

Output Guarantees (for Phase 3)

  • Enhanced markdown files in temp_dir
  • Diagrams inserted with proper fencing: \```mermaid...`````
  • Blank lines before and after diagrams for proper rendering
  • Original file structure preserved (section-N directories maintained)
  • Atomic file operations (write complete file or skip)

Sources: tools/deepwiki-scraper.py:781-786 tools/deepwiki-scraper.py:883-908

Workflow Integration

Diagram: Three-Phase Integration

Sources: README.md:123-145 tools/deepwiki-scraper.py:810-916

Error Handling and Edge Cases

Skipped Files

Files are skipped if they already contain Mermaid diagrams to avoid duplicate insertion:

Sources: tools/deepwiki-scraper.py:686-687

Failed Matches

When a diagram cannot be matched:

  • The diagram is not inserted (conservative approach)
  • No error is raised (continues processing other diagrams)
  • File is left unmodified if no diagrams match

Sources: tools/deepwiki-scraper.py:699-746

Network Errors

If diagram extraction fails (network error, changed HTML structure):

  • Warning is printed but Phase 2 continues
  • Phase 1 files remain valid
  • System can still proceed to Phase 3 without diagrams

Sources: tools/deepwiki-scraper.py:610-612

Diagram Quality Thresholds

ThresholdPurpose
len(diagram) > 10Filter out trivial/invalid diagram code
len(anchor) > 50Ensure sufficient context for matching
len(line) > 20Filter out short lines from anchor text
chunk_size >= 80Minimum viable match size

Sources: tools/deepwiki-scraper.py648 tools/deepwiki-scraper.py712 tools/deepwiki-scraper.py661

Summary

Phase 2 implements a sophisticated fuzzy matching system that:

  1. Extracts all Mermaid diagrams from DeepWiki's JavaScript payload using regex patterns
  2. Processes diagram context to extract heading and anchor text metadata
  3. Matches diagrams to markdown files using progressive chunk size comparison (300→80 chars)
  4. Inserts diagrams after relevant paragraphs with proper formatting
  5. Validates through conservative matching to avoid false positives

The phase operates entirely on files in the temporary directory, leaving Phase 1's output intact while preparing enhanced files for Phase 3's mdBook build process.

Sources: README.md:130-136 tools/deepwiki-scraper.py:596-789