Phase 2: Diagram Enhancement

Relevant source files

Purpose and Scope

Phase 2 performs intelligent diagram extraction and placement after Phase 1 has generated clean markdown files. This phase extracts Mermaid diagrams from DeepWiki's JavaScript payload, matches them to appropriate locations in the markdown content using fuzzy text matching, and inserts them contextually after relevant paragraphs.

For information about the initial markdown extraction that precedes this phase, see Phase 1: Markdown Extraction. For details on the specific fuzzy matching algorithm implementation, see Fuzzy Diagram Matching Algorithm. For information about the extraction patterns used, see Diagram Extraction from Next.js.

Sources: README.md:130-136 tools/deepwiki-scraper.py:596-789

The Client-Side Rendering Problem

DeepWiki renders diagrams client-side using JavaScript, making them invisible to traditional HTML scraping. All Mermaid diagrams are embedded in a JavaScript payload (self.__next_f.push) that contains diagram code for all pages in the wiki , not just the current page. This creates a matching problem: given ~461 diagrams in a single payload and individual markdown files, how do we determine which diagrams belong in which files?

Key challenges:

Diagrams are escaped JavaScript strings (\n, \t, \")
No metadata associates diagrams with specific pages
html2text conversion changes text formatting from the original JavaScript context
Must avoid false positives (placing diagrams in wrong locations)

Sources: tools/deepwiki-scraper.py:458-461 README.md:131-136

Architecture Overview

Diagram: Phase 2 Processing Pipeline

Sources: tools/deepwiki-scraper.py:596-789

Diagram Extraction Process

The extraction process reads the JavaScript payload from any DeepWiki page and locates all Mermaid diagram blocks using regex pattern matching.

flowchart TD
    Start["extract_and_enhance_diagrams()"]
FetchURL["Fetch https://deepwiki.com/repo/1-overview"]
subgraph "Pattern Matching"
        Pattern1["Pattern: r'```mermaid\\\\\n(.*?)```'\n(re.DOTALL)"]
Pattern2["Pattern: r'([^`]{500,}?)```mermaid\\\\ (.*?)```'\n(with context)"]
FindAll["re.findall() → all_diagrams list"]
FindIter["re.finditer() → diagram_contexts with context"]
end
    
    subgraph "Unescaping"
        ReplaceNewline["Replace '\\\\\n' → newline"]
ReplaceTab["Replace '\\\\ ' → tab"]
ReplaceQuote["Replace '\\\\\"' → double-quote"]
ReplaceUnicode["Replace Unicode escapes:\n\\\< → '<'\n\\\> → '>'\n\\\& → '&'"]
end
    
    subgraph "Context Processing"
        Last500["Extract last 500 chars of context"]
FindHeading["Scan for last heading starting with #"]
ExtractAnchor["Extract last 2-3 non-heading lines\n(min 20 chars each)"]
BuildDict["Build dict: {last_heading, anchor_text, diagram}"]
end
    
 
   Start --> FetchURL
 
   FetchURL --> Pattern1
 
   FetchURL --> Pattern2
 
   Pattern1 --> FindAll
 
   Pattern2 --> FindIter
    
 
   FindAll --> ReplaceNewline
 
   FindIter --> ReplaceNewline
 
   ReplaceNewline --> ReplaceTab
 
   ReplaceTab --> ReplaceQuote
 
   ReplaceQuote --> ReplaceUnicode
    
 
   ReplaceUnicode --> Last500
 
   Last500 --> FindHeading
 
   FindHeading --> ExtractAnchor
 
   ExtractAnchor --> BuildDict
    
 
   BuildDict --> Output["Returns:\n- all_diagrams count\n- diagram_contexts list"]

Extraction Function Flow

Diagram: Diagram Extraction and Context Building

Sources: tools/deepwiki-scraper.py:604-674

Key Implementation Details

Component	Implementation	Location
Regex Pattern	r'```mermaid\\n(.*?)```' with `re.DOTALL` flag	tools/deepwiki-scraper.py615
Context Pattern	`r'([^`]{500,}?)`mermaid\\n(.*?)`'` captures 500+ chars	tools/deepwiki-scraper.py621
Unescape Operations	`replace('\\n', '\n')`, `replace('\\t', '\t')`, etc.	tools/deepwiki-scraper.py:628-635 tools/deepwiki-scraper.py:639-645
Heading Detection	`line.startswith('#')` on reversed context lines	tools/deepwiki-scraper.py:652-656
Anchor Extraction	Last 2-3 lines with `len(line) > 20`, max 300 chars	tools/deepwiki-scraper.py:658-666
Context Storage	Dict with keys: `last_heading`, `anchor_text`, `diagram`	tools/deepwiki-scraper.py:668-672

Sources: tools/deepwiki-scraper.py:614-674

Fuzzy Matching Algorithm

The fuzzy matching algorithm determines where each diagram should be inserted by finding the best match between the diagram's context and the markdown file's content.

flowchart TD
    Start["For each diagram_contexts[idx]"]
CheckUsed["idx in diagrams_used?"]
Skip["Skip to next diagram"]
subgraph "Text Normalization"
        NormFile["Normalize file content:\ncontent.lower()\n' '.join(content.split())"]
NormAnchor["Normalize anchor_text:\nanchor.lower()\n' '.join(anchor.split())"]
NormHeading["Normalize heading:\nheading.lower().replace('#', '').strip()"]
end
    
    subgraph "Progressive Chunk Matching"
        Try300["Try chunk_size=300"]
Try200["Try chunk_size=200"]
Try150["Try chunk_size=150"]
Try100["Try chunk_size=100"]
Try80["Try chunk_size=80"]
ExtractChunk["test_chunk = anchor_normalized[-chunk_size:]"]
FindPos["pos = content_normalized.find(test_chunk)"]
CheckPos["pos != -1?"]
ConvertLine["Convert char position to line number"]
RecordMatch["Record best_match_line, best_match_score"]
end
    
    subgraph "Heading Fallback"
        IterLines["For each line in markdown"]
CheckHeadingLine["line.strip().startswith('#')?"]
NormalizeLinе["Normalize line heading"]
CheckContains["heading_normalized in line_normalized?"]
RecordHeadingMatch["best_match_line = line_num\nbest_match_score = 50"]
end
    
 
   Start --> CheckUsed
 
   CheckUsed -->|Yes| Skip
 
   CheckUsed -->|No| NormFile
    
 
   NormFile --> NormAnchor
 
   NormAnchor --> Try300
 
   Try300 --> ExtractChunk
 
   ExtractChunk --> FindPos
 
   FindPos --> CheckPos
 
   CheckPos -->|Found| ConvertLine
 
   CheckPos -->|Not found| Try200
 
   ConvertLine --> RecordMatch
    
 
   Try200 --> Try150
 
   Try150 --> Try100
 
   Try100 --> Try80
 
   Try80 -->|All failed| IterLines
    
 
   RecordMatch --> Success["Return match with score"]
IterLines --> CheckHeadingLine
 
   CheckHeadingLine -->|Yes| NormalizeLinе
 
   NormalizeLinе --> CheckContains
 
   CheckContains -->|Yes| RecordHeadingMatch
 
   RecordHeadingMatch --> Success

Matching Strategy

Diagram: Progressive Chunk Matching with Fallback

Sources: tools/deepwiki-scraper.py:708-746

Chunk Size Progression

The algorithm tries progressively smaller chunk sizes to accommodate variations in text formatting between the JavaScript context and the html2text-converted markdown:

Chunk Size	Use Case	Success Rate
300 chars	Perfect or near-perfect matches	Highest precision
200 chars	Minor formatting differences	Good precision
150 chars	Moderate text variations	Acceptable precision
100 chars	Significant reformatting	Lower precision
80 chars	Minimal context available	Lowest precision
Heading match	Fallback when text matching fails	Score: 50

The algorithm stops at the first successful match, prioritizing larger chunks for higher confidence.

Sources: tools/deepwiki-scraper.py:716-730 README.md134

flowchart TD
    Start["Found best_match_line"]
CheckType["lines[best_match_line].strip().startswith('#')?"]
subgraph "Heading Case"
        H1["insert_line = best_match_line + 1"]
H2["Skip blank lines after heading"]
H3["Skip through paragraph content"]
H4["Stop at next blank line or heading"]
end
    
    subgraph "Paragraph Case"
        P1["insert_line = best_match_line + 1"]
P2["Find end of current paragraph"]
P3["Stop at next blank line or heading"]
end
    
    subgraph "Insertion Format"
        I1["Insert: empty line"]
I2["Insert: ```mermaid"]
I3["Insert: diagram code"]
I4["Insert: ```"]
I5["Insert: empty line"]
end
    
 
   Start --> CheckType
 
   CheckType -->|Heading| H1
 
   CheckType -->|Paragraph| P1
    
 
   H1 --> H2
 
   H2 --> H3
 
   H3 --> H4
    
 
   P1 --> P2
 
   P2 --> P3
    
 
   H4 --> I1
 
   P3 --> I1
    
 
   I1 --> I2
 
   I2 --> I3
 
   I3 --> I4
 
   I4 --> I5
    
 
   I5 --> Append["Append to pending_insertions list:\n(insert_line, diagram, score, idx)"]

Insertion Point Logic

After finding a match, the system determines the precise line number where the diagram should be inserted.

Insertion Algorithm

Diagram: Insertion Point Calculation

Sources: tools/deepwiki-scraper.py:747-768

graph LR
    Collect["Collect all\npending_insertions"]
Sort["Sort by insert_line\n(descending)"]
Insert["Insert from bottom to top\npreserves line numbers"]
Write["Write enhanced file\nto temp_dir"]
Collect --> Sort
 
   Sort --> Insert
 
   Insert --> Write

Batch Insertion Strategy

Diagrams are inserted in descending line order to avoid invalidating insertion points:

Diagram: Batch Insertion Order

Implementation:

Sources: tools/deepwiki-scraper.py:771-783

sequenceDiagram
    participant Main as extract_and_enhance_diagrams()
    participant Glob as temp_dir.glob('**/*.md')
    participant File as Individual .md file
    participant Matcher as Fuzzy Matcher
    participant Writer as File Writer
    
    Main->>Main: Extract all diagram_contexts
    Main->>Glob: Find all markdown files
    
    loop For each md_file
        Glob->>File: Open and read content
        File->>File: Check if '```mermaid' already present
        
        alt Already has diagrams
            File->>Glob: Skip (continue)
        else No diagrams
            File->>Matcher: Normalize content
            
            loop For each diagram_context
                Matcher->>Matcher: Try progressive chunk matching
                Matcher->>Matcher: Try heading fallback
                Matcher->>Matcher: Record best match
            end
            
            Matcher->>File: Return pending_insertions list
            File->>File: Sort insertions (descending)
            File->>File: Insert diagrams bottom-up
            File->>Writer: Write enhanced content
            Writer->>Main: Increment enhanced_count
        end
    end
    
    Main->>Main: Print summary

File Processing Workflow

Phase 2 operates on files in the temporary directory created by Phase 1, enhancing them in-place before they are moved to the final output directory.

Processing Loop

Diagram: File Processing Sequence

Sources: tools/deepwiki-scraper.py:676-788

Performance Characteristics

Extraction Statistics

From a typical wiki with ~10 pages:

Metric	Value	Location
Total diagrams in JS payload	~461	README.md132
Diagrams with context (500+ chars)	~48	README.md133
Context window size	500 characters	tools/deepwiki-scraper.py621
Anchor text max length	300 characters	tools/deepwiki-scraper.py666
Typical enhanced files	Varies by content	Printed in output

Sources: README.md:132-133 tools/deepwiki-scraper.py674 tools/deepwiki-scraper.py788

Matching Performance

The progressive chunk size strategy balances precision and recall:

High precision matches (300-200 chars) : Strong contextual alignment
Medium precision matches (150-100 chars) : Acceptable with some risk
Low precision matches (80 chars) : Risk of false positives
Heading-only matches (score: 50) : Last resort fallback

The algorithm prefers to skip a diagram rather than place it incorrectly, prioritizing documentation quality over diagram count.

Sources: tools/deepwiki-scraper.py:716-745

Integration with Phases 1 and 3

Input Requirements (from Phase 1)

Clean markdown files in temp_dir
Files must not already contain \```mermaid blocks
Proper heading structure for fallback matching
Normalized link structure

Sources: tools/deepwiki-scraper.py:810-877

Output Guarantees (for Phase 3)

Enhanced markdown files in temp_dir
Diagrams inserted with proper fencing: \```mermaid...`````
Blank lines before and after diagrams for proper rendering
Original file structure preserved (section-N directories maintained)
Atomic file operations (write complete file or skip)

Sources: tools/deepwiki-scraper.py:781-786 tools/deepwiki-scraper.py:883-908

The diagram is not inserted (conservative approach)
No error is raised (continues processing other diagrams)
File is left unmodified if no diagrams match

Sources: tools/deepwiki-scraper.py:699-746

Network Errors

If diagram extraction fails (network error, changed HTML structure):

Warning is printed but Phase 2 continues
Phase 1 files remain valid
System can still proceed to Phase 3 without diagrams

Sources: tools/deepwiki-scraper.py:610-612

Diagram Quality Thresholds

Threshold	Purpose
`len(diagram) > 10`	Filter out trivial/invalid diagram code
`len(anchor) > 50`	Ensure sufficient context for matching
`len(line) > 20`	Filter out short lines from anchor text
`chunk_size >= 80`	Minimum viable match size

Sources: tools/deepwiki-scraper.py648 tools/deepwiki-scraper.py712 tools/deepwiki-scraper.py661

Summary

Phase 2 implements a sophisticated fuzzy matching system that:

Extracts all Mermaid diagrams from DeepWiki's JavaScript payload using regex patterns
Processes diagram context to extract heading and anchor text metadata
Matches diagrams to markdown files using progressive chunk size comparison (300→80 chars)
Inserts diagrams after relevant paragraphs with proper formatting
Validates through conservative matching to avoid false positives

The phase operates entirely on files in the temporary directory, leaving Phase 1's output intact while preparing enhanced files for Phase 3's mdBook build process.