Phase 2: Diagram Enhancement
Relevant source files
Purpose and Scope
Phase 2 performs intelligent diagram extraction and placement after Phase 1 has generated clean markdown files. This phase extracts Mermaid diagrams from DeepWiki's JavaScript payload, matches them to appropriate locations in the markdown content using fuzzy text matching, and inserts them contextually after relevant paragraphs.
For information about the initial markdown extraction that precedes this phase, see Phase 1: Markdown Extraction. For details on the specific fuzzy matching algorithm implementation, see Fuzzy Diagram Matching Algorithm. For information about the extraction patterns used, see Diagram Extraction from Next.js.
Sources: README.md:130-136 tools/deepwiki-scraper.py:596-789
The Client-Side Rendering Problem
DeepWiki renders diagrams client-side using JavaScript, making them invisible to traditional HTML scraping. All Mermaid diagrams are embedded in a JavaScript payload (self.__next_f.push) that contains diagram code for all pages in the wiki , not just the current page. This creates a matching problem: given ~461 diagrams in a single payload and individual markdown files, how do we determine which diagrams belong in which files?
Key challenges:
- Diagrams are escaped JavaScript strings (
\n,\t,\") - No metadata associates diagrams with specific pages
- html2text conversion changes text formatting from the original JavaScript context
- Must avoid false positives (placing diagrams in wrong locations)
Sources: tools/deepwiki-scraper.py:458-461 README.md:131-136
Architecture Overview
Diagram: Phase 2 Processing Pipeline
Sources: tools/deepwiki-scraper.py:596-789
Diagram Extraction Process
The extraction process reads the JavaScript payload from any DeepWiki page and locates all Mermaid diagram blocks using regex pattern matching.
flowchart TD
Start["extract_and_enhance_diagrams()"]
FetchURL["Fetch https://deepwiki.com/repo/1-overview"]
subgraph "Pattern Matching"
Pattern1["Pattern: r'```mermaid\\\\\n(.*?)```'\n(re.DOTALL)"]
Pattern2["Pattern: r'([^`]{500,}?)```mermaid\\\\ (.*?)```'\n(with context)"]
FindAll["re.findall() → all_diagrams list"]
FindIter["re.finditer() → diagram_contexts with context"]
end
subgraph "Unescaping"
ReplaceNewline["Replace '\\\\\n' → newline"]
ReplaceTab["Replace '\\\\ ' → tab"]
ReplaceQuote["Replace '\\\\\"' → double-quote"]
ReplaceUnicode["Replace Unicode escapes:\n\\\< → '<'\n\\\> → '>'\n\\\& → '&'"]
end
subgraph "Context Processing"
Last500["Extract last 500 chars of context"]
FindHeading["Scan for last heading starting with #"]
ExtractAnchor["Extract last 2-3 non-heading lines\n(min 20 chars each)"]
BuildDict["Build dict: {last_heading, anchor_text, diagram}"]
end
Start --> FetchURL
FetchURL --> Pattern1
FetchURL --> Pattern2
Pattern1 --> FindAll
Pattern2 --> FindIter
FindAll --> ReplaceNewline
FindIter --> ReplaceNewline
ReplaceNewline --> ReplaceTab
ReplaceTab --> ReplaceQuote
ReplaceQuote --> ReplaceUnicode
ReplaceUnicode --> Last500
Last500 --> FindHeading
FindHeading --> ExtractAnchor
ExtractAnchor --> BuildDict
BuildDict --> Output["Returns:\n- all_diagrams count\n- diagram_contexts list"]
Extraction Function Flow
Diagram: Diagram Extraction and Context Building
Sources: tools/deepwiki-scraper.py:604-674
Key Implementation Details
| Component | Implementation | Location |
|---|---|---|
| Regex Pattern | r'```mermaid\\n(.*?)```' with re.DOTALL flag | tools/deepwiki-scraper.py615 |
| Context Pattern | r'([^]{500,}?)mermaid\\n(.*?)'` captures 500+ chars | tools/deepwiki-scraper.py621 |
| Unescape Operations | replace('\\n', '\n'), replace('\\t', '\t'), etc. | tools/deepwiki-scraper.py:628-635 tools/deepwiki-scraper.py:639-645 |
| Heading Detection | line.startswith('#') on reversed context lines | tools/deepwiki-scraper.py:652-656 |
| Anchor Extraction | Last 2-3 lines with len(line) > 20, max 300 chars | tools/deepwiki-scraper.py:658-666 |
| Context Storage | Dict with keys: last_heading, anchor_text, diagram | tools/deepwiki-scraper.py:668-672 |
Sources: tools/deepwiki-scraper.py:614-674
Fuzzy Matching Algorithm
The fuzzy matching algorithm determines where each diagram should be inserted by finding the best match between the diagram's context and the markdown file's content.
flowchart TD
Start["For each diagram_contexts[idx]"]
CheckUsed["idx in diagrams_used?"]
Skip["Skip to next diagram"]
subgraph "Text Normalization"
NormFile["Normalize file content:\ncontent.lower()\n' '.join(content.split())"]
NormAnchor["Normalize anchor_text:\nanchor.lower()\n' '.join(anchor.split())"]
NormHeading["Normalize heading:\nheading.lower().replace('#', '').strip()"]
end
subgraph "Progressive Chunk Matching"
Try300["Try chunk_size=300"]
Try200["Try chunk_size=200"]
Try150["Try chunk_size=150"]
Try100["Try chunk_size=100"]
Try80["Try chunk_size=80"]
ExtractChunk["test_chunk = anchor_normalized[-chunk_size:]"]
FindPos["pos = content_normalized.find(test_chunk)"]
CheckPos["pos != -1?"]
ConvertLine["Convert char position to line number"]
RecordMatch["Record best_match_line, best_match_score"]
end
subgraph "Heading Fallback"
IterLines["For each line in markdown"]
CheckHeadingLine["line.strip().startswith('#')?"]
NormalizeLinе["Normalize line heading"]
CheckContains["heading_normalized in line_normalized?"]
RecordHeadingMatch["best_match_line = line_num\nbest_match_score = 50"]
end
Start --> CheckUsed
CheckUsed -->|Yes| Skip
CheckUsed -->|No| NormFile
NormFile --> NormAnchor
NormAnchor --> Try300
Try300 --> ExtractChunk
ExtractChunk --> FindPos
FindPos --> CheckPos
CheckPos -->|Found| ConvertLine
CheckPos -->|Not found| Try200
ConvertLine --> RecordMatch
Try200 --> Try150
Try150 --> Try100
Try100 --> Try80
Try80 -->|All failed| IterLines
RecordMatch --> Success["Return match with score"]
IterLines --> CheckHeadingLine
CheckHeadingLine -->|Yes| NormalizeLinе
NormalizeLinе --> CheckContains
CheckContains -->|Yes| RecordHeadingMatch
RecordHeadingMatch --> Success
Matching Strategy
Diagram: Progressive Chunk Matching with Fallback
Sources: tools/deepwiki-scraper.py:708-746
Chunk Size Progression
The algorithm tries progressively smaller chunk sizes to accommodate variations in text formatting between the JavaScript context and the html2text-converted markdown:
| Chunk Size | Use Case | Success Rate |
|---|---|---|
| 300 chars | Perfect or near-perfect matches | Highest precision |
| 200 chars | Minor formatting differences | Good precision |
| 150 chars | Moderate text variations | Acceptable precision |
| 100 chars | Significant reformatting | Lower precision |
| 80 chars | Minimal context available | Lowest precision |
| Heading match | Fallback when text matching fails | Score: 50 |
The algorithm stops at the first successful match, prioritizing larger chunks for higher confidence.
Sources: tools/deepwiki-scraper.py:716-730 README.md134
flowchart TD
Start["Found best_match_line"]
CheckType["lines[best_match_line].strip().startswith('#')?"]
subgraph "Heading Case"
H1["insert_line = best_match_line + 1"]
H2["Skip blank lines after heading"]
H3["Skip through paragraph content"]
H4["Stop at next blank line or heading"]
end
subgraph "Paragraph Case"
P1["insert_line = best_match_line + 1"]
P2["Find end of current paragraph"]
P3["Stop at next blank line or heading"]
end
subgraph "Insertion Format"
I1["Insert: empty line"]
I2["Insert: ```mermaid"]
I3["Insert: diagram code"]
I4["Insert: ```"]
I5["Insert: empty line"]
end
Start --> CheckType
CheckType -->|Heading| H1
CheckType -->|Paragraph| P1
H1 --> H2
H2 --> H3
H3 --> H4
P1 --> P2
P2 --> P3
H4 --> I1
P3 --> I1
I1 --> I2
I2 --> I3
I3 --> I4
I4 --> I5
I5 --> Append["Append to pending_insertions list:\n(insert_line, diagram, score, idx)"]
Insertion Point Logic
After finding a match, the system determines the precise line number where the diagram should be inserted.
Insertion Algorithm
Diagram: Insertion Point Calculation
Sources: tools/deepwiki-scraper.py:747-768
graph LR
Collect["Collect all\npending_insertions"]
Sort["Sort by insert_line\n(descending)"]
Insert["Insert from bottom to top\npreserves line numbers"]
Write["Write enhanced file\nto temp_dir"]
Collect --> Sort
Sort --> Insert
Insert --> Write
Batch Insertion Strategy
Diagrams are inserted in descending line order to avoid invalidating insertion points:
Diagram: Batch Insertion Order
Implementation:
Sources: tools/deepwiki-scraper.py:771-783
sequenceDiagram
participant Main as extract_and_enhance_diagrams()
participant Glob as temp_dir.glob('**/*.md')
participant File as Individual .md file
participant Matcher as Fuzzy Matcher
participant Writer as File Writer
Main->>Main: Extract all diagram_contexts
Main->>Glob: Find all markdown files
loop For each md_file
Glob->>File: Open and read content
File->>File: Check if '```mermaid' already present
alt Already has diagrams
File->>Glob: Skip (continue)
else No diagrams
File->>Matcher: Normalize content
loop For each diagram_context
Matcher->>Matcher: Try progressive chunk matching
Matcher->>Matcher: Try heading fallback
Matcher->>Matcher: Record best match
end
Matcher->>File: Return pending_insertions list
File->>File: Sort insertions (descending)
File->>File: Insert diagrams bottom-up
File->>Writer: Write enhanced content
Writer->>Main: Increment enhanced_count
end
end
Main->>Main: Print summary
File Processing Workflow
Phase 2 operates on files in the temporary directory created by Phase 1, enhancing them in-place before they are moved to the final output directory.
Processing Loop
Diagram: File Processing Sequence
Sources: tools/deepwiki-scraper.py:676-788
Performance Characteristics
Extraction Statistics
From a typical wiki with ~10 pages:
| Metric | Value | Location |
|---|---|---|
| Total diagrams in JS payload | ~461 | README.md132 |
| Diagrams with context (500+ chars) | ~48 | README.md133 |
| Context window size | 500 characters | tools/deepwiki-scraper.py621 |
| Anchor text max length | 300 characters | tools/deepwiki-scraper.py666 |
| Typical enhanced files | Varies by content | Printed in output |
Sources: README.md:132-133 tools/deepwiki-scraper.py674 tools/deepwiki-scraper.py788
Matching Performance
The progressive chunk size strategy balances precision and recall:
- High precision matches (300-200 chars) : Strong contextual alignment
- Medium precision matches (150-100 chars) : Acceptable with some risk
- Low precision matches (80 chars) : Risk of false positives
- Heading-only matches (score: 50) : Last resort fallback
The algorithm prefers to skip a diagram rather than place it incorrectly, prioritizing documentation quality over diagram count.
Sources: tools/deepwiki-scraper.py:716-745
Integration with Phases 1 and 3
Input Requirements (from Phase 1)
- Clean markdown files in
temp_dir - Files must not already contain
\```mermaidblocks - Proper heading structure for fallback matching
- Normalized link structure
Sources: tools/deepwiki-scraper.py:810-877
Output Guarantees (for Phase 3)
- Enhanced markdown files in
temp_dir - Diagrams inserted with proper fencing:
\```mermaid...````` - Blank lines before and after diagrams for proper rendering
- Original file structure preserved (section-N directories maintained)
- Atomic file operations (write complete file or skip)
Sources: tools/deepwiki-scraper.py:781-786 tools/deepwiki-scraper.py:883-908
Workflow Integration
Diagram: Three-Phase Integration
Sources: README.md:123-145 tools/deepwiki-scraper.py:810-916
Error Handling and Edge Cases
Skipped Files
Files are skipped if they already contain Mermaid diagrams to avoid duplicate insertion:
Sources: tools/deepwiki-scraper.py:686-687
Failed Matches
When a diagram cannot be matched:
- The diagram is not inserted (conservative approach)
- No error is raised (continues processing other diagrams)
- File is left unmodified if no diagrams match
Sources: tools/deepwiki-scraper.py:699-746
Network Errors
If diagram extraction fails (network error, changed HTML structure):
- Warning is printed but Phase 2 continues
- Phase 1 files remain valid
- System can still proceed to Phase 3 without diagrams
Sources: tools/deepwiki-scraper.py:610-612
Diagram Quality Thresholds
| Threshold | Purpose |
|---|---|
len(diagram) > 10 | Filter out trivial/invalid diagram code |
len(anchor) > 50 | Ensure sufficient context for matching |
len(line) > 20 | Filter out short lines from anchor text |
chunk_size >= 80 | Minimum viable match size |
Sources: tools/deepwiki-scraper.py648 tools/deepwiki-scraper.py712 tools/deepwiki-scraper.py661
Summary
Phase 2 implements a sophisticated fuzzy matching system that:
- Extracts all Mermaid diagrams from DeepWiki's JavaScript payload using regex patterns
- Processes diagram context to extract heading and anchor text metadata
- Matches diagrams to markdown files using progressive chunk size comparison (300→80 chars)
- Inserts diagrams after relevant paragraphs with proper formatting
- Validates through conservative matching to avoid false positives
The phase operates entirely on files in the temporary directory, leaving Phase 1's output intact while preparing enhanced files for Phase 3's mdBook build process.
Sources: README.md:130-136 tools/deepwiki-scraper.py:596-789