Diagram Extraction from Next.js

Relevant source files

tools/deepwiki-scraper.py

Purpose and Scope

This document details how Mermaid diagrams are extracted from DeepWiki's Next.js JavaScript payload. DeepWiki uses client-side rendering for diagrams, embedding them as escaped strings within the HTML's JavaScript data structures. This page covers the extraction algorithms, regex patterns, unescaping logic, and deduplication mechanisms used to recover these diagrams.

For information about how extracted diagrams are matched to content and injected into Markdown files, see Fuzzy Diagram Matching Algorithm. For the overall diagram enhancement workflow, see Phase 2: Diagram Enhancement.

The Next.js Data Payload Problem

DeepWiki's architecture presents a unique challenge for diagram extraction. The application uses Next.js with client-side rendering, where Mermaid diagrams are embedded in the JavaScript payload rather than being present in the static HTML. Furthermore, the JavaScript payload contains diagrams from all pages in the wiki, not just the currently viewed page, making per-page extraction impossible without additional context matching.

Diagram: Next.js Payload Structure

graph TB
    subgraph "Browser View"
        HTML["HTML Response\nfrom deepwiki.com"]
end
    
    subgraph "Embedded JavaScript"
        JSPayload["Next.js Data Payload\nMixed content from all pages"]
DiagramData["Mermaid Diagrams\nAs escaped strings"]
end
    
    subgraph "String Format"
        EscapedFormat["```mermaid\\ \ngraph TD\\ \nA --> B\\ \n```"]
UnescapedFormat["```mermaid\ngraph TD\nA --> B\n```"]
end
    
 
   HTML --> JSPayload
 
   JSPayload --> DiagramData
 
   DiagramData --> EscapedFormat
 
   EscapedFormat -.->|extract_mermaid_from_nextjs_data| UnescapedFormat
    
    note1["Problem: Diagrams from\nALL wiki pages mixed together"]
JSPayload -.-> note1
    
    note2["Problem: Escape sequences\n\\\n, \\	, \\\", etc."]
EscapedFormat -.-> note2

Sources: tools/deepwiki-scraper.py:218-331 tools/deepwiki-scraper.py:596-674

The key characteristics of this data format:

Characteristic	Description	Impact
Escaped newlines	Literal `\\n` instead of newline characters	Requires unescaping before use
Mixed content	All pages' diagrams in one payload	Requires context matching (Phase 2)
Unicode escapes	Sequences like `\\u003c` for `<`	Requires comprehensive unescape logic
String wrapping	Diagrams wrapped in JavaScript strings	Requires careful quote handling

Extraction Entry Point

The extract_mermaid_from_nextjs_data() function serves as the primary extraction mechanism. It is called during Phase 2 of the pipeline when processing the HTML response from any DeepWiki page.

Diagram: Extraction Function Flow

flowchart TD
    Start["extract_mermaid_from_nextjs_data(html_text)"]
Strategy1["Strategy 1:\nFenced Block Pattern\n```mermaid\\\n(.*?)```"]
Check1{{"Blocks found?"}}
Strategy2["Strategy 2:\nJavaScript String Scan\nSearch for diagram keywords"]
Unescape["Unescape all blocks:\n\\\n→ newline\n\\ → tab\n\< → <"]
Dedup["Deduplicate by fingerprint\nFirst 100 chars"]
Return["Return unique_blocks"]
Start --> Strategy1
 
   Strategy1 --> Check1
 
   Check1 -->|Yes| Unescape
 
   Check1 -->|No| Strategy2
 
   Strategy2 --> Unescape
 
   Unescape --> Dedup
 
   Dedup --> Return

Sources: tools/deepwiki-scraper.py:218-331

Strategy 1: Fenced Mermaid Block Pattern

The primary extraction strategy uses a regex pattern to locate fenced Mermaid code blocks within the JavaScript payload. These blocks follow the Markdown convention but with escaped newlines.

Regex Pattern : r'```mermaid\\n(.*?)```'

This pattern specifically targets:

Opening fence: ````mermaid`
Escaped newline: \\n (literal backslash-n in the string)
Diagram content: (.*?) (non-greedy capture)
Closing fence: `````

Diagram: Fenced Block Extraction Process

Sources: tools/deepwiki-scraper.py:223-244

Code Implementation :

The extraction loop at tools/deepwiki-scraper.py:223-244 implements this strategy:

Pattern matching : Uses re.finditer() with re.DOTALL flag to handle multi-line diagrams
Content extraction : Captures the diagram code via match.group(1)
Unescaping : Applies comprehensive escape sequence replacement
Validation : Filters blocks with len(block) > 10 to exclude empty matches
Logging : Prints first 50 characters and line count for diagnostics

Strategy 2: JavaScript String Scanning

When Strategy 1 fails to find fenced blocks, the function falls back to scanning for raw diagram strings embedded in JavaScript. This handles cases where diagrams are stored as plain strings without Markdown fencing.

Diagram: JavaScript String Scan Algorithm

flowchart TD
    Start["For each diagram keyword"]
Keywords["Keywords:\ngraph TD, graph TB,\nflowchart TD, sequenceDiagram,\nclassDiagram"]
FindKW["pos = html_text.find(keyword, pos)"]
CheckFound{{"Keyword found?"}}
BackwardScan["Scan backwards 20 chars\nFind opening quote"]
QuoteFound{{"Quote found?"}}
ForwardScan["Scan forward up to 10000 chars\nFind closing quote\nSkip escaped quotes"]
Extract["Extract string_start:string_end"]
UnescapeValidate["Unescape and validate\nMust have 3+ lines"]
Append["Append to mermaid_blocks"]
NextPos["pos += 1, continue search"]
Start --> Keywords
 
   Keywords --> FindKW
 
   FindKW --> CheckFound
 
   CheckFound -->|Yes| BackwardScan
 
   CheckFound -->|No, break| End["Move to next keyword"]
BackwardScan --> QuoteFound
 
   QuoteFound -->|Yes| ForwardScan
 
   QuoteFound -->|No| NextPos
 
   ForwardScan --> Extract
 
   Extract --> UnescapeValidate
 
   UnescapeValidate --> Append
 
   Append --> NextPos
 
   NextPos --> FindKW

Sources: tools/deepwiki-scraper.py:246-302

Keyword List :

The algorithm searches for these Mermaid diagram type indicators:

Quote Handling Logic :

The forward scan at tools/deepwiki-scraper.py:273-285 implements careful quote detection:

Scans up to 10,000 characters forward (safety limit)
Checks if previous character is \ to identify escaped quotes
Breaks on first unescaped " character
Returns to search position + 1 if no closing quote found

Unescape Processing

All extracted diagram blocks undergo comprehensive unescaping to convert JavaScript string representations into valid Mermaid code. The unescaping process handles multiple escape sequence types.

Escape Sequence Mapping :

Escaped Form	Unescaped Result	Purpose
`\\n`	`\n` (newline)	Line breaks in diagram code
`\\t`	`\t` (tab)	Indentation
`\\"`	`"` (quote)	String literals in labels
`\\\\`	`\` (backslash)	Literal backslashes
`\\u003c`	`<`	Less-than symbol
`\\u003e`	`>`	Greater-than symbol
`\\u0026`	`&`	Ampersand

Diagram: Unescape Transformation Pipeline

Sources: tools/deepwiki-scraper.py:231-238 tools/deepwiki-scraper.py:289-295

Implementation Details :

The unescaping sequence at tools/deepwiki-scraper.py:231-238 executes in a specific order to prevent double-processing:

Newlines first : \\n → \n (most common)
Tabs : \\t → \t (whitespace)
Quotes : \\" → " (before backslash handling to avoid conflicts)
Backslashes : \\\\ → \ (last to avoid interfering with other escapes)
Unicode : \\u003c, \\u003e, \\u0026 → <, >, &

The order matters: processing backslashes before quotes would incorrectly unescape \\\\" sequences.

flowchart TD
    Start["Input: mermaid_blocks[]\n(may contain duplicates)"]
Init["Initialize:\nunique_blocks = []\nseen = set()"]
Loop["For each block in mermaid_blocks"]
Fingerprint["fingerprint = block[:100]\n(first 100 chars)"]
CheckSeen{{"fingerprint in seen?"}}
Skip["Skip duplicate"]
Add["Add to seen set\nAppend to unique_blocks"]
Return["Return unique_blocks"]
Start --> Init
 
   Init --> Loop
 
   Loop --> Fingerprint
 
   Fingerprint --> CheckSeen
 
   CheckSeen -->|Yes| Skip
 
   CheckSeen -->|No| Add
 
   Skip --> Loop
 
   Add --> Loop
 
   Loop -->|Done| Return

Deduplication Mechanism

Since multiple extraction strategies may find the same diagram (once as fenced block, once as JavaScript string), the function implements fingerprint-based deduplication.

Diagram: Deduplication Algorithm

Sources: tools/deepwiki-scraper.py:304-311

Fingerprint Strategy :

The deduplication at tools/deepwiki-scraper.py:304-311 uses the first 100 characters as a unique identifier. This approach:

Avoids exact string comparison : Saves memory and time for large diagrams
Handles minor variations : Trailing whitespace differences don't affect matching
Preserves order : First occurrence wins (FIFO)
Works across strategies : Catches duplicates from both extraction methods

Integration with Enhancement Pipeline

The extract_mermaid_from_nextjs_data() function is called from extract_and_enhance_diagrams() during Phase 2 processing. The integration pattern extracts diagrams globally, then distributes them to individual pages through context matching.

Diagram: Phase 2 Integration Flow

sequenceDiagram
    participant Main as "extract_and_enhance_diagrams()"
    participant HTTP as "requests.Session"
    participant Extract as "extract_mermaid_from_nextjs_data()"
    participant Context as "Context Extraction"
    participant Files as "Markdown Files"
    
    Main->>HTTP: GET https://deepwiki.com/repo/1-overview
    HTTP-->>Main: HTML response (all diagrams)
    
    Main->>Extract: extract_mermaid_from_nextjs_data(html_text)
    Extract->>Extract: Strategy 1: Fenced blocks
    Extract->>Extract: Strategy 2: JS strings
    Extract->>Extract: Unescape all
    Extract->>Extract: Deduplicate
    Extract-->>Main: all_diagrams[] (~461 diagrams)
    
    Main->>Context: Extract with 500-char context
    Context-->>Main: diagram_contexts[] (~48 with context)
    
    Main->>Files: Fuzzy match and inject into *.md
    Files-->>Main: enhanced_count files modified

Sources: tools/deepwiki-scraper.py:596-674 tools/deepwiki-scraper.py:604-612

Call Site :

The extraction is invoked at tools/deepwiki-scraper.py:604-612 within extract_and_enhance_diagrams():

Fetch any page : Typically uses /1-overview as all diagrams are in every page's payload
Extract globally : Calls extract_mermaid_from_nextjs_data() on full HTML response
Count total : Logs total diagram count (~461 in typical repositories)
Extract context : Secondary regex pass to capture surrounding text (see Fuzzy Diagram Matching Algorithm)

Alternative Pattern Search :

Phase 2 also performs a second extraction pass at tools/deepwiki-scraper.py:615-646 with context:

Pattern : r'([^]{500,}?)mermaid\\n(.*?)'`
Purpose : Captures 500+ characters before each diagram for context matching
Result : diagram_contexts[] with last_heading, anchor_text, and diagram fields
Filtering : Only diagrams with meaningful context are used for fuzzy matching

Error Handling and Diagnostics

The extraction function includes comprehensive error handling and diagnostic output to aid in debugging and monitoring extraction quality.

Error Handling Strategy :

Sources: tools/deepwiki-scraper.py:327-331

Diagnostic Output :

The function provides detailed logging at tools/deepwiki-scraper.py244 tools/deepwiki-scraper.py248 tools/deepwiki-scraper.py300 tools/deepwiki-scraper.py:314-316:

Output Message	Condition	Purpose
`"Found mermaid diagram: {first_50}... ({lines} lines)"`	Each successful extraction	Verify diagram content
`"No fenced mermaid blocks found, trying JavaScript extraction..."`	Strategy 1 fails	Indicate fallback
`"Found JS mermaid diagram: {first_50}... ({lines} lines)"`	Strategy 2 success	Show fallback results
`"Extracted {count} unique mermaid diagram(s)"`	Deduplication complete	Report final count
`"Warning: No valid mermaid diagrams extracted"`	Zero diagrams found	Alert to potential issues
`"Warning: Failed to extract mermaid from page data: {e}"`	Exception caught	Debug extraction failures

Performance Characteristics

The extraction algorithm exhibits specific performance characteristics relevant to large wiki repositories.

Complexity Analysis :

Operation	Time Complexity	Space Complexity	Notes
Strategy 1 regex	O(n)	O(m)	n = HTML length, m = diagram count
Strategy 2 scan	O(n × k)	O(m)	k = keyword count (10)
Unescaping	O(m × d)	O(m × d)	d = avg diagram length
Deduplication	O(m)	O(m)	Uses 100-char fingerprint
Total	O(n × k)	O(m × d)	Dominated by Strategy 2

Typical Performance :

Based on the diagnostic output patterns at tools/deepwiki-scraper.py314:

Input size : ~2-5 MB HTML response
Extraction time : ~200-500ms (dominated by regex operations)
Diagrams found : ~461 total diagrams
Diagrams with context : ~48 after filtering
Memory usage : ~1-2 MB for diagram storage (ephemeral)

Optimization Opportunities :

The current implementation prioritizes correctness over performance. Potential optimizations:

Early termination : Stop Strategy 2 after finding sufficient diagrams
Compiled patterns : Pre-compile regex patterns (currently done inline)
Streaming extraction : Process HTML in chunks rather than loading entirely
Fingerprint cache : Persist fingerprints across runs to avoid re-extraction

However, given typical execution times (<1 second), these optimizations are not currently necessary.

Summary

The Next.js diagram extraction mechanism solves the challenge of recovering client-side rendered Mermaid diagrams from DeepWiki's JavaScript payload. The implementation uses a two-strategy approach (fenced blocks and JavaScript string scanning), comprehensive unescaping logic, and fingerprint-based deduplication to reliably extract hundreds of diagrams from a single HTML response. The extracted diagrams are then passed to the fuzzy matching algorithm (see Fuzzy Diagram Matching Algorithm) for intelligent placement in the appropriate Markdown files.

Key Functions and Components :

Component	Location	Purpose
`extract_mermaid_from_nextjs_data()`	tools/deepwiki-scraper.py:218-331	Main extraction function
Strategy 1 regex	tools/deepwiki-scraper.py:223-244	Fenced block pattern matching
Strategy 2 scanner	tools/deepwiki-scraper.py:246-302	JavaScript string scanning
Deduplication	tools/deepwiki-scraper.py:304-311	Fingerprint-based uniqueness
Phase 2 integration	tools/deepwiki-scraper.py:604-612	Call site in enhancement pipeline

Sources: tools/deepwiki-scraper.py:218-331 tools/deepwiki-scraper.py:596-674

Keyboard shortcuts

deepwiki-to-mdbook Documentation