Diagram Extraction from Next.js
Relevant source files
Purpose and Scope
This document details how Mermaid diagrams are extracted from DeepWiki's Next.js JavaScript payload. DeepWiki uses client-side rendering for diagrams, embedding them as escaped strings within the HTML's JavaScript data structures. This page covers the extraction algorithms, regex patterns, unescaping logic, and deduplication mechanisms used to recover these diagrams.
For information about how extracted diagrams are matched to content and injected into Markdown files, see Fuzzy Diagram Matching Algorithm. For the overall diagram enhancement workflow, see Phase 2: Diagram Enhancement.
The Next.js Data Payload Problem
DeepWiki's architecture presents a unique challenge for diagram extraction. The application uses Next.js with client-side rendering, where Mermaid diagrams are embedded in the JavaScript payload rather than being present in the static HTML. Furthermore, the JavaScript payload contains diagrams from all pages in the wiki, not just the currently viewed page, making per-page extraction impossible without additional context matching.
Diagram: Next.js Payload Structure
graph TB
subgraph "Browser View"
HTML["HTML Response\nfrom deepwiki.com"]
end
subgraph "Embedded JavaScript"
JSPayload["Next.js Data Payload\nMixed content from all pages"]
DiagramData["Mermaid Diagrams\nAs escaped strings"]
end
subgraph "String Format"
EscapedFormat["```mermaid\\ \ngraph TD\\ \nA --> B\\ \n```"]
UnescapedFormat["```mermaid\ngraph TD\nA --> B\n```"]
end
HTML --> JSPayload
JSPayload --> DiagramData
DiagramData --> EscapedFormat
EscapedFormat -.->|extract_mermaid_from_nextjs_data| UnescapedFormat
note1["Problem: Diagrams from\nALL wiki pages mixed together"]
JSPayload -.-> note1
note2["Problem: Escape sequences\n\\\n, \\ , \\\", etc."]
EscapedFormat -.-> note2
Sources: tools/deepwiki-scraper.py:218-331 tools/deepwiki-scraper.py:596-674
The key characteristics of this data format:
| Characteristic | Description | Impact |
|---|---|---|
| Escaped newlines | Literal \\n instead of newline characters | Requires unescaping before use |
| Mixed content | All pages' diagrams in one payload | Requires context matching (Phase 2) |
| Unicode escapes | Sequences like \\u003c for < | Requires comprehensive unescape logic |
| String wrapping | Diagrams wrapped in JavaScript strings | Requires careful quote handling |
Extraction Entry Point
The extract_mermaid_from_nextjs_data() function serves as the primary extraction mechanism. It is called during Phase 2 of the pipeline when processing the HTML response from any DeepWiki page.
Diagram: Extraction Function Flow
flowchart TD
Start["extract_mermaid_from_nextjs_data(html_text)"]
Strategy1["Strategy 1:\nFenced Block Pattern\n```mermaid\\\n(.*?)```"]
Check1{{"Blocks found?"}}
Strategy2["Strategy 2:\nJavaScript String Scan\nSearch for diagram keywords"]
Unescape["Unescape all blocks:\n\\\n→ newline\n\\ → tab\n\< → <"]
Dedup["Deduplicate by fingerprint\nFirst 100 chars"]
Return["Return unique_blocks"]
Start --> Strategy1
Strategy1 --> Check1
Check1 -->|Yes| Unescape
Check1 -->|No| Strategy2
Strategy2 --> Unescape
Unescape --> Dedup
Dedup --> Return
Sources: tools/deepwiki-scraper.py:218-331
Strategy 1: Fenced Mermaid Block Pattern
The primary extraction strategy uses a regex pattern to locate fenced Mermaid code blocks within the JavaScript payload. These blocks follow the Markdown convention but with escaped newlines.
Regex Pattern : r'```mermaid\\n(.*?)```'
This pattern specifically targets:
- Opening fence: ````mermaid`
- Escaped newline:
\\n(literal backslash-n in the string) - Diagram content:
(.*?)(non-greedy capture) - Closing fence: `````
Diagram: Fenced Block Extraction Process
Sources: tools/deepwiki-scraper.py:223-244
Code Implementation :
The extraction loop at tools/deepwiki-scraper.py:223-244 implements this strategy:
- Pattern matching : Uses
re.finditer()withre.DOTALLflag to handle multi-line diagrams - Content extraction : Captures the diagram code via
match.group(1) - Unescaping : Applies comprehensive escape sequence replacement
- Validation : Filters blocks with
len(block) > 10to exclude empty matches - Logging : Prints first 50 characters and line count for diagnostics
Strategy 2: JavaScript String Scanning
When Strategy 1 fails to find fenced blocks, the function falls back to scanning for raw diagram strings embedded in JavaScript. This handles cases where diagrams are stored as plain strings without Markdown fencing.
Diagram: JavaScript String Scan Algorithm
flowchart TD
Start["For each diagram keyword"]
Keywords["Keywords:\ngraph TD, graph TB,\nflowchart TD, sequenceDiagram,\nclassDiagram"]
FindKW["pos = html_text.find(keyword, pos)"]
CheckFound{{"Keyword found?"}}
BackwardScan["Scan backwards 20 chars\nFind opening quote"]
QuoteFound{{"Quote found?"}}
ForwardScan["Scan forward up to 10000 chars\nFind closing quote\nSkip escaped quotes"]
Extract["Extract string_start:string_end"]
UnescapeValidate["Unescape and validate\nMust have 3+ lines"]
Append["Append to mermaid_blocks"]
NextPos["pos += 1, continue search"]
Start --> Keywords
Keywords --> FindKW
FindKW --> CheckFound
CheckFound -->|Yes| BackwardScan
CheckFound -->|No, break| End["Move to next keyword"]
BackwardScan --> QuoteFound
QuoteFound -->|Yes| ForwardScan
QuoteFound -->|No| NextPos
ForwardScan --> Extract
Extract --> UnescapeValidate
UnescapeValidate --> Append
Append --> NextPos
NextPos --> FindKW
Sources: tools/deepwiki-scraper.py:246-302
Keyword List :
The algorithm searches for these Mermaid diagram type indicators:
Quote Handling Logic :
The forward scan at tools/deepwiki-scraper.py:273-285 implements careful quote detection:
- Scans up to 10,000 characters forward (safety limit)
- Checks if previous character is
\to identify escaped quotes - Breaks on first unescaped
"character - Returns to search position + 1 if no closing quote found
Unescape Processing
All extracted diagram blocks undergo comprehensive unescaping to convert JavaScript string representations into valid Mermaid code. The unescaping process handles multiple escape sequence types.
Escape Sequence Mapping :
| Escaped Form | Unescaped Result | Purpose |
|---|---|---|
\\n | \n (newline) | Line breaks in diagram code |
\\t | \t (tab) | Indentation |
\\" | " (quote) | String literals in labels |
\\\\ | \ (backslash) | Literal backslashes |
\\u003c | < | Less-than symbol |
\\u003e | > | Greater-than symbol |
\\u0026 | & | Ampersand |
Diagram: Unescape Transformation Pipeline
Sources: tools/deepwiki-scraper.py:231-238 tools/deepwiki-scraper.py:289-295
Implementation Details :
The unescaping sequence at tools/deepwiki-scraper.py:231-238 executes in a specific order to prevent double-processing:
- Newlines first :
\\n→\n(most common) - Tabs :
\\t→\t(whitespace) - Quotes :
\\"→"(before backslash handling to avoid conflicts) - Backslashes :
\\\\→\(last to avoid interfering with other escapes) - Unicode :
\\u003c,\\u003e,\\u0026→<,>,&
The order matters: processing backslashes before quotes would incorrectly unescape \\\\" sequences.
flowchart TD
Start["Input: mermaid_blocks[]\n(may contain duplicates)"]
Init["Initialize:\nunique_blocks = []\nseen = set()"]
Loop["For each block in mermaid_blocks"]
Fingerprint["fingerprint = block[:100]\n(first 100 chars)"]
CheckSeen{{"fingerprint in seen?"}}
Skip["Skip duplicate"]
Add["Add to seen set\nAppend to unique_blocks"]
Return["Return unique_blocks"]
Start --> Init
Init --> Loop
Loop --> Fingerprint
Fingerprint --> CheckSeen
CheckSeen -->|Yes| Skip
CheckSeen -->|No| Add
Skip --> Loop
Add --> Loop
Loop -->|Done| Return
Deduplication Mechanism
Since multiple extraction strategies may find the same diagram (once as fenced block, once as JavaScript string), the function implements fingerprint-based deduplication.
Diagram: Deduplication Algorithm
Sources: tools/deepwiki-scraper.py:304-311
Fingerprint Strategy :
The deduplication at tools/deepwiki-scraper.py:304-311 uses the first 100 characters as a unique identifier. This approach:
- Avoids exact string comparison : Saves memory and time for large diagrams
- Handles minor variations : Trailing whitespace differences don't affect matching
- Preserves order : First occurrence wins (FIFO)
- Works across strategies : Catches duplicates from both extraction methods
Integration with Enhancement Pipeline
The extract_mermaid_from_nextjs_data() function is called from extract_and_enhance_diagrams() during Phase 2 processing. The integration pattern extracts diagrams globally, then distributes them to individual pages through context matching.
Diagram: Phase 2 Integration Flow
sequenceDiagram
participant Main as "extract_and_enhance_diagrams()"
participant HTTP as "requests.Session"
participant Extract as "extract_mermaid_from_nextjs_data()"
participant Context as "Context Extraction"
participant Files as "Markdown Files"
Main->>HTTP: GET https://deepwiki.com/repo/1-overview
HTTP-->>Main: HTML response (all diagrams)
Main->>Extract: extract_mermaid_from_nextjs_data(html_text)
Extract->>Extract: Strategy 1: Fenced blocks
Extract->>Extract: Strategy 2: JS strings
Extract->>Extract: Unescape all
Extract->>Extract: Deduplicate
Extract-->>Main: all_diagrams[] (~461 diagrams)
Main->>Context: Extract with 500-char context
Context-->>Main: diagram_contexts[] (~48 with context)
Main->>Files: Fuzzy match and inject into *.md
Files-->>Main: enhanced_count files modified
Sources: tools/deepwiki-scraper.py:596-674 tools/deepwiki-scraper.py:604-612
Call Site :
The extraction is invoked at tools/deepwiki-scraper.py:604-612 within extract_and_enhance_diagrams():
- Fetch any page : Typically uses
/1-overviewas all diagrams are in every page's payload - Extract globally : Calls
extract_mermaid_from_nextjs_data()on full HTML response - Count total : Logs total diagram count (~461 in typical repositories)
- Extract context : Secondary regex pass to capture surrounding text (see Fuzzy Diagram Matching Algorithm)
Alternative Pattern Search :
Phase 2 also performs a second extraction pass at tools/deepwiki-scraper.py:615-646 with context:
- Pattern :
r'([^]{500,}?)mermaid\\n(.*?)'` - Purpose : Captures 500+ characters before each diagram for context matching
- Result :
diagram_contexts[]withlast_heading,anchor_text, anddiagramfields - Filtering : Only diagrams with meaningful context are used for fuzzy matching
Error Handling and Diagnostics
The extraction function includes comprehensive error handling and diagnostic output to aid in debugging and monitoring extraction quality.
Error Handling Strategy :
Sources: tools/deepwiki-scraper.py:327-331
Diagnostic Output :
The function provides detailed logging at tools/deepwiki-scraper.py244 tools/deepwiki-scraper.py248 tools/deepwiki-scraper.py300 tools/deepwiki-scraper.py:314-316:
| Output Message | Condition | Purpose |
|---|---|---|
"Found mermaid diagram: {first_50}... ({lines} lines)" | Each successful extraction | Verify diagram content |
"No fenced mermaid blocks found, trying JavaScript extraction..." | Strategy 1 fails | Indicate fallback |
"Found JS mermaid diagram: {first_50}... ({lines} lines)" | Strategy 2 success | Show fallback results |
"Extracted {count} unique mermaid diagram(s)" | Deduplication complete | Report final count |
"Warning: No valid mermaid diagrams extracted" | Zero diagrams found | Alert to potential issues |
"Warning: Failed to extract mermaid from page data: {e}" | Exception caught | Debug extraction failures |
Performance Characteristics
The extraction algorithm exhibits specific performance characteristics relevant to large wiki repositories.
Complexity Analysis :
| Operation | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Strategy 1 regex | O(n) | O(m) | n = HTML length, m = diagram count |
| Strategy 2 scan | O(n × k) | O(m) | k = keyword count (10) |
| Unescaping | O(m × d) | O(m × d) | d = avg diagram length |
| Deduplication | O(m) | O(m) | Uses 100-char fingerprint |
| Total | O(n × k) | O(m × d) | Dominated by Strategy 2 |
Typical Performance :
Based on the diagnostic output patterns at tools/deepwiki-scraper.py314:
- Input size : ~2-5 MB HTML response
- Extraction time : ~200-500ms (dominated by regex operations)
- Diagrams found : ~461 total diagrams
- Diagrams with context : ~48 after filtering
- Memory usage : ~1-2 MB for diagram storage (ephemeral)
Optimization Opportunities :
The current implementation prioritizes correctness over performance. Potential optimizations:
- Early termination : Stop Strategy 2 after finding sufficient diagrams
- Compiled patterns : Pre-compile regex patterns (currently done inline)
- Streaming extraction : Process HTML in chunks rather than loading entirely
- Fingerprint cache : Persist fingerprints across runs to avoid re-extraction
However, given typical execution times (<1 second), these optimizations are not currently necessary.
Summary
The Next.js diagram extraction mechanism solves the challenge of recovering client-side rendered Mermaid diagrams from DeepWiki's JavaScript payload. The implementation uses a two-strategy approach (fenced blocks and JavaScript string scanning), comprehensive unescaping logic, and fingerprint-based deduplication to reliably extract hundreds of diagrams from a single HTML response. The extracted diagrams are then passed to the fuzzy matching algorithm (see Fuzzy Diagram Matching Algorithm) for intelligent placement in the appropriate Markdown files.
Key Functions and Components :
| Component | Location | Purpose |
|---|---|---|
extract_mermaid_from_nextjs_data() | tools/deepwiki-scraper.py:218-331 | Main extraction function |
| Strategy 1 regex | tools/deepwiki-scraper.py:223-244 | Fenced block pattern matching |
| Strategy 2 scanner | tools/deepwiki-scraper.py:246-302 | JavaScript string scanning |
| Deduplication | tools/deepwiki-scraper.py:304-311 | Fingerprint-based uniqueness |
| Phase 2 integration | tools/deepwiki-scraper.py:604-612 | Call site in enhancement pipeline |
Sources: tools/deepwiki-scraper.py:218-331 tools/deepwiki-scraper.py:596-674