Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

DeepWiki GitHub

Diagram Extraction from Next.js

Relevant source files

Purpose and Scope

This document details how Mermaid diagrams are extracted from DeepWiki's Next.js JavaScript payload. DeepWiki uses client-side rendering for diagrams, embedding them as escaped strings within the HTML's JavaScript data structures. This page covers the extraction algorithms, regex patterns, unescaping logic, and deduplication mechanisms used to recover these diagrams.

For information about how extracted diagrams are matched to content and injected into Markdown files, see Fuzzy Diagram Matching Algorithm. For the overall diagram enhancement workflow, see Phase 2: Diagram Enhancement.


The Next.js Data Payload Problem

DeepWiki's architecture presents a unique challenge for diagram extraction. The application uses Next.js with client-side rendering, where Mermaid diagrams are embedded in the JavaScript payload rather than being present in the static HTML. Furthermore, the JavaScript payload contains diagrams from all pages in the wiki, not just the currently viewed page, making per-page extraction impossible without additional context matching.

Diagram: Next.js Payload Structure

graph TB
    subgraph "Browser View"
        HTML["HTML Response\nfrom deepwiki.com"]
end
    
    subgraph "Embedded JavaScript"
        JSPayload["Next.js Data Payload\nMixed content from all pages"]
DiagramData["Mermaid Diagrams\nAs escaped strings"]
end
    
    subgraph "String Format"
        EscapedFormat["```mermaid\\ \ngraph TD\\ \nA --> B\\ \n```"]
UnescapedFormat["```mermaid\ngraph TD\nA --> B\n```"]
end
    
 
   HTML --> JSPayload
 
   JSPayload --> DiagramData
 
   DiagramData --> EscapedFormat
 
   EscapedFormat -.->|extract_mermaid_from_nextjs_data| UnescapedFormat
    
    note1["Problem: Diagrams from\nALL wiki pages mixed together"]
JSPayload -.-> note1
    
    note2["Problem: Escape sequences\n\\\n, \\	, \\\", etc."]
EscapedFormat -.-> note2

Sources: tools/deepwiki-scraper.py:218-331 tools/deepwiki-scraper.py:596-674

The key characteristics of this data format:

CharacteristicDescriptionImpact
Escaped newlinesLiteral \\n instead of newline charactersRequires unescaping before use
Mixed contentAll pages' diagrams in one payloadRequires context matching (Phase 2)
Unicode escapesSequences like \\u003c for <Requires comprehensive unescape logic
String wrappingDiagrams wrapped in JavaScript stringsRequires careful quote handling

Extraction Entry Point

The extract_mermaid_from_nextjs_data() function serves as the primary extraction mechanism. It is called during Phase 2 of the pipeline when processing the HTML response from any DeepWiki page.

Diagram: Extraction Function Flow

flowchart TD
    Start["extract_mermaid_from_nextjs_data(html_text)"]
Strategy1["Strategy 1:\nFenced Block Pattern\n```mermaid\\\n(.*?)```"]
Check1{{"Blocks found?"}}
Strategy2["Strategy 2:\nJavaScript String Scan\nSearch for diagram keywords"]
Unescape["Unescape all blocks:\n\\\n→ newline\n\\ → tab\n\< → <"]
Dedup["Deduplicate by fingerprint\nFirst 100 chars"]
Return["Return unique_blocks"]
Start --> Strategy1
 
   Strategy1 --> Check1
 
   Check1 -->|Yes| Unescape
 
   Check1 -->|No| Strategy2
 
   Strategy2 --> Unescape
 
   Unescape --> Dedup
 
   Dedup --> Return

Sources: tools/deepwiki-scraper.py:218-331


Strategy 1: Fenced Mermaid Block Pattern

The primary extraction strategy uses a regex pattern to locate fenced Mermaid code blocks within the JavaScript payload. These blocks follow the Markdown convention but with escaped newlines.

Regex Pattern : r'```mermaid\\n(.*?)```'

This pattern specifically targets:

  • Opening fence: ````mermaid`
  • Escaped newline: \\n (literal backslash-n in the string)
  • Diagram content: (.*?) (non-greedy capture)
  • Closing fence: `````

Diagram: Fenced Block Extraction Process

Sources: tools/deepwiki-scraper.py:223-244

Code Implementation :

The extraction loop at tools/deepwiki-scraper.py:223-244 implements this strategy:

  1. Pattern matching : Uses re.finditer() with re.DOTALL flag to handle multi-line diagrams
  2. Content extraction : Captures the diagram code via match.group(1)
  3. Unescaping : Applies comprehensive escape sequence replacement
  4. Validation : Filters blocks with len(block) > 10 to exclude empty matches
  5. Logging : Prints first 50 characters and line count for diagnostics

Strategy 2: JavaScript String Scanning

When Strategy 1 fails to find fenced blocks, the function falls back to scanning for raw diagram strings embedded in JavaScript. This handles cases where diagrams are stored as plain strings without Markdown fencing.

Diagram: JavaScript String Scan Algorithm

flowchart TD
    Start["For each diagram keyword"]
Keywords["Keywords:\ngraph TD, graph TB,\nflowchart TD, sequenceDiagram,\nclassDiagram"]
FindKW["pos = html_text.find(keyword, pos)"]
CheckFound{{"Keyword found?"}}
BackwardScan["Scan backwards 20 chars\nFind opening quote"]
QuoteFound{{"Quote found?"}}
ForwardScan["Scan forward up to 10000 chars\nFind closing quote\nSkip escaped quotes"]
Extract["Extract string_start:string_end"]
UnescapeValidate["Unescape and validate\nMust have 3+ lines"]
Append["Append to mermaid_blocks"]
NextPos["pos += 1, continue search"]
Start --> Keywords
 
   Keywords --> FindKW
 
   FindKW --> CheckFound
 
   CheckFound -->|Yes| BackwardScan
 
   CheckFound -->|No, break| End["Move to next keyword"]
BackwardScan --> QuoteFound
 
   QuoteFound -->|Yes| ForwardScan
 
   QuoteFound -->|No| NextPos
 
   ForwardScan --> Extract
 
   Extract --> UnescapeValidate
 
   UnescapeValidate --> Append
 
   Append --> NextPos
 
   NextPos --> FindKW

Sources: tools/deepwiki-scraper.py:246-302

Keyword List :

The algorithm searches for these Mermaid diagram type indicators:

Quote Handling Logic :

The forward scan at tools/deepwiki-scraper.py:273-285 implements careful quote detection:

  • Scans up to 10,000 characters forward (safety limit)
  • Checks if previous character is \ to identify escaped quotes
  • Breaks on first unescaped " character
  • Returns to search position + 1 if no closing quote found

Unescape Processing

All extracted diagram blocks undergo comprehensive unescaping to convert JavaScript string representations into valid Mermaid code. The unescaping process handles multiple escape sequence types.

Escape Sequence Mapping :

Escaped FormUnescaped ResultPurpose
\\n\n (newline)Line breaks in diagram code
\\t\t (tab)Indentation
\\"" (quote)String literals in labels
\\\\\ (backslash)Literal backslashes
\\u003c<Less-than symbol
\\u003e>Greater-than symbol
\\u0026&Ampersand

Diagram: Unescape Transformation Pipeline

Sources: tools/deepwiki-scraper.py:231-238 tools/deepwiki-scraper.py:289-295

Implementation Details :

The unescaping sequence at tools/deepwiki-scraper.py:231-238 executes in a specific order to prevent double-processing:

  1. Newlines first : \\n\n (most common)
  2. Tabs : \\t\t (whitespace)
  3. Quotes : \\"" (before backslash handling to avoid conflicts)
  4. Backslashes : \\\\\ (last to avoid interfering with other escapes)
  5. Unicode : \\u003c, \\u003e, \\u0026<, >, &

The order matters: processing backslashes before quotes would incorrectly unescape \\\\" sequences.


flowchart TD
    Start["Input: mermaid_blocks[]\n(may contain duplicates)"]
Init["Initialize:\nunique_blocks = []\nseen = set()"]
Loop["For each block in mermaid_blocks"]
Fingerprint["fingerprint = block[:100]\n(first 100 chars)"]
CheckSeen{{"fingerprint in seen?"}}
Skip["Skip duplicate"]
Add["Add to seen set\nAppend to unique_blocks"]
Return["Return unique_blocks"]
Start --> Init
 
   Init --> Loop
 
   Loop --> Fingerprint
 
   Fingerprint --> CheckSeen
 
   CheckSeen -->|Yes| Skip
 
   CheckSeen -->|No| Add
 
   Skip --> Loop
 
   Add --> Loop
 
   Loop -->|Done| Return

Deduplication Mechanism

Since multiple extraction strategies may find the same diagram (once as fenced block, once as JavaScript string), the function implements fingerprint-based deduplication.

Diagram: Deduplication Algorithm

Sources: tools/deepwiki-scraper.py:304-311

Fingerprint Strategy :

The deduplication at tools/deepwiki-scraper.py:304-311 uses the first 100 characters as a unique identifier. This approach:

  • Avoids exact string comparison : Saves memory and time for large diagrams
  • Handles minor variations : Trailing whitespace differences don't affect matching
  • Preserves order : First occurrence wins (FIFO)
  • Works across strategies : Catches duplicates from both extraction methods

Integration with Enhancement Pipeline

The extract_mermaid_from_nextjs_data() function is called from extract_and_enhance_diagrams() during Phase 2 processing. The integration pattern extracts diagrams globally, then distributes them to individual pages through context matching.

Diagram: Phase 2 Integration Flow

sequenceDiagram
    participant Main as "extract_and_enhance_diagrams()"
    participant HTTP as "requests.Session"
    participant Extract as "extract_mermaid_from_nextjs_data()"
    participant Context as "Context Extraction"
    participant Files as "Markdown Files"
    
    Main->>HTTP: GET https://deepwiki.com/repo/1-overview
    HTTP-->>Main: HTML response (all diagrams)
    
    Main->>Extract: extract_mermaid_from_nextjs_data(html_text)
    Extract->>Extract: Strategy 1: Fenced blocks
    Extract->>Extract: Strategy 2: JS strings
    Extract->>Extract: Unescape all
    Extract->>Extract: Deduplicate
    Extract-->>Main: all_diagrams[] (~461 diagrams)
    
    Main->>Context: Extract with 500-char context
    Context-->>Main: diagram_contexts[] (~48 with context)
    
    Main->>Files: Fuzzy match and inject into *.md
    Files-->>Main: enhanced_count files modified

Sources: tools/deepwiki-scraper.py:596-674 tools/deepwiki-scraper.py:604-612

Call Site :

The extraction is invoked at tools/deepwiki-scraper.py:604-612 within extract_and_enhance_diagrams():

  1. Fetch any page : Typically uses /1-overview as all diagrams are in every page's payload
  2. Extract globally : Calls extract_mermaid_from_nextjs_data() on full HTML response
  3. Count total : Logs total diagram count (~461 in typical repositories)
  4. Extract context : Secondary regex pass to capture surrounding text (see Fuzzy Diagram Matching Algorithm)

Alternative Pattern Search :

Phase 2 also performs a second extraction pass at tools/deepwiki-scraper.py:615-646 with context:

  • Pattern : r'([^]{500,}?)mermaid\\n(.*?)'`
  • Purpose : Captures 500+ characters before each diagram for context matching
  • Result : diagram_contexts[] with last_heading, anchor_text, and diagram fields
  • Filtering : Only diagrams with meaningful context are used for fuzzy matching

Error Handling and Diagnostics

The extraction function includes comprehensive error handling and diagnostic output to aid in debugging and monitoring extraction quality.

Error Handling Strategy :

Sources: tools/deepwiki-scraper.py:327-331

Diagnostic Output :

The function provides detailed logging at tools/deepwiki-scraper.py244 tools/deepwiki-scraper.py248 tools/deepwiki-scraper.py300 tools/deepwiki-scraper.py:314-316:

Output MessageConditionPurpose
"Found mermaid diagram: {first_50}... ({lines} lines)"Each successful extractionVerify diagram content
"No fenced mermaid blocks found, trying JavaScript extraction..."Strategy 1 failsIndicate fallback
"Found JS mermaid diagram: {first_50}... ({lines} lines)"Strategy 2 successShow fallback results
"Extracted {count} unique mermaid diagram(s)"Deduplication completeReport final count
"Warning: No valid mermaid diagrams extracted"Zero diagrams foundAlert to potential issues
"Warning: Failed to extract mermaid from page data: {e}"Exception caughtDebug extraction failures

Performance Characteristics

The extraction algorithm exhibits specific performance characteristics relevant to large wiki repositories.

Complexity Analysis :

OperationTime ComplexitySpace ComplexityNotes
Strategy 1 regexO(n)O(m)n = HTML length, m = diagram count
Strategy 2 scanO(n × k)O(m)k = keyword count (10)
UnescapingO(m × d)O(m × d)d = avg diagram length
DeduplicationO(m)O(m)Uses 100-char fingerprint
TotalO(n × k)O(m × d)Dominated by Strategy 2

Typical Performance :

Based on the diagnostic output patterns at tools/deepwiki-scraper.py314:

  • Input size : ~2-5 MB HTML response
  • Extraction time : ~200-500ms (dominated by regex operations)
  • Diagrams found : ~461 total diagrams
  • Diagrams with context : ~48 after filtering
  • Memory usage : ~1-2 MB for diagram storage (ephemeral)

Optimization Opportunities :

The current implementation prioritizes correctness over performance. Potential optimizations:

  1. Early termination : Stop Strategy 2 after finding sufficient diagrams
  2. Compiled patterns : Pre-compile regex patterns (currently done inline)
  3. Streaming extraction : Process HTML in chunks rather than loading entirely
  4. Fingerprint cache : Persist fingerprints across runs to avoid re-extraction

However, given typical execution times (<1 second), these optimizations are not currently necessary.


Summary

The Next.js diagram extraction mechanism solves the challenge of recovering client-side rendered Mermaid diagrams from DeepWiki's JavaScript payload. The implementation uses a two-strategy approach (fenced blocks and JavaScript string scanning), comprehensive unescaping logic, and fingerprint-based deduplication to reliably extract hundreds of diagrams from a single HTML response. The extracted diagrams are then passed to the fuzzy matching algorithm (see Fuzzy Diagram Matching Algorithm) for intelligent placement in the appropriate Markdown files.

Key Functions and Components :

ComponentLocationPurpose
extract_mermaid_from_nextjs_data()tools/deepwiki-scraper.py:218-331Main extraction function
Strategy 1 regextools/deepwiki-scraper.py:223-244Fenced block pattern matching
Strategy 2 scannertools/deepwiki-scraper.py:246-302JavaScript string scanning
Deduplicationtools/deepwiki-scraper.py:304-311Fingerprint-based uniqueness
Phase 2 integrationtools/deepwiki-scraper.py:604-612Call site in enhancement pipeline

Sources: tools/deepwiki-scraper.py:218-331 tools/deepwiki-scraper.py:596-674