Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Fuzzy Matching Algorithm

Loading…

Fuzzy Matching Algorithm

Relevant source files

Purpose and Scope

The fuzzy matching algorithm places Mermaid diagrams extracted from DeepWiki’s JavaScript payload into the correct locations within Markdown files. It matches diagram context (as it appears in the JavaScript) to content locations in html2text-converted Markdown, accounting for formatting differences between the two representations.

This algorithm is implemented in extract_and_enhance_diagrams() function and processes all files after the initial markdown extraction phase completes.

Sources: python/deepwiki-scraper.py:880-1275

The Matching Problem

The fuzzy matching algorithm addresses a fundamental mismatch: diagrams are embedded in DeepWiki’s JavaScript payload alongside their surrounding context text, but this context text differs significantly from the final Markdown output produced by html2text. The algorithm must find where each diagram belongs despite these differences.

Format Differences Between Sources

AspectJavaScript Payloadhtml2text Output
WhitespaceEscaped \n sequencesActual newlines
Line wrappingNo wrapping (continuous text)Wrapped at natural boundaries
HTML entitiesEscaped (\u003c, \u0026)Decoded (<, &)
FormattingInline with escaped quotesClean Markdown syntax
StructureLinear text streamHierarchical headings/paragraphs

Sources: python/deepwiki-scraper.py:898-903

Context Extraction Strategy

The algorithm extracts two types of context for each diagram to enable matching:

1. Last Heading Before Diagram

Extractinglast_heading from Context

The algorithm scans backwards through context_lines to find the most recent line starting with #, which provides a coarse-grained location hint.

Sources: python/deepwiki-scraper.py:1066-1071

2. Anchor Text (Last 2-3 Paragraphs)

Extractinganchor_text from Context

The anchor_text consists of the last 2-3 substantial non-heading lines before the diagram, truncated to 300 characters. This provides fine-grained matching capability.

Sources: python/deepwiki-scraper.py:1073-1081

Progressive Chunk Size Matching

The core of the fuzzy matching algorithm uses progressively smaller chunk sizes to find matches, prioritizing longer (more specific) matches over shorter ones.

Chunk Size Progression

The algorithm tests chunks in this order:

Chunk SizePurposeMatch Quality
300 charsFull anchor textHighest confidence
200 charsMost of anchorHigh confidence
150 charsSignificant portionMedium-high confidence
100 charsKey phrasesMedium confidence
80 charsMinimum viable matchLow confidence

Matching Algorithm Flow

Progressive Chunk Matching in Code

Sources: python/deepwiki-scraper.py:1169-1239

Text Normalization

Both the diagram context and the target Markdown content undergo identical normalization to maximize matching success:

This process:

  • Converts all text to lowercase
  • Collapses all consecutive whitespace (spaces, tabs, newlines) into single spaces
  • Removes leading/trailing whitespace

Sources: python/deepwiki-scraper.py:1166-1167 python/deepwiki-scraper.py:1185-1186

Fallback: Heading-Based Matching

If progressive chunk matching fails (best_match_line == -1 and heading exists), the algorithm falls back to heading-based matching:

Heading Fallback Implementation

Heading-based matches receive a fixed best_match_score of 50, lower than any chunk-based match (minimum 80), indicating lower confidence.

Sources: python/deepwiki-scraper.py:1203-1216

Insertion Point Calculation

Once a match is found with best_match_score >= 80, the algorithm calculates the precise insert_line for the diagram:

Insertion After Headings

Calculatinginsert_line After Heading

Sources: python/deepwiki-scraper.py:1222-1230

Insertion After Paragraphs

Calculatinginsert_line After Paragraph

Sources: python/deepwiki-scraper.py:1232-1236

Scoring and Deduplication

The algorithm tracks which diagrams have been used to prevent duplicates:

For each file, the algorithm:

  1. Attempts to match all diagrams in diagram_contexts with file content
  2. Stores successful matches with their scores in pending_insertions as tuples: (insert_line, diagram, best_match_score, idx)
  3. Marks diagrams as used by adding their index to diagrams_used set
  4. Sorts pending_insertions by line number (descending) to avoid index shifting
  5. Inserts diagrams from bottom to top

Sources: python/deepwiki-scraper.py:1162-1163 python/deepwiki-scraper.py1238 python/deepwiki-scraper.py:1242-1243

Diagram Insertion Format

Diagrams are inserted with proper Markdown fencing and spacing, accounting for backticks in the diagram content:

Buildinglines_to_insert

This results in the following structure in the Markdown file:

Next paragraph text.

If the diagram contains triple backticks, the fence length is increased (e.g., to 4 or 5 backticks) to avoid conflicts.

**Sources:** <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/python/deepwiki-scraper.py#L1249-L1266" min=1249 max=1266 file-path="python/deepwiki-scraper.py">Hii</FileRef>

## Complete Matching Pipeline

**`extract_and_enhance_diagrams()` Function Flow**

```mermaid
flowchart TD
    Start["extract_and_enhance_diagrams(\nrepo, temp_dir,\nsession, diagram_source_url)"] --> Fetch["response = session.get(\ndiagram_source_url)\nhtml_text = response.text"]
    Fetch --> ExtractPattern["diagram_pattern =\nr'```mermaid(?:\\r\\n|\\n|\r?\n)\n(.*?)(?:\\r\\n|\\n|\r?\n)```'\ndiagram_matches =\nlist(re.finditer(\npattern, html_text, re.DOTALL))"]
    ExtractPattern --> ContextLoop["diagram_contexts = []\nfor match in diagram_matches:\ncontext_start =\nmax(0, match.start() - 2000)"]
    ContextLoop --> ParseContext["extract:\n- last_heading\n- anchor_text[-300:]\n- diagram (unescaped)"]
    ParseContext --> NormalizeDiag["diagram =\nmerge_multiline_labels(diagram)\ndiagram =\nstrip_wrapping_quotes(diagram)\ndiagram =\nnormalize_mermaid_diagram(diagram)"]
    NormalizeDiag --> AppendContext["diagram_contexts.append({\n'last_heading': last_heading,\n'anchor_text': anchor_text,\n'diagram': diagram\n})"]
    AppendContext --> FindFiles["md_files =\nlist(temp_dir.glob('**/*.md'))"]
    
    FindFiles --> FileLoop["for md_file in md_files:"]
    FileLoop --> ReadFile["content = f.read()"]
    ReadFile --> CheckExists["re.search(\nr'^\\s*`{3,}\\s*mermaid\\b',\ncontent, re.IGNORECASE\n| re.MULTILINE)?"]
    CheckExists -->|Yes| Skip["continue"]
    CheckExists -->|No| SplitLines["lines = content.split('\\n')"]
    SplitLines --> InitVars["diagrams_used = set()\npending_insertions = []\ncontent_normalized =\ncontent.lower()"]
    InitVars --> DiagLoop["for idx, item in\nenumerate(diagram_contexts):"]
    
    DiagLoop --> TryChunks["for chunk_size in\n[300, 200, 150, 100, 80]:\ntest_chunk =\nanchor_normalized[-chunk_size:]\npos = content_normalized\n.find(test_chunk)"]
    TryChunks -->|Found| CalcLine["convert pos to line_num\nbest_match_line = line_num\nbest_match_score = chunk_size"]
    TryChunks -->|Not found| TryHeading["heading fallback matching"]
    TryHeading -->|Found| SetScore["best_match_score = 50"]
    CalcLine --> CheckScore["best_match_score >= 80?"]
    SetScore --> CheckScore
    CheckScore -->|Yes| CalcInsert["calculate insert_line:\nenforce_content_start()\nadvance_past_lists()"]
    CalcInsert --> AppendPending["pending_insertions.append(\ninsert_line, diagram,\nbest_match_score, idx)\ndiagrams_used.add(idx)"]
    CheckScore -->|No| NextDiag["next diagram"]
    
    AppendPending --> NextDiag
    NextDiag --> DiagLoop
    DiagLoop -->|All done| SortPending["pending_insertions.sort(\nkey=lambda x: x[0],\nreverse=True)"]
    SortPending --> InsertLoop["for insert_line, diagram,\nscore, idx in\npending_insertions:\ncalculate fence_len\nlines.insert(insert_line,\nlines_to_insert)"]
    InsertLoop --> SaveFile["with open(md_file, 'w')\nas f:\nf.write('\\n'.join(lines))"]
    SaveFile --> FileLoop
    FileLoop -->|All files| PrintStats["print(f'Enhanced\n{enhanced_count} files')"]

Sources: python/deepwiki-scraper.py:880-1275

Key Functions and Variables

Function/VariableLocationPurpose
extract_and_enhance_diagrams()python/deepwiki-scraper.py:880-1275Main orchestrator for diagram enhancement phase
diagram_contextspython/deepwiki-scraper.py903List of dicts with last_heading, anchor_text, diagram
first_body_heading_index()python/deepwiki-scraper.py:1095-1099Finds first ## heading in file
protected_prefix_end()python/deepwiki-scraper.py:1101-1115Determines where title and source list end
advance_past_lists()python/deepwiki-scraper.py:1125-1137Skips over list blocks to avoid insertion in lists
enforce_content_start()python/deepwiki-scraper.py:1138-1147Ensures insertion is after protected sections
diagrams_usedpython/deepwiki-scraper.py1162Set tracking which diagram indices are already placed
pending_insertionspython/deepwiki-scraper.py1163List of tuples: (insert_line, diagram, score, idx)
best_match_linepython/deepwiki-scraper.py1175Line number where best match was found
best_match_scorepython/deepwiki-scraper.py1176Score of best match (chunk_size or 50 for heading)
Progressive chunk looppython/deepwiki-scraper.py:1187-1201Tries chunk sizes [300, 200, 150, 100, 80]
Heading fallbackpython/deepwiki-scraper.py:1203-1216Matches based on heading text when chunks fail
Insertion calculationpython/deepwiki-scraper.py:1219-1236Determines where to insert diagram after match
Diagram insertionpython/deepwiki-scraper.py:1249-1266Inserts diagram with dynamic fence length

Performance Characteristics

The algorithm processes diagrams in a single pass per file with the following complexity:

OperationComplexityNotes
Content normalizationO(n)Where n = file size in characters
Chunk searchO(n × c × d)c = 5 chunk sizes, d = diagram count
Line number conversionO(L)Where L = number of lines in file
Insertion sortingO(k log k)Where k = matched diagrams
Bottom-up insertionO(k × L)Avoids index recalculation due to reverse order

For a typical file with 1000 lines and 48 diagram candidates with context, the algorithm completes in under 100ms per file.

Sources: python/deepwiki-scraper.py:880-1275

Match Quality Statistics

As reported in the console output, the algorithm typically achieves:

  • Total diagrams in JavaScript : ~461 diagrams across all pages
  • Diagrams with sufficient context : ~48 diagrams (500+ char context)
  • Average match rate : 60-80% of diagrams with context are successfully placed
  • Typical score distribution :
    • 300-char matches: 20-30% (highest confidence)
    • 200-char matches: 15-25%
    • 150-char matches: 15-20%
    • 100-char matches: 10-15%
    • 80-char matches: 5-10%
    • Heading fallback: 5-10% (lowest confidence)

Sources: README.md:132-136 tools/deepwiki-scraper.py674

Dismiss

Refresh this wiki

Enter email to refresh