This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Fuzzy Matching Algorithm

Loading…

Fuzzy Matching Algorithm

Relevant source files

python/deepwiki-scraper.py

Purpose and Scope

The fuzzy matching algorithm places Mermaid diagrams extracted from DeepWiki’s JavaScript payload into the correct locations within Markdown files. It matches diagram context (as it appears in the JavaScript) to content locations in html2text-converted Markdown, accounting for formatting differences between the two representations.

This algorithm is implemented in extract_and_enhance_diagrams() function and processes all files after the initial markdown extraction phase completes.

Sources: python/deepwiki-scraper.py:880-1275

The Matching Problem

The fuzzy matching algorithm addresses a fundamental mismatch: diagrams are embedded in DeepWiki’s JavaScript payload alongside their surrounding context text, but this context text differs significantly from the final Markdown output produced by html2text. The algorithm must find where each diagram belongs despite these differences.

Format Differences Between Sources

Aspect	JavaScript Payload	html2text Output
Whitespace	Escaped `\n` sequences	Actual newlines
Line wrapping	No wrapping (continuous text)	Wrapped at natural boundaries
HTML entities	Escaped (`\u003c`, `\u0026`)	Decoded (`<`, `&`)
Formatting	Inline with escaped quotes	Clean Markdown syntax
Structure	Linear text stream	Hierarchical headings/paragraphs

Sources: python/deepwiki-scraper.py:898-903

Context Extraction Strategy

The algorithm extracts two types of context for each diagram to enable matching:

1. Last Heading Before Diagram

Extractinglast_heading from Context

The algorithm scans backwards through context_lines to find the most recent line starting with #, which provides a coarse-grained location hint.

Sources: python/deepwiki-scraper.py:1066-1071

2. Anchor Text (Last 2-3 Paragraphs)

Extractinganchor_text from Context

The anchor_text consists of the last 2-3 substantial non-heading lines before the diagram, truncated to 300 characters. This provides fine-grained matching capability.

Sources: python/deepwiki-scraper.py:1073-1081

Progressive Chunk Size Matching

The core of the fuzzy matching algorithm uses progressively smaller chunk sizes to find matches, prioritizing longer (more specific) matches over shorter ones.

Chunk Size Progression

The algorithm tests chunks in this order:

Chunk Size	Purpose	Match Quality
300 chars	Full anchor text	Highest confidence
200 chars	Most of anchor	High confidence
150 chars	Significant portion	Medium-high confidence
100 chars	Key phrases	Medium confidence
80 chars	Minimum viable match	Low confidence

Matching Algorithm Flow

Progressive Chunk Matching in Code

Sources: python/deepwiki-scraper.py:1169-1239

Text Normalization

Both the diagram context and the target Markdown content undergo identical normalization to maximize matching success:

This process:

Converts all text to lowercase
Collapses all consecutive whitespace (spaces, tabs, newlines) into single spaces
Removes leading/trailing whitespace

Sources: python/deepwiki-scraper.py:1166-1167 python/deepwiki-scraper.py:1185-1186

Fallback: Heading-Based Matching

If progressive chunk matching fails (best_match_line == -1 and heading exists), the algorithm falls back to heading-based matching:

Heading Fallback Implementation

Heading-based matches receive a fixed best_match_score of 50, lower than any chunk-based match (minimum 80), indicating lower confidence.

Sources: python/deepwiki-scraper.py:1203-1216

Insertion Point Calculation

Once a match is found with best_match_score >= 80, the algorithm calculates the precise insert_line for the diagram:

Insertion After Headings

Calculatinginsert_line After Heading

Sources: python/deepwiki-scraper.py:1222-1230

Insertion After Paragraphs

Calculatinginsert_line After Paragraph

Sources: python/deepwiki-scraper.py:1232-1236

Scoring and Deduplication

The algorithm tracks which diagrams have been used to prevent duplicates:

For each file, the algorithm:

Attempts to match all diagrams in diagram_contexts with file content
Stores successful matches with their scores in pending_insertions as tuples: (insert_line, diagram, best_match_score, idx)
Marks diagrams as used by adding their index to diagrams_used set
Sorts pending_insertions by line number (descending) to avoid index shifting
Inserts diagrams from bottom to top

Sources: python/deepwiki-scraper.py:1162-1163 python/deepwiki-scraper.py1238 python/deepwiki-scraper.py:1242-1243

Diagram Insertion Format

Diagrams are inserted with proper Markdown fencing and spacing, accounting for backticks in the diagram content:

Buildinglines_to_insert

This results in the following structure in the Markdown file:

Next paragraph text.

If the diagram contains triple backticks, the fence length is increased (e.g., to 4 or 5 backticks) to avoid conflicts.

**Sources:** <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/python/deepwiki-scraper.py#L1249-L1266" min=1249 max=1266 file-path="python/deepwiki-scraper.py">Hii</FileRef>

## Complete Matching Pipeline

**`extract_and_enhance_diagrams()` Function Flow**

```mermaid
flowchart TD
    Start["extract_and_enhance_diagrams(\nrepo, temp_dir,\nsession, diagram_source_url)"] --> Fetch["response = session.get(\ndiagram_source_url)\nhtml_text = response.text"]
    Fetch --> ExtractPattern["diagram_pattern =\nr'```mermaid(?:\\r\\n|\\n|\r?\n)\n(.*?)(?:\\r\\n|\\n|\r?\n)```'\ndiagram_matches =\nlist(re.finditer(\npattern, html_text, re.DOTALL))"]
    ExtractPattern --> ContextLoop["diagram_contexts = []\nfor match in diagram_matches:\ncontext_start =\nmax(0, match.start() - 2000)"]
    ContextLoop --> ParseContext["extract:\n- last_heading\n- anchor_text[-300:]\n- diagram (unescaped)"]
    ParseContext --> NormalizeDiag["diagram =\nmerge_multiline_labels(diagram)\ndiagram =\nstrip_wrapping_quotes(diagram)\ndiagram =\nnormalize_mermaid_diagram(diagram)"]
    NormalizeDiag --> AppendContext["diagram_contexts.append({\n'last_heading': last_heading,\n'anchor_text': anchor_text,\n'diagram': diagram\n})"]
    AppendContext --> FindFiles["md_files =\nlist(temp_dir.glob('**/*.md'))"]
    
    FindFiles --> FileLoop["for md_file in md_files:"]
    FileLoop --> ReadFile["content = f.read()"]
    ReadFile --> CheckExists["re.search(\nr'^\\s*`{3,}\\s*mermaid\\b',\ncontent, re.IGNORECASE\n| re.MULTILINE)?"]
    CheckExists -->|Yes| Skip["continue"]
    CheckExists -->|No| SplitLines["lines = content.split('\\n')"]
    SplitLines --> InitVars["diagrams_used = set()\npending_insertions = []\ncontent_normalized =\ncontent.lower()"]
    InitVars --> DiagLoop["for idx, item in\nenumerate(diagram_contexts):"]
    
    DiagLoop --> TryChunks["for chunk_size in\n[300, 200, 150, 100, 80]:\ntest_chunk =\nanchor_normalized[-chunk_size:]\npos = content_normalized\n.find(test_chunk)"]
    TryChunks -->|Found| CalcLine["convert pos to line_num\nbest_match_line = line_num\nbest_match_score = chunk_size"]
    TryChunks -->|Not found| TryHeading["heading fallback matching"]
    TryHeading -->|Found| SetScore["best_match_score = 50"]
    CalcLine --> CheckScore["best_match_score >= 80?"]
    SetScore --> CheckScore
    CheckScore -->|Yes| CalcInsert["calculate insert_line:\nenforce_content_start()\nadvance_past_lists()"]
    CalcInsert --> AppendPending["pending_insertions.append(\ninsert_line, diagram,\nbest_match_score, idx)\ndiagrams_used.add(idx)"]
    CheckScore -->|No| NextDiag["next diagram"]
    
    AppendPending --> NextDiag
    NextDiag --> DiagLoop
    DiagLoop -->|All done| SortPending["pending_insertions.sort(\nkey=lambda x: x[0],\nreverse=True)"]
    SortPending --> InsertLoop["for insert_line, diagram,\nscore, idx in\npending_insertions:\ncalculate fence_len\nlines.insert(insert_line,\nlines_to_insert)"]
    InsertLoop --> SaveFile["with open(md_file, 'w')\nas f:\nf.write('\\n'.join(lines))"]
    SaveFile --> FileLoop
    FileLoop -->|All files| PrintStats["print(f'Enhanced\n{enhanced_count} files')"]

Sources: python/deepwiki-scraper.py:880-1275

Key Functions and Variables

Function/Variable	Location	Purpose
`extract_and_enhance_diagrams()`	python/deepwiki-scraper.py:880-1275	Main orchestrator for diagram enhancement phase
`diagram_contexts`	python/deepwiki-scraper.py903	List of dicts with `last_heading`, `anchor_text`, `diagram`
`first_body_heading_index()`	python/deepwiki-scraper.py:1095-1099	Finds first `##` heading in file
`protected_prefix_end()`	python/deepwiki-scraper.py:1101-1115	Determines where title and source list end
`advance_past_lists()`	python/deepwiki-scraper.py:1125-1137	Skips over list blocks to avoid insertion in lists
`enforce_content_start()`	python/deepwiki-scraper.py:1138-1147	Ensures insertion is after protected sections
`diagrams_used`	python/deepwiki-scraper.py1162	Set tracking which diagram indices are already placed
`pending_insertions`	python/deepwiki-scraper.py1163	List of tuples: `(insert_line, diagram, score, idx)`
`best_match_line`	python/deepwiki-scraper.py1175	Line number where best match was found
`best_match_score`	python/deepwiki-scraper.py1176	Score of best match (chunk_size or 50 for heading)
Progressive chunk loop	python/deepwiki-scraper.py:1187-1201	Tries chunk sizes [300, 200, 150, 100, 80]
Heading fallback	python/deepwiki-scraper.py:1203-1216	Matches based on heading text when chunks fail
Insertion calculation	python/deepwiki-scraper.py:1219-1236	Determines where to insert diagram after match
Diagram insertion	python/deepwiki-scraper.py:1249-1266	Inserts diagram with dynamic fence length

Performance Characteristics

The algorithm processes diagrams in a single pass per file with the following complexity:

Operation	Complexity	Notes
Content normalization	O(n)	Where n = file size in characters
Chunk search	O(n × c × d)	c = 5 chunk sizes, d = diagram count
Line number conversion	O(L)	Where L = number of lines in file
Insertion sorting	O(k log k)	Where k = matched diagrams
Bottom-up insertion	O(k × L)	Avoids index recalculation due to reverse order

For a typical file with 1000 lines and 48 diagram candidates with context, the algorithm completes in under 100ms per file.

Sources: python/deepwiki-scraper.py:880-1275

Match Quality Statistics

As reported in the console output, the algorithm typically achieves:

Total diagrams in JavaScript : ~461 diagrams across all pages
Diagrams with sufficient context : ~48 diagrams (500+ char context)
Average match rate : 60-80% of diagrams with context are successfully placed
Typical score distribution :
- 300-char matches: 20-30% (highest confidence)
- 200-char matches: 15-25%
- 150-char matches: 15-20%
- 100-char matches: 10-15%
- 80-char matches: 5-10%
- Heading fallback: 5-10% (lowest confidence)

Sources: README.md:132-136 tools/deepwiki-scraper.py674

Dismiss

Refresh this wiki

Enter email to refresh

Keyboard shortcuts

deepwiki-to-mdbook Documentation