This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Fuzzy Matching Algorithm
Loading…
Fuzzy Matching Algorithm
Relevant source files
Purpose and Scope
The fuzzy matching algorithm places Mermaid diagrams extracted from DeepWiki’s JavaScript payload into the correct locations within Markdown files. It matches diagram context (as it appears in the JavaScript) to content locations in html2text-converted Markdown, accounting for formatting differences between the two representations.
This algorithm is implemented in extract_and_enhance_diagrams() function and processes all files after the initial markdown extraction phase completes.
Sources: python/deepwiki-scraper.py:880-1275
The Matching Problem
The fuzzy matching algorithm addresses a fundamental mismatch: diagrams are embedded in DeepWiki’s JavaScript payload alongside their surrounding context text, but this context text differs significantly from the final Markdown output produced by html2text. The algorithm must find where each diagram belongs despite these differences.
Format Differences Between Sources
| Aspect | JavaScript Payload | html2text Output |
|---|---|---|
| Whitespace | Escaped \n sequences | Actual newlines |
| Line wrapping | No wrapping (continuous text) | Wrapped at natural boundaries |
| HTML entities | Escaped (\u003c, \u0026) | Decoded (<, &) |
| Formatting | Inline with escaped quotes | Clean Markdown syntax |
| Structure | Linear text stream | Hierarchical headings/paragraphs |
Sources: python/deepwiki-scraper.py:898-903
Context Extraction Strategy
The algorithm extracts two types of context for each diagram to enable matching:
1. Last Heading Before Diagram
Extractinglast_heading from Context
The algorithm scans backwards through context_lines to find the most recent line starting with #, which provides a coarse-grained location hint.
Sources: python/deepwiki-scraper.py:1066-1071
2. Anchor Text (Last 2-3 Paragraphs)
Extractinganchor_text from Context
The anchor_text consists of the last 2-3 substantial non-heading lines before the diagram, truncated to 300 characters. This provides fine-grained matching capability.
Sources: python/deepwiki-scraper.py:1073-1081
Progressive Chunk Size Matching
The core of the fuzzy matching algorithm uses progressively smaller chunk sizes to find matches, prioritizing longer (more specific) matches over shorter ones.
Chunk Size Progression
The algorithm tests chunks in this order:
| Chunk Size | Purpose | Match Quality |
|---|---|---|
| 300 chars | Full anchor text | Highest confidence |
| 200 chars | Most of anchor | High confidence |
| 150 chars | Significant portion | Medium-high confidence |
| 100 chars | Key phrases | Medium confidence |
| 80 chars | Minimum viable match | Low confidence |
Matching Algorithm Flow
Progressive Chunk Matching in Code
Sources: python/deepwiki-scraper.py:1169-1239
Text Normalization
Both the diagram context and the target Markdown content undergo identical normalization to maximize matching success:
This process:
- Converts all text to lowercase
- Collapses all consecutive whitespace (spaces, tabs, newlines) into single spaces
- Removes leading/trailing whitespace
Sources: python/deepwiki-scraper.py:1166-1167 python/deepwiki-scraper.py:1185-1186
Fallback: Heading-Based Matching
If progressive chunk matching fails (best_match_line == -1 and heading exists), the algorithm falls back to heading-based matching:
Heading Fallback Implementation
Heading-based matches receive a fixed best_match_score of 50, lower than any chunk-based match (minimum 80), indicating lower confidence.
Sources: python/deepwiki-scraper.py:1203-1216
Insertion Point Calculation
Once a match is found with best_match_score >= 80, the algorithm calculates the precise insert_line for the diagram:
Insertion After Headings
Calculatinginsert_line After Heading
Sources: python/deepwiki-scraper.py:1222-1230
Insertion After Paragraphs
Calculatinginsert_line After Paragraph
Sources: python/deepwiki-scraper.py:1232-1236
Scoring and Deduplication
The algorithm tracks which diagrams have been used to prevent duplicates:
For each file, the algorithm:
- Attempts to match all diagrams in
diagram_contextswith file content - Stores successful matches with their scores in
pending_insertionsas tuples:(insert_line, diagram, best_match_score, idx) - Marks diagrams as used by adding their index to
diagrams_usedset - Sorts
pending_insertionsby line number (descending) to avoid index shifting - Inserts diagrams from bottom to top
Sources: python/deepwiki-scraper.py:1162-1163 python/deepwiki-scraper.py1238 python/deepwiki-scraper.py:1242-1243
Diagram Insertion Format
Diagrams are inserted with proper Markdown fencing and spacing, accounting for backticks in the diagram content:
Buildinglines_to_insert
This results in the following structure in the Markdown file:
Next paragraph text.
If the diagram contains triple backticks, the fence length is increased (e.g., to 4 or 5 backticks) to avoid conflicts.
**Sources:** <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/python/deepwiki-scraper.py#L1249-L1266" min=1249 max=1266 file-path="python/deepwiki-scraper.py">Hii</FileRef>
## Complete Matching Pipeline
**`extract_and_enhance_diagrams()` Function Flow**
```mermaid
flowchart TD
Start["extract_and_enhance_diagrams(\nrepo, temp_dir,\nsession, diagram_source_url)"] --> Fetch["response = session.get(\ndiagram_source_url)\nhtml_text = response.text"]
Fetch --> ExtractPattern["diagram_pattern =\nr'```mermaid(?:\\r\\n|\\n|\r?\n)\n(.*?)(?:\\r\\n|\\n|\r?\n)```'\ndiagram_matches =\nlist(re.finditer(\npattern, html_text, re.DOTALL))"]
ExtractPattern --> ContextLoop["diagram_contexts = []\nfor match in diagram_matches:\ncontext_start =\nmax(0, match.start() - 2000)"]
ContextLoop --> ParseContext["extract:\n- last_heading\n- anchor_text[-300:]\n- diagram (unescaped)"]
ParseContext --> NormalizeDiag["diagram =\nmerge_multiline_labels(diagram)\ndiagram =\nstrip_wrapping_quotes(diagram)\ndiagram =\nnormalize_mermaid_diagram(diagram)"]
NormalizeDiag --> AppendContext["diagram_contexts.append({\n'last_heading': last_heading,\n'anchor_text': anchor_text,\n'diagram': diagram\n})"]
AppendContext --> FindFiles["md_files =\nlist(temp_dir.glob('**/*.md'))"]
FindFiles --> FileLoop["for md_file in md_files:"]
FileLoop --> ReadFile["content = f.read()"]
ReadFile --> CheckExists["re.search(\nr'^\\s*`{3,}\\s*mermaid\\b',\ncontent, re.IGNORECASE\n| re.MULTILINE)?"]
CheckExists -->|Yes| Skip["continue"]
CheckExists -->|No| SplitLines["lines = content.split('\\n')"]
SplitLines --> InitVars["diagrams_used = set()\npending_insertions = []\ncontent_normalized =\ncontent.lower()"]
InitVars --> DiagLoop["for idx, item in\nenumerate(diagram_contexts):"]
DiagLoop --> TryChunks["for chunk_size in\n[300, 200, 150, 100, 80]:\ntest_chunk =\nanchor_normalized[-chunk_size:]\npos = content_normalized\n.find(test_chunk)"]
TryChunks -->|Found| CalcLine["convert pos to line_num\nbest_match_line = line_num\nbest_match_score = chunk_size"]
TryChunks -->|Not found| TryHeading["heading fallback matching"]
TryHeading -->|Found| SetScore["best_match_score = 50"]
CalcLine --> CheckScore["best_match_score >= 80?"]
SetScore --> CheckScore
CheckScore -->|Yes| CalcInsert["calculate insert_line:\nenforce_content_start()\nadvance_past_lists()"]
CalcInsert --> AppendPending["pending_insertions.append(\ninsert_line, diagram,\nbest_match_score, idx)\ndiagrams_used.add(idx)"]
CheckScore -->|No| NextDiag["next diagram"]
AppendPending --> NextDiag
NextDiag --> DiagLoop
DiagLoop -->|All done| SortPending["pending_insertions.sort(\nkey=lambda x: x[0],\nreverse=True)"]
SortPending --> InsertLoop["for insert_line, diagram,\nscore, idx in\npending_insertions:\ncalculate fence_len\nlines.insert(insert_line,\nlines_to_insert)"]
InsertLoop --> SaveFile["with open(md_file, 'w')\nas f:\nf.write('\\n'.join(lines))"]
SaveFile --> FileLoop
FileLoop -->|All files| PrintStats["print(f'Enhanced\n{enhanced_count} files')"]
Sources: python/deepwiki-scraper.py:880-1275
Key Functions and Variables
| Function/Variable | Location | Purpose |
|---|---|---|
extract_and_enhance_diagrams() | python/deepwiki-scraper.py:880-1275 | Main orchestrator for diagram enhancement phase |
diagram_contexts | python/deepwiki-scraper.py903 | List of dicts with last_heading, anchor_text, diagram |
first_body_heading_index() | python/deepwiki-scraper.py:1095-1099 | Finds first ## heading in file |
protected_prefix_end() | python/deepwiki-scraper.py:1101-1115 | Determines where title and source list end |
advance_past_lists() | python/deepwiki-scraper.py:1125-1137 | Skips over list blocks to avoid insertion in lists |
enforce_content_start() | python/deepwiki-scraper.py:1138-1147 | Ensures insertion is after protected sections |
diagrams_used | python/deepwiki-scraper.py1162 | Set tracking which diagram indices are already placed |
pending_insertions | python/deepwiki-scraper.py1163 | List of tuples: (insert_line, diagram, score, idx) |
best_match_line | python/deepwiki-scraper.py1175 | Line number where best match was found |
best_match_score | python/deepwiki-scraper.py1176 | Score of best match (chunk_size or 50 for heading) |
| Progressive chunk loop | python/deepwiki-scraper.py:1187-1201 | Tries chunk sizes [300, 200, 150, 100, 80] |
| Heading fallback | python/deepwiki-scraper.py:1203-1216 | Matches based on heading text when chunks fail |
| Insertion calculation | python/deepwiki-scraper.py:1219-1236 | Determines where to insert diagram after match |
| Diagram insertion | python/deepwiki-scraper.py:1249-1266 | Inserts diagram with dynamic fence length |
Performance Characteristics
The algorithm processes diagrams in a single pass per file with the following complexity:
| Operation | Complexity | Notes |
|---|---|---|
| Content normalization | O(n) | Where n = file size in characters |
| Chunk search | O(n × c × d) | c = 5 chunk sizes, d = diagram count |
| Line number conversion | O(L) | Where L = number of lines in file |
| Insertion sorting | O(k log k) | Where k = matched diagrams |
| Bottom-up insertion | O(k × L) | Avoids index recalculation due to reverse order |
For a typical file with 1000 lines and 48 diagram candidates with context, the algorithm completes in under 100ms per file.
Sources: python/deepwiki-scraper.py:880-1275
Match Quality Statistics
As reported in the console output, the algorithm typically achieves:
- Total diagrams in JavaScript : ~461 diagrams across all pages
- Diagrams with sufficient context : ~48 diagrams (500+ char context)
- Average match rate : 60-80% of diagrams with context are successfully placed
- Typical score distribution :
- 300-char matches: 20-30% (highest confidence)
- 200-char matches: 15-25%
- 150-char matches: 15-20%
- 100-char matches: 10-15%
- 80-char matches: 5-10%
- Heading fallback: 5-10% (lowest confidence)
Sources: README.md:132-136 tools/deepwiki-scraper.py674
Dismiss
Refresh this wiki
Enter email to refresh