Phase 1: Markdown Extraction

Relevant source files

This page documents Phase 1 of the three-phase processing pipeline, which handles the extraction and initial conversion of wiki content from DeepWiki.com into clean Markdown files. Phase 1 produces raw Markdown files in a temporary directory before diagram enhancement (Phase 2, see #7) and mdBook HTML generation (Phase 3, see #8).

For detailed information about specific sub-processes within Phase 1, see:

Wiki structure discovery algorithm: #6.1
HTML parsing and Markdown conversion: #6.2

Scope and Objectives

Phase 1 accomplishes the following:

Discover all wiki pages and their hierarchical structure from DeepWiki
Fetch HTML content for each page via HTTP requests
Parse HTML to extract main content and remove UI elements
Convert cleaned HTML to Markdown using html2text
Organize output files into a hierarchical directory structure
Save to a temporary directory for subsequent processing

This phase is implemented entirely in Python within deepwiki-scraper.py and operates independently of Phases 2 and 3.

Sources: README.md:121-128 tools/deepwiki-scraper.py:790-876

Phase 1 Execution Flow

The following diagram shows the complete execution flow of Phase 1, mapping high-level steps to specific functions in the codebase:

Sources: tools/deepwiki-scraper.py:790-876 tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594

flowchart TD
    Start["main()
Entry Point"]
CreateTemp["Create tempfile.TemporaryDirectory()"]
CreateSession["requests.Session()
with User-Agent"]
DiscoverPhase["Structure Discovery Phase"]
ExtractWiki["extract_wiki_structure(repo, session)"]
ParseLinks["BeautifulSoup: find_all('a', href=pattern)"]
SortPages["sort by page number (handle dots)"]
ExtractionPhase["Content Extraction Phase"]
LoopPages["For each page in pages list"]
FetchContent["extract_page_content(url, session, page_info)"]
FetchHTML["fetch_page(url, session)
with retries"]
ParseHTML["BeautifulSoup(response.text)"]
RemoveNav["Remove nav/header/footer/aside elements"]
FindContent["Find main content: article/main/[role='main']"]
ConvertPhase["Conversion Phase"]
ConvertMD["convert_html_to_markdown(html_content)"]
HTML2Text["html2text.HTML2Text with body_width=0"]
CleanFooter["clean_deepwiki_footer(markdown)"]
FixLinks["Regex replace: wiki links → .md paths"]
SavePhase["File Organization Phase"]
DetermineLevel{"page['level'] == 0?"}
SaveRoot["Save to temp_dir/NUM-title.md"]
CreateSubdir["Create temp_dir/section-N/"]
SaveSubdir["Save to section-N/NUM-title.md"]
NextPage{"More pages?"}
Complete["Phase 1 Complete: temp_dir contains all .md files"]
Start --> CreateTemp
 
   CreateTemp --> CreateSession
 
   CreateSession --> DiscoverPhase
    
 
   DiscoverPhase --> ExtractWiki
 
   ExtractWiki --> ParseLinks
 
   ParseLinks --> SortPages
 
   SortPages --> ExtractionPhase
    
 
   ExtractionPhase --> LoopPages
 
   LoopPages --> FetchContent
 
   FetchContent --> FetchHTML
 
   FetchHTML --> ParseHTML
 
   ParseHTML --> RemoveNav
 
   RemoveNav --> FindContent
 
   FindContent --> ConvertPhase
    
 
   ConvertPhase --> ConvertMD
 
   ConvertMD --> HTML2Text
 
   HTML2Text --> CleanFooter
 
   CleanFooter --> FixLinks
 
   FixLinks --> SavePhase
    
 
   SavePhase --> DetermineLevel
 
   DetermineLevel -->|Yes: Main Page| SaveRoot
 
   DetermineLevel -->|No: Subsection| CreateSubdir
 
   CreateSubdir --> SaveSubdir
 
   SaveRoot --> NextPage
 
   SaveSubdir --> NextPage
    
 
   NextPage -->|Yes| LoopPages
 
   NextPage -->|No| Complete

Core Components and Data Flow

Structure Discovery Pipeline

The structure discovery process identifies all wiki pages and builds a hierarchical page list:

Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:90-116 tools/deepwiki-scraper.py:118-123

flowchart LR
    subgraph Input
        BaseURL["Base URL\ndeepwiki.com/owner/repo"]
end
    
    subgraph extract_wiki_structure
        FetchMain["fetch_page(base_url)"]
ParseSoup["BeautifulSoup(response.text)"]
FindLinks["soup.find_all('a', href=regex)"]
ExtractInfo["Extract page number & title\nRegex: /(\d+(?:\.\d+)*)-(.+)$"]
CalcLevel["Calculate level from dots\nlevel = page_num.count('.')"]
BuildPages["Build pages list with metadata"]
SortFunc["Sort by sort_key(page)\nparts = [int(x)
for x in num.split('.')]"]
end
    
    subgraph Output
        PagesList["List[Dict]\n{'number': '2.1',\n'title': 'Section',\n'url': full_url,\n'href': path,\n'level': 1}"]
end
    
 
   BaseURL --> FetchMain
 
   FetchMain --> ParseSoup
 
   ParseSoup --> FindLinks
 
   FindLinks --> ExtractInfo
 
   ExtractInfo --> CalcLevel
 
   CalcLevel --> BuildPages
 
   BuildPages --> SortFunc
 
   SortFunc --> PagesList

Content Extraction and Cleaning

Each page undergoes a multi-step cleaning process to remove DeepWiki UI elements:

Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:175-190 tools/deepwiki-scraper.py:127-173

flowchart TD
    subgraph fetch_page
        MakeRequest["requests.get(url, headers)\nUser-Agent: Mozilla/5.0..."]
RetryLogic["Retry up to 3 times\n2 second delay between attempts"]
CheckStatus["response.raise_for_status()"]
end
    
    subgraph extract_page_content
        ParsePage["BeautifulSoup(response.text)"]
RemoveUnwanted["Decompose: nav, header, footer,\naside, .sidebar, script, style"]
FindMain["Try selectors in order:\narticle → main → .wiki-content\n→ [role='main'] → body"]
RemoveUI["Remove DeepWiki UI elements:\n'Edit Wiki', 'Last indexed:',\n'Index your code with Devin'"]
RemoveNavLists["Remove navigation <ul> lists\n(80%+ internal wiki links)"]
end
    
    subgraph convert_html_to_markdown
        HTML2TextInit["h = html2text.HTML2Text()\nh.ignore_links = False\nh.body_width = 0"]
HandleContent["markdown = h.handle(html_content)"]
CleanFooterCall["clean_deepwiki_footer(markdown)"]
end
    
    subgraph clean_deepwiki_footer
        SplitLines["lines = markdown.split('\\n')"]
ScanBackward["Scan last 50 lines backward\nfor footer patterns"]
MatchPatterns["Regex patterns:\n'Dismiss', 'Refresh this wiki',\n'On this page', 'Edit Wiki'"]
TruncateLines["lines = lines[:footer_start]"]
RemoveEmpty["Remove trailing empty lines"]
end
    
 
   MakeRequest --> RetryLogic
 
   RetryLogic --> CheckStatus
 
   CheckStatus --> ParsePage
    
 
   ParsePage --> RemoveUnwanted
 
   RemoveUnwanted --> FindMain
 
   FindMain --> RemoveUI
 
   RemoveUI --> RemoveNavLists
 
   RemoveNavLists --> HTML2TextInit
    
 
   HTML2TextInit --> HandleContent
 
   HandleContent --> CleanFooterCall
    
 
   CleanFooterCall --> SplitLines
 
   SplitLines --> ScanBackward
 
   ScanBackward --> MatchPatterns
 
   MatchPatterns --> TruncateLines
 
   TruncateLines --> RemoveEmpty

Link Rewriting Logic

Phase 1 transforms internal DeepWiki links into relative Markdown file paths. The rewriting logic accounts for hierarchical directory structure:

Sources: tools/deepwiki-scraper.py:549-592

flowchart TD
    subgraph Input
        WikiLink["DeepWiki Link\n[text](/owner/repo/2-1-section)"]
SourcePage["Current Page Info\n{level: 1, number: '2.1'}"]
end
    
    subgraph fix_wiki_link
        ExtractPath["Regex: /(\d+(?:\.\d+)*)-(.+)$"]
ParseNumbers["Extract: page_num='2.1', slug='section'"]
ConvertNum["file_num = page_num.replace('.', '-')\nResult: '2-1'"]
CheckTarget{"Is target\nsubsection?\n(has '.')"}
CheckSource{"Is source\nsubsection?\n(level > 0)"}
CheckSame{"Same main\nsection?"}
PathSameSection["Relative path:\nfile_num-slug.md"]
PathDiffSection["Full path:\nsection-N/file_num-slug.md"]
PathToMain["Up one level:\n../file_num-slug.md"]
PathMainToMain["Same level:\nfile_num-slug.md"]
end
    
    subgraph Output
        MDLink["Markdown Link\n[text](2-1-section.md)\nor [text](section-2/2-1-section.md)\nor [text](../2-1-section.md)"]
end
    
 
   WikiLink --> ExtractPath
 
   ExtractPath --> ParseNumbers
 
   ParseNumbers --> ConvertNum
 
   ConvertNum --> CheckTarget
    
 
   CheckTarget -->|Yes| CheckSource
 
   CheckTarget -->|No: Main Page| CheckSource
    
 
   CheckSource -->|Target: Sub, Source: Sub| CheckSame
 
   CheckSource -->|Target: Sub, Source: Main| PathDiffSection
 
   CheckSource -->|Target: Main, Source: Sub| PathToMain
 
   CheckSource -->|Target: Main, Source: Main| PathMainToMain
    
 
   CheckSame -->|Yes| PathSameSection
 
   CheckSame -->|No| PathDiffSection
    
 
   PathSameSection --> MDLink
 
   PathDiffSection --> MDLink
 
   PathToMain --> MDLink
 
   PathMainToMain --> MDLink

File Organization Strategy

Phase 1 organizes output files into a hierarchical directory structure based on page levels:

Directory Structure Rules

Page Level	Page Number Format	Directory Location	Filename Pattern	Example
0 (Main)	`1`, `2`, `3`	`temp_dir/` (root)	`{num}-{slug}.md`	`1-overview.md`
1 (Subsection)	`2.1`, `3.4`	`temp_dir/section-{N}/`	`{num}-{slug}.md`	`section-2/2-1-workspace.md`

Priority	Selector Type	Selector Value	Rationale
1	CSS Selector	`article`	Semantic HTML5 element for main content
2	CSS Selector	`main`	HTML5 main landmark element
3	CSS Selector	`.wiki-content`	Common class name for wiki content
4	CSS Selector	`.content`	Generic content class
5	CSS Selector	`#content`	Generic content ID
6	CSS Selector	`.markdown-body`	GitHub-style markdown container
7	Attribute	`role="main"`	ARIA landmark role
8	Fallback	`body`	Last resort: entire body

Sources: tools/deepwiki-scraper.py:472-484

Content Selector Fallback Chain : Try 8 different selectors (see table above)
Empty Content Check : Raises exception if no content element found tools/deepwiki-scraper.py:486-487
HTTP Retry Logic : 3 attempts with exponential backoff
Session Persistence : Reuses TCP connections for efficiency

Sources: tools/deepwiki-scraper.py:472-487 tools/deepwiki-scraper.py:27-42

Output Format

Temporary Directory Structure

At the end of Phase 1, the temporary directory contains the following structure:

temp_dir/
├── 1-overview.md                    # Main page (level 0)
├── 2-architecture.md                # Main page (level 0)
├── 3-components.md                  # Main page (level 0)
├── section-2/                       # Subsections of page 2
│   ├── 2-1-workspace-and-crates.md  # Subsection (level 1)
│   └── 2-2-dependency-graph.md      # Subsection (level 1)
└── section-4/                       # Subsections of page 4
    ├── 4-1-logical-planning.md
    └── 4-2-physical-planning.md

Markdown File Format

Each generated Markdown file has the following characteristics:

Title : Always starts with # {Page Title} heading
Content : Cleaned HTML converted to Markdown via html2text
Links : Internal wiki links rewritten to relative .md paths
No Diagrams : Diagrams are added in Phase 2 (see #7)
No Footer : DeepWiki UI elements removed via clean_deepwiki_footer()
Encoding : UTF-8

Sources: tools/deepwiki-scraper.py:862-868 tools/deepwiki-scraper.py:127-173

Phase 1 Completion Criteria

Phase 1 is considered complete when:

All pages discovered by extract_wiki_structure() have been processed
Each page's Markdown file has been written to the temporary directory
Directory structure (main pages + section-N/ subdirectories) has been created
Success count is reported: "✓ Successfully extracted N/M pages to temp directory"

The temporary directory is then passed to Phase 2 for diagram enhancement.

Sources: tools/deepwiki-scraper.py877 tools/deepwiki-scraper.py:596-788

deepwiki-to-mdbook Documentation

Phase 1: Markdown Extraction

Scope and Objectives

Phase 1 Execution Flow

Core Components and Data Flow

Structure Discovery Pipeline

Content Extraction and Cleaning

Link Rewriting Logic

File Organization Strategy

Directory Structure Rules

File Organization Implementation

HTTP Session Configuration

Session Setup

Retry Strategy

Data Structures

Page Metadata Dictionary

BeautifulSoup Content Selectors

Error Handling and Robustness

Page Extraction Error Handling

Content Extraction Fallbacks

Output Format

Temporary Directory Structure

Markdown File Format

Phase 1 Completion Criteria

Keyboard shortcuts

deepwiki-to-mdbook Documentation