Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

DeepWiki GitHub

Python Dependencies

Relevant source files

This page documents the Python dependencies required by the deepwiki-scraper.py script, including their purposes, version requirements, and how they are used throughout the content extraction and conversion pipeline. For information about the scraper script itself, see deepwiki-scraper.py. For details about how Rust dependencies (mdBook and mdbook-mermaid) are installed, see Docker Multi-Stage Build.

Dependencies Overview

The system requires three core Python libraries for web scraping and HTML-to-Markdown conversion:

PackageMinimum VersionPrimary Purpose
requests2.31.0HTTP client for fetching web pages
beautifulsoup44.12.0HTML parsing and DOM manipulation
html2text2020.1.16HTML to Markdown conversion

These dependencies are declared in tools/requirements.txt:1-3 and installed during Docker image build using the uv package manager.

Sources: tools/requirements.txt:1-3 Dockerfile:16-17

Dependency Usage Flow

The following diagram illustrates how each Python dependency is used across the three-phase processing pipeline:

Sources: tools/deepwiki-scraper.py:17-19 tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:596-788

flowchart TD
    subgraph "Phase 1: Markdown Extraction"
        FetchPage["fetch_page()\n[tools/deepwiki-scraper.py:27-42]"]
ExtractStruct["extract_wiki_structure()\n[tools/deepwiki-scraper.py:78-125]"]
ExtractContent["extract_page_content()\n[tools/deepwiki-scraper.py:453-594]"]
ConvertHTML["convert_html_to_markdown()\n[tools/deepwiki-scraper.py:175-190]"]
end
    
    subgraph "Phase 2: Diagram Enhancement"
        ExtractDiagrams["extract_and_enhance_diagrams()\n[tools/deepwiki-scraper.py:596-788]"]
end
    
    subgraph "requests Library"
        Session["requests.Session()"]
GetMethod["session.get()"]
HeadMethod["session.head()"]
end
    
    subgraph "BeautifulSoup4 Library"
        BS4Parser["BeautifulSoup(html, 'html.parser')"]
FindAll["soup.find_all()"]
Select["soup.select()"]
Decompose["element.decompose()"]
end
    
    subgraph "html2text Library"
        H2TClass["html2text.HTML2Text()"]
HandleMethod["h.handle()"]
end
    
 
   FetchPage --> Session
 
   FetchPage --> GetMethod
 
   ExtractStruct --> GetMethod
 
   ExtractStruct --> BS4Parser
 
   ExtractStruct --> FindAll
    
 
   ExtractContent --> GetMethod
 
   ExtractContent --> BS4Parser
 
   ExtractContent --> Select
 
   ExtractContent --> Decompose
 
   ExtractContent --> ConvertHTML
    
 
   ConvertHTML --> H2TClass
 
   ConvertHTML --> HandleMethod
    
 
   ExtractDiagrams --> GetMethod

requests

The requests library provides HTTP client functionality for fetching web pages from DeepWiki.com. It is imported at tools/deepwiki-scraper.py17 and used throughout the scraper.

Key Usage Patterns

Session Management: A requests.Session() object is created at tools/deepwiki-scraper.py:818-821 to maintain connection pooling and share headers across multiple requests:

HTTP GET Requests: The fetch_page() function at tools/deepwiki-scraper.py:27-42 uses session.get() with retry logic, browser-like headers, and 30-second timeout to fetch HTML content.

HTTP HEAD Requests: The discover_subsections() function at tools/deepwiki-scraper.py:44-76 uses session.head() to efficiently check for page existence without downloading full content.

Configuration Options

The library is configured with:

Sources: tools/deepwiki-scraper.py17 tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:818-821

BeautifulSoup4

The beautifulsoup4 library (imported as bs4) provides HTML parsing and DOM manipulation capabilities. It is imported at tools/deepwiki-scraper.py18 as from bs4 import BeautifulSoup.

Parser Selection

BeautifulSoup is instantiated with the built-in html.parser backend at multiple locations:

This parser choice avoids external dependencies (lxml, html5lib) and provides sufficient functionality for well-formed HTML.

flowchart LR
    subgraph "Navigation Methods"
        FindAll["soup.find_all()"]
Find["soup.find()"]
Select["soup.select()"]
SelectOne["soup.select_one()"]
end
    
    subgraph "Usage in extract_wiki_structure()"
        StructLinks["Find wiki page links\n[line 90]"]
end
    
    subgraph "Usage in extract_page_content()"
        RemoveNav["Remove navigation elements\n[line 466]"]
FindContent["Locate main content area\n[line 473-485]"]
RemoveUI["Remove DeepWiki UI elements\n[line 491-511]"]
end
    
 
   FindAll --> StructLinks
 
   FindAll --> RemoveUI
    
 
   Select --> RemoveNav
 
   SelectOne --> FindContent
    
 
   Find --> FindContent

DOM Navigation Methods

The following diagram maps BeautifulSoup methods to their usage contexts in the codebase:

Sources: tools/deepwiki-scraper.py18 tools/deepwiki-scraper.py84 tools/deepwiki-scraper.py90 tools/deepwiki-scraper.py463 tools/deepwiki-scraper.py:466-511

Content Manipulation

Element Removal: The element.decompose() method permanently removes elements from the DOM tree:

CSS Selectors: BeautifulSoup's select() and select_one() methods support CSS selector syntax for finding content areas:

tools/deepwiki-scraper.py:473-476

Attribute-Based Selection: The find() method with attrs parameter locates elements by ARIA roles:

tools/deepwiki-scraper.py480

Text Extraction

BeautifulSoup's get_text() method extracts plain text from elements:

Sources: tools/deepwiki-scraper.py:466-511

html2text

The html2text library converts HTML content to Markdown format. It is imported at tools/deepwiki-scraper.py19 and used exclusively in the convert_html_to_markdown() function.

Configuration

An HTML2Text instance is created with specific configuration at tools/deepwiki-scraper.py:178-180:

Key Settings:

  • ignore_links = False : Preserves hyperlinks as Markdown link syntax <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/text" undefined file-path="text">Hii</FileRef>
  • body_width = 0 : Disables automatic line wrapping at 80 characters, preserving original formatting

Conversion Process

The handle() method at tools/deepwiki-scraper.py181 performs the actual HTML-to-Markdown conversion:

This processes the cleaned HTML (after BeautifulSoup removes unwanted elements) and produces Markdown text with:

  • Headers converted to # syntax
  • Links converted to <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/text" undefined file-path="text">Hii</FileRef> format
  • Lists converted to - or 1. format
  • Bold/italic formatting preserved
  • Code blocks and inline code preserved

Post-Processing

The conversion output undergoes additional cleanup at tools/deepwiki-scraper.py188:

Sources: tools/deepwiki-scraper.py19 tools/deepwiki-scraper.py:175-190

flowchart TD
    subgraph "Dockerfile Stage 2"
        BaseImage["FROM python:3.12-slim\n[Dockerfile:8]"]
CopyUV["COPY uv from astral-sh image\n[Dockerfile:13]"]
CopyReqs["COPY tools/requirements.txt\n[Dockerfile:16]"]
InstallDeps["RUN uv pip install --system\n[Dockerfile:17]"]
end
    
    subgraph "requirements.txt"
        Requests["requests>=2.31.0"]
BS4["beautifulsoup4>=4.12.0"]
HTML2Text["html2text>=2020.1.16"]
end
    
 
   BaseImage --> CopyUV
 
   CopyUV --> CopyReqs
 
   CopyReqs --> InstallDeps
    
 
   Requests --> InstallDeps
 
   BS4 --> InstallDeps
 
   HTML2Text --> InstallDeps

Installation Process

The dependencies are installed during Docker image build using the uv package manager, which provides fast, reliable Python package installation.

Multi-Stage Build Integration

Sources: Dockerfile8 Dockerfile13 Dockerfile:16-17 tools/requirements.txt:1-3

Installation Command

The dependencies are installed with a single uv pip install command at Dockerfile17:

Flags:

  • --system : Installs into system Python, not a virtual environment
  • --no-cache : Avoids caching to reduce Docker image size
  • -r /tmp/requirements.txt : Specifies requirements file path

The uv tool is significantly faster than standard pip and provides deterministic dependency resolution, making builds more reliable and reproducible.

Sources: Dockerfile:16-17

Version Requirements

The minimum version constraints specified in tools/requirements.txt:1-3 ensure compatibility with required features:

requests >= 2.31.0

This version requirement ensures:

  • Security fixes : Addresses CVE-2023-32681 (Proxy-Authorization header leakage)
  • Session improvements : Enhanced connection pooling and retry mechanisms
  • HTTP/2 support : Better performance for multiple requests

The codebase relies on stable Session API behavior introduced in 2.x releases.

beautifulsoup4 >= 4.12.0

This version requirement ensures:

  • Python 3.12 compatibility : Required for the base image python:3.12-slim
  • Parser stability : Consistent behavior with html.parser backend
  • Security updates : Protection against XML parsing vulnerabilities

The codebase uses standard find/select methods that are stable across 4.x versions.

html2text >= 2020.1.16

This version requirement ensures:

  • Python 3 compatibility : Earlier versions targeted Python 2.7
  • Markdown formatting fixes : Improved handling of nested lists and code blocks
  • Link preservation : Proper conversion of HTML links to Markdown syntax

The codebase uses the body_width=0 configuration which was stabilized in this version.

Sources: tools/requirements.txt:1-3

Import Locations

All three dependencies are imported at the top of deepwiki-scraper.py:

These are the only external dependencies required by the Python layer. The script uses only standard library modules for all other functionality (sys, re, time, pathlib, urllib.parse).

Sources: tools/deepwiki-scraper.py:17-19