Python Dependencies
Relevant source files
This page documents the Python dependencies required by the deepwiki-scraper.py script, including their purposes, version requirements, and how they are used throughout the content extraction and conversion pipeline. For information about the scraper script itself, see deepwiki-scraper.py. For details about how Rust dependencies (mdBook and mdbook-mermaid) are installed, see Docker Multi-Stage Build.
Dependencies Overview
The system requires three core Python libraries for web scraping and HTML-to-Markdown conversion:
| Package | Minimum Version | Primary Purpose |
|---|---|---|
requests | 2.31.0 | HTTP client for fetching web pages |
beautifulsoup4 | 4.12.0 | HTML parsing and DOM manipulation |
html2text | 2020.1.16 | HTML to Markdown conversion |
These dependencies are declared in tools/requirements.txt:1-3 and installed during Docker image build using the uv package manager.
Sources: tools/requirements.txt:1-3 Dockerfile:16-17
Dependency Usage Flow
The following diagram illustrates how each Python dependency is used across the three-phase processing pipeline:
Sources: tools/deepwiki-scraper.py:17-19 tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:596-788
flowchart TD
subgraph "Phase 1: Markdown Extraction"
FetchPage["fetch_page()\n[tools/deepwiki-scraper.py:27-42]"]
ExtractStruct["extract_wiki_structure()\n[tools/deepwiki-scraper.py:78-125]"]
ExtractContent["extract_page_content()\n[tools/deepwiki-scraper.py:453-594]"]
ConvertHTML["convert_html_to_markdown()\n[tools/deepwiki-scraper.py:175-190]"]
end
subgraph "Phase 2: Diagram Enhancement"
ExtractDiagrams["extract_and_enhance_diagrams()\n[tools/deepwiki-scraper.py:596-788]"]
end
subgraph "requests Library"
Session["requests.Session()"]
GetMethod["session.get()"]
HeadMethod["session.head()"]
end
subgraph "BeautifulSoup4 Library"
BS4Parser["BeautifulSoup(html, 'html.parser')"]
FindAll["soup.find_all()"]
Select["soup.select()"]
Decompose["element.decompose()"]
end
subgraph "html2text Library"
H2TClass["html2text.HTML2Text()"]
HandleMethod["h.handle()"]
end
FetchPage --> Session
FetchPage --> GetMethod
ExtractStruct --> GetMethod
ExtractStruct --> BS4Parser
ExtractStruct --> FindAll
ExtractContent --> GetMethod
ExtractContent --> BS4Parser
ExtractContent --> Select
ExtractContent --> Decompose
ExtractContent --> ConvertHTML
ConvertHTML --> H2TClass
ConvertHTML --> HandleMethod
ExtractDiagrams --> GetMethod
requests
The requests library provides HTTP client functionality for fetching web pages from DeepWiki.com. It is imported at tools/deepwiki-scraper.py17 and used throughout the scraper.
Key Usage Patterns
Session Management: A requests.Session() object is created at tools/deepwiki-scraper.py:818-821 to maintain connection pooling and share headers across multiple requests:
HTTP GET Requests: The fetch_page() function at tools/deepwiki-scraper.py:27-42 uses session.get() with retry logic, browser-like headers, and 30-second timeout to fetch HTML content.
HTTP HEAD Requests: The discover_subsections() function at tools/deepwiki-scraper.py:44-76 uses session.head() to efficiently check for page existence without downloading full content.
Configuration Options
The library is configured with:
- Custom User-Agent: Mimics a real browser to avoid bot detection tools/deepwiki-scraper.py:29-31
- Timeout: 30-second limit on requests tools/deepwiki-scraper.py35
- Retry Logic: Up to 3 attempts with 2-second delays tools/deepwiki-scraper.py:33-42
- Connection Pooling: Automatic via
Session()object
Sources: tools/deepwiki-scraper.py17 tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:818-821
BeautifulSoup4
The beautifulsoup4 library (imported as bs4) provides HTML parsing and DOM manipulation capabilities. It is imported at tools/deepwiki-scraper.py18 as from bs4 import BeautifulSoup.
Parser Selection
BeautifulSoup is instantiated with the built-in html.parser backend at multiple locations:
- Structure discovery: tools/deepwiki-scraper.py84
- Content extraction: tools/deepwiki-scraper.py463
This parser choice avoids external dependencies (lxml, html5lib) and provides sufficient functionality for well-formed HTML.
flowchart LR
subgraph "Navigation Methods"
FindAll["soup.find_all()"]
Find["soup.find()"]
Select["soup.select()"]
SelectOne["soup.select_one()"]
end
subgraph "Usage in extract_wiki_structure()"
StructLinks["Find wiki page links\n[line 90]"]
end
subgraph "Usage in extract_page_content()"
RemoveNav["Remove navigation elements\n[line 466]"]
FindContent["Locate main content area\n[line 473-485]"]
RemoveUI["Remove DeepWiki UI elements\n[line 491-511]"]
end
FindAll --> StructLinks
FindAll --> RemoveUI
Select --> RemoveNav
SelectOne --> FindContent
Find --> FindContent
DOM Navigation Methods
The following diagram maps BeautifulSoup methods to their usage contexts in the codebase:
Sources: tools/deepwiki-scraper.py18 tools/deepwiki-scraper.py84 tools/deepwiki-scraper.py90 tools/deepwiki-scraper.py463 tools/deepwiki-scraper.py:466-511
Content Manipulation
Element Removal: The element.decompose() method permanently removes elements from the DOM tree:
- Navigation elements: tools/deepwiki-scraper.py:466-467
- DeepWiki UI components: tools/deepwiki-scraper.py:491-500
- Table of contents lists: tools/deepwiki-scraper.py:504-511
CSS Selectors: BeautifulSoup's select() and select_one() methods support CSS selector syntax for finding content areas:
tools/deepwiki-scraper.py:473-476
Attribute-Based Selection: The find() method with attrs parameter locates elements by ARIA roles:
Text Extraction
BeautifulSoup's get_text() method extracts plain text from elements:
- With
strip=Trueto remove whitespace: tools/deepwiki-scraper.py94 tools/deepwiki-scraper.py492 - Used for DeepWiki UI element detection: tools/deepwiki-scraper.py:492-500
Sources: tools/deepwiki-scraper.py:466-511
html2text
The html2text library converts HTML content to Markdown format. It is imported at tools/deepwiki-scraper.py19 and used exclusively in the convert_html_to_markdown() function.
Configuration
An HTML2Text instance is created with specific configuration at tools/deepwiki-scraper.py:178-180:
Key Settings:
ignore_links = False: Preserves hyperlinks as Markdown link syntax<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/text" undefined file-path="text">Hii</FileRef>body_width = 0: Disables automatic line wrapping at 80 characters, preserving original formatting
Conversion Process
The handle() method at tools/deepwiki-scraper.py181 performs the actual HTML-to-Markdown conversion:
This processes the cleaned HTML (after BeautifulSoup removes unwanted elements) and produces Markdown text with:
- Headers converted to
#syntax - Links converted to
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/135bed35/text" undefined file-path="text">Hii</FileRef>format - Lists converted to
-or1.format - Bold/italic formatting preserved
- Code blocks and inline code preserved
Post-Processing
The conversion output undergoes additional cleanup at tools/deepwiki-scraper.py188:
- DeepWiki footer removal via
clean_deepwiki_footer(): tools/deepwiki-scraper.py:127-173 - Link rewriting to relative paths: tools/deepwiki-scraper.py:549-592
- Duplicate title removal: tools/deepwiki-scraper.py:525-545
Sources: tools/deepwiki-scraper.py19 tools/deepwiki-scraper.py:175-190
flowchart TD
subgraph "Dockerfile Stage 2"
BaseImage["FROM python:3.12-slim\n[Dockerfile:8]"]
CopyUV["COPY uv from astral-sh image\n[Dockerfile:13]"]
CopyReqs["COPY tools/requirements.txt\n[Dockerfile:16]"]
InstallDeps["RUN uv pip install --system\n[Dockerfile:17]"]
end
subgraph "requirements.txt"
Requests["requests>=2.31.0"]
BS4["beautifulsoup4>=4.12.0"]
HTML2Text["html2text>=2020.1.16"]
end
BaseImage --> CopyUV
CopyUV --> CopyReqs
CopyReqs --> InstallDeps
Requests --> InstallDeps
BS4 --> InstallDeps
HTML2Text --> InstallDeps
Installation Process
The dependencies are installed during Docker image build using the uv package manager, which provides fast, reliable Python package installation.
Multi-Stage Build Integration
Sources: Dockerfile8 Dockerfile13 Dockerfile:16-17 tools/requirements.txt:1-3
Installation Command
The dependencies are installed with a single uv pip install command at Dockerfile17:
Flags:
--system: Installs into system Python, not a virtual environment--no-cache: Avoids caching to reduce Docker image size-r /tmp/requirements.txt: Specifies requirements file path
The uv tool is significantly faster than standard pip and provides deterministic dependency resolution, making builds more reliable and reproducible.
Sources: Dockerfile:16-17
Version Requirements
The minimum version constraints specified in tools/requirements.txt:1-3 ensure compatibility with required features:
requests >= 2.31.0
This version requirement ensures:
- Security fixes : Addresses CVE-2023-32681 (Proxy-Authorization header leakage)
- Session improvements : Enhanced connection pooling and retry mechanisms
- HTTP/2 support : Better performance for multiple requests
The codebase relies on stable Session API behavior introduced in 2.x releases.
beautifulsoup4 >= 4.12.0
This version requirement ensures:
- Python 3.12 compatibility : Required for the base image
python:3.12-slim - Parser stability : Consistent behavior with
html.parserbackend - Security updates : Protection against XML parsing vulnerabilities
The codebase uses standard find/select methods that are stable across 4.x versions.
html2text >= 2020.1.16
This version requirement ensures:
- Python 3 compatibility : Earlier versions targeted Python 2.7
- Markdown formatting fixes : Improved handling of nested lists and code blocks
- Link preservation : Proper conversion of HTML links to Markdown syntax
The codebase uses the body_width=0 configuration which was stabilized in this version.
Sources: tools/requirements.txt:1-3
Import Locations
All three dependencies are imported at the top of deepwiki-scraper.py:
These are the only external dependencies required by the Python layer. The script uses only standard library modules for all other functionality (sys, re, time, pathlib, urllib.parse).
Sources: tools/deepwiki-scraper.py:17-19