This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Python Dependencies
Loading…
Python Dependencies
Relevant source files
This page documents the Python dependencies required by the deepwiki-scraper.py script, including their purposes, version requirements, and how they are used throughout the content extraction and conversion pipeline. For information about the scraper script itself, see deepwiki-scraper.py. For details about how Rust dependencies (mdBook and mdbook-mermaid) are installed, see Docker Multi-Stage Build.
Dependencies Overview
The system requires three core Python libraries for web scraping and HTML-to-Markdown conversion:
| Package | Minimum Version | Primary Purpose |
|---|---|---|
requests | 2.31.0 | HTTP client for fetching web pages |
beautifulsoup4 | 4.12.0 | HTML parsing and DOM manipulation |
html2text | 2020.1.16 | HTML to Markdown conversion |
These dependencies are declared in tools/requirements.txt:1-3 and installed during Docker image build using the uv package manager.
Sources: tools/requirements.txt:1-3 Dockerfile:16-17
Dependency Usage Flow
The following diagram illustrates how each Python dependency is used across the three-phase processing pipeline:
Sources: tools/deepwiki-scraper.py:17-19 tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:596-788
flowchart TD
subgraph "Phase 1: Markdown Extraction"
FetchPage["fetch_page()\n[tools/deepwiki-scraper.py:27-42]"]
ExtractStruct["extract_wiki_structure()\n[tools/deepwiki-scraper.py:78-125]"]
ExtractContent["extract_page_content()\n[tools/deepwiki-scraper.py:453-594]"]
ConvertHTML["convert_html_to_markdown()\n[tools/deepwiki-scraper.py:175-190]"]
end
subgraph "Phase 2: Diagram Enhancement"
ExtractDiagrams["extract_and_enhance_diagrams()\n[tools/deepwiki-scraper.py:596-788]"]
end
subgraph "requests Library"
Session["requests.Session()"]
GetMethod["session.get()"]
HeadMethod["session.head()"]
end
subgraph "BeautifulSoup4 Library"
BS4Parser["BeautifulSoup(html, 'html.parser')"]
FindAll["soup.find_all()"]
Select["soup.select()"]
Decompose["element.decompose()"]
end
subgraph "html2text Library"
H2TClass["html2text.HTML2Text()"]
HandleMethod["h.handle()"]
end
FetchPage --> Session
FetchPage --> GetMethod
ExtractStruct --> GetMethod
ExtractStruct --> BS4Parser
ExtractStruct --> FindAll
ExtractContent --> GetMethod
ExtractContent --> BS4Parser
ExtractContent --> Select
ExtractContent --> Decompose
ExtractContent --> ConvertHTML
ConvertHTML --> H2TClass
ConvertHTML --> HandleMethod
ExtractDiagrams --> GetMethod
requests
The requests library provides HTTP client functionality for fetching web pages from DeepWiki.com. It is imported at tools/deepwiki-scraper.py17 and used throughout the scraper.
Key Usage Patterns
Session Management: A requests.Session() object is created at tools/deepwiki-scraper.py:818-821 to maintain connection pooling and share headers across multiple requests:
HTTP GET Requests: The fetch_page() function at tools/deepwiki-scraper.py:27-42 uses session.get() with retry logic, browser-like headers, and 30-second timeout to fetch HTML content.
HTTP HEAD Requests: The discover_subsections() function at tools/deepwiki-scraper.py:44-76 uses session.head() to efficiently check for page existence without downloading full content.
Configuration Options
The library is configured with:
- Custom User-Agent: Mimics a real browser to avoid bot detection tools/deepwiki-scraper.py:29-31
- Timeout: 30-second limit on requests tools/deepwiki-scraper.py35
- Retry Logic: Up to 3 attempts with 2-second delays tools/deepwiki-scraper.py:33-42
- Connection Pooling: Automatic via
Session()object
Sources: tools/deepwiki-scraper.py17 tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:818-821
BeautifulSoup4
The beautifulsoup4 library (imported as bs4) provides HTML parsing and DOM manipulation capabilities. It is imported at tools/deepwiki-scraper.py18 as from bs4 import BeautifulSoup.
Parser Selection
BeautifulSoup is instantiated with the built-in html.parser backend at multiple locations:
- Structure discovery: tools/deepwiki-scraper.py84
- Content extraction: tools/deepwiki-scraper.py463
This parser choice avoids external dependencies (lxml, html5lib) and provides sufficient functionality for well-formed HTML.
flowchart LR
subgraph "Navigation Methods"
FindAll["soup.find_all()"]
Find["soup.find()"]
Select["soup.select()"]
SelectOne["soup.select_one()"]
end
subgraph "Usage in extract_wiki_structure()"
StructLinks["Find wiki page links\n[line 90]"]
end
subgraph "Usage in extract_page_content()"
RemoveNav["Remove navigation elements\n[line 466]"]
FindContent["Locate main content area\n[line 473-485]"]
RemoveUI["Remove DeepWiki UI elements\n[line 491-511]"]
end
FindAll --> StructLinks
FindAll --> RemoveUI
Select --> RemoveNav
SelectOne --> FindContent
Find --> FindContent
DOM Navigation Methods
The following diagram maps BeautifulSoup methods to their usage contexts in the codebase:
Sources: tools/deepwiki-scraper.py18 tools/deepwiki-scraper.py84 tools/deepwiki-scraper.py90 tools/deepwiki-scraper.py463 tools/deepwiki-scraper.py:466-511
Content Manipulation
Element Removal: The element.decompose() method permanently removes elements from the DOM tree:
- Navigation elements: tools/deepwiki-scraper.py:466-467
- DeepWiki UI components: tools/deepwiki-scraper.py:491-500
- Table of contents lists: tools/deepwiki-scraper.py:504-511
CSS Selectors: BeautifulSoup’s select() and select_one() methods support CSS selector syntax for finding content areas:
tools/deepwiki-scraper.py:473-476
Attribute-Based Selection: The find() method with attrs parameter locates elements by ARIA roles:
Text Extraction
BeautifulSoup’s get_text() method extracts plain text from elements:
- With
strip=Trueto remove whitespace: tools/deepwiki-scraper.py94 tools/deepwiki-scraper.py492 - Used for DeepWiki UI element detection: tools/deepwiki-scraper.py:492-500
Sources: tools/deepwiki-scraper.py:466-511
html2text
The html2text library converts HTML content to Markdown format. It is imported at tools/deepwiki-scraper.py19 and used exclusively in the convert_html_to_markdown() function.
Configuration
An HTML2Text instance is created with specific configuration at tools/deepwiki-scraper.py:178-180:
Key Settings:
ignore_links = False: Preserves hyperlinks as Markdown link syntax<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/text" undefined file-path="text">Hii</FileRef>body_width = 0: Disables automatic line wrapping at 80 characters, preserving original formatting
Conversion Process
The handle() method at tools/deepwiki-scraper.py181 performs the actual HTML-to-Markdown conversion:
This processes the cleaned HTML (after BeautifulSoup removes unwanted elements) and produces Markdown text with:
- Headers converted to
#syntax - Links converted to
<FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/text" undefined file-path="text">Hii</FileRef>format - Lists converted to
-or1.format - Bold/italic formatting preserved
- Code blocks and inline code preserved
Post-Processing
The conversion output undergoes additional cleanup at tools/deepwiki-scraper.py188:
- DeepWiki footer removal via
clean_deepwiki_footer(): tools/deepwiki-scraper.py:127-173 - Link rewriting to relative paths: tools/deepwiki-scraper.py:549-592
- Duplicate title removal: tools/deepwiki-scraper.py:525-545
Sources: tools/deepwiki-scraper.py19 tools/deepwiki-scraper.py:175-190
flowchart TD
subgraph "Dockerfile Stage 2"
BaseImage["FROM python:3.12-slim\n[Dockerfile:8]"]
CopyUV["COPY uv from astral-sh image\n[Dockerfile:13]"]
CopyReqs["COPY tools/requirements.txt\n[Dockerfile:16]"]
InstallDeps["RUN uv pip install --system\n[Dockerfile:17]"]
end
subgraph "requirements.txt"
Requests["requests>=2.31.0"]
BS4["beautifulsoup4>=4.12.0"]
HTML2Text["html2text>=2020.1.16"]
end
BaseImage --> CopyUV
CopyUV --> CopyReqs
CopyReqs --> InstallDeps
Requests --> InstallDeps
BS4 --> InstallDeps
HTML2Text --> InstallDeps
Installation Process
The dependencies are installed during Docker image build using the uv package manager, which provides fast, reliable Python package installation.
Multi-Stage Build Integration
Sources: Dockerfile8 Dockerfile13 Dockerfile:16-17 tools/requirements.txt:1-3
Installation Command
The dependencies are installed with a single uv pip install command at Dockerfile17:
Flags:
--system: Installs into system Python, not a virtual environment--no-cache: Avoids caching to reduce Docker image size-r /tmp/requirements.txt: Specifies requirements file path
The uv tool is significantly faster than standard pip and provides deterministic dependency resolution, making builds more reliable and reproducible.
Sources: Dockerfile:16-17
Version Requirements
The minimum version constraints specified in tools/requirements.txt:1-3 ensure compatibility with required features:
requests >= 2.31.0
This version requirement ensures:
- Security fixes : Addresses CVE-2023-32681 (Proxy-Authorization header leakage)
- Session improvements : Enhanced connection pooling and retry mechanisms
- HTTP/2 support : Better performance for multiple requests
The codebase relies on stable Session API behavior introduced in 2.x releases.
beautifulsoup4 >= 4.12.0
This version requirement ensures:
- Python 3.12 compatibility : Required for the base image
python:3.12-slim - Parser stability : Consistent behavior with
html.parserbackend - Security updates : Protection against XML parsing vulnerabilities
The codebase uses standard find/select methods that are stable across 4.x versions.
html2text >= 2020.1.16
This version requirement ensures:
- Python 3 compatibility : Earlier versions targeted Python 2.7
- Markdown formatting fixes : Improved handling of nested lists and code blocks
- Link preservation : Proper conversion of HTML links to Markdown syntax
The codebase uses the body_width=0 configuration which was stabilized in this version.
Sources: tools/requirements.txt:1-3
Import Locations
All three dependencies are imported at the top of deepwiki-scraper.py:
These are the only external dependencies required by the Python layer. The script uses only standard library modules for all other functionality (sys, re, time, pathlib, urllib.parse).
Sources: tools/deepwiki-scraper.py:17-19
Dismiss
Refresh this wiki
Enter email to refresh