MediaCrawler - Multi-Platform Social Media Crawler
Project Overview
MediaCrawler is a powerful multi-platform social media crawler developed and maintained by NanmiCoder. This project is based on Playwright technology and can crawl public information from multiple mainstream social media platforms, including content and comments.
Technical Architecture
Core Technologies
- Playwright: A browser automation framework that retains the logged-in browser environment.
- Python: The primary development language, requiring version 3.9.6+.
- JavaScript Execution: Obtains encrypted parameters by executing JS expressions.
- Node.js: Requires version 16+.
Working Principle
The project uses Playwright as a bridge, retaining the context browser environment after successful login, and obtains some encrypted parameters by executing JavaScript expressions. This approach avoids the complex work of reproducing the core encrypted JS code, greatly reducing the difficulty of reverse engineering.
Environment Requirements
System Requirements
- Python 3.9.6+
- Node.js 16+
Dependency Management
The project has added uv
to manage project dependencies. You can use uv to replace the traditional pip for dependency installation, which is more convenient and faster.
Installation and Deployment
Basic Installation Steps
# Enter the project root directory
cd MediaCrawler
# Create a virtual environment
python -m venv venv
# Activate the virtual environment
# macOS & Linux
source venv/bin/activate
# Windows
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install playwright browser
playwright install
Database Initialization (Optional)
# Execute database initialization (only for the first time)
python db.py
Usage
Basic Commands
# Keyword search crawling
python main.py --platform xhs --lt qrcode --type search
# Crawl by specifying post ID
python main.py --platform xhs --lt qrcode --type detail
# View help information
python main.py --help
Configuration Instructions
- The project does not enable comment crawling mode by default.
- If you need to crawl comments, please modify the
ENABLE_GET_COMMENTS
variable in config/base_config.py
.
- Other function configurations can also be viewed in
config/base_config.py
, all with Chinese comments.
Data Storage
Supported Storage Methods
- MySQL Database: Supports relational database storage (requires pre-creation of the database).
- CSV File: Saved to CSV format files in the
data/
directory.
- JSON File: Saved to JSON format files in the
data/
directory.
Pro Version Advantages
The project also provides the MediaCrawlerPro version, which has the following advantages over the open-source version:
- Multi-Account + IP Proxy Support (Key Feature)
- Removes Playwright Dependency, making it easier to use
- Supports Linux Environment usage
- Code Refactoring Optimization, easier to read and maintain
- Decoupled JS Signature Logic, higher code quality
- Perfect Architectural Design, easier to extend
- Added Social Media Video Downloader desktop software
- Supports Multi-Platform Homepage Feed Recommendations (HomeFeed)
Legal Disclaimer
Disclaimer
- This project is for learning and research purposes only and is prohibited for commercial use.
- It is strictly forbidden to use it for illegal purposes or to infringe upon the legitimate rights and interests of others.
- Users must abide by relevant laws and regulations and bear their own legal responsibilities.
- The developer is not responsible for any legal liability arising from the use of this project.
Project Value
MediaCrawler is not just a crawler tool, but also an excellent learning project:
- Architecture Design Learning: The project has a mature architectural design, which is worth learning from.
- Technical Practice: Covers the comprehensive application of various technology stacks.
- Engineering Thinking: Complete engineering practice from code organization to deployment.
- Anti-Crawling Technology: Learn about solutions to modern anti-crawling technologies.