Building an Automated Video Course Archiver: Fetch, Download & Upload to Telegram
π₯ Building an Automated Video Course Archiver
Catalogue β Fetch β Download β Upload β Repeat. A fully automated, fault-tolerant pipeline for archiving Classplus course videos to Telegram β complete with a live web dashboard.
Table of Contents
- Overview
- System Architecture
- Prerequisites
- File Structure
- Script 0a: JSON Cataloguer (
All-in-One_JSON-based.py) - Script 0b: Directory Cataloguer (
directory-based.py) - Script 1: Fetching Video Links (
fetch_videos_v2.py) - Script 2: Downloading Videos (
download_videos_v2.py) - Script 3: Uploading to Telegram (
upv3.py) - Script 4: The Dashboard (
main_v3.py) - Cataloguer Comparison: JSON vs Directory
- Data Flow & State Files
- Configuration Reference
- Running the Pipeline
- Fault Tolerance & Resumability
- Download Source Files
Overview
This project is a six-script automation system that:
- Catalogues (Option A) a Classplus course into a single structured JSON file β useful for a quick audit of all content types
- Catalogues (Option B) the same course into a mirrored directory tree on disk β one JSON file per item, browsable like a real filesystem
- Fetches playable
.m3u8stream URLs for every video with full resumability and retry logic - Downloads those videos to local disk using FFmpeg with multi-threaded concurrency
- Uploads the downloaded files sequentially to a Telegram channel
- Orchestrates steps 3β5 from a single Gradio web dashboard with scheduling, live logs, and status tracking
Every script is designed with resumability at its core β if a run is interrupted at any point, the next run picks up exactly where it left off. No work is ever duplicated.
System Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β main_v3.py (Dashboard) β
β Gradio UI Β· Scheduler Β· Process Manager β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β orchestrates (Steps 3β5)
β
ββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββ
β [Optional Step 0 β choose one] β
β β
β 0a: All-in-One_JSON-based.py βββΊ course_content.json β
β (Single JSON, all content types) β
β β
β 0b: directory-based.py βββΊ course_content/ β
β (Mirrored folder tree, one .json file per item) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββΌβββββββββββββββββββββββββ
βΌ β
βββββββββββββββββ β
βfetch_videos ββββΊ course_data_with_ β
β _v2.py β videos.json β
βββββββββββββββββ β β
β βΌ β
β βββββββββββββββββββββ β
β βdownload_videos ββββΊ course_downloads/
β β _v2.py β *.mp4
β βββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββ β
β β upv3.py ββββΊ Telegram Channel
β β (Telegram Upload) β @channel
β βββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββ
Each script produces and consumes JSON state files that act as the shared memory between stages. This decoupled design means any script can be run independently or restarted at any time.
Prerequisites
System Dependencies
# FFmpeg (required for downloading HLS streams)
sudo apt install ffmpeg # Ubuntu/Debian
brew install ffmpeg # macOS
# Python 3.8+
python3 --version
Python Packages
pip install requests python-dotenv gradio colorlog tqdm
Optional: Local Telegram Bot API Server
For uploading files larger than 50 MB (the cloud API limit), the uploader uses a local Bot API server on 127.0.0.1:8081. You can run the official server from telegram-bot-api or use Docker:
docker run -d -p 8081:8081 aiogram/telegram-bot-api \
--api-id=YOUR_API_ID \
--api-hash=YOUR_API_HASH
Environment File
Create a .env file in your working directory:
ACCESS_TOKEN=your_classplus_jwt_token_here
COURSE_SAVE_PATH=./course_output
File Structure
project/
βββ All-in-One_JSON-based.py # Step 0a (optional): JSON catalogue of full course
βββ directory-based.py # Step 0b (optional): Directory-mirrored catalogue
βββ fetch_videos_v2.py # Step 1: Fetch stream URLs from Classplus
βββ download_videos_v2.py # Step 2: Download videos via FFmpeg
βββ upv3.py # Step 3: Upload to Telegram
βββ main_v3.py # Dashboard: Orchestrates Steps 1β3
βββ .env # Your credentials (never commit this)
βββ course_output/ # Output from Steps 1β3
β βββ course_content.json # Step 0a output
β βββ course_data_with_videos.json # Step 1 state
β βββ download_manifest.json # Step 2 state
β βββ upload_state.json # Step 3 state
β βββ download_log.jsonl # Step 0a/0b JSONL log
β βββ download_links_log.jsonl # Step 1 JSONL log
β βββ video_downloader_log.jsonl # Step 2 JSONL log
β βββ telegram_uploader_log.jsonl # Step 3 JSONL log
β βββ course_downloads/ # Downloaded .mp4 files
β βββ Folder Name/
β β βββ Video Title.mp4
β β βββ ...
β βββ ...
βββ course_content/ # Output from Step 0b (directory-based)
βββ root/
βββ index.json # Lists all items in this folder
βββ Lecture 01 - Overview.json # Individual item metadata
βββ Reference Notes.pdf.json
βββ Chapter 1 - Basics/ # Subfolder mirroring course structure
βββ index.json
βββ Video 01.json
βββ ...
Script 0a: JSON Cataloguer
File: All-in-One_JSON-based.py
This is the optional first step β a standalone audit tool that scans an entire Classplus course and dumps everything (videos, documents, tests, and all folder hierarchy) into a single clean JSON file. Use it to explore a courseβs structure before committing to a full download pipeline.
What Makes It Different
Unlike fetch_videos_v2.py, this script does not resolve playable video URLs. Its goal is purely to catalogue β quickly building a complete map of all content types in one shot. Itβs the right tool when you want to answer questions like:
- How many videos/docs/tests does this course have?
- Whatβs the folder structure?
- Which content is locked?
Content Types Captured
contentType |
Type | Fields Captured |
|---|---|---|
1 |
Folder | description, resources, counts (videos/docs/tests), children |
2 |
Video | vidKey, thumbnailUrl, videoType, duration, contentHashId |
3 |
Document | format, url, isContentLocked |
4 |
Test | testId, maxMarks, URL |
How It Works
recursively_build_structure(folder_id=None)
β
βββ fetch_folder_content(course_id, folder_id)
β βββ Handles pagination via offset loop until no items remain
β
βββ extract_relevant(item)
β βββ Strips raw API response to a clean minimal dict per type
β
βββ For each folder found β recurse into children
βββ Attaches result as nested 'children' list
The entire tree is built in memory and saved as one atomic write to course_content.json at the end β making this a fast, single-pass scan.
Pagination Support
This script handles courses with many items per folder using an offset-based loop:
while True:
resp = session.get(BASE_URL, params=params)
items = resp.json()['data']['courseContent']
if not items:
break
all_items.extend(items)
params['offset'] += len(items)
time.sleep(0.2)
This is more thorough than fetch_videos_v2.py which does not implement explicit pagination.
Authentication Headers
This script uses a fuller set of mobile-app headers to closely mimic the Classplus Android app, which can be helpful for courses with stricter API access controls:
HEADERS = {
'user-agent': 'Mobile-Android',
'mobile-agent': 'Mobile-Android',
'api-version': '52',
'is-apk': '1',
'region': 'IN',
'x-requested-with': 'co.jones.stibl',
'x-access-token': ACCESS_TOKEN,
...
}
Sample Output (course_content.json)
[
{
"contentType": 1,
"id": 98765,
"name": "Unit 1 - Fundamentals",
"description": "Core concepts",
"counts": { "documents": 3, "videos": 12, "tests": 2 },
"children": [
{
"contentType": 2,
"id": 11111,
"name": "Intro Lecture",
"duration": 2847,
"contentHashId": "xyz987abc",
"videoType": 1
},
{
"contentType": 3,
"id": 22222,
"name": "Reference Notes.pdf",
"format": "pdf",
"url": "https://...",
"isContentLocked": false
},
{
"contentType": 4,
"id": 33333,
"name": "Unit 1 Quiz",
"testId": 44444,
"maxMarks": 20
}
]
}
]
Key Configuration
| Variable | Default | Description |
|---|---|---|
ACCESS_TOKEN |
hardcoded | Classplus JWT token |
COURSE_ID |
733740 |
Target course ID |
OUTPUT_FILENAME |
course_content.json |
Output file name |
REQUEST_TIMEOUT |
30s |
Per-request timeout |
MAX_RETRIES |
5 |
Retry attempts on server errors |
RETRY_BACKOFF_FACTOR |
0.5 |
Exponential backoff multiplier |
Tip: After running this script, inspect
course_content.jsonto confirm the course ID and structure before runningfetch_videos_v2.py.
Script 0b: Directory Cataloguer
File: directory-based.py
This is an alternative cataloguing approach that saves each piece of course content as its own individual .json file, mirroring the courseβs folder hierarchy directly onto the filesystem. Instead of one big JSON blob, you get a browsable directory tree you can explore in any file manager.
How It Works
recursively_save(folder_id=None, path=BASE_SAVE_PATH)
β
βββ fetch_folder_content(course_id, folder_id)
β βββ Pagination loop (limit=50, offset increments)
β
βββ For each item β extract_relevant() β save_content_file()
β βββ Writes individual <item_name>.json to current directory
β
βββ For each subfolder found β recurse with subfolder's ID
β
βββ Write index.json listing all filenames in the current folder
Filesystem Output
The script creates a directory tree that exactly mirrors the course structure:
course_content/
βββ root/
βββ index.json β lists all items in this folder
βββ Lecture 01 - Overview.json β video metadata
βββ Reference Notes.pdf.json β document metadata
βββ Unit 1 Quiz.json β test metadata
βββ Chapter 1 - Fundamentals/ β subfolder becomes a real directory
βββ index.json
βββ Intro Video.json
βββ ...
Each index.json is a simple list of filenames present in that directory, acting as a table of contents:
[
"Lecture 01 - Overview.json",
"Reference Notes.pdf.json",
"Unit 1 Quiz.json",
"Chapter 1 - Fundamentals"
]
Filename Sanitization (Two-Level)
The script attempts to save each file using the raw content name. If the OS rejects it (special characters, reserved names), it falls back to a sanitized version with an MD5 hash suffix to avoid collisions:
# Level 1: raw name (may fail on some OSes)
raw_filepath = save_dir / f"{original_name}.json"
# Level 2: sanitized fallback with MD5 hash
sanitized_name = f"{name[:72]}_{hashlib.md5(name.encode()).hexdigest()[:6]}.json"
Both attempts are logged to download_log.jsonl so you can audit which files needed fallback names.
Partial Scan with START_FOLDER_ID
A unique feature of this script: you can scan just one subtree instead of the whole course.
# Scan entire course
START_FOLDER_ID = None
# Scan only a specific chapter/folder
START_FOLDER_ID = 38232234
This is useful when you want to re-scan or inspect a single section without waiting for the entire course to be traversed again.
Sample Item Output (Intro Video.json)
{
"contentType": 2,
"id": 11111,
"name": "Intro Video",
"description": "Course introduction",
"vidKey": "abc123",
"thumbnailUrl": "https://cdn.example.com/thumb.jpg",
"videoType": 1,
"duration": 2847,
"contentHashId": "xyz987abc"
}
Key Configuration
| Variable | Default | Description |
|---|---|---|
ACCESS_TOKEN |
hardcoded | Classplus JWT token |
COURSE_ID |
516707 |
Target course ID |
START_FOLDER_ID |
None |
None = full course; set an int to scan a subtree |
BASE_SAVE_PATH |
./course_content |
Root directory for all output |
INDEX_FILENAME |
index.json |
Name of the per-folder index file |
Script 1: Fetching Video Links
File: fetch_videos_v2.py
This script is the entry point of the pipeline. It authenticates with the Classplus API, recursively scans the course folder tree, and resolves the final playable .m3u8 URL for each video.
How It Works
On first run:
- Calls
GET /v2/course/content/getrecursively to build a complete folder/video tree - For each video, calls the JW Player signed URL endpoint to get the master
.m3u8playlist - Parses the playlist to find the highest-quality variant stream URL
- Saves the full tree (with URLs) to
course_data_with_videos.json
On subsequent runs:
- Loads the existing state file
- Runs a stale hash check β compares
contentHashIdvalues against fresh API data. If no changes are found in the first 10 items, it skips the full refresh (performance optimization) - Processes only videos with status
pendingorfailed
Resumability & Retry Logic
MAX_VIDEO_ATTEMPTS = 500 # Each video gets up to 500 tries across all runs
CONSECUTIVE_ERROR_LIMIT = 10 # Exit early if token appears expired
Video statuses cycle through: pending β completed / failed β permanently_failed
State is saved after every single video β so even a crash mid-run loses at most one videoβs work.
Token Expiry Detection
The script tracks consecutive 403/404 errors. If 10 errors occur back-to-back, it raises a CriticalErrorExit exception, saves progress, logs a clear message, and exits cleanly β rather than hammering the API with a dead token.
class CriticalErrorExit(Exception):
pass
# Triggered inside get_final_video_url()
if consecutive_error_counter >= CONSECUTIVE_ERROR_LIMIT:
raise CriticalErrorExit("Please check your ACCESS_TOKEN.")
Key Configuration
| Variable | Default | Description |
|---|---|---|
COURSE_ID |
516707 |
Classplus course ID |
BASE_URL |
Classplus v2 API | Content listing endpoint |
MAX_VIDEO_ATTEMPTS |
500 |
Retries per video across all runs |
CONSECUTIVE_ERROR_LIMIT |
10 |
403/404s before forced exit |
API_DELAY |
0.5s |
Delay between API calls |
Script 2: Downloading Videos
File: download_videos_v2.py
This script reads the JSON state from the fetcher and downloads all videos to local disk using FFmpeg, with up to 10 concurrent downloads running simultaneously.
Architecture: Worker Pool + State Manager
The script uses two distinct concurrency layers:
Main Thread
β
βββ ThreadPoolExecutor (10 workers)
β βββ Worker 1: ffmpeg -i <url> video1.mp4
β βββ Worker 2: ffmpeg -i <url> video2.mp4
β βββ ...
β βββ Worker 10: ffmpeg -i <url> video10.mp4
β
βββ StateManager Thread
βββ Reads Queue β Updates manifest β Saves to disk every 5s
This design ensures:
- No I/O contention β only one thread ever writes to disk
- Maximum throughput β a new download starts the instant any worker finishes
- Data integrity β state is saved periodically even if the main process crashes
FFmpeg Integration
Each download runs FFmpeg to convert the .m3u8 HLS stream directly to .mp4:
ffmpeg -y -i <m3u8_url> -c copy -bsf:a aac_adtstoasc output.mp4
The -c copy flag means no re-encoding β the video is simply remuxed, making downloads as fast as your connection allows.
Progress Bars
When tqdm is available, each worker shows a live progress bar tied to the videoβs duration (via ffprobe):
Overall Progress |βββββββββββββββββ| 43/100 [02:15<03:01]
Lecture 01 - Intro.mp4 |ββββββββββββββββ| 1234.0s/1234.0s
Lecture 02 - Basics.mp4 |ββββββββββββββββ| 621.0s/1234.0s
State Sync on Resume
When resuming, the script calls _refresh_manifest_from_source() to:
- Update video names and URLs from the latest fetcher output
- Reset any
faileddownloads topendingso they get retried with fresh URLs
Key Configuration
| Variable | Default | Description |
|---|---|---|
MAX_CONCURRENT_DOWNLOADS |
10 |
Parallel download threads |
SAVE_INTERVAL_SECONDS |
5.0 |
How often state is saved to disk |
INPUT_DIR |
./course_output |
Where to read links file from |
DOWNLOAD_ROOT_DIR |
./course_output/course_downloads |
Where to save .mp4 files |
Script 3: Uploading to Telegram
File: upv3.py
This script reads download_manifest.json, synchronizes with any existing upload state, and uploads all completed .mp4 files to a Telegram channel β strictly one at a time, in order.
Why Sequential?
Upload order matters for Telegram channels β viewers expect lectures in the correct sequence. A concurrent uploader would cause messages to appear out of order due to network variability. Sequential upload guarantees perfect ordering.
State Synchronization
On each run, synchronize_states() merges the download manifest with the existing upload state:
- Finds all videos with
downloadStatus: completedthat arenβt yet in the upload state β adds them aspending - Resets any
uploadingstatus topending(these were interrupted mid-upload last time) - Preserves all previously
completedupload records
Message Link Tracking
After each successful upload, the script constructs and saves a direct t.me/... link:
# Public channel (e.g., @mychannel)
tele_message_link = f"https://t.me/{chat_username}/{message_id}"
# Private supergroup (e.g., -10012345678)
chat_id_short = str(CHAT_ID).replace("-100", "")
tele_message_link = f"https://t.me/c/{chat_id_short}/{message_id}"
This link is stored in upload_state.json alongside the telegramFileId and widget_id for each video.
Caption Format
Each uploaded video gets a structured caption showing its folder path and title:
π Chapter 1 / Week 3 / Organic Chemistry
- π Lecture 14: Reaction Mechanisms
Rate Limit & Retry Handling
MAX_RETRIES = 5
RETRY_DELAY_SECONDS = 8
# On HTTP 429 (Too Many Requests):
retry_after = int(e.response.json()['parameters']['retry_after'])
time.sleep(retry_after) # Respects Telegram's own backoff instruction
Graceful Interruption
If Ctrl+C is pressed mid-upload, the current video is marked pending (not failed) so the next run re-uploads it cleanly from the beginning. No partial uploads are ever left in a completed state.
Key Configuration
| Variable | Default | Description |
|---|---|---|
BOT_TOKEN |
Your bot token | Telegram Bot API token |
CHAT_ID |
@nifobry |
Target channel (public @name or private -100...) |
BASE_URL_TEMPLATE |
http://127.0.0.1:8081 |
Local Bot API server for large files |
MAX_RETRIES |
5 |
Upload retry attempts per file |
STARTING_WIDGET_ID |
1 |
Sequential ID counter for uploads |
Script 4: The Dashboard
File: main_v3.py
The dashboard is a Gradio web application that ties all three scripts together into a manageable control panel. It runs each script as a subprocess, streams their output to log files in real time, and exposes controls through a browser-based UI.
UI Layout
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π₯ Video Processing Pipeline Dashboard β
β [ Compact log view ] [ Dark mode ] β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ€
β β‘ Controls β π Logs β
β [βΆ Start] [βΉ Stop] β β
β [π Run Now] [β Skip]β Application Logs (Dashboard) β
β β βββββββββββββββββββββββββββββββββββ
β Schedule: [βββββ] 14h β β 2025-07-15 10:23:01 [INFO]... ββ
β [Apply Interval] β βββββββββββββββββββββββββββββββββββ
β β β
β π Pipeline Status β Subprocess Logs (Scripts) β
β β Running β βββββββββββββββββββββββββββββββββββ
β Step 2/3: download... β β [10:23:05] β
SUCCESS: URL... ββ
β Last Run: 10:23:01 β βββββββββββββββββββββββββββββββββββ
β Next Run: 00:23:01 β β
β βοΈ3 β1 βΉοΈ0 βοΈ0 β [π Refresh All] β
β β β
β π Token Management β β
β [β’β’β’β’β’β’β’β’β’β’β’β’] Update β β
ββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββ
Key Features
Scheduling: Start a repeating schedule that runs immediately, then every N hours (1β168h range, user-configurable via slider). The next scheduled time is always shown in the status panel.
Manual Control:
- βΆ Start Schedule β begins the periodic runner
- βΉ Stop All β halts the scheduler and sends SIGINT to any running subprocess, escalating to SIGTERM and SIGKILL if needed
- π Run Now β one-shot manual trigger (works alongside or without a schedule)
- β Skip Step β terminates the current script and advances the pipeline to the next one
Subprocess Management: Each script runs as a child process with its stdout/stderr piped to subprocess.log via a background streaming thread. The main app also writes its own structured log to app.log. Both logs are tailed and shown in the UI.
Status Indicators: Color-coded status dot:
- π’
limegreenβ Completed / Idle - π‘
goldenrodβ Waiting for next scheduled run - π
orangeredβ Currently running - π΄
redβ Failed / Stopped
Token Management: The ACCESS_TOKEN can be updated directly from the UI without restarting. The new value is written to .env and immediately applied to os.environ.
Auto-Refresh: A gr.Timer(5) component polls master_refresh_cb() every 5 seconds to update status, logs, and button interactivity without any page reload.
Process Lifecycle
# Graceful shutdown sequence (GRACE_SHUTDOWN_SECONDS = 5)
1. Send SIGINT (Ctrl+C equivalent)
2. Wait up to 5 seconds
3. If still alive β SIGTERM
4. Wait 2 seconds
5. If still alive β SIGKILL
Log Viewing Modes
| Mode | App Log Lines | Subprocess Log Lines |
|---|---|---|
| Expanded (default) | 1500 | 1500 |
| Compact | 400 | 400 |
Cataloguer Comparison: JSON vs Directory
All three cataloguing scripts talk to the same Classplus API but serve different purposes. Hereβs a full side-by-side comparison:
| Feature | All-in-One_JSON-based.py |
directory-based.py |
fetch_videos_v2.py |
|---|---|---|---|
| Primary purpose | Single-file course audit | Filesystem-mirrored audit | Resolve playable video URLs |
| Output format | One course_content.json |
One .json per item in OS folders |
course_data_with_videos.json |
| Browsable output | β Open in editor | β Open in file manager | β Open in editor |
| Per-folder index | β No | β
index.json in each dir |
β No |
| Partial subtree scan | β Always from root | β
START_FOLDER_ID |
β Always from root |
| Content types | Videos + Docs + Tests | Videos + Docs + Tests | Videos only |
| Video URL resolution | β contentHashId only |
β contentHashId only |
β
Final .m3u8 URL |
| Resumability | β Full rescan every run | β Full rescan every run | β Per-video state tracking |
| Per-video retry tracking | β No | β No | β Up to 500 attempts |
| Token expiry detection | β No | β No | β Exits after 10Γ 403s |
| Pagination | β Offset loop | β Offset loop (limit=50) | β Single request per folder |
| Filename sanitization | Basic | β Two-level with MD5 fallback | N/A |
| Auth headers | Full mobile headers | Full mobile headers | Minimal |
| Used in dashboard | β Standalone | β Standalone | β Step 1 of pipeline |
| Course ID (default) | 733740 |
516707 |
516707 |
When to use which:
All-in-One_JSON-based.pyβ Quick audit of a new course. Want one file you can search withCtrl+Forjq.directory-based.pyβ Want to browse/inspect course content visually. Need to scan only a specific chapter. Course has many items with special characters in names.fetch_videos_v2.pyβ Ready to actually download videos. Need resumable, production-grade URL resolution that handles token expiry gracefully.
Recommended workflow:
- Run either
0aor0bto audit the course and confirm structure - Run
fetch_videos_v2.pyto resolve all playable URLs - Run the full pipeline via the dashboard to download and upload
Data Flow & State Files
Understanding the JSON state files is key to understanding how the pipeline is resumable.
course_data_with_videos.json (Fetcher Output)
[
{
"contentType": 1,
"id": 12345,
"name": "Chapter 1 - Introduction",
"children": [
{
"contentType": 2,
"id": 67890,
"name": "Lecture 01 - Overview",
"contentHashId": "abc123xyz",
"duration": 3612,
"downloadStatus": "completed",
"finalUrl": "https://cdn.jwplayer.com/.../index.m3u8",
"retryCount": 1
}
]
}
]
download_manifest.json (Downloader Output)
Extends the fetcher JSON with download-specific fields:
{
"localPath": "Chapter 1 - Introduction/Lecture 01 - Overview.mp4",
"downloadStatus": "completed",
"downloadError": null
}
upload_state.json (Uploader Output)
Extends the download manifest with upload-specific fields:
{
"uploadStatus": "completed",
"telegramFileId": "BQACAgIAAxkBAAI...",
"tele_message_link": "https://t.me/nifobry/42",
"widget_id": 42,
"uploadError": null
}
Configuration Reference
All key settings are defined at the top of each script as module-level constants, making them easy to find and adjust.
| Setting | File | Default | Description |
|---|---|---|---|
ACCESS_TOKEN |
All-in-One_JSON-based.py |
hardcoded | Classplus JWT token |
COURSE_ID |
All-in-One_JSON-based.py |
733740 |
Target course ID for cataloguing |
OUTPUT_FILENAME |
All-in-One_JSON-based.py |
course_content.json |
Catalogue output file |
ACCESS_TOKEN |
directory-based.py |
hardcoded | Classplus JWT token |
COURSE_ID |
directory-based.py |
516707 |
Target course ID |
START_FOLDER_ID |
directory-based.py |
None |
None = full course; int = subtree only |
BASE_SAVE_PATH |
directory-based.py |
./course_content |
Root output directory |
ACCESS_TOKEN |
fetch_videos_v2.py |
.env |
Classplus JWT token |
COURSE_ID |
fetch_videos_v2.py |
516707 |
Target course ID |
MAX_VIDEO_ATTEMPTS |
fetch_videos_v2.py |
500 |
Max retries per video |
CONSECUTIVE_ERROR_LIMIT |
fetch_videos_v2.py |
10 |
403/404s before abort |
MAX_CONCURRENT_DOWNLOADS |
download_videos_v2.py |
10 |
Parallel ffmpeg workers |
SAVE_INTERVAL_SECONDS |
download_videos_v2.py |
5.0 |
State save frequency |
BOT_TOKEN |
upv3.py |
hardcoded | Telegram Bot token |
CHAT_ID |
upv3.py |
@nifobry |
Target Telegram channel |
MAX_RETRIES |
upv3.py |
5 |
Upload retry attempts |
DEFAULT_INTERVAL_HOURS |
main_v3.py |
14 |
Default pipeline schedule |
GRACE_SHUTDOWN_SECONDS |
main_v3.py |
5 |
Time before SIGTERM escalation |
Running the Pipeline
Option A: Dashboard (Recommended for Steps 1β3)
# 1. Set up environment
echo "ACCESS_TOKEN=your_token_here" > .env
# 2. Install dependencies
pip install requests python-dotenv gradio colorlog tqdm
# 3. Launch the dashboard
python main_v3.py
# Open browser at http://127.0.0.1:7860
# 4. Click "βΆ Start Schedule" or "π Run Now"
Option B: Run Scripts Individually
# Step 0a (optional): Catalogue into a single JSON file
python All-in-One_JSON-based.py
# β Produces: course_output/course_content.json
# Step 0b (optional): Catalogue into a mirrored directory tree
python directory-based.py
# β Produces: course_content/root/<folders and .json files>
# Step 0b (partial scan β specific folder only):
# Edit START_FOLDER_ID = 38232234 in the script, then:
python directory-based.py
# Step 1: Fetch video URLs
python fetch_videos_v2.py
# β Produces: course_output/course_data_with_videos.json
# Step 2: Download videos (requires ffmpeg)
python download_videos_v2.py
# β Produces: course_output/course_downloads/*.mp4
# Step 3: Upload to Telegram (requires local Bot API server)
python upv3.py
# β Uploads to Telegram, produces: course_output/upload_state.json
Option C: Automate with Cron
# Run the full pipeline every day at 2 AM
0 2 * * * cd /path/to/project && python fetch_videos_v2.py && python download_videos_v2.py && python upv3.py
Fault Tolerance & Resumability
This pipeline was designed with one guiding principle: any script can be interrupted at any time and safely restarted.
| Scenario | Behavior |
|---|---|
| Fetcher interrupted mid-scan | Resumes from last saved video on next run |
| Token expires during fetch | Detects 10 consecutive 403s β saves & exits cleanly |
| Downloader crashes | State Manager has already saved progress; resumes on restart |
| Upload interrupted (Ctrl+C) | In-progress video reset to pending; not marked failed |
| Script process killed by OS | State saved every 5 seconds; at most 5s of work lost |
| New videos added to course | Stale hash check detects changes and refreshes URLs |
| Previously failed downloads | Auto-reset to pending on next downloader run |
Download Source Files
All six Python scripts are available as a single zip archive:
Contents:
| File | Description |
|---|---|
All-in-One_JSON-based.py |
Full course cataloguer β single JSON file |
directory-based.py |
Full course cataloguer β mirrored directory tree |
fetch_videos_v2.py |
Classplus API scraper & .m3u8 URL resolver |
download_videos_v2.py |
Multi-threaded HLS downloader via FFmpeg |
upv3.py |
Sequential Telegram uploader with message link tracking |
main_v3.py |
Gradio dashboard & scheduler |
Notes & Disclaimer
- This tool is intended for personal archiving of content you have legitimate access to.
- The Classplus
ACCESS_TOKENis a JWT that expires periodically. When it does, update it via the dashboardβs token management panel or directly in.env. - The local Telegram Bot API server (
127.0.0.1:8081) is required for files larger than 50 MB. For smaller files, you can switchBASE_URL_TEMPLATEtohttps://api.telegram.org/bot{}. - Always keep your
.envfile out of version control. Add it to.gitignore.
Built with Python Β· Gradio Β· FFmpeg Β· Telegram Bot API