A few days ago, I snagged a free .news
domain from NameCheap and registered hy2.news
. Since I didn’t have much use for it, I decided to set up a personal news feed page. Since it’s a dynamic page, I brought out Wordpress and used an RSS plugin to make it happen. The page is live at: Hyruo News
The page is up, but sourcing the RSS feed became an issue. So, I started by trying to scrape the Nodeseek forum, where I’ve been active recently. After hours of effort, I still failed.
Handcrafting an RSSHub Route
The official RSSHub development tutorial is a bit jumpy, but the main process is as follows.
Preliminary Work
- Clone the RSSHub repository (might take over half an hour if your computer is slow)
git clone https://github.com/DIYgod/RSSHub.git
- Install the latest version of Node.js (version must be greater than 22) Node.js Official Site
- Install dependencies
pnpm i
- Run
pnpm run dev
Developing the Route
Developing the route is relatively simple. Open the RSSHub\lib\routes
directory, create a new folder under it, such as nodeseek, and then add two files namespace.ts
and custom.ts
in that folder.
- namespace.ts File
This file can be copied from the official tutorial or just copied from another folder in the lib\routes
directory and modified. Example:
1
2
3
4
5
6
7
| import type { Namespace } from '@/types';
export const namespace: Namespace = {
name: 'nodeseek',
url: 'nodeseek.com',
lang: 'zh-CN',
};
|
- custom.ts File
This is the main file for developing the route. The filename can be named according to the target website’s structure. Just look at other folders to get the idea. The difficulty lies in the specific content. Example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
| import { Route } from '@/types';
import { load } from 'cheerio';
import { parseDate } from '@/utils/parse-date';
import logger from '@/utils/logger';
import puppeteer from '@/utils/puppeteer';
import cache from '@/utils/cache';
export const route: Route = {
path: '/user/:userId',
categories: ['bbs'],
example: '/nodeseek/user/1',
parameters: { userId: 'User ID, e.g., 1' },
features: {
requireConfig: false,
requirePuppeteer: true, // Enable Puppeteer
antiCrawler: true, // Enable anti-crawler
supportBT: false,
supportPodcast: false,
supportScihub: false,
},
radar: [
{
source: ['nodeseek.com/space/:userId'],
target: '/user/:userId',
},
],
name: 'NodeSeek User Topics',
maintainers: ['Your Name'],
handler: async (ctx) => {
const userId = ctx.req.param('userId');
const baseUrl = 'https://www.nodeseek.com';
const userUrl = `${baseUrl}/space/${userId}#/discussions`;
// Import Puppeteer utility class and initialize browser instance
const browser = await puppeteer();
// Open a new tab
const page = await browser.newPage();
// Set request headers
await page.setExtraHTTPHeaders({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
Referer: baseUrl,
Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
});
// Visit the target link
logger.http(`Requesting ${userUrl}`);
await page.goto(userUrl, {
waitUntil: 'networkidle2', // Wait for the page to fully load
});
// Simulate scrolling the page (if needed)
await page.evaluate(() => {
window.scrollBy(0, window.innerHeight);
});
// Wait for the post list to load
await page.waitForSelector('a[href^="/post-"]', { timeout: 7000 });
// Get the HTML content of the page
const response = await page.content();
const $ = load(response);
// Extract the post list
let items = $('a[href^="/post-"]')
.toArray()
.map((item) => {
const $item = $(item);
const title = $item.find('span').text().trim();
const link = `${baseUrl}${$item.attr('href')}`;
return {
title,
link,
};
});
// Exclude two fixed links in the footer
const excludedLinks = ['/post-6797-1', '/post-6800-1'];
items = items.filter((item) => !excludedLinks.includes(new URL(item.link).pathname));
// Extract up to 15 posts
items = items.slice(0, 15);
// Print the extracted post list
console.log('Extracted post list:', items); // Debug info
// If the post list is empty, the dynamic content might not have loaded
if (items.length === 0) {
throw new Error('Failed to retrieve post list, please check the page structure');
}
// Get the content of each post
items = await Promise.all(
items.map((item) =>
cache.tryGet(item.link, async () => {
// Open a new tab
const postPage = await browser.newPage();
// Set request headers
await postPage.setExtraHTTPHeaders({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
Referer: baseUrl,
});
// Visit the post link
logger.http(`Requesting ${item.link}`);
await postPage.goto(item.link, {
waitUntil: 'networkidle2', // Wait for the page to fully load
});
// Get the post content
const postHtml = await postPage.content();
const $post = load(postHtml);
item.description = $post('article.post-content').html();
item.pubDate = parseDate($post('time').attr('datetime'));
// Close the post page
await postPage.close();
return item;
})
)
);
// Close the browser instance
await browser.close();
// Return RSS data
return {
title: `NodeSeek User ${userId} Topics`,
link: userUrl,
item: items,
};
},
};
|
Summary of Main Issues
In short, the main reason for the failure in handcrafting the Nodeseek route was the inability to bypass Nodeseek’s anti-crawling measures and Cloudflare’s protection.
The extreme method RSSHub can use is to simulate browser behavior with Puppeteer to counter crawling. However, machine-simulated behavior is easily detected by platforms like Cloudflare.
During local testing, I had about a 50% success rate, and that was with the latest version of Puppeteer. If I used the Puppeteer version from RSSHub’s official dependencies, the success rate was less than 10%. Considering that submitting to RSSHub requires double review, the success rate is just too low.
For now, I have to reluctantly give up.
When It Rains, It Pours
This morning, I woke up to find that the sky had fallen—6 of my free *.US.KG domains were down because the parent domain stopped resolving.
Then, at noon, the sky fell again. The .news
domain I had just set up was suspended by the registrar. They sent me an email asking for an explanation of why my personal information changed during the registration process, and then forcibly redirected the NS to an IP address no one recognizes.
Well, that’s my own fault. I messed up during registration.When I directly used the browser’s auto-fill form program, I accidentally filled in all the real information. Then, when I tried to change it back, an error occurred as soon as I made the changes.
PS: In the afternoon, *.us.kg returned to normal. But I feel like I won’t love it anymore.