Earlier this year, after obtaining the source code for the Soomal.com website, I uploaded it to my VPS. However, due to the outdated architecture of the original site, which was inconvenient to manage and not mobile-friendly, I recently undertook a complete overhaul, converting and migrating the entire site to Hugo.
Migration Plan Design
I had long considered revamping Soomal.cc. I had previously run some preliminary tests but encountered numerous issues, which led me to shelve the project temporarily.
Challenges and Difficulties
Large Volume of Articles
Soomal contains 9,630 articles, with the earliest dating back to 2003, totaling 19 million words.
The site also hosts 326,000 JPG images across more than 4,700 folders. Most images come in three sizes, though some are missing, resulting in a total size of nearly 70 GB.
Complexity in Article Conversion
The Soomal source code only includes HTML files for article pages. While these files might have been generated by the same program, preliminary tests revealed that the page structure had undergone multiple changes over time, with different tags used in various periods, making information extraction from the HTML files highly challenging.
Encoding Issues: The original HTML files use GB2312 encoding and were likely hosted on a Windows server, requiring special handling for character encoding and escape sequences during conversion.
Image Issues: The site contains a vast number of images, which are the essence of Soomal. However, these images use diverse tags and styles, making it difficult to extract links and descriptions without omissions.
Tags and Categories: The site has nearly 12,000 article tags and over 20 categories. However, the HTML files lack category information, which can only be found in the 2,000+ category slice HTML files. Tags also present problems—some contain spaces, special characters, or duplicates within the same article.
Article Content: The HTML files include the main text, related articles, and tags, all nested under the DOC
tag. Initially, I overlooked that related articles use lowercase doc
tags, leading to extraction errors during testing. It was only after noticing this discrepancy while browsing the site that I restarted the conversion project.
Storage Solution Dilemma
I initially hosted Soomal.cc on a VPS. Over a few months, despite low traffic, data usage soared to nearly 1.5TB. Although the VPS offers unlimited bandwidth, this was concerning. After migrating to Hugo, I found that most free hosting services impose restrictions—GitHub recommends repositories under 1GB, CloudFlare Pages limits files to 20,000, CloudFlare R2 caps storage at 10GB, and Vercel and Netlify both limit traffic to 100GB.
Conversion Methodology
Given the potential challenges in converting Soomal to Hugo, I devised a five-step migration plan.
Step 1: Convert HTML Files to Markdown
Define Conversion Requirements
- Extract Titles: Retrieve article titles from the
<head>
tag. For example, extract 谈谈手机产业链和手机厂商的相互影响
from <title>刘延作品 - 谈谈手机产业链和手机厂商的相互影响 [Soomal]</title>
. - Extract Tags: Use keyword filtering to locate tags in the HTML, extract tag names, and enclose them in quotes to handle spaces in tag names.
- Extract Main Text: Retrieve the article body from the
DOC
tag and truncate content after the doc
tag. - Extract Metadata: Gather publication dates, author information, and header images from the HTML.
- Extract Images: Identify and extract all image references (e.g.,
smallpic
, bigpic
, smallpic2
, wrappic
). - Extract Special Content: Include subheadings, download links, tables, etc.
File Conversion
Given the clear requirements, I used Python scripts for the conversion.
Click to View Conversion Script Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
| import os
import re
from bs4 import BeautifulSoup, Tag, NavigableString
from datetime import datetime
def convert_html_to_md(html_path, output_dir):
try:
# Read HTML files with GB2312 encoding
with open(html_path, 'r', encoding='gb2312', errors='ignore') as f:
html_content = f.read()
soup = BeautifulSoup(html_content, 'html.parser')
# 1. Extract title
title = extract_title(soup)
# 2. Extract bookmark tags
bookmarks = extract_bookmarks(soup)
# 3. Extract title image and info
title_img, info_content = extract_title_info(soup)
# 4. Extract main content
body_content = extract_body_content(soup)
# Generate YAML frontmatter
frontmatter = f"""---
title: "{title}"
date: {datetime.now().strftime('%Y-%m-%dT%H:%M:%S+08:00')}
tags: {bookmarks}
title_img: "{title_img}"
info: "{info_content}"
---\n\n"""
# Generate Markdown content
markdown_content = frontmatter + body_content
# Save Markdown file
output_path = os.path.join(output_dir, os.path.basename(html_path).replace('.htm', '.md'))
with open(output_path, 'w', encoding='utf-8') as f:
f.write(markdown_content)
return f"Conversion successful: {os.path.basename(html_path)}"
except Exception as e:
return f"Conversion failed {os.path.basename(html_path)}: {str(e)}"
def extract_title(soup):
"""Extract title"""
if soup.title:
return soup.title.string.strip()
return ""
def extract_bookmarks(soup):
"""Extract bookmark tags, each enclosed in quotes"""
bookmarks = []
bookmark_element = soup.find(string=re.compile(r'本文的相关书签:'))
if bookmark_element:
parent = bookmark_element.find_parent(['ul', 'li'])
if parent:
# Extract text from all <a> tags
for a_tag in parent.find_all('a'):
text = a_tag.get_text().strip()
if text:
# Enclose each tag in quotes
bookmarks.append(f'"{text}"')
return f"[{', '.join(bookmarks)}]" if bookmarks else "[]"
def extract_title_info(soup):
"""Extract title image and info content"""
title_img = ""
info_content = ""
titlebox = soup.find('div', class_='titlebox')
if titlebox:
# Extract title image
title_img_div = titlebox.find('div', class_='titleimg')
if title_img_div and title_img_div.img:
title_img = title_img_div.img['src']
# Extract info content
info_div = titlebox.find('div', class_='info')
if info_div:
# Remove all HTML tags, keeping only text
info_content = info_div.get_text().strip()
return title_img, info_content
def extract_body_content(soup):
"""Extract main content and process images"""
body_content = ""
doc_div = soup.find('div', class_='Doc') # Note uppercase 'D'
if doc_div:
# Remove all nested div class="doc" (lowercase)
for nested_doc in doc_div.find_all('div', class_='doc'):
nested_doc.decompose()
# Process images
process_images(doc_div)
# Iterate through all child elements to build Markdown content
for element in doc_div.children:
if isinstance(element, Tag):
if element.name == 'div' and 'subpagetitle' in element.get('class', []):
# Convert to subheading
body_content += f"## {element.get_text().strip()}\n\n"
else:
# Preserve other content
body_content += element.get_text().strip() + "\n\n"
elif isinstance(element, NavigableString):
body_content += element.strip() + "\n\n"
return body_content.strip()
def process_images(container):
"""Process image content (Rules A/B/C)"""
# A: Handle <li data-src> tags
for li in container.find_all('li', attrs={'data-src': True}):
img_url = li['data-src'].replace('..', 'https://soomal.cc', 1)
caption_div = li.find('div', class_='caption')
content_div = li.find('div', class_='content')
alt_text = caption_div.get_text().strip() if caption_div else ""
meta_text = content_div.get_text().strip() if content_div else ""
# Create Markdown image syntax
img_md = f"\n\n{meta_text}\n\n"
li.replace_with(img_md)
# B: Process <span class="smallpic"> tags
for span in container.find_all('span', class_='smallpic'):
img = span.find('img')
if img and 'src' in img.attrs:
img_url = img['src'].replace('..', 'https://soomal.cc', 1)
caption_div = span.find('div', class_='caption')
content_div = span.find('div', class_='content')
alt_text = caption_div.get_text().strip() if caption_div else ""
meta_text = content_div.get_text().strip() if content_div else ""
# Create Markdown image syntax
img_md = f"\n\n{meta_text}\n\n"
span.replace_with(img_md)
# C: Process <div class="bigpic"> tags
for div in container.find_all('div', class_='bigpic'):
img = div.find('img')
if img and 'src' in img.attrs:
img_url = img['src'].replace('..', 'https://soomal.cc', 1)
caption_div = div.find('div', class_='caption')
content_div = div.find('div', class_='content')
alt_text = caption_div.get_text().strip() if caption_div else ""
meta_text = content_div.get_text().strip() if content_div else ""
# Create Markdown image syntax
img_md = f"\n\n{meta_text}\n\n"
div.replace_with(img_md)
if __name__ == "__main__":
input_dir = 'doc'
output_dir = 'markdown_output'
# Create output directory
os.makedirs(output_dir, exist_ok=True)
# Process all HTML files
for filename in os.listdir(input_dir):
if filename.endswith('.htm'):
html_path = os.path.join(input_dir, filename)
result = convert_html_to_md(html_path, output_dir)
print(result)
|
Step 2: Processing Categories and Abstracts
Due to the original HTML files not containing category information, the article category directories had to be processed separately. During category processing, article abstracts were also handled simultaneously.
Extracting Category and Abstract Information
Primarily using Python to extract and format category and abstract information from over 2,000 category pages.
Click to view conversion code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
| import os
import re
from bs4 import BeautifulSoup
import codecs
from collections import defaultdict
def extract_category_info(folder_path):
# Use defaultdict to automatically initialize nested dictionaries
article_categories = defaultdict(set) # Stores article ID to category mapping
article_summaries = {} # Stores article ID to abstract mapping
# Iterate through all HTM files in the folder
for filename in os.listdir(folder_path):
if not filename.endswith('.htm'):
continue
file_path = os.path.join(folder_path, filename)
try:
# Read file with GB2312 encoding and convert to UTF-8
with codecs.open(file_path, 'r', encoding='gb2312', errors='replace') as f:
content = f.read()
soup = BeautifulSoup(content, 'html.parser')
# Extract category name
title_tag = soup.title
if title_tag:
title_text = title_tag.get_text().strip()
# Extract content before the first hyphen
category_match = re.search(r'^([^-]+)', title_text)
if category_match:
category_name = category_match.group(1).strip()
# Add quotes if category name contains spaces
if ' ' in category_name:
category_name = f'"{category_name}"'
else:
category_name = "Unknown_Category"
else:
category_name = "Unknown_Category"
# Extract article information
for item in soup.find_all('div', class_='item'):
# Extract article ID
article_link = item.find('a', href=True)
if article_link:
href = article_link['href']
article_id = re.search(r'../doc/(\d+)\.htm', href)
if article_id:
article_id = article_id.group(1)
else:
continue
else:
continue
# Extract article abstract
synopsis_div = item.find('div', class_='synopsis')
synopsis = synopsis_div.get_text().strip() if synopsis_div else ""
# Store category information
article_categories[article_id].add(category_name)
# Store abstract (only once to avoid overwriting)
if article_id not in article_summaries:
article_summaries[article_id] = synopsis
except UnicodeDecodeError:
# Attempt using GBK encoding as fallback
try:
with codecs.open(file_path, 'r', encoding='gbk', errors='replace') as f:
content = f.read()
# Reprocess content...
# Note: Repeated processing code omitted here; should be extracted as a function
# For code completeness, we include the repeated logic
soup = BeautifulSoup(content, 'html.parser')
title_tag = soup.title
if title_tag:
title_text = title_tag.get_text().strip()
category_match = re.search(r'^([^-]+)', title_text)
if category_match:
category_name = category_match.group(1).strip()
if ' ' in category_name:
category_name = f'"{category_name}"'
else:
category_name = "Unknown_Category"
else:
category_name = "Unknown_Category"
for item in soup.find_all('div', class_='item'):
article_link = item.find('a', href=True)
if article_link:
href = article_link['href']
article_id = re.search(r'../doc/(\d+)\.htm', href)
if article_id:
article_id = article_id.group(1)
else:
continue```python
else:
continue
synopsis_div = item.find('div', class_='synopsis')
synopsis = synopsis_div.get_text().strip() if synopsis_div else ""
article_categories[article_id].add(category_name)
if article_id not in article_summaries:
article_summaries[article_id] = synopsis
except Exception as e:
print(f"Error processing file {filename} (after trying GBK): {str(e)}")
continue
except Exception as e:
print(f"Error processing file {filename}: {str(e)}")
continue
return article_categories, article_summaries
def save_to_markdown(article_categories, article_summaries, output_path):
with open(output_path, 'w', encoding='utf-8') as md_file:
# Write Markdown header
md_file.write("# Article Categories and Summaries\n\n")
md_file.write("> This file contains IDs, categories and summaries of all articles\n\n")
# Sort by article ID
sorted_article_ids = sorted(article_categories.keys(), key=lambda x: int(x))
for article_id in sorted_article_ids:
# Get sorted category list
categories = sorted(article_categories[article_id])
# Format as list string
categories_str = ", ".join(categories)
# Get summary
summary = article_summaries.get(article_id, "No summary available")
# Write Markdown content
md_file.write(f"## Filename: {article_id}\n")
md_file.write(f"**Categories**: {categories_str}\n")
md_file.write(f"**Summary**: {summary}\n\n")
md_file.write("---\n\n")
if __name__ == "__main__":
# Configure input and output paths
input_folder = 'Categories' # Replace with your HTM folder path
output_md = 'articles_categories.md'
# Execute extraction
article_categories, article_summaries = extract_category_info(input_folder)
# Save results to Markdown file
save_to_markdown(article_categories, article_summaries, output_md)
# Print statistics
print(f"Successfully processed data for {len(article_categories)} articles")
print(f"Saved to {output_md}")
print(f"Found {len(article_summaries)} articles with summaries")
|
Writing category and summary information to markdown files
This step is relatively simple - writing the extracted category and summary data into the previously converted markdown files one by one.
Click to view the writing script
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
| import os
import re
import ruamel.yaml
from collections import defaultdict
def parse_articles_categories(md_file_path):
"""
Parse articles_categories.md file to extract article IDs, categories and summaries
"""
article_info = defaultdict(dict)
current_id = None
try:
with open(md_file_path, 'r', encoding='utf-8') as f:
for line in f:
# Match filename
filename_match = re.match(r'^## Filename: (\d+)$', line.strip())
if filename_match:
current_id = filename_match.group(1)
continue
# Match category information
categories_match = re.match(r'^\*\*Categories\*\*: (.+)$', line.strip())
if categories_match and current_id:
categories_str = categories_match.group(1)
# Clean category string, remove extra spaces and quotes
categories = [cat.strip().strip('"') for cat in categories_str.split(',')]
article_info[current_id]['categories'] = categories
continue
# Match summary information
summary_match = re.match(r'^\*\*Summary\*\*: (.+)$', line.strip())
if summary_match and current_id:
summary = summary_match.group(1)
article_info[current_id]['summary'] = summary
continue
# Reset current ID when encountering separator
if line.startswith('---'):
current_id = None
except Exception as e:
print(f"Error parsing articles_categories.md file: {str(e)}")
return article_info
def update_markdown_files(article_info, md_folder):
"""
Update Markdown files by adding category and summary information to frontmatter
"""
updated_count = 0
skipped_count = 0
# Initialize YAML parser
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
yaml.width = 1000 # Prevent long summaries from line breaking
for filename in os.listdir(md_folder):
if not filename.endswith('.md'):
continue
article_id = filename[:-3] # Remove .md extension
file_path = os.path.join(md_folder, filename)
# Check if information exists for this article
if article_id not in article_info:
skipped_count += 1
continue
try:
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
# Parse frontmatter
frontmatter_match = re.search(r'^---\n(.*?)\n---', content, re.DOTALL)
if not frontmatter_match:
print(f"No frontmatter found in file {filename}, skipping")
skipped_count += 1
continue
frontmatter_content = frontmatter_match.group(1)
# Convert frontmatter to dictionary
data = yaml.load(frontmatter_content)
if data is None:
data = {}
# Add category and summary information
info = article_info[article_id]
# Add categories
if 'categories' in info:
# If categories already exist, merge them (deduplicate)
existing_categories = set(data.get('categories', []))
new_categories = set(info['categories'])
combined_categories = sorted(existing_categories.union(new_categories))
data['categories'] = combined_categories
# Add summary (if summary exists and is not empty)
if 'summary' in info and info['summary']:
# Only update if summary doesn't exist or new summary is not empty
if 'summary' not in data or info['summary']:
data['summary'] = info['summary']
# Regenerate frontmatter
new_frontmatter = '---\n'
with ruamel.yaml.String
```Here's the English translation of the provided text:
```python
IO() as stream:
yaml.dump(data, stream)
new_frontmatter += stream.getvalue().strip()
new_frontmatter += '\n---'
# Replace original frontmatter
new_content = content.replace(frontmatter_match.group(0), new_frontmatter)
# Write to file
with open(file_path, 'w', encoding='utf-8') as f:
f.write(new_content)
updated_count += 1
except Exception as e:
print(f"Error updating file {filename}: {str(e)}")
skipped_count += 1
return updated_count, skipped_count
if __name__ == "__main__":
# Configure paths
articles_md = 'articles_categories.md' # Markdown file containing category and summary information
md_folder = 'markdown_output' # Folder containing Markdown articles
# Parse articles_categories.md file
print("Parsing articles_categories.md file...")
article_info = parse_articles_categories(articles_md)
print(f"Successfully parsed information for {len(article_info)} articles")
# Update Markdown files
print(f"\nUpdating category and summary information for {len(article_info)} articles...")
updated, skipped = update_markdown_files(article_info, md_folder)
# Print statistics
print(f"\nProcessing complete!")
print(f"Successfully updated: {updated} files")
print(f"Skipped: {skipped} files")
print(f"Articles with found information: {len(article_info)}")
|
Step 3: Convert article frontmatter information
This step primarily involves correcting the frontmatter section of the output Markdown files to meet Hugo theme requirements.
- Revise article header information according to frontmatter specifications
Mainly handles special characters, date formats, authors, featured images, tags, categories, etc.
View conversion code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
| import os
import re
import frontmatter
import yaml
from datetime import datetime
def escape_special_characters(text):
"""Escape special characters in YAML"""
# Escape backslashes while preserving already escaped characters
return re.sub(r'(?<!\\)\\(?!["\\/bfnrt]|u[0-9a-fA-F]{4})', r'\\\\', text)
def process_md_files(folder_path):
for filename in os.listdir(folder_path):
if filename.endswith(".md"):
file_path = os.path.join(folder_path, filename)
try:
# Read file content
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
# Manually split frontmatter and content
if content.startswith('---\n'):
parts = content.split('---\n', 2)
if len(parts) >= 3:
fm_text = parts[1]
body_content = parts[2] if len(parts) > 2 else ""
# Escape special characters
fm_text = escape_special_characters(fm_text)
# Recombine content
new_content = f"---\n{fm_text}---\n{body_content}"
# Parse frontmatter using safe loading
post = frontmatter.loads(new_content)
# Process info field
if 'info' in post.metadata:
info = post.metadata['info']
# Extract date
date_match = re.search(r'On (\d{4}\.\d{1,2}\.\d{1,2} \d{1,2}:\d{2}:\d{2})', info)
if date_match:
date_str = date_match.group(1)
try:
dt = datetime.strptime(date_str, "%Y.%m.%d %H:%M:%S")
post.metadata['date'] = dt.strftime("%Y-%m-%dT%H:%M:%S+08:00")
except ValueError:
# Keep original date as fallback
pass
# Extract author
author_match = re.match(r'^(.+?)作品', info)
if author_match:
authors = author_match.group(1).strip()
# Split multiple authors
author_list = [a.strip() for a in re.split(r'\s+', authors) if a.strip()]
post.metadata['author'] = author_list
# Create description
desc_parts = info.split('|', 1)
if len(desc_parts) > 1:
post.metadata['description'] = desc_parts[1].strip()
# Remove original info
del post.metadata['info']
# Process title_img
if 'title_img' in post.metadata:
img_url = post.metadata['title_img'].replace("../", "https://soomal.cc/")
# Handle potential double slashes
img_url = re.sub(r'(?<!:)/{2,}', '/', img_url)
post.metadata['cover'] = {
'image': img_url,
'caption': "",
'alt': "",
'relative': False
}
del post.metadata['title_img']
# Modify title
if 'title' in post.metadata:
title = post.metadata['title']
# Remove content before "-"
if '-' in title:
new_title = title.split('-', 1)[1].strip()
post.metadata['title'] = new_title
# Save modified file
with open(file_path, 'w', encoding='utf-8') as f_out:
f_out.write(frontmatter.dumps(post))
except Exception as e:
print(f"Error processing file {filename}: {str(e)}")
# Log error files for later review
with open("processing_errors.log", "a", encoding="utf-8") as log:
log.write(f"Error in {filename}: {str(e)}\n")
if __na
``````python
if __name__ == "__main__":
folder_path = "markdown_output" # Replace with your actual path
process_md_files(folder_path)
print("Frontmatter processing completed for all Markdown files!")
|
- Streamlining Tags and Categories
Soomal.com originally had over 20 article categories, some of which were meaningless (e.g., the “All Articles” category). Additionally, there was significant overlap between article categories and tags. To ensure uniqueness between categories and tags, further simplification was implemented. Another goal was to minimize the number of files generated during the final website build.
View the code for streamlining tags and categories
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
| import os
import yaml
import frontmatter
def clean_hugo_tags_categories(folder_path):
"""
Clean up tags and categories in Hugo articles:
1. Remove "All Articles" from categories
2. Remove tags that duplicate categories
"""
# Valid categories list ("All Articles" removed)
valid_categories = [
"Digital Devices", "Audio", "Music", "Mobile Digital", "Reviews", "Introductions",
"Evaluation Reports", "Galleries", "Smartphones", "Android", "Headphones",
"Musicians", "Imaging", "Digital Terminals", "Speakers", "iOS", "Cameras",
"Sound Cards", "Album Reviews", "Tablets", "Technology", "Applications",
"Portable Players", "Windows", "Digital Accessories", "Essays", "DACs",
"Audio Systems", "Lenses", "Musical Instruments", "Audio Codecs"
]
# Process all Markdown files in the folder
for filename in os.listdir(folder_path):
if not filename.endswith('.md'):
continue
filepath = os.path.join(folder_path, filename)
with open(filepath, 'r', encoding='utf-8') as f:
post = frontmatter.load(f)
# 1. Clean categories (remove invalid entries and deduplicate)
if 'categories' in post.metadata:
# Convert to set for deduplication + filter invalid categories
categories = list(set(post.metadata['categories']))
cleaned_categories = [
cat for cat in categories
if cat in valid_categories
]
post.metadata['categories'] = cleaned_categories
# 2. Clean tags (remove duplicates with categories)
if 'tags' in post.metadata:
current_cats = post.metadata.get('categories', [])
# Convert to set for deduplication + filter category duplicates
tags = list(set(post.metadata['tags']))
cleaned_tags = [
tag for tag in tags
if tag not in current_cats
]
post.metadata['tags'] = cleaned_tags
# Save modified file
with open(filepath, 'w', encoding='utf-8') as f_out:
f_out.write(frontmatter.dumps(post))
if __name__ == "__main__":
# Example usage (modify with your actual path)
md_folder = "./markdown_output"
clean_hugo_tags_categories(md_folder)
print(f"Processing completed: {len(os.listdir(md_folder))} files")
|
Step 4: Reducing Image Quantity
During the HTML-to-Markdown conversion, since only article content was extracted, many cropped images from the original site became unnecessary. Therefore, we matched the converted Markdown files against the original site’s images to identify only those needed for the new site.
This step reduced the total number of images from 326,000 to 118,000.
- Extracting Image Links
Extract all image links from Markdown files. Since the image links were standardized during conversion, this process was straightforward.
View the extraction code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
| import os
import re
import argparse
def extract_image_links(directory):
"""Extract image links from all md files in directory"""
image_links = set()
pattern = re.compile(r'https://soomal\.cc[^\s\)\]\}]*?\.jpg', re.IGNORECASE)
for root, _, files in os.walk(directory):
for filename in files:
if filename.endswith('.md'):
filepath = os.path.join(root, filename)
try:
with open(filepath, 'r', encoding='utf-8') as f:
content = f.read()
matches = pattern.findall(content)
if matches:
image_links.update(matches)
except Exception as e:
print(f"Error processing {filepath}: {str(e)}")
return sorted(image_links)
def save_links_to_file(links, output_file):
"""Save links to file"""
with open(output_file, 'w', encoding='utf-8') as f:
for link in links:
f.write(link + '\n')
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Extract image links from Markdown')
parser.add_argument('--input', default='markdown_output', help='Path to Markdown directory')
parser.add_argument('--output', default='image_links.txt', help='Output file path')
args = parser.parse_args()
print(f"Scanning directory: {args.input}")
links = extract_image_links(args.input)
print(f"Found {len(links)} unique image links")
save_links_to_file(links, args.output)
print(f"Links saved to: {args.output}")
|
- Copying Corresponding Images
Use the extracted image links to locate and copy corresponding files from the original site directory, ensuring directory accuracy.
A. View Windows Copy Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
| import os
import shutil
import time
import sys
def main():
# Configuration
source_drive = "F:\\"
target_drive = "D:\\"
image_list_file = r"D:\trans-soomal\image_links.txt"
log_file = r"D:\trans-soomal\image_copy_log.txt"
error_log_file = r"D:\trans-soomal\image_copy_errors.txt"
print("Image copy script starting...")
# Record start time
start_time = time.time()
# Create log files
with open(log_file, "w", encoding="utf-8") as log, open(error_log_file, "w", encoding="utf-8") as err_log:
log.write(f"Image Copy Log - Start Time: {time.ctime(start_time)}\n")
err_log.write("Failed copies:\n")
try:
# Read image list
with open(image_list_file, "r", encoding="utf-8") as f:
image_paths = [line.strip() for line in f if line.strip()]
total_files = len(image_paths)
success_count = 0
fail_count = 0
skipped_count = 0
print(f"Found {total_files} images to copy")
# Process each file
for i, relative_path in enumerate(image_paths):
# Display progress
progress = (i + 1) / total_files * 100
sys.stdout.write(f"\rProgress: {progress:.2f}% ({i+1}/{total_files})")
sys.stdout.flush()
# Build full paths
source_path = os.path.join(source_drive, relative_path)
target_path = os.path.join(target_drive, relative_path)
try:
# Check if source exists
if not os.path.exists(source_path):
err_log.write(f"Source missing: {source_path}\n")
fail_count += 1
continue
# Check if target already exists
```if os.path.exists(target_path):
log.write(f"File already exists, skipping: {target_path}\n")
skipped_count += 1
continue
# Create target directory
target_dir = os.path.dirname(target_path)
os.makedirs(target_dir, exist_ok=True)
# Copy file
shutil.copy2(source_path, target_path)
# Log success
log.write(f"[SUCCESS] Copied {source_path} to {target_path}\n")
success_count += 1
except Exception as e:
# Log failure
err_log.write(f"[FAILED] {source_path} -> {target_path} : {str(e)}\n")
fail_count += 1
# Calculate elapsed time
end_time = time.time()
elapsed_time = end_time - start_time
minutes, seconds = divmod(elapsed_time, 60)
hours, minutes = divmod(minutes, 60)
# Write summary
summary = f"""
================================
Copy operation completed
Start time: {time.ctime(start_time)}
End time: {time.ctime(end_time)}
Total duration: {int(hours)}h {int(minutes)}m {seconds:.2f}s
Total files: {total_files}
Successfully copied: {success_count}
Skipped (existing): {skipped_count}
Failed: {fail_count}
================================
"""
log.write(summary)
print(summary)
except Exception as e:
print(f"\nError occurred: {str(e)}")
err_log.write(f"Script error: {str(e)}\n")
if __name__ == "__main__":
main()
|
B. View Linux Copy Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
| #!/bin/bash
# Configuration parameters
LINK_FILE="/user/image_links.txt" # Replace with actual link file path
SOURCE_BASE="/user/soomal.cc/index"
DEST_BASE="/user/images.soomal.cc/index"
LOG_FILE="/var/log/image_copy_$(date +%Y%m%d_%H%M%S).log"
THREADS=3 # Automatically get CPU cores as thread count
# Start logging
{
echo "===== Copy Task Started: $(date) ====="
echo "Source base directory: $SOURCE_BASE"
echo "Destination base directory: $DEST_BASE"
echo "Link file: $LINK_FILE"
echo "Thread count: $THREADS"
# Path validation example
echo -e "\n=== Path Validation ==="
sample_url="https://soomal.cc/images/doc/20090406/00000007.jpg"
expected_src="${SOURCE_BASE}/images/doc/20090406/00000007.jpg"
expected_dest="${DEST_BASE}/images/doc/20090406/00000007.jpg"
echo "Example URL: $sample_url"
echo "Expected source path: $expected_src"
echo "Expected destination path: $expected_dest"
if [[ -f "$expected_src" ]]; then
echo "Validation successful: Example source file exists"
else
echo "Validation failed: Example source file missing! Please check paths"
exit 1
fi
# Create destination base directory
mkdir -p "${DEST_BASE}/images"
# Prepare parallel processing
echo -e "\n=== Processing Started ==="
total=$(wc -l < "$LINK_FILE")
echo "Total links: $total"
counter=0
# Processing function
process_link() {
local url="$1"
local rel_path="${url#https://soomal.cc}"
# Build full paths
local src_path="${SOURCE_BASE}${rel_path}"
local dest_path="${DEST_BASE}${rel_path}"
# Create destination directory
mkdir -p "$(dirname "$dest_path")"
# Copy file
if [[ -f "$src_path" ]]; then
if cp -f "$src_path" "$dest_path"; then
echo "SUCCESS: $rel_path"
return 0
else
echo "COPY FAILED: $rel_path"
return 2
fi
else
echo "MISSING: $rel_path"
return 1
fi
}
# Export function for parallel use
export -f process_link
export SOURCE_BASE DEST_BASE
# Use parallel for concurrent processing
echo "Starting parallel copying..."
parallel --bar --jobs $THREADS --progress \
--halt soon,fail=1 \
--joblog "${LOG_FILE}.jobs" \
--tagstring "{}" \
"process_link {}" < "$LINK_FILE" | tee -a "$LOG_FILE"
# Collect results
success=$(grep -c 'SUCCESS:' "$LOG_FILE")
missing=$(grep -c 'MISSING:' "$LOG_FILE")
failed=$(grep -c 'COPY FAILED:' "$LOG_FILE")
# Final statistics
echo -e "\n===== Copy Task Completed: $(date) ====="
echo "Total links: $total"
echo "Successfully copied: $success"
echo "Missing files: $missing"
echo "Copy failures: $failed"
echo "Success rate: $((success * 100 / total))%"
} | tee "$LOG_FILE"
# Save missing files list
grep '^MISSING:' "$LOG_FILE" | cut -d' ' -f2- > "${LOG_FILE%.log}_missing.txt"
echo "Missing files list: ${LOG_FILE%.log}_missing.txt"
|
Step 5: Compress Image Sizes
I had previously compressed the website’s source images once, but it wasn’t enough. My goal is to reduce the image size to under 10 GB to meet potential future requirements for migrating to CloudFlare R2.
- Convert JPG to Webp
After compressing the images with Webp before, I kept them in JPG format to avoid access issues due to the numerous HTML files. Since this migration is to Hugo, there’s no need to retain JPG format anymore, so I’ll directly convert them to Webp. Additionally, since my webpage is set to a 960px width and I’m not using fancy lightbox plugins, resizing the images to 960px can further reduce the size.
Actual tests showed that after this compression, the image size dropped to 7.7GB. However, I noticed a minor issue with the image processing logic. Soomal has many vertical images as well as horizontal ones, and 960px width appears somewhat small on 4K displays. I ultimately converted the images with the short edge set to a maximum of 1280px at 85% quality, resulting in a size of about 14GB, which fits within my VPS’s 20GB storage. I also tested with a short edge of 1150px at 80% quality, which met the 10GB requirement.
View Image Conversion Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
| import os
import subprocess
import time
import sys
import shutil
from pathlib import Path
def main():
# Configure paths
source_dir = Path("D:\\images") # Original image directory
output_dir = Path("D:\\images_webp") # WebP output directory
temp_dir = Path("D:\\temp_webp") # Temporary processing directory
magick_path = "C:\\webp\\magick.exe" # ImageMagick path
# Create necessary directories
output_dir.mkdir(parents=True, exist_ok=True)
temp_dir.mkdir(parents=True, exist_ok=True)
# Log files
log_file = output_dir / "conversion_log.txt"
stats_file = output_dir / "conversion_stats.csv"
print("Image conversion script starting...")
print(f"Source directory: {source_dir}")
print(f"Output directory: {output_dir}")
print(f"Temporary directory: {temp_dir}")
# Initialize log
with open(log_file, "w", encoding="utf-8") as log:
log.write(f"Image conversion log - Start time: {time.ctime()}\n")
# Initialize stats file
with open(stats_file, "w", encoding="utf-8") as stats:
stats.write("Original File,Converted File,Original Size (KB),Converted Size (KB),Space Saved (KB),Savings Percentage\n")
# Collect all image files
image_exts = ('.jpg', '.jpeg', '.png', '.bmp', '.tiff', '.gif')
all_images = []
for root, _, files in os.walk(source_dir):
for file in files:
if file.lower().endswith(image_exts):
all_images.append(Path(root) / file)
total_files = len(all_images)
converted_files = 0
skipped_files = 0
error_files = 0
print(f"Found {total_files} image files to process")
# Process each image
for idx, img_path in enumerate(all_images):
try:
# Progress display
```Display progress
progress = (idx + 1) / total_files * 100
sys.stdout.write(f"\rProgress: {progress:.2f}% ({idx+1}/{total_files})")
sys.stdout.flush()
# Create relative path structure
rel_path = img_path.relative_to(source_dir)
webp_path = output_dir / rel_path.with_suffix('.webp')
webp_path.parent.mkdir(parents=True, exist_ok=True)
# Check if file already exists
if webp_path.exists():
skipped_files += 1
continue
# Create temporary file path
temp_path = temp_dir / f"{img_path.stem}_temp.webp"
# Get original file size
orig_size = img_path.stat().st_size / 1024 # KB
# Convert and resize using ImageMagick
cmd = [
magick_path,
str(img_path),
"-resize", "960>", # Resize only if width exceeds 960px
"-quality", "85", # Initial quality 85
"-define", "webp:lossless=false",
str(temp_path)
]
# Execute command
result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
if result.returncode != 0:
# Log conversion failure
with open(log_file, "a", encoding="utf-8") as log:
log.write(f"[ERROR] Failed to convert {img_path}: {result.stderr}\n")
error_files += 1
continue
# Move temporary file to target location
shutil.move(str(temp_path), str(webp_path))
# Get converted file size
new_size = webp_path.stat().st_size / 1024 # KB
# Calculate space savings
saved = orig_size - new_size
saved_percent = (saved / orig_size) * 100 if orig_size > 0 else 0
# Record statistics
with open(stats_file, "a", encoding="utf-8") as stats:
stats.write(f"{img_path},{webp_path},{orig_size:.2f},{new_size:.2f},{saved:.2f},{saved_percent:.2f}\n")
converted_files += 1
except Exception as e:
with open(log_file, "a", encoding="utf-8") as log:
log.write(f"[EXCEPTION] Error processing {img_path}: {str(e)}\n")
error_files += 1
# Completion report
total_size = sum(f.stat().st_size for f in output_dir.glob('**/*') if f.is_file())
total_size_gb = total_size / (1024 ** 3) # Convert to GB
end_time = time.time()
elapsed = end_time - time.time()
mins, secs = divmod(elapsed, 60)
hours, mins = divmod(mins, 60)
with open(log_file, "a", encoding="utf-8") as log:
log.write("\nConversion Report:\n")
log.write(f"Total files: {total_files}\n")
log.write(f"Successfully converted: {converted_files}\n")
log.write(f"Skipped files: {skipped_files}\n")
log.write(f"Error files: {error_files}\n")
log.write(f"Output directory size: {total_size_gb:.2f} GB\n")
print("\n\nConversion completed!")
print(f"Total files: {total_files}")
print(f"Successfully converted: {converted_files}")
print(f"Skipped files: {skipped_files}")
print(f"Error files: {error_files}")
print(f"Output directory size: {total_size_gb:.2f} GB")
# Clean up temporary directory
try:
shutil.rmtree(temp_dir)
print(f"Cleaned temporary directory: {temp_dir}")
except Exception as e:
print(f"Error cleaning temporary directory: {str(e)}")
print(f"Log file: {log_file}")
print(f"Statistics file: {stats_file}")
print(f"Total time elapsed: {int(hours)} hours {int(mins)} minutes {secs:.2f} seconds")
if __name__ == "__main__":
main()
|
- Further Image Compression
I originally designed this step to further compress images if the initial conversion didn’t reduce the total size below 10GB. However, the first step successfully resolved the issue, making additional compression unnecessary. Nevertheless, I tested further compression by converting images to WebP with a maximum short edge of 1280px and 60% quality, which resulted in a total size of only 9GB.
View Secondary Compression Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
| import os
import subprocess
import time
import sys
import shutil
from pathlib import Path
def main():
# Configure paths
webp_dir = Path("D:\\images_webp") # WebP directory
temp_dir = Path("D:\\temp_compress") # Temporary directory
cwebp_path = "C:\\Windows\\System32\\cwebp.exe" # cwebp path
# Create temporary directory
temp_dir.mkdir(parents=True, exist_ok=True)
# Log files
log_file = webp_dir / "compression_log.txt"
stats_file = webp_dir / "compression_stats.csv"
print("WebP compression script starting...")
print(f"Processing directory: {webp_dir}")
print(f"Temporary directory: {temp_dir}")
# Initialize log
with open(log_file, "w", encoding="utf-8") as log:
log.write(f"WebP Compression Log - Start time: {time.ctime()}\n")
# Initialize statistics file
with open(stats_file, "w", encoding="utf-8") as stats:
stats.write("Original File,Compressed File,Original Size (KB),New Size (KB),Space Saved (KB),Savings Percentage\n")
# Collect all WebP files
all_webp = list(webp_dir.glob('**/*.webp'))
total_files = len(all_webp)
if total_files == 0:
print("No WebP files found. Please run the conversion script first.")
return
print(f"Found {total_files} WebP files to compress")
compressed_count = 0
skipped_count = 0
error_count = 0
# Process each WebP file
for idx, webp_path in enumerate(all_webp):
try:
# Display progress
progress = (idx + 1) / total_files * 100
sys.stdout.write(f"\rProgress: {progress:.2f}% ({idx+1}/{total_files})")
sys.stdout.flush()
# Original size
orig_size = webp_path.stat().st_size / 1024 # KB
# Create temporary file path
temp_path = temp_dir / f"{webp_path.stem}_compressed.webp"
# Perform secondary compression using cwebp
cmd = [
cwebp_path,
"-q", "75", # Quality parameter
"-m", "6", # Maximum compression mode
str(webp_path),
"-o", str(temp_path)
]
# Execute command
result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
if result.returncode != 0:
# Log compression failure
with open(log_file, "a", encoding="utf-8") as log:
log.write(f"[ERROR] Failed to compress {webp_path}: {result.stderr}\n")
error_count += 1
continue
# Get new file size
new_size = temp_path.stat().st_size / 1024 # KB```markdown
# Skip if the new file is larger than the original
if new_size >= orig_size:
skipped_count += 1
temp_path.unlink() # Delete temporary file
continue
# Calculate space savings
saved = orig_size - new_size
saved_percent = (saved / orig_size) * 100 if orig_size > 0 else 0
# Record statistics
with open(stats_file, "a", encoding="utf-8") as stats:
stats.write(f"{webp_path},{webp_path},{orig_size:.2f},{new_size:.2f},{saved:.2f},{saved_percent:.2f}\n")
# Replace original file
webp_path.unlink() # Delete original file
shutil.move(str(temp_path), str(webp_path))
compressed_count += 1
except Exception as e:
with open(log_file, "a", encoding="utf-8") as log:
log.write(f"[Error] Processing {webp_path} failed: {str(e)}\n")
error_count += 1
# Completion report
total_size = sum(f.stat().st_size for f in webp_dir.glob('**/*') if f.is_file())
total_size_gb = total_size / (1024 ** 3) # Convert to GB
end_time = time.time()
elapsed = end_time - time.time()
mins, secs = divmod(elapsed, 60)
hours, mins = divmod(mins, 60)
with open(log_file, "a", encoding="utf-8") as log:
log.write("\nCompression Report:\n")
log.write(f"Files processed: {total_files}\n")
log.write(f"Successfully compressed: {compressed_count}\n")
log.write(f"Skipped files: {skipped_count}\n")
log.write(f"Error files: {error_count}\n")
log.write(f"Total output directory size: {total_size_gb:.2f} GB\n")
print("\n\nCompression completed!")
print(f"Files processed: {total_files}")
print(f"Successfully compressed: {compressed_count}")
print(f"Skipped files: {skipped_count}")
print(f"Error files: {error_count}")
print(f"Total output directory size: {total_size_gb:.2f} GB")
# Clean temporary directory
try:
shutil.rmtree(temp_dir)
print(f"Cleaned temporary directory: {temp_dir}")
except Exception as e:
print(f"Error cleaning temporary directory: {str(e)}")
print(f"Log file: {log_file}")
print(f"Stats file: {stats_file}")
print(f"Total duration: {int(hours)}h {int(mins)}m {secs:.2f}s")
if __name__ == "__main__":
main()
|
Implementation Plan
Selecting the Right Hugo Theme
For a Hugo project with tens of thousands of markdown files, choosing a theme can be quite challenging.
I tested a visually appealing theme that took over three hours to complete generation without finishing. Some themes threw constant errors during generation, while others produced over 200,000 files.
Ultimately, I settled on the most stable option - the PaperMod theme. By default, this theme generates only about 100 files, and the final website contains fewer than 50,000 files, which is relatively efficient.
Although it doesn’t meet Cloudflare Pages’ 20,000-file limit, it’s sufficiently lean. The build took 6.5 minutes on GitHub Pages and 8 minutes on Vercel.
However, some issues emerged during the build:
- Search functionality: Due to the massive article volume, the default index file reached 80MB, rendering it practically unusable. I had to limit indexing to only article titles and summaries.
- Sitemap generation: The default 4MB sitemap consistently failed to load in Google Search Console, though Bing Webmaster Tools handled it without issues.
- Pagination: With 12,000 tags and 20 articles per page, this would generate 60,000 files. Even after increasing to 200 articles per page, there were still 37,000 files (while other files totaled only 12,000).
The tag issue presents an optimization opportunity: only displaying the top 1,000 most-used tags while incorporating others into article titles. This could potentially reduce the file count below 20,000, meeting Cloudflare Pages’ requirements.
Choosing Static Site Hosting
The Hugo project itself is under 100MB (with 80MB being markdown files), making GitHub hosting feasible. Given GitHub Pages’ slower speeds, I opted for Vercel deployment. While Vercel’s 100GB bandwidth limit might seem restrictive, it should suffice for static content.
Selecting Image Hosting
Still evaluating options. Initially considered Cloudflare R2 but hesitated due to concerns about exceeding free tier limits. Currently using a budget $7/year “fake Alibaba Cloud” VPS as a temporary solution.