Featured image of post Migrating Soomal.cc to Hugo

Migrating Soomal.cc to Hugo

Earlier this year, after obtaining the source code for the Soomal.com website, I uploaded it to my VPS. However, due to the outdated architecture of the original site, which was inconvenient to manage and not mobile-friendly, I recently undertook a complete overhaul, converting and migrating the entire site to Hugo.

Migration Plan Design

I had long considered revamping Soomal.cc. I had previously run some preliminary tests but encountered numerous issues, which led me to shelve the project temporarily.

Challenges and Difficulties

  1. Large Volume of Articles

    Soomal contains 9,630 articles, with the earliest dating back to 2003, totaling 19 million words.

    The site also hosts 326,000 JPG images across more than 4,700 folders. Most images come in three sizes, though some are missing, resulting in a total size of nearly 70 GB.

  2. Complexity in Article Conversion

    The Soomal source code only includes HTML files for article pages. While these files might have been generated by the same program, preliminary tests revealed that the page structure had undergone multiple changes over time, with different tags used in various periods, making information extraction from the HTML files highly challenging.

    • Encoding Issues: The original HTML files use GB2312 encoding and were likely hosted on a Windows server, requiring special handling for character encoding and escape sequences during conversion.

    • Image Issues: The site contains a vast number of images, which are the essence of Soomal. However, these images use diverse tags and styles, making it difficult to extract links and descriptions without omissions.

    • Tags and Categories: The site has nearly 12,000 article tags and over 20 categories. However, the HTML files lack category information, which can only be found in the 2,000+ category slice HTML files. Tags also present problems—some contain spaces, special characters, or duplicates within the same article.

    • Article Content: The HTML files include the main text, related articles, and tags, all nested under the DOC tag. Initially, I overlooked that related articles use lowercase doc tags, leading to extraction errors during testing. It was only after noticing this discrepancy while browsing the site that I restarted the conversion project.

  3. Storage Solution Dilemma

    I initially hosted Soomal.cc on a VPS. Over a few months, despite low traffic, data usage soared to nearly 1.5TB. Although the VPS offers unlimited bandwidth, this was concerning. After migrating to Hugo, I found that most free hosting services impose restrictions—GitHub recommends repositories under 1GB, CloudFlare Pages limits files to 20,000, CloudFlare R2 caps storage at 10GB, and Vercel and Netlify both limit traffic to 100GB.


Conversion Methodology

Given the potential challenges in converting Soomal to Hugo, I devised a five-step migration plan.

Step 1: Convert HTML Files to Markdown

  1. Define Conversion Requirements

    • Extract Titles: Retrieve article titles from the <head> tag. For example, extract 谈谈手机产业链和手机厂商的相互影响 from <title>刘延作品 - 谈谈手机产业链和手机厂商的相互影响 [Soomal]</title>.
    • Extract Tags: Use keyword filtering to locate tags in the HTML, extract tag names, and enclose them in quotes to handle spaces in tag names.
    • Extract Main Text: Retrieve the article body from the DOC tag and truncate content after the doc tag.
    • Extract Metadata: Gather publication dates, author information, and header images from the HTML.
    • Extract Images: Identify and extract all image references (e.g., smallpic, bigpic, smallpic2, wrappic).
    • Extract Special Content: Include subheadings, download links, tables, etc.
  2. File Conversion Given the clear requirements, I used Python scripts for the conversion.

Click to View Conversion Script Example
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
import os
import re
from bs4 import BeautifulSoup, Tag, NavigableString
from datetime import datetime

def convert_html_to_md(html_path, output_dir):
    try:
        # Read HTML files with GB2312 encoding
        with open(html_path, 'r', encoding='gb2312', errors='ignore') as f:
            html_content = f.read()
        
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # 1. Extract title
        title = extract_title(soup)
        
        # 2. Extract bookmark tags
        bookmarks = extract_bookmarks(soup)
        
        # 3. Extract title image and info
        title_img, info_content = extract_title_info(soup)
        
        # 4. Extract main content
        body_content = extract_body_content(soup)
        
        # Generate YAML frontmatter
        frontmatter = f"""---
title: "{title}"
date: {datetime.now().strftime('%Y-%m-%dT%H:%M:%S+08:00')}
tags: {bookmarks}
title_img: "{title_img}"
info: "{info_content}"
---\n\n"""
        
        # Generate Markdown content
        markdown_content = frontmatter + body_content
        
        # Save Markdown file
        output_path = os.path.join(output_dir, os.path.basename(html_path).replace('.htm', '.md'))
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(markdown_content)
            
        return f"Conversion successful: {os.path.basename(html_path)}"
    except Exception as e:
        return f"Conversion failed {os.path.basename(html_path)}: {str(e)}"

def extract_title(soup):
    """Extract title"""
    if soup.title:
        return soup.title.string.strip()
    return ""

def extract_bookmarks(soup):
    """Extract bookmark tags, each enclosed in quotes"""
    bookmarks = []
    bookmark_element = soup.find(string=re.compile(r'本文的相关书签:'))
    
    if bookmark_element:
        parent = bookmark_element.find_parent(['ul', 'li'])
        if parent:
            # Extract text from all <a> tags
            for a_tag in parent.find_all('a'):
                text = a_tag.get_text().strip()
                if text:
                    # Enclose each tag in quotes
                    bookmarks.append(f'"{text}"')
    
    return f"[{', '.join(bookmarks)}]" if bookmarks else "[]"

def extract_title_info(soup):
    """Extract title image and info content"""
    title_img = ""
    info_content = ""
    
    titlebox = soup.find('div', class_='titlebox')
    if titlebox:
        # Extract title image
        title_img_div = titlebox.find('div', class_='titleimg')
        if title_img_div and title_img_div.img:
            title_img = title_img_div.img['src']
        
        # Extract info content
        info_div = titlebox.find('div', class_='info')
        if info_div:
            # Remove all HTML tags, keeping only text
            info_content = info_div.get_text().strip()
    
    return title_img, info_content

def extract_body_content(soup):
    """Extract main content and process images"""
    body_content = ""
    doc_div = soup.find('div', class_='Doc')  # Note uppercase 'D'
    
    if doc_div:
        # Remove all nested div class="doc" (lowercase)
        for nested_doc in doc_div.find_all('div', class_='doc'):
            nested_doc.decompose()
        
        # Process images
        process_images(doc_div)
        
        # Iterate through all child elements to build Markdown content
        for element in doc_div.children:
            if isinstance(element, Tag):
                if element.name == 'div' and 'subpagetitle' in element.get('class', []):
                    # Convert to subheading
                    body_content += f"## {element.get_text().strip()}\n\n"
                else:
                    # Preserve other content
                    body_content += element.get_text().strip() + "\n\n"
            elif isinstance(element, NavigableString):
                body_content += element.strip() + "\n\n"
    
    return body_content.strip()

def process_images(container):
    """Process image content (Rules A/B/C)"""
    # A: Handle <li data-src> tags
    for li in container.find_all('li', attrs={'data-src': True}):
        img_url = li['data-src'].replace('..', 'https://soomal.cc', 1)
        caption_div = li.find('div', class_='caption')
        content_div = li.find('div', class_='content')
        
        alt_text = caption_div.get_text().strip() if caption_div else ""
        meta_text = content_div.get_text().strip() if content_div else ""
        
        # Create Markdown image syntax
        img_md = f"![{alt_text}]({img_url})\n\n{meta_text}\n\n"
        li.replace_with(img_md)
    
    # B: Process <span class="smallpic"> tags
    for span in container.find_all('span', class_='smallpic'):
        img = span.find('img')
        if img and 'src' in img.attrs:
            img_url = img['src'].replace('..', 'https://soomal.cc', 1)
            caption_div = span.find('div', class_='caption')
            content_div = span.find('div', class_='content')
            
            alt_text = caption_div.get_text().strip() if caption_div else ""
            meta_text = content_div.get_text().strip() if content_div else ""
            
            # Create Markdown image syntax
            img_md = f"![{alt_text}]({img_url})\n\n{meta_text}\n\n"
            span.replace_with(img_md)
            
    # C: Process <div class="bigpic"> tags
    for div in container.find_all('div', class_='bigpic'):
        img = div.find('img')
        if img and 'src' in img.attrs:
            img_url = img['src'].replace('..', 'https://soomal.cc', 1)
            caption_div = div.find('div', class_='caption')
            content_div = div.find('div', class_='content')
            
            alt_text = caption_div.get_text().strip() if caption_div else ""
            meta_text = content_div.get_text().strip() if content_div else ""
            
            # Create Markdown image syntax
            img_md = f"![{alt_text}]({img_url})\n\n{meta_text}\n\n"
            div.replace_with(img_md)

if __name__ == "__main__":
    input_dir = 'doc'
    output_dir = 'markdown_output'
    
    # Create output directory
    os.makedirs(output_dir, exist_ok=True)
    
    # Process all HTML files
    for filename in os.listdir(input_dir):
        if filename.endswith('.htm'):
            html_path = os.path.join(input_dir, filename)
            result = convert_html_to_md(html_path, output_dir)
            print(result)

Step 2: Processing Categories and Abstracts

Due to the original HTML files not containing category information, the article category directories had to be processed separately. During category processing, article abstracts were also handled simultaneously.

  1. Extracting Category and Abstract Information

    Primarily using Python to extract and format category and abstract information from over 2,000 category pages.

Click to view conversion code
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
import os
import re
from bs4 import BeautifulSoup
import codecs
from collections import defaultdict

def extract_category_info(folder_path):
    # Use defaultdict to automatically initialize nested dictionaries
    article_categories = defaultdict(set)  # Stores article ID to category mapping
    article_summaries = {}  # Stores article ID to abstract mapping
    
    # Iterate through all HTM files in the folder
    for filename in os.listdir(folder_path):
        if not filename.endswith('.htm'):
            continue
            
        file_path = os.path.join(folder_path, filename)
        
        try:
            # Read file with GB2312 encoding and convert to UTF-8
            with codecs.open(file_path, 'r', encoding='gb2312', errors='replace') as f:
                content = f.read()
                
            soup = BeautifulSoup(content, 'html.parser')
            
            # Extract category name
            title_tag = soup.title
            if title_tag:
                title_text = title_tag.get_text().strip()
                # Extract content before the first hyphen
                category_match = re.search(r'^([^-]+)', title_text)
                if category_match:
                    category_name = category_match.group(1).strip()
                    # Add quotes if category name contains spaces
                    if ' ' in category_name:
                        category_name = f'"{category_name}"'
                else:
                    category_name = "Unknown_Category"
            else:
                category_name = "Unknown_Category"
            
            # Extract article information
            for item in soup.find_all('div', class_='item'):
                # Extract article ID
                article_link = item.find('a', href=True)
                if article_link:
                    href = article_link['href']
                    article_id = re.search(r'../doc/(\d+)\.htm', href)
                    if article_id:
                        article_id = article_id.group(1)
                    else:
                        continue
                else:
                    continue
                
                # Extract article abstract
                synopsis_div = item.find('div', class_='synopsis')
                synopsis = synopsis_div.get_text().strip() if synopsis_div else ""
                
                # Store category information
                article_categories[article_id].add(category_name)
                
                # Store abstract (only once to avoid overwriting)
                if article_id not in article_summaries:
                    article_summaries[article_id] = synopsis
    
        except UnicodeDecodeError:
            # Attempt using GBK encoding as fallback
            try:
                with codecs.open(file_path, 'r', encoding='gbk', errors='replace') as f:
                    content = f.read()
                # Reprocess content...
                # Note: Repeated processing code omitted here; should be extracted as a function
                # For code completeness, we include the repeated logic
                soup = BeautifulSoup(content, 'html.parser')
                title_tag = soup.title
                if title_tag:
                    title_text = title_tag.get_text().strip()
                    category_match = re.search(r'^([^-]+)', title_text)
                    if category_match:
                        category_name = category_match.group(1).strip()
                        if ' ' in category_name:
                            category_name = f'"{category_name}"'
                    else:
                        category_name = "Unknown_Category"
                else:
                    category_name = "Unknown_Category"
                
                for item in soup.find_all('div', class_='item'):
                    article_link = item.find('a', href=True)
                    if article_link:
                        href = article_link['href']
                        article_id = re.search(r'../doc/(\d+)\.htm', href)
                        if article_id:
                            article_id = article_id.group(1)
                        else:
                            continue```python
else:
    continue

synopsis_div = item.find('div', class_='synopsis')
synopsis = synopsis_div.get_text().strip() if synopsis_div else ""

article_categories[article_id].add(category_name)

if article_id not in article_summaries:
    article_summaries[article_id] = synopsis

except Exception as e:
    print(f"Error processing file {filename} (after trying GBK): {str(e)}")
    continue

except Exception as e:
    print(f"Error processing file {filename}: {str(e)}")
    continue

return article_categories, article_summaries

def save_to_markdown(article_categories, article_summaries, output_path):
    with open(output_path, 'w', encoding='utf-8') as md_file:
        # Write Markdown header
        md_file.write("# Article Categories and Summaries\n\n")
        md_file.write("> This file contains IDs, categories and summaries of all articles\n\n")
        
        # Sort by article ID
        sorted_article_ids = sorted(article_categories.keys(), key=lambda x: int(x))
        
        for article_id in sorted_article_ids:
            # Get sorted category list
            categories = sorted(article_categories[article_id])
            # Format as list string
            categories_str = ", ".join(categories)
            
            # Get summary
            summary = article_summaries.get(article_id, "No summary available")
            
            # Write Markdown content
            md_file.write(f"## Filename: {article_id}\n")
            md_file.write(f"**Categories**: {categories_str}\n")
            md_file.write(f"**Summary**: {summary}\n\n")
            md_file.write("---\n\n")

if __name__ == "__main__":
    # Configure input and output paths
    input_folder = 'Categories'  # Replace with your HTM folder path
    output_md = 'articles_categories.md'
    
    # Execute extraction
    article_categories, article_summaries = extract_category_info(input_folder)
    
    # Save results to Markdown file
    save_to_markdown(article_categories, article_summaries, output_md)
    
    # Print statistics
    print(f"Successfully processed data for {len(article_categories)} articles")
    print(f"Saved to {output_md}")
    print(f"Found {len(article_summaries)} articles with summaries")
  1. Writing category and summary information to markdown files

    This step is relatively simple - writing the extracted category and summary data into the previously converted markdown files one by one.

Click to view the writing script
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
import os
import re
import ruamel.yaml
from collections import defaultdict

def parse_articles_categories(md_file_path):
    """
    Parse articles_categories.md file to extract article IDs, categories and summaries
    """
    article_info = defaultdict(dict)
    current_id = None
    
    try:
        with open(md_file_path, 'r', encoding='utf-8') as f:
            for line in f:
                # Match filename
                filename_match = re.match(r'^## Filename: (\d+)$', line.strip())
                if filename_match:
                    current_id = filename_match.group(1)
                    continue
                
                # Match category information
                categories_match = re.match(r'^\*\*Categories\*\*: (.+)$', line.strip())
                if categories_match and current_id:
                    categories_str = categories_match.group(1)
                    # Clean category string, remove extra spaces and quotes
                    categories = [cat.strip().strip('"') for cat in categories_str.split(',')]
                    article_info[current_id]['categories'] = categories
                    continue
                
                # Match summary information
                summary_match = re.match(r'^\*\*Summary\*\*: (.+)$', line.strip())
                if summary_match and current_id:
                    summary = summary_match.group(1)
                    article_info[current_id]['summary'] = summary
                    continue
                
                # Reset current ID when encountering separator
                if line.startswith('---'):
                    current_id = None
    
    except Exception as e:
        print(f"Error parsing articles_categories.md file: {str(e)}")
    
    return article_info

def update_markdown_files(article_info, md_folder):
    """
    Update Markdown files by adding category and summary information to frontmatter
    """
    updated_count = 0
    skipped_count = 0
    
    # Initialize YAML parser
    yaml = ruamel.yaml.YAML()
    yaml.preserve_quotes = True
    yaml.width = 1000  # Prevent long summaries from line breaking
    
    for filename in os.listdir(md_folder):
        if not filename.endswith('.md'):
            continue
            
        article_id = filename[:-3]  # Remove .md extension
        file_path = os.path.join(md_folder, filename)
        
        # Check if information exists for this article
        if article_id not in article_info:
            skipped_count += 1
            continue
            
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()
            
            # Parse frontmatter
            frontmatter_match = re.search(r'^---\n(.*?)\n---', content, re.DOTALL)
            if not frontmatter_match:
                print(f"No frontmatter found in file {filename}, skipping")
                skipped_count += 1
                continue
                
            frontmatter_content = frontmatter_match.group(1)
            
            # Convert frontmatter to dictionary
            data = yaml.load(frontmatter_content)
            if data is None:
                data = {}
            
            # Add category and summary information
            info = article_info[article_id]
            
            # Add categories
            if 'categories' in info:
                # If categories already exist, merge them (deduplicate)
                existing_categories = set(data.get('categories', []))
                new_categories = set(info['categories'])
                combined_categories = sorted(existing_categories.union(new_categories))
                data['categories'] = combined_categories
            
            # Add summary (if summary exists and is not empty)
            if 'summary' in info and info['summary']:
                # Only update if summary doesn't exist or new summary is not empty
                if 'summary' not in data or info['summary']:
                    data['summary'] = info['summary']
            
            # Regenerate frontmatter
            new_frontmatter = '---\n'
            with ruamel.yaml.String
```Here's the English translation of the provided text:

```python
IO() as stream:
    yaml.dump(data, stream)
    new_frontmatter += stream.getvalue().strip()
new_frontmatter += '\n---'

# Replace original frontmatter
new_content = content.replace(frontmatter_match.group(0), new_frontmatter)

# Write to file
with open(file_path, 'w', encoding='utf-8') as f:
    f.write(new_content)
    
updated_count += 1

except Exception as e:
    print(f"Error updating file {filename}: {str(e)}")
    skipped_count += 1

return updated_count, skipped_count

if __name__ == "__main__":
    # Configure paths
    articles_md = 'articles_categories.md'  # Markdown file containing category and summary information
    md_folder = 'markdown_output'  # Folder containing Markdown articles
    
    # Parse articles_categories.md file
    print("Parsing articles_categories.md file...")
    article_info = parse_articles_categories(articles_md)
    print(f"Successfully parsed information for {len(article_info)} articles")
    
    # Update Markdown files
    print(f"\nUpdating category and summary information for {len(article_info)} articles...")
    updated, skipped = update_markdown_files(article_info, md_folder)
    
    # Print statistics
    print(f"\nProcessing complete!")
    print(f"Successfully updated: {updated} files")
    print(f"Skipped: {skipped} files")
    print(f"Articles with found information: {len(article_info)}")

Step 3: Convert article frontmatter information

This step primarily involves correcting the frontmatter section of the output Markdown files to meet Hugo theme requirements.

  1. Revise article header information according to frontmatter specifications Mainly handles special characters, date formats, authors, featured images, tags, categories, etc.
View conversion code
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
import os
import re
import frontmatter
import yaml
from datetime import datetime

def escape_special_characters(text):
    """Escape special characters in YAML"""
    # Escape backslashes while preserving already escaped characters
    return re.sub(r'(?<!\\)\\(?!["\\/bfnrt]|u[0-9a-fA-F]{4})', r'\\\\', text)

def process_md_files(folder_path):
    for filename in os.listdir(folder_path):
        if filename.endswith(".md"):
            file_path = os.path.join(folder_path, filename)
            try:
                # Read file content
                with open(file_path, 'r', encoding='utf-8') as f:
                    content = f.read()
                
                # Manually split frontmatter and content
                if content.startswith('---\n'):
                    parts = content.split('---\n', 2)
                    if len(parts) >= 3:
                        fm_text = parts[1]
                        body_content = parts[2] if len(parts) > 2 else ""
                        
                        # Escape special characters
                        fm_text = escape_special_characters(fm_text)
                        
                        # Recombine content
                        new_content = f"---\n{fm_text}---\n{body_content}"
                        
                        # Parse frontmatter using safe loading
                        post = frontmatter.loads(new_content)
                        
                        # Process info field
                        if 'info' in post.metadata:
                            info = post.metadata['info']
                            
                            # Extract date
                            date_match = re.search(r'On (\d{4}\.\d{1,2}\.\d{1,2} \d{1,2}:\d{2}:\d{2})', info)
                            if date_match:
                                date_str = date_match.group(1)
                                try:
                                    dt = datetime.strptime(date_str, "%Y.%m.%d %H:%M:%S")
                                    post.metadata['date'] = dt.strftime("%Y-%m-%dT%H:%M:%S+08:00")
                                except ValueError:
                                    # Keep original date as fallback
                                    pass
                            
                            # Extract author
                            author_match = re.match(r'^(.+?)作品', info)
                            if author_match:
                                authors = author_match.group(1).strip()
                                # Split multiple authors
                                author_list = [a.strip() for a in re.split(r'\s+', authors) if a.strip()]
                                post.metadata['author'] = author_list
                            
                            # Create description
                            desc_parts = info.split('|', 1)
                            if len(desc_parts) > 1:
                                post.metadata['description'] = desc_parts[1].strip()
                            
                            # Remove original info
                            del post.metadata['info']
                        
                        # Process title_img
                        if 'title_img' in post.metadata:
                            img_url = post.metadata['title_img'].replace("../", "https://soomal.cc/")
                            # Handle potential double slashes
                            img_url = re.sub(r'(?<!:)/{2,}', '/', img_url)
                            post.metadata['cover'] = {
                                'image': img_url,
                                'caption': "",
                                'alt': "",
                                'relative': False
                            }
                            del post.metadata['title_img']
                        
                        # Modify title
                        if 'title' in post.metadata:
                            title = post.metadata['title']
                            # Remove content before "-"
                            if '-' in title:
                                new_title = title.split('-', 1)[1].strip()
                                post.metadata['title'] = new_title
                        
                        # Save modified file
                        with open(file_path, 'w', encoding='utf-8') as f_out:
                            f_out.write(frontmatter.dumps(post))
            except Exception as e:
                print(f"Error processing file {filename}: {str(e)}")
                # Log error files for later review
                with open("processing_errors.log", "a", encoding="utf-8") as log:
                    log.write(f"Error in {filename}: {str(e)}\n")

if __na
``````python
if __name__ == "__main__":
    folder_path = "markdown_output"  # Replace with your actual path
    process_md_files(folder_path)
    print("Frontmatter processing completed for all Markdown files!")
  1. Streamlining Tags and Categories Soomal.com originally had over 20 article categories, some of which were meaningless (e.g., the “All Articles” category). Additionally, there was significant overlap between article categories and tags. To ensure uniqueness between categories and tags, further simplification was implemented. Another goal was to minimize the number of files generated during the final website build.
View the code for streamlining tags and categories
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import os
import yaml
import frontmatter

def clean_hugo_tags_categories(folder_path):
    """
    Clean up tags and categories in Hugo articles:
    1. Remove "All Articles" from categories
    2. Remove tags that duplicate categories
    """
    # Valid categories list ("All Articles" removed)
    valid_categories = [
        "Digital Devices", "Audio", "Music", "Mobile Digital", "Reviews", "Introductions", 
        "Evaluation Reports", "Galleries", "Smartphones", "Android", "Headphones", 
        "Musicians", "Imaging", "Digital Terminals", "Speakers", "iOS", "Cameras", 
        "Sound Cards", "Album Reviews", "Tablets", "Technology", "Applications", 
        "Portable Players", "Windows", "Digital Accessories", "Essays", "DACs", 
        "Audio Systems", "Lenses", "Musical Instruments", "Audio Codecs"
    ]
    
    # Process all Markdown files in the folder
    for filename in os.listdir(folder_path):
        if not filename.endswith('.md'):
            continue
            
        filepath = os.path.join(folder_path, filename)
        with open(filepath, 'r', encoding='utf-8') as f:
            post = frontmatter.load(f)
            
            # 1. Clean categories (remove invalid entries and deduplicate)
            if 'categories' in post.metadata:
                # Convert to set for deduplication + filter invalid categories
                categories = list(set(post.metadata['categories']))
                cleaned_categories = [
                    cat for cat in categories 
                    if cat in valid_categories
                ]
                post.metadata['categories'] = cleaned_categories
            
            # 2. Clean tags (remove duplicates with categories)
            if 'tags' in post.metadata:
                current_cats = post.metadata.get('categories', [])
                # Convert to set for deduplication + filter category duplicates
                tags = list(set(post.metadata['tags']))
                cleaned_tags = [
                    tag for tag in tags 
                    if tag not in current_cats
                ]
                post.metadata['tags'] = cleaned_tags
                
            # Save modified file
            with open(filepath, 'w', encoding='utf-8') as f_out:
                f_out.write(frontmatter.dumps(post))

if __name__ == "__main__":
    # Example usage (modify with your actual path)
    md_folder = "./markdown_output"
    clean_hugo_tags_categories(md_folder)
    print(f"Processing completed: {len(os.listdir(md_folder))} files")

Step 4: Reducing Image Quantity

During the HTML-to-Markdown conversion, since only article content was extracted, many cropped images from the original site became unnecessary. Therefore, we matched the converted Markdown files against the original site’s images to identify only those needed for the new site.

This step reduced the total number of images from 326,000 to 118,000.

  1. Extracting Image Links Extract all image links from Markdown files. Since the image links were standardized during conversion, this process was straightforward.
View the extraction code
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import os
import re
import argparse

def extract_image_links(directory):
    """Extract image links from all md files in directory"""
    image_links = set()
    pattern = re.compile(r'https://soomal\.cc[^\s\)\]\}]*?\.jpg', re.IGNORECASE)
    
    for root, _, files in os.walk(directory):
        for filename in files:
            if filename.endswith('.md'):
                filepath = os.path.join(root, filename)
                try:
                    with open(filepath, 'r', encoding='utf-8') as f:
                        content = f.read()
                        matches = pattern.findall(content)
                        if matches:
                            image_links.update(matches)
                except Exception as e:
                    print(f"Error processing {filepath}: {str(e)}")
    
    return sorted(image_links)

def save_links_to_file(links, output_file):
    """Save links to file"""
    with open(output_file, 'w', encoding='utf-8') as f:
        for link in links:
            f.write(link + '\n')

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Extract image links from Markdown')
    parser.add_argument('--input', default='markdown_output', help='Path to Markdown directory')
    parser.add_argument('--output', default='image_links.txt', help='Output file path')
    args = parser.parse_args()

    print(f"Scanning directory: {args.input}")
    links = extract_image_links(args.input)
    
    print(f"Found {len(links)} unique image links")
    save_links_to_file(links, args.output)
    print(f"Links saved to: {args.output}")
  1. Copying Corresponding Images Use the extracted image links to locate and copy corresponding files from the original site directory, ensuring directory accuracy.
A. View Windows Copy Code
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
import os
import shutil
import time
import sys

def main():
    # Configuration
    source_drive = "F:\\"
    target_drive = "D:\\"
    image_list_file = r"D:\trans-soomal\image_links.txt"
    log_file = r"D:\trans-soomal\image_copy_log.txt"
    error_log_file = r"D:\trans-soomal\image_copy_errors.txt"
    
    print("Image copy script starting...")
    
    # Record start time
    start_time = time.time()
    
    # Create log files
    with open(log_file, "w", encoding="utf-8") as log, open(error_log_file, "w", encoding="utf-8") as err_log:
        log.write(f"Image Copy Log - Start Time: {time.ctime(start_time)}\n")
        err_log.write("Failed copies:\n")
        
        try:
            # Read image list
            with open(image_list_file, "r", encoding="utf-8") as f:
                image_paths = [line.strip() for line in f if line.strip()]
            
            total_files = len(image_paths)
            success_count = 0
            fail_count = 0
            skipped_count = 0
            
            print(f"Found {total_files} images to copy")
            
            # Process each file
            for i, relative_path in enumerate(image_paths):
                # Display progress
                progress = (i + 1) / total_files * 100
                sys.stdout.write(f"\rProgress: {progress:.2f}% ({i+1}/{total_files})")
                sys.stdout.flush()
                
                # Build full paths
                source_path = os.path.join(source_drive, relative_path)
                target_path = os.path.join(target_drive, relative_path)
                
                try:
                    # Check if source exists
                    if not os.path.exists(source_path):
                        err_log.write(f"Source missing: {source_path}\n")
                        fail_count += 1
                        continue
                    
                    # Check if target already exists
```if os.path.exists(target_path):
                        log.write(f"File already exists, skipping: {target_path}\n")
                        skipped_count += 1
                        continue
                    
                    # Create target directory
                    target_dir = os.path.dirname(target_path)
                    os.makedirs(target_dir, exist_ok=True)
                    
                    # Copy file
                    shutil.copy2(source_path, target_path)
                    
                    # Log success
                    log.write(f"[SUCCESS] Copied {source_path} to {target_path}\n")
                    success_count += 1
                    
                except Exception as e:
                    # Log failure
                    err_log.write(f"[FAILED] {source_path} -> {target_path} : {str(e)}\n")
                    fail_count += 1
            
            # Calculate elapsed time
            end_time = time.time()
            elapsed_time = end_time - start_time
            minutes, seconds = divmod(elapsed_time, 60)
            hours, minutes = divmod(minutes, 60)
            
            # Write summary
            summary = f"""
================================
Copy operation completed
Start time: {time.ctime(start_time)}
End time: {time.ctime(end_time)}
Total duration: {int(hours)}h {int(minutes)}m {seconds:.2f}s

Total files: {total_files}
Successfully copied: {success_count}
Skipped (existing): {skipped_count}
Failed: {fail_count}
================================
"""
            log.write(summary)
            print(summary)
            
        except Exception as e:
            print(f"\nError occurred: {str(e)}")
            err_log.write(f"Script error: {str(e)}\n")

if __name__ == "__main__":
    main()
B. View Linux Copy Code
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
#!/bin/bash

# Configuration parameters
LINK_FILE="/user/image_links.txt"  # Replace with actual link file path
SOURCE_BASE="/user/soomal.cc/index"
DEST_BASE="/user/images.soomal.cc/index"
LOG_FILE="/var/log/image_copy_$(date +%Y%m%d_%H%M%S).log"
THREADS=3  # Automatically get CPU cores as thread count

# Start logging
{
echo "===== Copy Task Started: $(date) ====="
echo "Source base directory: $SOURCE_BASE"
echo "Destination base directory: $DEST_BASE"
echo "Link file: $LINK_FILE"
echo "Thread count: $THREADS"

# Path validation example
echo -e "\n=== Path Validation ==="
sample_url="https://soomal.cc/images/doc/20090406/00000007.jpg"
expected_src="${SOURCE_BASE}/images/doc/20090406/00000007.jpg"
expected_dest="${DEST_BASE}/images/doc/20090406/00000007.jpg"

echo "Example URL: $sample_url"
echo "Expected source path: $expected_src"
echo "Expected destination path: $expected_dest"

if [[ -f "$expected_src" ]]; then
    echo "Validation successful: Example source file exists"
else
    echo "Validation failed: Example source file missing! Please check paths"
    exit 1
fi

# Create destination base directory
mkdir -p "${DEST_BASE}/images"

# Prepare parallel processing
echo -e "\n=== Processing Started ==="
total=$(wc -l < "$LINK_FILE")
echo "Total links: $total"
counter=0

# Processing function
process_link() {
    local url="$1"
    local rel_path="${url#https://soomal.cc}"
    
    # Build full paths
    local src_path="${SOURCE_BASE}${rel_path}"
    local dest_path="${DEST_BASE}${rel_path}"
    
    # Create destination directory
    mkdir -p "$(dirname "$dest_path")"
    
    # Copy file
    if [[ -f "$src_path" ]]; then
        if cp -f "$src_path" "$dest_path"; then
            echo "SUCCESS: $rel_path"
            return 0
        else
            echo "COPY FAILED: $rel_path"
            return 2
        fi
    else
        echo "MISSING: $rel_path"
        return 1
    fi
}

# Export function for parallel use
export -f process_link
export SOURCE_BASE DEST_BASE

# Use parallel for concurrent processing
echo "Starting parallel copying..."
parallel --bar --jobs $THREADS --progress \
         --halt soon,fail=1 \
         --joblog "${LOG_FILE}.jobs" \
         --tagstring "{}" \
         "process_link {}" < "$LINK_FILE" | tee -a "$LOG_FILE"

# Collect results
success=$(grep -c 'SUCCESS:' "$LOG_FILE")
missing=$(grep -c 'MISSING:' "$LOG_FILE")
failed=$(grep -c 'COPY FAILED:' "$LOG_FILE")

# Final statistics
echo -e "\n===== Copy Task Completed: $(date) ====="
echo "Total links: $total"
echo "Successfully copied: $success"
echo "Missing files: $missing"
echo "Copy failures: $failed"
echo "Success rate: $((success * 100 / total))%"

} | tee "$LOG_FILE"

# Save missing files list
grep '^MISSING:' "$LOG_FILE" | cut -d' ' -f2- > "${LOG_FILE%.log}_missing.txt"
echo "Missing files list: ${LOG_FILE%.log}_missing.txt"

Step 5: Compress Image Sizes

I had previously compressed the website’s source images once, but it wasn’t enough. My goal is to reduce the image size to under 10 GB to meet potential future requirements for migrating to CloudFlare R2.

  1. Convert JPG to Webp After compressing the images with Webp before, I kept them in JPG format to avoid access issues due to the numerous HTML files. Since this migration is to Hugo, there’s no need to retain JPG format anymore, so I’ll directly convert them to Webp. Additionally, since my webpage is set to a 960px width and I’m not using fancy lightbox plugins, resizing the images to 960px can further reduce the size.

Actual tests showed that after this compression, the image size dropped to 7.7GB. However, I noticed a minor issue with the image processing logic. Soomal has many vertical images as well as horizontal ones, and 960px width appears somewhat small on 4K displays. I ultimately converted the images with the short edge set to a maximum of 1280px at 85% quality, resulting in a size of about 14GB, which fits within my VPS’s 20GB storage. I also tested with a short edge of 1150px at 80% quality, which met the 10GB requirement.

View Image Conversion Code
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
import os
import subprocess
import time
import sys
import shutil
from pathlib import Path

def main():
    # Configure paths
    source_dir = Path("D:\\images")  # Original image directory
    output_dir = Path("D:\\images_webp")  # WebP output directory
    temp_dir = Path("D:\\temp_webp")  # Temporary processing directory
    magick_path = "C:\\webp\\magick.exe"  # ImageMagick path
    
    # Create necessary directories
    output_dir.mkdir(parents=True, exist_ok=True)
    temp_dir.mkdir(parents=True, exist_ok=True)
    
    # Log files
    log_file = output_dir / "conversion_log.txt"
    stats_file = output_dir / "conversion_stats.csv"
    
    print("Image conversion script starting...")
    print(f"Source directory: {source_dir}")
    print(f"Output directory: {output_dir}")
    print(f"Temporary directory: {temp_dir}")
    
    # Initialize log
    with open(log_file, "w", encoding="utf-8") as log:
        log.write(f"Image conversion log - Start time: {time.ctime()}\n")
    
    # Initialize stats file
    with open(stats_file, "w", encoding="utf-8") as stats:
        stats.write("Original File,Converted File,Original Size (KB),Converted Size (KB),Space Saved (KB),Savings Percentage\n")
    
    # Collect all image files
    image_exts = ('.jpg', '.jpeg', '.png', '.bmp', '.tiff', '.gif')
    all_images = []
    for root, _, files in os.walk(source_dir):
        for file in files:
            if file.lower().endswith(image_exts):
                all_images.append(Path(root) / file)
    
    total_files = len(all_images)
    converted_files = 0
    skipped_files = 0
    error_files = 0
    
    print(f"Found {total_files} image files to process")
    
    # Process each image
    for idx, img_path in enumerate(all_images):
        try:
            # Progress display
```Display progress  
            progress = (idx + 1) / total_files * 100  
            sys.stdout.write(f"\rProgress: {progress:.2f}% ({idx+1}/{total_files})")  
            sys.stdout.flush()  
              
            # Create relative path structure  
            rel_path = img_path.relative_to(source_dir)  
            webp_path = output_dir / rel_path.with_suffix('.webp')  
            webp_path.parent.mkdir(parents=True, exist_ok=True)  
              
            # Check if file already exists  
            if webp_path.exists():  
                skipped_files += 1  
                continue  
              
            # Create temporary file path  
            temp_path = temp_dir / f"{img_path.stem}_temp.webp"  
              
            # Get original file size  
            orig_size = img_path.stat().st_size / 1024  # KB  
              
            # Convert and resize using ImageMagick  
            cmd = [  
                magick_path,  
                str(img_path),  
                "-resize", "960>",   # Resize only if width exceeds 960px  
                "-quality", "85",    # Initial quality 85  
                "-define", "webp:lossless=false",  
                str(temp_path)  
            ]  
              
            # Execute command  
            result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)  
              
            if result.returncode != 0:  
                # Log conversion failure  
                with open(log_file, "a", encoding="utf-8") as log:  
                    log.write(f"[ERROR] Failed to convert {img_path}: {result.stderr}\n")  
                error_files += 1  
                continue  
              
            # Move temporary file to target location  
            shutil.move(str(temp_path), str(webp_path))  
              
            # Get converted file size  
            new_size = webp_path.stat().st_size / 1024  # KB  
              
            # Calculate space savings  
            saved = orig_size - new_size  
            saved_percent = (saved / orig_size) * 100 if orig_size > 0 else 0  
              
            # Record statistics  
            with open(stats_file, "a", encoding="utf-8") as stats:  
                stats.write(f"{img_path},{webp_path},{orig_size:.2f},{new_size:.2f},{saved:.2f},{saved_percent:.2f}\n")  
              
            converted_files += 1  
          
        except Exception as e:  
            with open(log_file, "a", encoding="utf-8") as log:  
                log.write(f"[EXCEPTION] Error processing {img_path}: {str(e)}\n")  
            error_files += 1  
      
    # Completion report  
    total_size = sum(f.stat().st_size for f in output_dir.glob('**/*') if f.is_file())  
    total_size_gb = total_size / (1024 ** 3)  # Convert to GB  
      
    end_time = time.time()  
    elapsed = end_time - time.time()  
    mins, secs = divmod(elapsed, 60)  
    hours, mins = divmod(mins, 60)  
      
    with open(log_file, "a", encoding="utf-8") as log:  
        log.write("\nConversion Report:\n")  
        log.write(f"Total files: {total_files}\n")  
        log.write(f"Successfully converted: {converted_files}\n")  
        log.write(f"Skipped files: {skipped_files}\n")  
        log.write(f"Error files: {error_files}\n")  
        log.write(f"Output directory size: {total_size_gb:.2f} GB\n")  
      
    print("\n\nConversion completed!")  
    print(f"Total files: {total_files}")  
    print(f"Successfully converted: {converted_files}")  
    print(f"Skipped files: {skipped_files}")  
    print(f"Error files: {error_files}")  
    print(f"Output directory size: {total_size_gb:.2f} GB")  
      
    # Clean up temporary directory  
    try:  
        shutil.rmtree(temp_dir)  
        print(f"Cleaned temporary directory: {temp_dir}")  
    except Exception as e:  
        print(f"Error cleaning temporary directory: {str(e)}")  
      
    print(f"Log file: {log_file}")  
    print(f"Statistics file: {stats_file}")  
    print(f"Total time elapsed: {int(hours)} hours {int(mins)} minutes {secs:.2f} seconds")  

if __name__ == "__main__":  
    main()  
  1. Further Image Compression
    I originally designed this step to further compress images if the initial conversion didn’t reduce the total size below 10GB. However, the first step successfully resolved the issue, making additional compression unnecessary. Nevertheless, I tested further compression by converting images to WebP with a maximum short edge of 1280px and 60% quality, which resulted in a total size of only 9GB.
View Secondary Compression Code
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
import os  
import subprocess  
import time  
import sys  
import shutil  
from pathlib import Path  

def main():  
    # Configure paths  
    webp_dir = Path("D:\\images_webp")  # WebP directory  
    temp_dir = Path("D:\\temp_compress")  # Temporary directory  
    cwebp_path = "C:\\Windows\\System32\\cwebp.exe"  # cwebp path  
      
    # Create temporary directory  
    temp_dir.mkdir(parents=True, exist_ok=True)  
      
    # Log files  
    log_file = webp_dir / "compression_log.txt"  
    stats_file = webp_dir / "compression_stats.csv"  
      
    print("WebP compression script starting...")  
    print(f"Processing directory: {webp_dir}")  
    print(f"Temporary directory: {temp_dir}")  
      
    # Initialize log  
    with open(log_file, "w", encoding="utf-8") as log:  
        log.write(f"WebP Compression Log - Start time: {time.ctime()}\n")  
      
    # Initialize statistics file  
    with open(stats_file, "w", encoding="utf-8") as stats:  
        stats.write("Original File,Compressed File,Original Size (KB),New Size (KB),Space Saved (KB),Savings Percentage\n")  
      
    # Collect all WebP files  
    all_webp = list(webp_dir.glob('**/*.webp'))  
    total_files = len(all_webp)  
      
    if total_files == 0:  
        print("No WebP files found. Please run the conversion script first.")  
        return  
      
    print(f"Found {total_files} WebP files to compress")  
      
    compressed_count = 0  
    skipped_count = 0  
    error_count = 0  
      
    # Process each WebP file  
    for idx, webp_path in enumerate(all_webp):  
        try:  
            # Display progress  
            progress = (idx + 1) / total_files * 100  
            sys.stdout.write(f"\rProgress: {progress:.2f}% ({idx+1}/{total_files})")  
            sys.stdout.flush()  
              
            # Original size  
            orig_size = webp_path.stat().st_size / 1024  # KB  
              
            # Create temporary file path  
            temp_path = temp_dir / f"{webp_path.stem}_compressed.webp"  
              
            # Perform secondary compression using cwebp  
            cmd = [  
                cwebp_path,  
                "-q", "75",  # Quality parameter  
                "-m", "6",   # Maximum compression mode  
                str(webp_path),  
                "-o", str(temp_path)  
            ]  
              
            # Execute command  
            result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)  
              
            if result.returncode != 0:  
                # Log compression failure  
                with open(log_file, "a", encoding="utf-8") as log:  
                    log.write(f"[ERROR] Failed to compress {webp_path}: {result.stderr}\n")  
                error_count += 1  
                continue  
              
            # Get new file size  
            new_size = temp_path.stat().st_size / 1024  # KB```markdown
# Skip if the new file is larger than the original
if new_size >= orig_size:
    skipped_count += 1
    temp_path.unlink()  # Delete temporary file
    continue

# Calculate space savings
saved = orig_size - new_size
saved_percent = (saved / orig_size) * 100 if orig_size > 0 else 0

# Record statistics
with open(stats_file, "a", encoding="utf-8") as stats:
    stats.write(f"{webp_path},{webp_path},{orig_size:.2f},{new_size:.2f},{saved:.2f},{saved_percent:.2f}\n")

# Replace original file
webp_path.unlink()  # Delete original file
shutil.move(str(temp_path), str(webp_path))
compressed_count += 1

except Exception as e:
    with open(log_file, "a", encoding="utf-8") as log:
        log.write(f"[Error] Processing {webp_path} failed: {str(e)}\n")
    error_count += 1

# Completion report
total_size = sum(f.stat().st_size for f in webp_dir.glob('**/*') if f.is_file())
total_size_gb = total_size / (1024 ** 3)  # Convert to GB

end_time = time.time()
elapsed = end_time - time.time()
mins, secs = divmod(elapsed, 60)
hours, mins = divmod(mins, 60)

with open(log_file, "a", encoding="utf-8") as log:
    log.write("\nCompression Report:\n")
    log.write(f"Files processed: {total_files}\n")
    log.write(f"Successfully compressed: {compressed_count}\n")
    log.write(f"Skipped files: {skipped_count}\n")
    log.write(f"Error files: {error_count}\n")
    log.write(f"Total output directory size: {total_size_gb:.2f} GB\n")

print("\n\nCompression completed!")
print(f"Files processed: {total_files}")
print(f"Successfully compressed: {compressed_count}")
print(f"Skipped files: {skipped_count}")
print(f"Error files: {error_count}")
print(f"Total output directory size: {total_size_gb:.2f} GB")

# Clean temporary directory
try:
    shutil.rmtree(temp_dir)
    print(f"Cleaned temporary directory: {temp_dir}")
except Exception as e:
    print(f"Error cleaning temporary directory: {str(e)}")

print(f"Log file: {log_file}")
print(f"Stats file: {stats_file}")
print(f"Total duration: {int(hours)}h {int(mins)}m {secs:.2f}s")

if __name__ == "__main__":
    main()

Implementation Plan

Selecting the Right Hugo Theme

For a Hugo project with tens of thousands of markdown files, choosing a theme can be quite challenging.

I tested a visually appealing theme that took over three hours to complete generation without finishing. Some themes threw constant errors during generation, while others produced over 200,000 files.

Ultimately, I settled on the most stable option - the PaperMod theme. By default, this theme generates only about 100 files, and the final website contains fewer than 50,000 files, which is relatively efficient.

Although it doesn’t meet Cloudflare Pages’ 20,000-file limit, it’s sufficiently lean. The build took 6.5 minutes on GitHub Pages and 8 minutes on Vercel.

However, some issues emerged during the build:

  • Search functionality: Due to the massive article volume, the default index file reached 80MB, rendering it practically unusable. I had to limit indexing to only article titles and summaries.
  • Sitemap generation: The default 4MB sitemap consistently failed to load in Google Search Console, though Bing Webmaster Tools handled it without issues.
  • Pagination: With 12,000 tags and 20 articles per page, this would generate 60,000 files. Even after increasing to 200 articles per page, there were still 37,000 files (while other files totaled only 12,000).

The tag issue presents an optimization opportunity: only displaying the top 1,000 most-used tags while incorporating others into article titles. This could potentially reduce the file count below 20,000, meeting Cloudflare Pages’ requirements.

Choosing Static Site Hosting

The Hugo project itself is under 100MB (with 80MB being markdown files), making GitHub hosting feasible. Given GitHub Pages’ slower speeds, I opted for Vercel deployment. While Vercel’s 100GB bandwidth limit might seem restrictive, it should suffice for static content.

Selecting Image Hosting

Still evaluating options. Initially considered Cloudflare R2 but hesitated due to concerns about exceeding free tier limits. Currently using a budget $7/year “fake Alibaba Cloud” VPS as a temporary solution.

Built with Hugo, Powered by Github.
Total Posts: 346, Total Words: 477062.
本站已加入BLOGS·CN