Issues Arising from Missing System Language Encoding and Related Thoughts

In the afternoon, a classmate called me, saying that his computer had a serious problem.

The symptoms are described as follows:

When opening Baidu, Sina, NetEase, Tencent, and other pages in the browser, all display garbled characters;
Under the browser’s View -> Encoding options, only “Left to Right Document” and “Right to Left Document” are available, with various encodings like UTF-8 and GBK missing;
After opening QQ, only two input boxes and selection boxes are visible, and no text is displayed on the entire QQ interface after logging in;
The omnipotent Firefox works normally, with no garbled characters on any websites.

Upon receiving the computer, my first thought was that the encoding files were missing. Since I had no prior experience with such issues, I first checked the language tools in the Control Panel to see if there were any related options or switches, but found nothing.

The classmate informed me that the problem occurred after he uninstalled some nonsensical files using 360.

After searching online, I learned that encoding files are controlled by certain DLLs. I wondered if 360 had deregistered some DLLs during the uninstallation process.

So, I attempted to re-register all DLLs by running CMD and executing: for %1 in (%windir%system32\*.dll) do regsvr32.exe /s %1.

However, the problem persisted.

Continuing my online search, I discovered that in the XP system, the DLL controlling encoding is only mlang.dll.

I then went to zhaodll.com, searched for and downloaded it, and replaced the file of the same name in the system32 directory.

Due to high security settings on some systems, but fortunately, XP is not as restrictive as WIN 7 (many files in system32 have permissions set to trustedinstaller), I first renamed the file and then replaced it.

After replacement, I had to re-register the DLL file by running: regsvr32 mlang.dll.

Summary:

Encoding errors are a common issue. Website designers should consider this when choosing encodings during website creation. Choosing UNICODE (UTF-8) is more rational and aligns with international trends.
UTF-8 is an international encoding with good universality, allowing foreigners or users with issues like the one above to browse websites.
GBK is a national encoding with less universality than UTF-8, but UTF-8 occupies more database space than GBK, with Chinese characters in UTF-8 taking up 3 bytes.
For platforms like Discuz that heavily rely on GBK encoding (puzzling, as most plugins are GBK), using GBK is more rational. However, for platforms like WordPress, UTF-8 is undoubtedly the most suitable.

Trivia:

1. GB2312 contains 6763 Chinese characters.

In 1980, the national standard exchange code “Information Exchange Chinese Character Encoding Character Set–Basic Set” was promulgated, with the national standard number: GB2312-80. It selected 6763 Chinese characters, divided into two levels: 3755 in the first level (common characters) and 3008 in the second level (less common characters). It also included 682 characters, including numbers, general symbols, Latin letters, Japanese kana, Greek letters, Russian letters, phonetic symbols, and phonetic letters.

2. GBK is an extension of GB2312, containing approximately 22014 Chinese characters.

China, Japan, and South Korea jointly developed the “CJK Unified Chinese Character Encoding Character Set,” with the international standard number: ISO/IEC10646 and the national standard number: GB13000-90. This Chinese character encoding character set, commonly known as the large character set, includes 20902 Chinese characters, collecting simplified characters from the mainland’s first and second-level character sets, traditional characters from Taiwan’s “General Chinese Character Standard Exchange Code,” 58 special characters from Hong Kong, and 92 “Idu” characters from the Korean ethnic group in the Yanbian region. It even covers common Chinese characters used in Japanese and Korean, meeting various needs.

3. GB18030 is an extension of GBK, containing approximately 27000 Chinese characters.

In March 2000, the Ministry of Information Industry and the General Administration of Quality Supervision, Inspection, and Quarantine jointly released two new standards in Beijing. One is “Information Technology and Information Exchange Chinese Character Encoding Character Set, Extension of the Basic Set,” with the national standard number: GB18030-2000. It includes 27533 Chinese characters and also collects major minority languages such as Tibetan, Mongolian, and Uyghur, aiming to solve the input of rare Chinese characters and major minority languages in postal, household registration, financial, and geographic information systems.

4. BIG5 is a traditional Chinese encoding, containing approximately 13000 Chinese characters.

The Hong Kong, Macau, and Taiwan regions commonly use Taiwan’s “General Chinese Character Standard Exchange Code,” with the regional standard number: CNS11643. It selects over 13000 traditional Chinese characters, known as BIG5 code or Big Five code. The “Overseas Traditional Version” of Qianma includes the BIG5 character set, allowing the input of over 13000 traditional Chinese characters.

5. UTF-8 (Unicode) contains characters needed by all countries worldwide.

Unicode passed DIS (Draft International Standard) in June 1992, with the current version V2.0 released in 1996. It includes 6811 symbols, 20902 Chinese characters, 11172 Korean phonetic characters, 6400 ideographic areas, and 20249 reserved characters, totaling 65534. The size after Unicode encoding is the same. For example, an English letter “a” and a Chinese character “好” both occupy the same space after encoding, which is two bytes!

Attached are some interesting articles about encoding here:

http://note.sdo.com/u/1377503605#/c/i6MBR~jgeuKpnM01o000OO