MNBVC-Wiki

在对检测到的文件编码进行转码的过程中我们发现：

因为GBK编码与CP936/MS936存在兼容性问题，会导致使用已检测的编码格式打开这段内容时出错，虽然可以用errors=“ignore” 跳过，但跳过这段内容会出现数据遗失。数据文件范例 224.txt, https://pan.baidu.com/s/1gBVBxv_WGie6W6hCIZxTXQ 提取码: stfs

经过测试,初步发现原因是因为GB18030和cp936，ms936几种编码格式在Python具体实现造成的。以224.txt文件中出现的乱码情况为里，其出错的位置是2753行的这部分内容 `“100～120RMB/80～90$/65～70?左右”` 这个?号原本应该是一个欧元符号€，但当Python代码用GB18030编码对该文件内容进行decode处理时，在这个欧元符号处就会到了解析错误。

通常情况下在Windows中简体中文用的是CP936代码页使用0x80来表示欧元符号，而在GB18030编码中没有使用0x80编码位来表示欧元符号。

以下是微软对这个问题的详细解释：

What is GB18030? GB18030–2000 is a new Chinese character encoding standard. The standard contains many characters and has some tough new conformance requirements. GB18030-2000 encodes characters in sequences of one, two, or four bytes. These sequences are defined as follows:

Single-byte: 00-0x7f Two-byte: 0x81-0xfe + 0x40-0x7e, 0x80-0xfe Four-byte: 0x81-0xfe + 0x30-0x39 + 0x81-0xfe + 0x30-0x39 The single-byte section applies the standard GB 11383 coding structure and principles by using the code points 0x00 through 0x7f. GB 11383 is identical to ISO 4873:1986

The two-byte section uses two eight-bit sequences – much in the same manner as most DBCS (double-byte character sets) do – to express a character. The leading byte code points range from 0x81 through 0xfe. The trailing byte code points ranges from 0x40 through 0x7e and 0x80 through 0xfe. This section has the same problem as most DBCS in as much as some code points can be either a leading or trailing byte, thus making character delimitation more complicated.

The four-byte section uses the code points 0x30 through 0x39 as a way to extend the two-byte encodings. Which means the four-byte code points range from 0x81308130 through 0xfe39fe39.

Is GB18030 replacing the Windows Simplified Chinese code page (CP936)? No, Windows code pages must be either one byte (SBCS) or a mix of one and two bytes (DBCS). This requirement is reflected throughout our code e.g. in data structures, program interfaces, network protocols and applications. The existing code page for Simplified Chinese, CP936, is a double byte code page. GB18030 is a four–byte code page i.e. every character is represented by one, two or four bytes. To replace CP936 with GB18030 would require rewriting much of the system. Even if we were to do this, such a system would not run regular applications nor interoperate with regular Windows.

zhangxu：目前 libicu 库是对这个问题有较好的处理的，我们可以用使用 libicu 库的 Linux 命令 uconv 进行验证： uconv -f gbk -t utf-8 224.txt > test.txt

执行以后，打开 test.txt 可以看到“100～120RMB/80～90$/65～70€左右”这句正确解码了。

那么，在 Python 中，可以用下面的代码调用 libicu 库（通过PyICU封装）来解码：

from icu import UnicodeString
 
 
def convert_encoding(input_file, output_file):
    # 打开二进制文件进行读取
    with open(input_file, "rb") as f_input:
        with open(output_file, "w") as f_output:
            data = f_input.read()
            # 将读取的数据转换为UTF-8编码
            utf8_data = UnicodeString(data, "GBK")
            # 将转换后的UTF-8数据写入输出文件
            f_output.write(str(utf8_data))
 
 
if __name__ == "__main__":
    input_file = "224.txt"
    output_file = "224_utf8.txt"
 
    convert_encoding(input_file, output_file)
    print("Conversion completed.")

同样基于 libicu 库的 Java 也可以处理这个问题：

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.charset.Charset;
 
public class TextFileReader {
    public static void main(String[] argsm) {
        String filePath = "/Users/alan/Downloads/test/224.txt";
        String charsetName = "MS936";
        try {
            String fileContent = readTextFile(filePath, charsetName);
            System.out.println(fileContent);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
 
    public static String readTextFile(String filePath, String charsetName) throws IOException {
        File file = new File(filePath);
        StringBuilder sb = new StringBuilder();
        try (BufferedReader reader = new BufferedReader(
                new InputStreamReader(new FileInputStream(file), Charset.forName(charsetName)))) {
            String line;
            while ((line = reader.readLine()) != null) {
                sb.append(line).append(System.lineSeparator());
            }
        }
        return sb.toString();
    }
}