从文件中读取日语字符的问题 - C

Question

我正在编写一个程序来读取一个包含近 200 万行的文件。该文件格式为带有艺术家姓名字符串的整数 ID 选项卡。

6821361 Selinsgrove High School Chorus
10151460    greek-Antique
10236365    jnr walker & the all-stars
6878792 Grieg - Kraggerud, Kjekshus
6880556 Mr. Oiseau
6906305 stars on 54 (maxi single)
10584525    Jonie Mitchel
10299729    エリス レジーナ／アントニオ カルロス ジョビン

以上是文件中某些项目的示例（并非某些行不遵循特定格式）。我的程序工作文件直到它到达示例的最后一行然后它无休止地打印 エリスレジーナ／アントニオカルロスジョビ33.

struct artist *read_artists(char *fname)
{
    FILE *file;
    struct artist *temp = (struct artist*)malloc(sizeof(struct artist));
    struct artist *head = (struct artist*)malloc(sizeof(struct artist));
    file = fopen("/Users/Daniel/Library/Developer/Xcode/DerivedData/project_Audioscrobbler_Artists-hgwyqpinuoxayzbmvarcjxryqnrz/Build/Products/Debug/artist_data.txt", "r");
    if(file == 0)
    {
        perror("fopen");
        exit(1);
    }
    int artist_ID;
    char artist_name[650];
    while(!feof(file))
    {
        fscanf(file, "%d\t%65[^\t\n]\n", &artist_ID, artist_name);
        temp = create_play(artist_ID, artist_name, 0, -1);
        head = add_play(head, temp);
        printf("%s\n", artist_name);
    }
    fclose(file);
    //print_plays(head);
    return head;
}

以上是我读取文件的代码。你能帮忙解释一下哪里出了问题吗？

Answer 1

如评论所示，一个问题是 while(!feof(file)) linked 内容将在详细说明为什么这不是一个好主意，但总而言之，引用 link:

中的一个答案

(!feof(文件))...

...is wrong because it tests for something that is irrelevant and fails to test for something that you need to know. The result is that you are erroneously executing code that assumes that it is accessing data that was read successfully, when in fact this never happened. - Kerrek SB

在您的情况下，这种用法不会导致您的问题，但正如 Kerrek 解释的那样，屏蔽它可能会发生。

您可以将其替换为 fgets(...)：

char lineBuf[1000];//make length longer or shorter for your purpose
file = fopen("/Users/Daniel/Library/Developer/Xcode/DerivedData/project_Audioscrobbler_Artists-hgwyqpinuoxayzbmvarcjxryqnrz/Build/Products/Debug/artist_data.txt", "r");
if(!file) return -1;
while(fgets (lineBuf, sizeof(lineBuf), file))
{
    //process each line here
    //But processing Japanese characters
    //will require special considerations.
    //Refer to the link below for UNICODE tips
}

Unicode in C and C++...

特别是，您需要使用足以包含您将要处理的不同大小字符的变量类型。 link 对此进行了非常详细的讨论。

摘录如下：

"char" no longer means character
I hereby recommend referring to character codes in C programs using a 32-bit unsigned integer type. Many platforms provide a
"wchar_t" (wide character) type, but unfortunately it is to be avoided since some compilers allot it only 16 bits—not enough to represent Unicode. Wherever you need to pass around an individual character, change "char" to "unsigned int" or similar. The only remaining use for the "char" type is to mean "byte".

编辑：
在上面的评论中，您声明 但它失败的字符串是 66 字节长 。因为您正在读入 'char' 数组，所以完成字符所需的字节在包括最后一个必要字节之前被截断了一个字节。 ASCII 字符可以包含在单个 char space 中。日语字符不能。如果您使用的是 unsigned int 数组而不是 char 数组，最后一个字节将被包括在内。

Answer 2

OP 的代码失败，因为未检查 fscanf() 的结果。

fscanf(file, "%d\t%65[^\t\n]\n", &artist_ID, artist_name);

fscanf()读到"エリスレジーナ／アントニオカルロスジョビン"的65char。然而，这个以 UTF8 编码的字符串的长度为 66。最后的 'ン' 是代码 227、131、179（八进制 343 203 263），只读取了最后 2 个。当打印 artist_name 时，会出现以下内容。

エリス レジーナ／アントニオ カルロス ジョビ33

现在开始做题。最后 char 179 留在 file 中。在 next fscanf() 上，它失败了，因为 char 179 没有转换成 int ("%d")。所以 fscanf() returns 0。由于代码没有检查 fscanf() 的结果，它没有意识到 artist_ID 和 artist_name 是之前遗留下来的，所以打印相同的文本。

由于 feof() 永远不会为真，因为 char 179 未被消耗，我们有无限循环。

while(!feof(file)) 隐藏了这个问题，但没有引起它。

提出的fgets()是一个很好的方法。另一个是：

while (fscanf(file, "%d\t%65[^\t\n]\n", &artist_ID, artist_name) == 2) {
    temp = create_play(artist_ID, artist_name, 0, -1);
    head = add_play(head, temp);
    printf("%s\n", artist_name);
    }

IOWs，验证*scanf()的结果。

从文件中读取日语字符的问题 - C

Issue reading Japanese characters from file - C

c

file

linked-list