不同应用程序之间的表情符号阅读差异

Question

我有一堆 tweets/threads 数据集需要处理，还有一些单独的注释文件。这些注释文件由一些由对应于 word/sentence 的索引表示的跨度组成。如您所料，索引是字符在 tweet/thread 文件中的位置。

当我处理其中包含一些表情符号的文件时出现问题。举个具体的例子：

这是相关文件的一部分 (download):

TeamKhabib   @danawhite @seanshelby @arielhelwani @AliAbdelaziz00 #McTapper xxxxx://x.xx/xxxxxxxxxx
mmafan1709  @TeamKhabib @danawhite @seanshelby @arielhelwani @AliAbdelaziz00 Conor is Khabib hardest fight and Khabib is Conors hardest fight

我用plain open函数读取了python中的文件，参数encoding='utf8':

with open('028_948124816611139589.branch318.txt.username_text_tabseparated', 'r', encoding='utf-8') as f:
    content = f.read()
    print(content[211:214])

注释说在 211-214 范围内有单词 and。我上面提到的阅读方式，有 ' kh'.

当我使用注释文件中的索引获取跨接字符串时，我得到的字符串少了 3 个字符（向右）。因为，在注释中， ' 显然需要 2 spaces。但是，当 python 读取它们时，它是一个，因此字符转换。当我使用 len(list(file.read())) 获取文件的长度时，它变得更加明显。这个returns我是7809，而文件的实际长度是7812。7812是我在vscode中得到的文件末尾的pos，一个叫做vscode-position的插件。另一个文件给了我 513 和 527 的不一致。

我阅读表情符号没有问题，我在我的 output/array 中看到它们，但是它们在编码中占用的 space 是不同的。我的问题在其他相关问题中没有得到解答

显然，阅读这个文件是有道理的，因为这些文件 read/created 有一些 format/method/concept/encoding/whatever 这个插件和注释者同意，但 open.read 不同意。

我正在使用 python 3.8.

我在这里错过了什么？

Answer 1

经过讨论，我认为这个问题是跨度是根据 Unicode 字符串计算的，这些 Unicode 字符串使用 Unicode 代码点的代理项对 > U+FFFF。 Python 2 和 Java 和 C# 等其他语言使用 UTF-16 代码单元存储 Unicode 字符串，而不是像 Python 3 这样的抽象代码点。如果我将测试数据视为 UTF-16LE-编码，答案就出来了：

import re

# Important to note that the original file has two tabs in it that SO doesn't display.
#  * Between the first "TeamKabib" and smiley
#  * Between "mmafan1709" and "@TeamKhabib"
# Use the download link while it is valid.

with open('test.txt', 'r', encoding='utf-8') as f:
    content = f.read()
    b = content.encode('utf-16le')
    print(b[211 * 2:214 * 2].decode('utf-16le'))

# result: and

偏移量需要加倍，因为每个 UTF-16 编码单元是两个字节，然后结果必须解码才能正确显示。

我特别使用了 utf-16le 和 utf-16，因为后者会添加一个 BOM 并丢掉另外两个字节（或一个代码单元）的计数。

不同应用程序之间的表情符号阅读差异

Emoji reading discrepancy between different applications

python

encoding

utf-8

emoji