我该如何处理这些弄乱我的打印格式的奇怪特殊字符？

Question

我正在打印格式化的 table。但有时这些用户生成的字符占用了不止一个字符宽度，它会弄乱格式，如下面的屏幕截图所示...

"title" 列的宽度被格式化为 68 字节。但是这些 "special characters" 占用了超过 1 个字符的宽度，但只算作 1 个字符。这会将列推到其边界之外。

print('{0:16s}{3:<18s}{1:68s}{2:>8n}'.format((
    ' ' + streamer['user_name'][:12] + '..') if len(streamer['user_name']) > 12 else ' ' + streamer['user_name'],
    (streamer['title'].strip()[:62] + '..') if len(streamer['title']) > 62 else streamer['title'].strip(),
    streamer['viewer_count'],
    (gamesDic[streamer['game_id']][:15] + '..') if len(gamesDic[streamer['game_id']]) > 15 else gamesDic[streamer['game_id']]))

关于如何处理这些特殊字符有什么建议吗？

编辑： 我将有问题的字符串打印到文件中。

( ) ✨ ＬＩＶＥ SUBS GET SNAPCHAT

edit2:

为什么这些不在字符边界上对齐？

edit3:

今天前两个字符产生奇怪的输出。但是在下面的每种情况下，列都是对齐的。

孤立的第一个字符...

title[0]

孤立的第二个字符...title[1]

第一个和第二个字符在一起.. title[0] + title[1]

Answer 1

我已经根据@snakecharmerb 的编写了自定义字符串格式化程序，但仍然 "half character width" 问题仍然存在：

import unicodedata

def fstring(string, max_length, align='l'):
    string = str(string)
    extra_length = 0
    for char in string:
        if unicodedata.east_asian_width(char) == 'F':
            extra_length += 1

    diff = max_length - len(string) - extra_length
    if diff > 0:
        return string + diff * ' ' if align == 'l' else diff * ' ' + string
    elif diff < 0:
        return string[:max_length-3] + '.. '

    return string

data = [{'user_name': 'shroud', 'game_id': 'Apex Legends', 'title': 'pathfinder twitch prime loot YAYA @shroud on socials for update', 'viewer_count': 66200},
        {'user_name': 'Amouranth', 'game_id': 'ASMR', 'title': '  ( ) ✨ ＬＩＶＥ  SUBS GET SNAPCHAT', 'viewer_count': 2261}]

for d in data:
    name = fstring(d['user_name'], 20)
    game_id = fstring(d['game_id'], 15)
    title = fstring(d['title'], 62)
    count = fstring(d['viewer_count'], 10, align='r')
    print('{}{}{}{}'.format(name, game_id, title, count))

它产生输出：

（不能post它作为文本，因为格式会丢失）

Answer 2

我对问题发表了评论：

The characters in "ＬＩＶＥ" are Fullwidth characters. A hacky way to deal with them might be to test their width with unicodedata.east_asian_width(char) (it will return "F" for fullwidth characters) and substitute with the final character of unicodedata.name(char) (or just count them as length 2)

这个"answer"本质上是另一条评论，但对于评论字段来说太长了。

这个 hack - 在 Alderven 的中实现 - 几乎适用于 OP，但示例字符串以额外的半个字符宽度呈现（注意示例字符串不包含任何东亚半角字符.).

我无法重现这个确切的行为，使用这个测试语句，其中 s 是问题中的示例字符串，改变了删除的字符：

print((s + (68 - (len(s) + sum(1 for x in s if ud.east_asian_width(x) in ('F', 'N', 'W')))) * 'x')+ '\n'+ ('x' * 68))

在 Debian 的 Gnome 终端的 Python 3.6 解释器中，使用默认的等宽常规字体，删除全角字符导致示例字符串显然呈现三个字符长于 "x"s.

的等效字符串

删除全角和宽（东亚宽度 "W"）字符生成的字符串似乎呈现与 "x" 的等效数量相同的长度。

在 OpenSuse 上的 Python 3.7 KDE Konsole 终端中，使用 Ubuntu Monospace 常规字体，无论全宽、宽的组合如何，我都无法生成呈现相同长度的字符串或我删除的中性 ("N") 字符。

我确实注意到，在 Konsole 中单独渲染时，闪光字符 (✨) 似乎多占了半个宽度，但在测试完整字符串时看不到任何半个宽度差异。

我怀疑问题出在 Python 控制之外的低级渲染，正如 Unicode standard 上的注释所建议的那样：

Note: The East_Asian_Width property is not intended for use by modern terminal emulators without appropriate tailoring on a case-by-case basis. Such terminal emulators need a way to resolve the halfwidth/fullwidth dichotomy that is necessary for such environments, but the East_Asian_Width property does not provide an off-the-shelf solution for all situations. The growing repertoire of the Unicode Standard has long exceeded the bounds of East Asian legacy character encodings, and terminal emulations often need to be customized to support edge cases and for changes in typographical behavior over time.

我该如何处理这些弄乱我的打印格式的奇怪特殊字符？

How can I handle these weird special characters messing my print formatting?

python

unicode

terminal

string-formatting

python-unicode