Python:用字符串中的“curly ones”替换"dumb quotation marks"

Python: Replace "dumb quotation marks" with “curly ones” in a string

我有这样的字符串:

"But that gentleman,"看着达西,"seemed to think the country was nothing at all."

我想要这个输出:

“但是那位先生,”看着达西,“似乎认为这个国家一无是处。”

同样,愚蠢的单引号应该转换成它们的大括号。 Read about the typographic rules here if you are interested.

我猜这个问题之前已经解决了,但我找不到库或脚本来解决这个问题。 SmartyPants (Perl) is the mother of all libraries to do this and there is a python port。但它的输出是 HTML 个实体:“But that gentleman,” 我只想要一个带有大引号的普通字符串。有什么想法吗?

更新:

我按照 Padraig Cunningham 的建议解决了它:

  1. 使用 smartypants 进行排版更正
  2. 使用 HTMLParser().unescape 将 HTML 实体转换回 Unicode

如果您的输入文本包含您不想转换的 HTML 个实体,但在我的情况下没问题,这种方法可能会有问题。

更新结束

输入是否可信?

目前只能信任输入。该字符串可以包含一个非闭合双引号:"But be that gentleman, looking at Dary。它还可以包含一个非封闭的单引号:'But be that gentleman, looking at Dary。最后,它可以包含一个单引号,表示撇号:Don't go there.

我已经实现了一个试图正确关闭这些丢失的引号的算法,所以这不是问题的一部分。为了完整起见,这里是关闭丢失引号的代码:

quotationMarkDictionary = [{
    'start': '"',
    'end': '"',
    },{
    'start': '“',
    'end': '”',
    },{
    'start': '\'',
    'end': '\'',
    },{
    'start': '‘',
    'end': '’'
    },{
    'start': '(',
    'end': ')'
    },{
    'start': '{',
    'end': '}'
    },{
    'start': '[',
    'end': ']'
    }]

'''If assumedSentence has quotation marks (single, double, …) and the 
number of opening quotation marks is larger than the number of closing    
quotation marks, append a closing quotation mark at the end of the 
sentence. Likewise, add opening quotation marks to the beginning of the 
sentence if there are more closing marks than opening marks.'''
for quotationMark in quotationMarkDictionary:
  numberOpenings = assumedSentence['sentence'].count(quotationMark['start'])
  numberClosings = assumedSentence['sentence'].count(quotationMark['end'])
  # Are the opening and closing marks the same? ('Wrong' marks.) Then just make sure there is an even number of them
  if quotationMark['start'] is quotationMark['end'] and numberOpenings % 2 is not 0:
    # If sentence starts with this quotation mark, put the new one at the end
    if assumedSentence['sentence'].startswith(quotationMark['start']):
      assumedSentence['sentence'] += quotationMark['end']
    else:
      assumedSentence['sentence'] = quotationMark['end'] + assumedSentence['sentence']
  elif numberOpenings > numberClosings:
    assumedSentence['sentence'] += quotationMark['end']
  elif numberOpenings < numberClosings:
     assumedSentence['sentence'] = quotationMark['start'] + assumedSentence['sentence']

浏览一下文档,看起来就像你被困在 smartypants 上面 .replace:

smartypants(r'"smarty" \"pants\"').replace('&#x201C;', '“').replace('&#x201D;', '”')

不过,如果您为魔术字符串起别名,可能会更好读:

html_open_quote = '&#x201C;'
html_close_quote = '&#x201D;'
smart_open_quote = '“'
smart_close_quote = '”'
smartypants(r'"smarty" \"pants\"') \
    .replace(html_open_quote, smart_open_quote)  \
    .replace(html_close_quote, smart_close_quote)

假设输入正确,这可以使用正则表达式来完成:

# coding=utf8
import re
sample = '\'Sample Text\' - "But that gentleman," looking at Darcy, "seemed to think the \'country\' was nothing at all." \'Don\'t convert here.\''
print re.sub(r"(\s|^)\'(.*?)\'(\s|$)", r"‘’", re.sub(r"\"(.*?)\"", r"“”", sample))

输出:

‘Sample Text’ - “But that gentleman,” looking at Darcy, “seemed to think the ‘country’ was nothing at all.” ‘Don't convert here.’

我在这里分隔单引号,假设它们位于一行的 beginning/end 或周围有白色 space。

您可以使用 HTMLParser 对从 smartypants 返回的 html 实体进行转义:

In [32]: from HTMLParser import HTMLParser

In [33]: s = "&#x201C;But that gentleman,&#x201D;"

In [34]: print HTMLParser().unescape(s)
“But that gentleman,”
In [35]: HTMLParser().unescape(s)
Out[35]: u'\u201cBut that gentleman,\u201d'

为避免编码错误,您应该在打开文件时使用 io.open 并指定 encoding="the_encoding" 或将字符串解码为 un​​icode:

 In [11]: s
Out[11]: '&#x201C;But that gentleman,&#x201D;\xe2'

In [12]: print  HTMLParser().unescape(s.decode("latin-1"))
“But that gentleman,”â

自从最初提出问题以来,Python smartypants 获得了 an option 直接输出 Unicode 替换字符:

u = 256

Output Unicode characters instead of numeric character references, for example, from &#8220; to left double quotation mark () (U+201C).

对于最简单的用例,不需要正则表达式:

quote_chars_counts = {
    '"': 0,
    "'": 0,
    "`": 0
}


def to_smart_quotes(s):
    output = []

    for c in s:
        if c in quote_chars_counts.keys():
            replacement = (quote_chars_counts[c] % 2 == 0) and '“' or '”'
            quote_chars_counts[c] = quote_chars_counts[c] + 1
            new_ch = replacement
        else:
            new_ch = c
        output.append(new_ch)

    return ''.join(output)

如果需要,修改为从替换映射中提取替换而不是使用文字是微不足道的。