使用正则表达式在月份名称之前拆分字符串

Question

我有一堆随机文本行，每行末尾都有一个时间戳。我试图在时间戳之前拆分这些行。

当前输出：

Yes, I'd say so. Nov 08, 2014 UTC
Hell yes! Oct 01, 2014 UTC 
Anbefalt som bare det, løp og kjøp. Sep 16, 2014 UTC
Etc.

期望的输出（"tab" 我指的是实际的空白）：

Yes, I'd say so. <tab> Nov 08, 2014 UTC
Hell yes! <tab> Oct 01, 2014 UTC
Anbefalt som bare det, løp og kjøp. <tab> Sep 16, 2014 UTC
Etc.

到目前为止，我已经使用 "replace" 在月份之前放置一个制表符。像这样：

my_string.replace("May ", "\tMay ").replace("Apr ", "\tApr ").replace("Mar ", "\tMar ").replace("Feb ", "\tFeb ") etc. (incomplete code)

这很好用，除非随机文本涉及月份名称，例如"I bought it last may, great stuff"。由于日期是以这种特定方式格式化的，如果可能的话，我想用正则表达式和通配符对此进行改进。有没有办法在这些日期之前放置一个标签？正如您在上面看到的，日期格式如下：

[Three-letter abbreviation of the month] [two-digit day] [,] [four-digit year] [UTC]

例如

Oct 31, 2014 UTC

请原谅业余代码和方法，我是绝对的正则表达式 n00b。我在这里四处寻找答案，但我做空了。希望有人能帮忙！

Answer 1

您应该能够在所有月份使用一个 RegeEx 来执行此操作：

import re

lines = [
    "Yes, I'd say so. Nov 08, 2014 UTC",
    "Hell yes! Oct 01, 2014 UTC"
]

for ln in lines:
    print re.sub(r'(\w+\s\d{2}, \d{4} UTC)$', r'\t', ln)

哪个 return:

Yes, I'd say so.    Nov 08, 2014 UTC
Hell yes!   Oct 01, 2014 UTC

它的工作原理很简单。 re.sub 捕获第一个参数括号中的所有内容并将其分配给 </code>。第二个参数 <code>r'\t' 是我们想要替换字符串的内容。

在您的情况下，您想将其替换为原始字符串（由 </code> 表示），并在其前面添加一个制表符 (<code>\t)。

Answer 2

如果你总能保证它有那么多单词，那么你不需要正则表达式，只需使用内置函数反向拆分和连接，例如：

s = "Yes, I'd say so. Nov 08, 2014 UTC"
split = s.rsplit(None, 4)
new = split[0] + '\t' + ' '.join(split[1:])
# "Yes, I'd say so.\tNov 08, 2014 UTC"

Answer 3

在从末尾开始的 16 个字符处拆分

data = """Yes, I'd say so. Nov 08, 2014 UTC
Hell yes! Oct 01, 2014 UTC
Anbefalt som bare det, løp og kjøp. Sep 16, 2014 UTC"""

您也可以根据需要重新设置日期格式。

from datetime import datetime
    
fmt = "%b %d, %Y %Z"

for line in data.split("\n"):
    txt = line[:-16]
    dt = datetime.strptime(line[-16:], fmt)
    print("{}\t{}".format(txt, dt.strftime(fmt)))

Answer 4

如果您想为每个月份名称使用正则表达式并添加标签，请使用 re.sub:

lines = """Yes, I'd say so. Nov 08, 2014 UTC
Hell yes! Oct 01, 2014 UTC
Anbefalt som bare det, løp og kjøp. Sep 16, 2014 UTC"""

r = re.compile(r"\bJan\b|\bFeb\b|\bMar\b|\bApr\b|\bMay\b|\bJun\b|\bJul\b|\bAug\b|\bSep\b|\bOct\b|\bNov\b|\bDec\b")

for line in lines.splitlines():
    print(r.sub("\t"+r"\g<0>", line))

输出：

Yes, I'd say so.    Nov 08, 2014 UTC
Hell yes!   Oct 01, 2014 UTC
Anbefalt som bare det, løp og kjøp.     Sep 16, 2014 UTC

无论行的格式如何，正则表达式仍会找到任何月份的精确匹配项。

要精确匹配月份空格数字和逗号：

r = re.compile(r"(\bJan\b)\s+\d+,|(\bFeb\b)\s+\d+,|(\bMar\b)\s+\d+,|(\bApr\b)\s+\d+,|"
               r"(\bMay\b)\s+\d+,|(\bJun\b)\d+,|(\bJul\b)\s+\d+,|(\bAug\b)\s+\d+,|"
               r"(\bSep\b)\s+\d+,|(\bOct\b)\s+\d+,|(\bNov\b)\s+\d+,|(\bDec\b)\s+\d+,")

使用正则表达式在月份名称之前拆分字符串

Splitting string before name of month with regex

python

regex

replace

python-2.7

data-cleaning

在从末尾开始的 16 个字符处拆分