Python 通过正则表达式从文件中读取行

Python read lines from file by regex

我有一个文本文件,我想以某种格式将其读入列表。

当我写作时:

with open('chat_history.txt', encoding='utf8') as f:
    mylist = [line.rstrip('\n') for line in f]

我得到:

27/08/15, 15:45 - text
continue text
continue text 2
27/08/15, 16:10 - new text
new text 2
new text 3
27/08/15, 19:55 - more text

我想得到:

27/08/15, 15:45 - text continue text continue text 2
27/08/15, 16:10 - new text new text 2 new text 3
27/08/15, 19:55 - more text

我只想在格式 \nDD/MM/YY, HH:MM - 时拆分 不幸的是,我不是正则表达式专家。我试过了:

with open('chat_history.txt', encoding='utf8') as f:
    mylist = [line.rstrip('\n'r'[\d\d/\d\d/\d\d - ]') for line in f]

结果相同。再想一想,为什么它不起作用是有道理的。不过会喜欢一些帮助。

with open('chat_history.txt', encoding='utf8') as f:
    l = [line.rstrip('\n').replace('\n', ' ') for line in f]

print(l)

诚然,这可能 太过分了 ,我相信还有其他可能性可以达到同样的效果。我想在这里展示我的解决方案 (?(DEFINE)...) 使用较新的 regex module。先上代码,后解释:

import regex as re

string = """
27/08/15, 15:45 - text
continue text
continue text 2
27/08/15, 16:10 - new text
new text 2
new text 3
27/08/15, 19:55 - more text
"""

rx = re.compile(r'''
    (?(DEFINE)
        (?P<date>\d{2}/\d{2}/\d{2},\ \d{2}:\d{2}) # the date format
    )
    ^                    # anchor, start of the line
    (?&date)             # the previously defined format
    (?:(?!^(?&date)).)+  # "not date" as long as possible
''', re.M | re.X | re.S)


entries = (m.group(0).replace('\n', ' ') for m in rx.finditer(string))
for entry in entries:
    print(entry)

这产生:

27/08/15, 15:45 - text continue text continue text 2 
27/08/15, 16:10 - new text new text 2 new text 3 
27/08/15, 19:55 - more text 


基本上这种方法寻找日期块,中间用文本分隔:

date
text1
text2
date
text3
date
text

... 并像

一样将它们放在一起
date text1 text2
date text3
date text

"date format"定义在日期组中,之后结构如下

date "match as long as there's no date in the next line"

这是通过负前瞻实现的。之后所有找到的换行符都被替换为 space(在理解中,即)。
显然,没有 regex 模块和 (?(DEFINE) 块也可以获得相同的结果,但我们必须在匹配和前瞻中重复自己。
最后,表达式见a demo on regex101.com

我的解决方案使用比 Jan 的更简单的正则表达式。不过,使用正则表达式的代码稍微冗长一些。

一、输入文件:

$ cat -e chat_history.txt
27/08/15, 15:45 - text$
continue text$
continue text 2$
27/08/15, 16:10 - new text$
new text 2$
new text 3$
27/08/15, 19:55 - more text$

代码:

import re

date_time_regex = re.compile(r'^\d{2}/\d{2}/\d{2}, \d{2}:\d{2} - .*')

with open('chat_history.txt', encoding='utf8') as f:
    first_date = True
    for line in f:
        line = line.rstrip('\n')

        if date_time_regex.match(line):
            if not first_date:
                # Print a newline character before printing a date
                # if it is not the first date.
                print()
            else:
                first_date = False
        else:
            # Print a separator, without a newline character.
            print(' ', end='')

        # Print the original line, without a newline character.
        print(line, end='')

# Print the last newline character.
print()

运行 代码(不显示尾随空格):

$ python3 chat.py | cat -e
27/08/15, 15:45 - text continue text continue text 2$
27/08/15, 16:10 - new text new text 2 new text 3$
27/08/15, 19:55 - more text$