使用 python 拆分行并向文本文件添加定界符

Question

我正在清理用于 NLP 分析的杂乱 .txt 文件（文本 ID 和原始文本）。

目前看起来像：

@@0001 words 83 words, 90, words, 8989! @@0002 words, 98 words; words. @@0003 words 30 words ....

我想将其转换成干净的 .txt 或 .csv 格式，每个文本单独一行，ID 与文本之间用分隔符分隔。

ID   | text 
0001 | words 83 words, 90, words, 8989!
0002 | words, 98 words; words. 
0003 | words 30 words ....

以下代码创建一个 .txt 文件，其中每行文本：

with open('/file_directory/file.txt', 'r') as file, open('/file_directory/file_cleaned.txt', 'w') as file2:
    for line in file:
        for word in line.split('@@'):
           file2.write(word + '\n')

例如，

0001 words 83 words, 90, words, 8989!
0002 words, 98 words; words. 
0003 words 30 words ....

但是，我不知道如何添加定界符，因为我无法匹配特定的整数系列或整数长度（例如，4 位以上的数字）。目前，我试图先通过正则表达式添加定界符，然后拆分行，但我运行遇到正则表达式和文件写入问题。

import re
with open('/filedirectory/file.txt', 'r') as file, open('/filedirectory/file_cleaned.txt', 'w') as file2:
    text = file1.readlines()
    for line in text:
        text.re.split('^@\d{4,7}')
        for word in line.split('@@'):
           file2.write(word + '\n')

我收到错误：

AttributeError: 'list' object has no attribute 're'

如有任何想法，我们将不胜感激。谢谢！

Answer 1

正则表达式和字符串是两种不同的类型。作为参考，这里列出了 Python String Type and for the regex type objects

的每个方法

您收到错误的原因是您正试图从类型列表*的对象访问正则表达式方法。

但是为了您的目的：

可以使用正则表达式语法拆分字符串，
或者，您可以使用 re 模块来拆分它们。

但是，您要做的是将两者结合起来。

您可以：

splitlines =  line.split('^@\d{4,7}')

或者，您可以使用正则表达式：

import re
splitlines = re.compile('^@\d{4,7}').split(line)

Answer 2

对，不用说list对象没有属性re.

您可以使用

with open('/file_directory/file.txt', 'r') as file, open('/file_directory/file_cleaned.txt', 'w') as file2:
    file2.write(re.sub(r'@@\d+', r'\n\g<0> | ', file.read()).lstrip())

正则表达式匹配 @@ 和一个或多个数字，并用换行字符、整个匹配值和用单个空格括起来的 | 字符替换匹配项。

见Python demo:

import re
s = "@@0001 words 83 words, 90, words, 8989! @@0002 words, 98 words; words. @@0003 words 30 words ...."
print( re.sub(r'(@@\d+)', r'\n | ', s).lstrip() )

输出：

@@0001 |  words 83 words, 90, words, 8989! 
@@0002 |  words, 98 words; words. 
@@0003 |  words 30 words ....

使用 python 拆分行并向文本文件添加定界符

Split Rows and add Delimiter to text file using python

python

regex

data-cleaning