从字符串中剥离注释行

Question

我有一个字符串变量，其中包含来自大文本文件的输入。文本文件中的注释以“#”开头并以新行结尾。

所以，我想做的是从此字符串生成另一个字符串，其中删除了所有以“#”开头并以换行符结尾的注释行。

所以，我想我可以做一些事情，我可以将字符串拆分为：

def transform_string(input):
    output = ''
    # Look for #
    sub_strs = input.split('#')
    for s in sub_strs:
        # Look for newline
        sub_sub_strs = s.split('\r\n')
        for j in sub_sub_strs:
            output += j

return output

但是，它看起来很难看，我想知道是否有更优雅的 pythonic 方式来做到这一点。而且，这很容易出错。因为每个“#”都有一个对应的换行符，我想在第一次出现时进行拆分而不是完全拆分“\r\n”，我想。

Answer 1

正则表达式可以工作：

# Python 2.7
import re

def stripComment(text): return re.sub(r'#.*$', '', text)

print(stripComment("Hello there"))
# Hello there

print(stripComment("Hello #there"))
# Hello

这应该允许处理整行的注释，或者注释从中间某处开始的行（保留注释之前的内容）

Answer 2

正如您提到的，您正在从文本文件中读取内容，最好在读取文件时执行此操作：

data = []
with open("input_file.txt") as f:
    for line in f:
        if not line.startswith("#"):
            data.append(line)

data = "".join(data)

最后的连接步骤不是最佳的 -- 如果可以的话，您应该分别处理每一行，这样您就不需要内存中的整个文件。

Answer 3

您可以使用列表理解来过滤行：

>>> txt = """some lines
... #some commented
... some not
... #othe comment
... other line"""
>>> '\n'.join(line for line in txt.splitlines() if not line.startswith('#'))
'some lines\nsome not\nother line'

Answer 4

生成器可能是这里最 Pythonic 的解决方案：

def clean_input(filename):
    with open(filename, 'r') as f:
        for line in f:
            if not line.lstrip().startswith('#'):
                yield line

for line in clean_input('somefile.txt'):
    ...

这允许您将注释剥离或您需要的任何其他预处理从文件的实际处理中移开，您可以在其中迭代清理后的数据。

从字符串中剥离注释行

Stripping comments line from a string

python

string

parsing

text