Python 删除方括号和它们之间的无关信息
Python remove Square brackets and extraneous information between them
我正在尝试处理一个文件,我需要删除文件中的无关信息;值得注意的是,我试图删除方括号 []
包括方括号 []
[]
块内部和之间的文本,说这些块之间的所有内容包括它们本身但打印它之外的所有内容。
下面是我的带有数据样本的文本文件:
$ cat smb
Hi this is my config file.
Please dont delete it
[homes]
browseable = No
comment = Your Home
create mode = 0640
csc policy = disable
directory mask = 0750
public = No
writeable = Yes
[proj]
browseable = Yes
comment = Project directories
csc policy = disable
path = /proj
public = No
writeable = Yes
[]
This last second line.
End of the line.
期望的输出:
Hi this is my config file.
Please dont delete it
This last second line.
End of the line.
根据我的理解和重新搜索我尝试过的:
$ cat test.py
with open("smb", "r") as file:
for line in file:
start = line.find( '[' )
end = line.find( ']' )
if start != -1 and end != -1:
result = line[start+1:end]
print(result)
输出:
$ ./test.py
homes
proj
您可以遍历文件行并将它们收集到某个列表中,除非到达括号中的行,然后将收集的行连接回去:
with open("smb", "r") as f:
result = []
for line in f:
if line.startswith("[") and line.endswith("]"):
break
result.append(line)
result = "\n".join(result)
print(result)
将文件读入字符串,
extract = '''Hi this is my config file.
Please dont delete it
[homes]
browseable = No
comment = Your Home
create mode = 0640
csc policy = disable
directory mask = 0750
public = No
writeable = Yes
[proj]
browseable = Yes
comment = Project directories
csc policy = disable
path = /proj
public = No
writeable = Yes
[]
This last second line.
End of the line.
'''.split('\n[')[0][:-1]
会给你,
Hi this is my config file.
Please dont delete it
.split('\n[')
根据 '\n['
字符集的出现拆分字符串,[0]
选择上面的描述行。
with open("smb", "r") as f:
extract = f.read()
tail = extract.split(']\n')
extract = extract.split('\n[')[0][:-1]+[tail[len(tail)-1]
会读取并输出,
Hi this is my config file.
Please dont delete it
This last second line.
End of the line.
使用一个正则表达式
import re
with open("smb", "r") as f:
txt = f.read()
txt = re.sub(r'(\n\[)(.*?)(\[]\n)', '', txt, flags=re.DOTALL)
print(txt)
正则表达式解释:
(\n\[)
找到一个换行符后跟 [
的序列
(\[]\n)
找到 [] 后跟换行符
的序列
(.*?)
删除 (\n\[)
和 (\[]\n)
中间的所有内容
re.DOTALL
用于防止不必要的回溯
!!! PANDAS更新!!!
同样的逻辑可以用pandas
进行同样的解法
import re
import pandas as pd
# read each line in the file (one raw -> one line)
txt = pd.read_csv('smb', sep = '\n', header=None)
# join all the line in the file separating them with '\n'
txt = '\n'.join(txt[0].to_list())
# apply the regex to clean the text (the same as above)
txt = re.sub(r'(\n\[)(.*?)(\[]\n)', '\n', txt, flags=re.DOTALL)
print(txt)
尝试r"(?s)\s*\[[^\[\]]*\](?:(?:(?!\[[^\[\]]*\]).)+\[[^\[\]]*\])*\s*"
替换 r"\n"
既然你标记了 pandas
,让我们试试看:
df = pd.read_csv('smb', sep='----', header=None)
# mark rows starts with `[`
s = df[0].str.startswith('[')
# drop the lines between `[`
df = df.drop(np.arange(s.idxmax(),s[::-1].idxmax()+1))
# write to file if needed
df.to_csv('clean.txt', header=None, index=None)
输出(df
):
0
0 Hi this is my config file.
1 Please dont delete it
18 This last second line.
19 End of the line.
如果我没理解错的话,您想要第一个 [
之前和最后一个 ]
之后的所有内容。如果不是这样,请告诉我,我会更改我的答案。
with open("smb", "r") as f:
s = f.read()
head = s[:s.find('[')]
tail = s[s.rfind(']') + 1:]
return head.strip("\n") + "\n" + tail.strip("\n") # removing \n
这会给你想要的输出。
另一种选择是首先匹配方括号,如 [homes]
,然后匹配所有不仅包含 []
的行,因为那是结束标记。
您可以在不使用 (?s)
或使用 re.DOTALL
的情况下获得匹配,以防止不必要的回溯并将匹配替换为空字符串。
^\s*\[[^][]*\](?:\r?\n(?![^\S\r\n]*\[]$).*)*\r?\n[^\S\r\n]*\[]$\s*
说明
^
行首
\s*
匹配 0+ 个空白字符
\[[^][]*\]
(?:
非捕获组
\r?\n
匹配一个换行符
(?!
否定前瞻,断言右边的不是
[^\S\r\n]*\[]$
匹配 0+ 次空白字符(换行符除外)并匹配 []
)
关闭非捕获组
.*
匹配除换行符以外的任何字符 0 次以上
)*
关闭非捕获组并重复0+次
\r?\n
匹配一个换行符
[^\S\r\n]*
匹配 0+ 个没有换行符的空白字符
\[]$
匹配 []
并断言行尾
\s*
匹配0+个空白字符
代码示例
import re
regex = r"^\s*\[[^][]*\](?:\r?\n(?![^\S\r\n]*\[]$).*)*\r?\n[^\S\r\n]*\[]$\s*"
with open("smb", "r") as file:
data = file.read()
result = re.sub(regex, "", data, 0, re.MULTILINE)
print(result)
输出
Hi this is my config file.
Please dont delete it
This last second line.
End of the line.
这可能是您可以做到的最简洁的方法之一。
import re
from pathlib import Path
res = '\n'.join(re.findall(r'^\w.*', Path('smb').read_text(), flags=re.M))
解释:
Path
为文件创建一个 Path
对象。 Path.read_text()
打开文件读取文本并关闭文件。文件内容被传递给 re.findall
,它使用 re.M
标志来查看文件中的每一行,以再次验证模式 '^\w.*'
,它只接受以单词字符开头的行。这消除了以 white-space 或方括号开头的行。
在 Regex101 你可以测试这个:
(^\W)+?\[[\w\W]+?\[\](\W)+?(\w)
在代码中是这样
import re
------------------------------------------------------------↧-string where to replace--
result = re.sub(r"(^\W)+?\[[\w\W]+?\[\](\W)+?(\w)", "", input_string, 0, re.MULTILINE)
----------------------↑-this is the regex------------↑-substitution string-------------
干杯
因为您已经标记了 pandas
并规定文本出现在方括号之前和之后,我们可以使用 str.contains
并使用布尔值过滤掉第一个之间的行& 最后一个方括号。
df = pd.read_csv(your_file,sep='\t',header=None)
idx = df[df[0].str.contains('\[')].index
df1 = df.loc[~df.index.isin(range(idx[0],idx[-1] + 1))]
0
0 Hi this is my config file.
1 Please dont delete it
18 This last second line.
19 End of the line.
你的索引有误。除此之外,代码似乎还不错。
尝试:
start=0
targ = ""
end=0
with open("smb", "r") as file:
for line in file:
try:
if start==0:
start = line.index("[")
except:
start = start
try:
end = line.index("]")
except:
end = end
targ = targ+line
targ = targ[0:start-1]+targ[end+1:]
这应该有效。如果有任何问题,请告诉我。 :)
使用Pandas:
df = pd.read_csv('smb.txt', sep='----', header=None, engine='python',names=["text"])
res = df.loc[~df.text.str.contains("=|\[.*\]")]
print(res)
text
0 Hi this is my config file.
1 Please dont delete it
18 This last second line.
19 End of the line.
说明 :排除包含 =
或包含起始括号 ([
) 且后跟字符 (.*
) 并有一个右括号 (]``). the backslash (
```) 告诉 python 不要将括号视为特殊字符
仅使用 Python,使用相同的正则表达式模式,多出一行来处理空条目:
import re
with open('smb.txt') as myfile:
content = myfile.readlines()
pattern = re.compile("=|\[.*\]")
res = [ent.strip() for ent in content if not pattern.search(ent) ]
res = [ent for ent in res if ent != ""]
print(res)
['Hi this is my config file.',
'Please dont delete it',
'This last second line.',
'End of the line.']
我正在尝试处理一个文件,我需要删除文件中的无关信息;值得注意的是,我试图删除方括号 []
包括方括号 []
[]
块内部和之间的文本,说这些块之间的所有内容包括它们本身但打印它之外的所有内容。
下面是我的带有数据样本的文本文件:
$ cat smb
Hi this is my config file.
Please dont delete it
[homes]
browseable = No
comment = Your Home
create mode = 0640
csc policy = disable
directory mask = 0750
public = No
writeable = Yes
[proj]
browseable = Yes
comment = Project directories
csc policy = disable
path = /proj
public = No
writeable = Yes
[]
This last second line.
End of the line.
期望的输出:
Hi this is my config file.
Please dont delete it
This last second line.
End of the line.
根据我的理解和重新搜索我尝试过的:
$ cat test.py
with open("smb", "r") as file:
for line in file:
start = line.find( '[' )
end = line.find( ']' )
if start != -1 and end != -1:
result = line[start+1:end]
print(result)
输出:
$ ./test.py
homes
proj
您可以遍历文件行并将它们收集到某个列表中,除非到达括号中的行,然后将收集的行连接回去:
with open("smb", "r") as f:
result = []
for line in f:
if line.startswith("[") and line.endswith("]"):
break
result.append(line)
result = "\n".join(result)
print(result)
将文件读入字符串,
extract = '''Hi this is my config file.
Please dont delete it
[homes]
browseable = No
comment = Your Home
create mode = 0640
csc policy = disable
directory mask = 0750
public = No
writeable = Yes
[proj]
browseable = Yes
comment = Project directories
csc policy = disable
path = /proj
public = No
writeable = Yes
[]
This last second line.
End of the line.
'''.split('\n[')[0][:-1]
会给你,
Hi this is my config file.
Please dont delete it
.split('\n[')
根据 '\n['
字符集的出现拆分字符串,[0]
选择上面的描述行。
with open("smb", "r") as f:
extract = f.read()
tail = extract.split(']\n')
extract = extract.split('\n[')[0][:-1]+[tail[len(tail)-1]
会读取并输出,
Hi this is my config file.
Please dont delete it
This last second line.
End of the line.
使用一个正则表达式
import re
with open("smb", "r") as f:
txt = f.read()
txt = re.sub(r'(\n\[)(.*?)(\[]\n)', '', txt, flags=re.DOTALL)
print(txt)
正则表达式解释:
(\n\[)
找到一个换行符后跟 [
(\[]\n)
找到 [] 后跟换行符
(.*?)
删除 (\n\[)
和 (\[]\n)
re.DOTALL
用于防止不必要的回溯
!!! PANDAS更新!!!
同样的逻辑可以用pandas
进行同样的解法import re
import pandas as pd
# read each line in the file (one raw -> one line)
txt = pd.read_csv('smb', sep = '\n', header=None)
# join all the line in the file separating them with '\n'
txt = '\n'.join(txt[0].to_list())
# apply the regex to clean the text (the same as above)
txt = re.sub(r'(\n\[)(.*?)(\[]\n)', '\n', txt, flags=re.DOTALL)
print(txt)
尝试r"(?s)\s*\[[^\[\]]*\](?:(?:(?!\[[^\[\]]*\]).)+\[[^\[\]]*\])*\s*"
替换 r"\n"
既然你标记了 pandas
,让我们试试看:
df = pd.read_csv('smb', sep='----', header=None)
# mark rows starts with `[`
s = df[0].str.startswith('[')
# drop the lines between `[`
df = df.drop(np.arange(s.idxmax(),s[::-1].idxmax()+1))
# write to file if needed
df.to_csv('clean.txt', header=None, index=None)
输出(df
):
0
0 Hi this is my config file.
1 Please dont delete it
18 This last second line.
19 End of the line.
如果我没理解错的话,您想要第一个 [
之前和最后一个 ]
之后的所有内容。如果不是这样,请告诉我,我会更改我的答案。
with open("smb", "r") as f:
s = f.read()
head = s[:s.find('[')]
tail = s[s.rfind(']') + 1:]
return head.strip("\n") + "\n" + tail.strip("\n") # removing \n
这会给你想要的输出。
另一种选择是首先匹配方括号,如 [homes]
,然后匹配所有不仅包含 []
的行,因为那是结束标记。
您可以在不使用 (?s)
或使用 re.DOTALL
的情况下获得匹配,以防止不必要的回溯并将匹配替换为空字符串。
^\s*\[[^][]*\](?:\r?\n(?![^\S\r\n]*\[]$).*)*\r?\n[^\S\r\n]*\[]$\s*
说明
^
行首\s*
匹配 0+ 个空白字符\[[^][]*\]
(?:
非捕获组\r?\n
匹配一个换行符(?!
否定前瞻,断言右边的不是[^\S\r\n]*\[]$
匹配 0+ 次空白字符(换行符除外)并匹配[]
)
关闭非捕获组.*
匹配除换行符以外的任何字符 0 次以上
)*
关闭非捕获组并重复0+次\r?\n
匹配一个换行符[^\S\r\n]*
匹配 0+ 个没有换行符的空白字符\[]$
匹配[]
并断言行尾\s*
匹配0+个空白字符
代码示例
import re
regex = r"^\s*\[[^][]*\](?:\r?\n(?![^\S\r\n]*\[]$).*)*\r?\n[^\S\r\n]*\[]$\s*"
with open("smb", "r") as file:
data = file.read()
result = re.sub(regex, "", data, 0, re.MULTILINE)
print(result)
输出
Hi this is my config file.
Please dont delete it
This last second line.
End of the line.
这可能是您可以做到的最简洁的方法之一。
import re
from pathlib import Path
res = '\n'.join(re.findall(r'^\w.*', Path('smb').read_text(), flags=re.M))
解释:
Path
为文件创建一个 Path
对象。 Path.read_text()
打开文件读取文本并关闭文件。文件内容被传递给 re.findall
,它使用 re.M
标志来查看文件中的每一行,以再次验证模式 '^\w.*'
,它只接受以单词字符开头的行。这消除了以 white-space 或方括号开头的行。
在 Regex101 你可以测试这个:
(^\W)+?\[[\w\W]+?\[\](\W)+?(\w)
在代码中是这样
import re
------------------------------------------------------------↧-string where to replace--
result = re.sub(r"(^\W)+?\[[\w\W]+?\[\](\W)+?(\w)", "", input_string, 0, re.MULTILINE)
----------------------↑-this is the regex------------↑-substitution string-------------
干杯
因为您已经标记了 pandas
并规定文本出现在方括号之前和之后,我们可以使用 str.contains
并使用布尔值过滤掉第一个之间的行& 最后一个方括号。
df = pd.read_csv(your_file,sep='\t',header=None)
idx = df[df[0].str.contains('\[')].index
df1 = df.loc[~df.index.isin(range(idx[0],idx[-1] + 1))]
0
0 Hi this is my config file.
1 Please dont delete it
18 This last second line.
19 End of the line.
你的索引有误。除此之外,代码似乎还不错。
尝试:
start=0
targ = ""
end=0
with open("smb", "r") as file:
for line in file:
try:
if start==0:
start = line.index("[")
except:
start = start
try:
end = line.index("]")
except:
end = end
targ = targ+line
targ = targ[0:start-1]+targ[end+1:]
这应该有效。如果有任何问题,请告诉我。 :)
使用Pandas:
df = pd.read_csv('smb.txt', sep='----', header=None, engine='python',names=["text"])
res = df.loc[~df.text.str.contains("=|\[.*\]")]
print(res)
text
0 Hi this is my config file.
1 Please dont delete it
18 This last second line.
19 End of the line.
说明 :排除包含 =
或包含起始括号 ([
) 且后跟字符 (.*
) 并有一个右括号 (]``). the backslash (
```) 告诉 python 不要将括号视为特殊字符
仅使用 Python,使用相同的正则表达式模式,多出一行来处理空条目:
import re
with open('smb.txt') as myfile:
content = myfile.readlines()
pattern = re.compile("=|\[.*\]")
res = [ent.strip() for ent in content if not pattern.search(ent) ]
res = [ent for ent in res if ent != ""]
print(res)
['Hi this is my config file.',
'Please dont delete it',
'This last second line.',
'End of the line.']