使用 Python 提取嵌套括号中的句子
Extract sentences in nested parentheses using Python
我在一个目录中有多个 .txt
文件。
这是我的 .txt
个文件中 一个 的示例:
kkkkk;
select xx("xE'", PUT(xx.xxxx.),"'") jdfjhf:jhfjj from xxxx_x_xx_L ;
quit;
/* 1.xxxxx FROM xxxx_x_Ex_x */
proc sql; ("TRUuuuth");
hhhjhfjs as fdsjfsj:
select * from djfkjd to jfkjs
(
SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))
);
jjjjjj;
select xx("xE'", PUT(xx.xxxx.),"'") jdfjhf:jhfjj from xxxx_x_xx_L ;
quit;
/* 1.xxxxx FROM xxxx_x_Ex_x */ ()
proc sql; ("CUuuiiiiuth");
hhhjhfjs as fdsjfsj:
select * from djfkjd to jfkjs
(SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))(( ))
);
我正在尝试提取我的 .txt
文件中嵌套括号中的所有句子。
我已经尝试了多种方法,例如 stacking 括号,但是当代码通过 .txt
文件之一进行解析时,我收到一个错误,提示 "list index out of range"。我猜是因为括号里什么都没有写。
我也一直在尝试使用 regex,使用此代码:
with open('lan sample text file.txt','r') as fd:
lines = fd.read()
check = set()
check.add("Select")
check.add("select")
check.add("SELECT")
check.add("from")
check.add("FROM")
check.add("From")
items=re.findall("(\(.*)\)",lines,re.MULTILINE)
for x in items:
print(x)
但我的输出是:
("xE'", PUT(xx.xxxx.),"'"
("TRUuuuth"
((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.)
("xE'", PUT(xx.xxxx.),"'"
("CUuuiiiiuth"
((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.)
我想要的输出应该是这样的:
("xE'", PUT(xx.xxxx.),"'")
("TRUuuuth")
(
SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))
)
("xE'", PUT(xx.xxxx.),"'")
("CUuuiiiiuth")
(SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))(( ))
)
我会说我的解决方案不是优化的,但它会解决你的问题。
解决方案(只需将 test.txt 替换为您的文件名即可)
result = []
with open('test.txt','r') as fd:
# To keep track of '(' and ')' parentheses
parentheses_stack = []
# To keep track of complete word wrapped by ()
complete_word = []
# Iterate through each line in file
for words in fd.readlines():
# Iterate each character in a line
for char in list(words):
# Initialise the parentheses_stack when you find the first open '('
if char == '(':
parentheses_stack.append(char)
# Pop one open '(' from parentheses_stack when you find a ')'
if char == ')':
if not parentheses_stack = []:
parentheses_stack.pop()
if parentheses_stack == []:
complete_word.append(char)
# Collect characters in between the first '(' and last ')'
if not parentheses_stack == []:
complete_word.append(char)
else:
if not complete_word == []:
# Push the complete_word once you poped all '(' from parentheses_stack
result.append(''.join(complete_word))
complete_word = []
for res in result:
print(res)
结果:
WS:python rameshrv$ python3 /Users/rameshrv/Documents/python/test.py
("xE'", PUT(xx.xxxx.),"'")
("TRUuuuth")
(
SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))
)
("xE'", PUT(xx.xxxx.),"'")
()
("CUuuiiiiuth")
(SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))(( ))
)
正如我所说,这是 Python: How to match nested parentheses with regex? 的副本,它显示了几种处理嵌套括号的方法,并非所有方法都是基于正则表达式的。一种方法确实需要 PYPI 存储库中的 regex
模块。如果 text
包含文件的内容,那么下面应该做你想做的:
import regex as re
text = """kkkkk;
select xx("xE'", PUT(xx.xxxx.),"'") jdfjhf:jhfjj from xxxx_x_xx_L ;
quit;
/* 1.xxxxx FROM xxxx_x_Ex_x */
proc sql; ("TRUuuuth");
hhhjhfjs as fdsjfsj:
select * from djfkjd to jfkjs
(
SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))
);
jjjjjj;
select xx("xE'", PUT(xx.xxxx.),"'") jdfjhf:jhfjj from xxxx_x_xx_L ;
quit;
/* 1.xxxxx FROM xxxx_x_Ex_x */ ()
proc sql; ("CUuuiiiiuth");
hhhjhfjs as fdsjfsj:
select * from djfkjd to jfkjs
(SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))(( ))
);"""
regex = re.compile(r"""
(?<rec> #capturing group rec
\( #open parenthesis
(?: #non-capturing group
[^()]++ #anything but parenthesis one or more times without backtracking
| #or
(?&rec) #recursive substitute of group rec
)*
\) #close parenthesis
)
""", flags=re.VERBOSE)
for m in regex.finditer(text):
groups = m.captures('rec')
group = groups[-1] # the last group is the outermost nesting
if re.match(r'^\(+\s*\)+$', group):
continue # not interested in empty parentheses such as '( )'
print(group)
打印:
("xE'", PUT(xx.xxxx.),"'")
("TRUuuuth")
(
SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))
)
("xE'", PUT(xx.xxxx.),"'")
("CUuuiiiiuth")
(SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))(( ))
)
我在一个目录中有多个 .txt
文件。
这是我的 .txt
个文件中 一个 的示例:
kkkkk;
select xx("xE'", PUT(xx.xxxx.),"'") jdfjhf:jhfjj from xxxx_x_xx_L ;
quit;
/* 1.xxxxx FROM xxxx_x_Ex_x */
proc sql; ("TRUuuuth");
hhhjhfjs as fdsjfsj:
select * from djfkjd to jfkjs
(
SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))
);
jjjjjj;
select xx("xE'", PUT(xx.xxxx.),"'") jdfjhf:jhfjj from xxxx_x_xx_L ;
quit;
/* 1.xxxxx FROM xxxx_x_Ex_x */ ()
proc sql; ("CUuuiiiiuth");
hhhjhfjs as fdsjfsj:
select * from djfkjd to jfkjs
(SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))(( ))
);
我正在尝试提取我的 .txt
文件中嵌套括号中的所有句子。
我已经尝试了多种方法,例如 stacking 括号,但是当代码通过 .txt
文件之一进行解析时,我收到一个错误,提示 "list index out of range"。我猜是因为括号里什么都没有写。
我也一直在尝试使用 regex,使用此代码:
with open('lan sample text file.txt','r') as fd:
lines = fd.read()
check = set()
check.add("Select")
check.add("select")
check.add("SELECT")
check.add("from")
check.add("FROM")
check.add("From")
items=re.findall("(\(.*)\)",lines,re.MULTILINE)
for x in items:
print(x)
但我的输出是:
("xE'", PUT(xx.xxxx.),"'"
("TRUuuuth"
((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.)
("xE'", PUT(xx.xxxx.),"'"
("CUuuiiiiuth"
((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.)
我想要的输出应该是这样的:
("xE'", PUT(xx.xxxx.),"'")
("TRUuuuth")
(
SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))
)
("xE'", PUT(xx.xxxx.),"'")
("CUuuiiiiuth")
(SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))(( ))
)
我会说我的解决方案不是优化的,但它会解决你的问题。
解决方案(只需将 test.txt 替换为您的文件名即可)
result = []
with open('test.txt','r') as fd:
# To keep track of '(' and ')' parentheses
parentheses_stack = []
# To keep track of complete word wrapped by ()
complete_word = []
# Iterate through each line in file
for words in fd.readlines():
# Iterate each character in a line
for char in list(words):
# Initialise the parentheses_stack when you find the first open '('
if char == '(':
parentheses_stack.append(char)
# Pop one open '(' from parentheses_stack when you find a ')'
if char == ')':
if not parentheses_stack = []:
parentheses_stack.pop()
if parentheses_stack == []:
complete_word.append(char)
# Collect characters in between the first '(' and last ')'
if not parentheses_stack == []:
complete_word.append(char)
else:
if not complete_word == []:
# Push the complete_word once you poped all '(' from parentheses_stack
result.append(''.join(complete_word))
complete_word = []
for res in result:
print(res)
结果:
WS:python rameshrv$ python3 /Users/rameshrv/Documents/python/test.py
("xE'", PUT(xx.xxxx.),"'")
("TRUuuuth")
(
SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))
)
("xE'", PUT(xx.xxxx.),"'")
()
("CUuuiiiiuth")
(SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))(( ))
)
正如我所说,这是 Python: How to match nested parentheses with regex? 的副本,它显示了几种处理嵌套括号的方法,并非所有方法都是基于正则表达式的。一种方法确实需要 PYPI 存储库中的 regex
模块。如果 text
包含文件的内容,那么下面应该做你想做的:
import regex as re
text = """kkkkk;
select xx("xE'", PUT(xx.xxxx.),"'") jdfjhf:jhfjj from xxxx_x_xx_L ;
quit;
/* 1.xxxxx FROM xxxx_x_Ex_x */
proc sql; ("TRUuuuth");
hhhjhfjs as fdsjfsj:
select * from djfkjd to jfkjs
(
SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))
);
jjjjjj;
select xx("xE'", PUT(xx.xxxx.),"'") jdfjhf:jhfjj from xxxx_x_xx_L ;
quit;
/* 1.xxxxx FROM xxxx_x_Ex_x */ ()
proc sql; ("CUuuiiiiuth");
hhhjhfjs as fdsjfsj:
select * from djfkjd to jfkjs
(SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))(( ))
);"""
regex = re.compile(r"""
(?<rec> #capturing group rec
\( #open parenthesis
(?: #non-capturing group
[^()]++ #anything but parenthesis one or more times without backtracking
| #or
(?&rec) #recursive substitute of group rec
)*
\) #close parenthesis
)
""", flags=re.VERBOSE)
for m in regex.finditer(text):
groups = m.captures('rec')
group = groups[-1] # the last group is the outermost nesting
if re.match(r'^\(+\s*\)+$', group):
continue # not interested in empty parentheses such as '( )'
print(group)
打印:
("xE'", PUT(xx.xxxx.),"'")
("TRUuuuth")
(
SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))
)
("xE'", PUT(xx.xxxx.),"'")
("CUuuiiiiuth")
(SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and
(xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))(( ))
)