Python 正则表达式 - 从 orgmode 文件中获取项目
Python regex - get items from orgmode files
我有以下 org-mode 语法:
** Hardware [0/1]
- [ ] adapt a programmable motor to a tripod to be used for panning
** Reading - Technology [1/6]
- [X] Introduction to Networking - Charles Severance
- [ ] A Tour of C++ - Bjarne Stroustrup
- [ ] C++ How to Program - Paul Deitel
- [X] Computer Systems - Randal Bryant
- [ ] The C programming language - Brian Kernighan
- [ ] Beginning Linux Programming -Matthew and Stones
** Reading - Health [3/4]
- [ ] Patrick McKeown - The Oxygen Advantage
- [X] Total Knee Health - Martin Koban
- [X] Supple Leopard - Kelly Starrett
- [X] Convict Conditioning 1 and 2
我想提取项目,例如:
getitems "Hardware"
我应该得到:
- [ ] adapt a programmable motor to a tripod to be used for panning
如果我要求"Reading - Health",我应该得到:
- [ ] Patrick McKeown - The Oxygen Advantage
- [X] Total Knee Health - Martin Koban
- [X] Supple Leopard - Kelly Starrett
- [X] Convict Conditioning 1 and 2
我正在使用以下模式:
pattern = re.compile("\*\* "+ head + " (.+?)\*?$", re.DOTALL)
请求"Reading - Technology"时的输出是:
- [X] Introduction to Networking - Charles Severance
- [ ] A Tour of C++ - Bjarne Stroustrup
- [ ] C++ How to Program - Paul Deitel
- [X] Computer Systems - Randal Bryant
- [ ] The C programming language - Brian Kernighan
- [ ] Beginning Linux Programming -Matthew and Stones
** Reading - Health [3/4]
- [ ] Patrick McKeown - The Oxygen Advantage
- [X] Total Knee Health - Martin Koban
- [X] Supple Leopard - Kelly Starrett
- [X] Convict Conditioning 1 and 2
我也试过:
pattern = re.compile("\*\* "+ head + " (.+?)[\*|\z]", re.DOTALL)
最后一个对所有 headers 都适用,但最后一个除外。
请求"Reading - Health"时的输出:
- [ ] Patrick McKeown - The Oxygen Advantage
- [X] Total Knee Health - Martin Koban
- [X] Supple Leopard - Kelly Starrett
如您所见,它与最后一行不匹配。
我正在使用 python 2.7 和 findall。
不确定整个匹配是否需要正则表达式。我只是使用正则表达式来匹配 **
行,然后是 return 行,直到你看到下一个 **
行。
类似
pattern = re.compile("\*\* "+ head)
start = False
output = []
for line in my_file:
if pattern.match(line):
start = True
continue
elif line.startswith("**"): # but doesn't match pattern
break
if start:
output.append(line)
# now `output` should have the lines you want
如果您确定字符 *
不存在于您的项目中,您可以使用:
re.compile(r"\*\* "+head+r" \[\d+/\d+\]\n([^*]+)\*?")
您可以通过
实现
import re
string = """
** Hardware [0/1]
- [ ] adapt a programmable motor to a tripod to be used for panning
** Reading - Technology [1/6]
- [X] Introduction to Networking - Charles Severance
- [ ] A Tour of C++ - Bjarne Stroustrup
- [ ] C++ How to Program - Paul Deitel
- [X] Computer Systems - Randal Bryant
- [ ] The C programming language - Brian Kernighan
- [ ] Beginning Linux Programming -Matthew and Stones
** Reading - Health [3/4]
- [ ] Patrick McKeown - The Oxygen Advantage
- [X] Total Knee Health - Martin Koban
- [X] Supple Leopard - Kelly Starrett
- [X] Convict Conditioning 1 and 2
"""
def getitems(section):
rx = re.compile(r'^\*{2} ' + re.escape(section) + r'.+[\n\r](?P<block>(?:(?!^\*{2})[\s\S])+)', re.MULTILINE)
try:
items = rx.search(string)
return items.group('block')
except:
return None
items = getitems('Reading - Technology')
print(items)
代码的核心是(浓缩)表达式:
^\*{2}.+[\n\r] # match the beginning of the line, followed by two stars, anything else in between and a newline
(?P<block> # open group "block"
(?: # non-capturing group
(?!^\*{2}) # a neg. lookahead, making sure no ** follows at the beginning of a line
[\s\S] # any character...
)+ # ...at least once
) # close group "block"
在实际代码中 **
之后插入搜索字符串的位置。在 regex101.com 上查看 Reading - Technology
的演示。
作为后续行动,您也可以仅 return 选定的值 ,如下所示:
def getitems(section, selected=None):
rx = re.compile(r'^\*{2} ' + re.escape(section) + r'.+[\n\r](?P<block>(?:(?!^\*{2})[\s\S])+)', re.MULTILINE)
try:
items = rx.search(string).group('block')
if selected:
rxi = re.compile(r'^ - \[X\]\ (.+)', re.MULTILINE)
try:
selected_items = rxi.findall(items)
return selected_items
except:
return None
return items
except:
return None
items = getitems('Reading - Health', selected=True)
print(items)
我有以下 org-mode 语法:
** Hardware [0/1]
- [ ] adapt a programmable motor to a tripod to be used for panning
** Reading - Technology [1/6]
- [X] Introduction to Networking - Charles Severance
- [ ] A Tour of C++ - Bjarne Stroustrup
- [ ] C++ How to Program - Paul Deitel
- [X] Computer Systems - Randal Bryant
- [ ] The C programming language - Brian Kernighan
- [ ] Beginning Linux Programming -Matthew and Stones
** Reading - Health [3/4]
- [ ] Patrick McKeown - The Oxygen Advantage
- [X] Total Knee Health - Martin Koban
- [X] Supple Leopard - Kelly Starrett
- [X] Convict Conditioning 1 and 2
我想提取项目,例如:
getitems "Hardware"
我应该得到:
- [ ] adapt a programmable motor to a tripod to be used for panning
如果我要求"Reading - Health",我应该得到:
- [ ] Patrick McKeown - The Oxygen Advantage
- [X] Total Knee Health - Martin Koban
- [X] Supple Leopard - Kelly Starrett
- [X] Convict Conditioning 1 and 2
我正在使用以下模式:
pattern = re.compile("\*\* "+ head + " (.+?)\*?$", re.DOTALL)
请求"Reading - Technology"时的输出是:
- [X] Introduction to Networking - Charles Severance
- [ ] A Tour of C++ - Bjarne Stroustrup
- [ ] C++ How to Program - Paul Deitel
- [X] Computer Systems - Randal Bryant
- [ ] The C programming language - Brian Kernighan
- [ ] Beginning Linux Programming -Matthew and Stones
** Reading - Health [3/4]
- [ ] Patrick McKeown - The Oxygen Advantage
- [X] Total Knee Health - Martin Koban
- [X] Supple Leopard - Kelly Starrett
- [X] Convict Conditioning 1 and 2
我也试过:
pattern = re.compile("\*\* "+ head + " (.+?)[\*|\z]", re.DOTALL)
最后一个对所有 headers 都适用,但最后一个除外。
请求"Reading - Health"时的输出:
- [ ] Patrick McKeown - The Oxygen Advantage
- [X] Total Knee Health - Martin Koban
- [X] Supple Leopard - Kelly Starrett
如您所见,它与最后一行不匹配。
我正在使用 python 2.7 和 findall。
不确定整个匹配是否需要正则表达式。我只是使用正则表达式来匹配 **
行,然后是 return 行,直到你看到下一个 **
行。
类似
pattern = re.compile("\*\* "+ head)
start = False
output = []
for line in my_file:
if pattern.match(line):
start = True
continue
elif line.startswith("**"): # but doesn't match pattern
break
if start:
output.append(line)
# now `output` should have the lines you want
如果您确定字符 *
不存在于您的项目中,您可以使用:
re.compile(r"\*\* "+head+r" \[\d+/\d+\]\n([^*]+)\*?")
您可以通过
实现import re
string = """
** Hardware [0/1]
- [ ] adapt a programmable motor to a tripod to be used for panning
** Reading - Technology [1/6]
- [X] Introduction to Networking - Charles Severance
- [ ] A Tour of C++ - Bjarne Stroustrup
- [ ] C++ How to Program - Paul Deitel
- [X] Computer Systems - Randal Bryant
- [ ] The C programming language - Brian Kernighan
- [ ] Beginning Linux Programming -Matthew and Stones
** Reading - Health [3/4]
- [ ] Patrick McKeown - The Oxygen Advantage
- [X] Total Knee Health - Martin Koban
- [X] Supple Leopard - Kelly Starrett
- [X] Convict Conditioning 1 and 2
"""
def getitems(section):
rx = re.compile(r'^\*{2} ' + re.escape(section) + r'.+[\n\r](?P<block>(?:(?!^\*{2})[\s\S])+)', re.MULTILINE)
try:
items = rx.search(string)
return items.group('block')
except:
return None
items = getitems('Reading - Technology')
print(items)
代码的核心是(浓缩)表达式:
^\*{2}.+[\n\r] # match the beginning of the line, followed by two stars, anything else in between and a newline
(?P<block> # open group "block"
(?: # non-capturing group
(?!^\*{2}) # a neg. lookahead, making sure no ** follows at the beginning of a line
[\s\S] # any character...
)+ # ...at least once
) # close group "block"
在实际代码中 **
之后插入搜索字符串的位置。在 regex101.com 上查看 Reading - Technology
的演示。
作为后续行动,您也可以仅 return 选定的值 ,如下所示:
def getitems(section, selected=None):
rx = re.compile(r'^\*{2} ' + re.escape(section) + r'.+[\n\r](?P<block>(?:(?!^\*{2})[\s\S])+)', re.MULTILINE)
try:
items = rx.search(string).group('block')
if selected:
rxi = re.compile(r'^ - \[X\]\ (.+)', re.MULTILINE)
try:
selected_items = rxi.findall(items)
return selected_items
except:
return None
return items
except:
return None
items = getitems('Reading - Health', selected=True)
print(items)