如何编写正则表达式来提取年份
How to write regular expression to extract years
如何写正则表达式提取文本中的年份,年份可能有以下几种形式
Case 1:
1970 - 1980 --> 1970, 1980
January 1920 - Feb 1930 --> 1920, 1930
May 1920 to September 1930 --> 1920, 1930
Case 2:
July 1945 --> 1945
为 Case 1
编写正则表达式很容易,但我该如何处理 Case 2
呢
\d{4} \s? (?: [^a-zA-Z0-9] | to) \s? \w+? \d{4}
正则表达式:.*?([0-9]{4})(?:.*?([0-9]{4}))?
或 .*?(\d{4})(?:.*?(\d{4}))?
详情:
()
捕获组
(?:)
非捕获组
{n}
完全匹配 n
次
.*?
匹配零次到无限次之间的任何字符(惰性)
Python代码:
def Years(text):
return re.findall(r'.*?([0-9]{4})(?:.*?([0-9]{4}))?', text)
print(Years('January 1920 - Feb 1930'))
输出:
[('1920', '1930')]
根据您的要求,只需匹配所有 4 位数字
import re
s = '''1970 - 1980
January 1920 - Feb 1930
May 1920 to September 1930
July 1945'''
p = re.compile(r'\b\d{4}\b')
s = s.splitlines()
for x in s:
result = p.findall(x)
print(result)
输出
['1970', '1980']
['1920', '1930']
['1920', '1930']
['1945']
如何写正则表达式提取文本中的年份,年份可能有以下几种形式
Case 1:
1970 - 1980 --> 1970, 1980
January 1920 - Feb 1930 --> 1920, 1930
May 1920 to September 1930 --> 1920, 1930
Case 2:
July 1945 --> 1945
为 Case 1
编写正则表达式很容易,但我该如何处理 Case 2
呢
\d{4} \s? (?: [^a-zA-Z0-9] | to) \s? \w+? \d{4}
正则表达式:.*?([0-9]{4})(?:.*?([0-9]{4}))?
或 .*?(\d{4})(?:.*?(\d{4}))?
详情:
()
捕获组(?:)
非捕获组{n}
完全匹配n
次.*?
匹配零次到无限次之间的任何字符(惰性)
Python代码:
def Years(text):
return re.findall(r'.*?([0-9]{4})(?:.*?([0-9]{4}))?', text)
print(Years('January 1920 - Feb 1930'))
输出:
[('1920', '1930')]
根据您的要求,只需匹配所有 4 位数字
import re
s = '''1970 - 1980
January 1920 - Feb 1930
May 1920 to September 1930
July 1945'''
p = re.compile(r'\b\d{4}\b')
s = s.splitlines()
for x in s:
result = p.findall(x)
print(result)
输出
['1970', '1980']
['1920', '1930']
['1920', '1930']
['1945']