如何从 reddit post 标题中提取年龄和性别?
How to extract age and gender from reddit post titles?
我正在尝试抓取 subreddits 的 Reddit 帖子,其中有很多问题的形式是:
s1 = "I [22M] and my partner (21F) are foo and bar"
s2 = "My (22m) and my partner (21m) are bar and foo"
我想创建一个函数来解析每个字符串,然后 return 年龄和性别对。所以:
def parse(s1):
....
return [(22, "male"), (21, "female")]
本质上,每个 age/gender 标签都是一个 two-digit 数字,后跟 f, F, m, M
。
我们可以在这里尝试使用 re.findall
:
s1 = "I [22m] and my partner (21F) are foo and bar"
matches = re.findall(r'(?:[\[(](\d+[MF])[\])])', s1, re.IGNORECASE)
print(matches)
[('22', 'm'), ('21', 'F')]
您可以尝试使用此正则表达式提取匹配项:
(?:[\[\(])(\d{1,2})([MF])(?:[\]\)]) /i
对于 python 部分我会推荐 re
的 findall
方法:
import re
def parse(title):
return re.findall(r'(?:\[|\()(\d{1,2})([MF])(?:\]|\))', title, re.IGNORECASE)
title = 'I [22M] and my partner (21F) are foo and bar'
matches = parse(title)
print(matches)
编辑:
您可以尝试将正则表达式修改为此,以适应您在评论中提到的新要求:
(?:[\[\(])(\d{1,2})\s?([MF]|male|female)(?:[\]\)]) /i
您可以将 Regex 与 re
一起使用:
import re
>>> re.findall(r'(?<=\[|\()[^\)\]]+', s1) # find text within () or []
['22M', '21F']
>>> re.findall(r'\d+', '22M') # find age
['22']
>>> re.findall(r'[fFmM]+', '22M') # find gender
['M']
这个网站非常适合学习和练习正则表达式:https://regex101.com/
我正在尝试抓取 subreddits 的 Reddit 帖子,其中有很多问题的形式是:
s1 = "I [22M] and my partner (21F) are foo and bar"
s2 = "My (22m) and my partner (21m) are bar and foo"
我想创建一个函数来解析每个字符串,然后 return 年龄和性别对。所以:
def parse(s1):
....
return [(22, "male"), (21, "female")]
本质上,每个 age/gender 标签都是一个 two-digit 数字,后跟 f, F, m, M
。
我们可以在这里尝试使用 re.findall
:
s1 = "I [22m] and my partner (21F) are foo and bar"
matches = re.findall(r'(?:[\[(](\d+[MF])[\])])', s1, re.IGNORECASE)
print(matches)
[('22', 'm'), ('21', 'F')]
您可以尝试使用此正则表达式提取匹配项:
(?:[\[\(])(\d{1,2})([MF])(?:[\]\)]) /i
对于 python 部分我会推荐 re
的 findall
方法:
import re
def parse(title):
return re.findall(r'(?:\[|\()(\d{1,2})([MF])(?:\]|\))', title, re.IGNORECASE)
title = 'I [22M] and my partner (21F) are foo and bar'
matches = parse(title)
print(matches)
编辑:
您可以尝试将正则表达式修改为此,以适应您在评论中提到的新要求:
(?:[\[\(])(\d{1,2})\s?([MF]|male|female)(?:[\]\)]) /i
您可以将 Regex 与 re
一起使用:
import re
>>> re.findall(r'(?<=\[|\()[^\)\]]+', s1) # find text within () or []
['22M', '21F']
>>> re.findall(r'\d+', '22M') # find age
['22']
>>> re.findall(r'[fFmM]+', '22M') # find gender
['M']
这个网站非常适合学习和练习正则表达式:https://regex101.com/