python pandas dataframe words in context: 得到前后3个词
python pandas dataframe words in context: get 3 words before and after
我在 jupyter notebook 上工作,有一个 pandas 数据框 "data":
Question_ID | Customer_ID | Answer
1 234 Data is very important to use because ...
2 234 We value data since we need it ...
我想遍历 "Answer" 列中的文本并获取单词 "data" 前后的三个单词。
所以在这种情况下我会得到 "is very important"; "We value", "since we need".
有没有在 pandas 数据框中执行此操作的好方法?到目前为止,我只找到了 "Answer" 通过 python 代码(没有 pandas 数据帧)是它自己的文件 运行 的解决方案。虽然我意识到我需要使用 NLTK 库,但我以前没有使用过它,所以我不知道最好的方法是什么。 (这是一个很好的例子 Extracting a word and its prior 10 word context to a dataframe in Python)
使用生成器表达式、re.findall
和itertools.chain.from_iterable
函数的解决方案:
import pandas as pd, re, itertools
data = pd.read_csv('test.csv') # change with your current file path
data_adjacents = ((i for sublist in (list(filter(None,t))
for t in re.findall(r'(\w*?\s*\w*?\s*\w*?\s+)(?=\bdata\b)|(?<=\bdata\b)(\s+\w*\s*\w*\s*\w*)', l, re.I)) for i in sublist)
for l in data.Answer.tolist())
print(list(itertools.chain.from_iterable(data_adjacents)))
输出:
[' is very important', 'We value ', ' since we need']
这可能有效:
import pandas as pd
import re
df = pd.read_csv('data.csv')
for value in df.Answer.values:
non_data = re.split('Data|data', value) # split text removing "data"
terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
substrs = [term.split()[0:3] for term in terms_list] # slice and grab first three terms
result = [' '.join(term) for term in substrs] # combine the terms back into substrings
print result
输出:
['is very important']
['We value', 'since we need']
我在 jupyter notebook 上工作,有一个 pandas 数据框 "data":
Question_ID | Customer_ID | Answer
1 234 Data is very important to use because ...
2 234 We value data since we need it ...
我想遍历 "Answer" 列中的文本并获取单词 "data" 前后的三个单词。 所以在这种情况下我会得到 "is very important"; "We value", "since we need".
有没有在 pandas 数据框中执行此操作的好方法?到目前为止,我只找到了 "Answer" 通过 python 代码(没有 pandas 数据帧)是它自己的文件 运行 的解决方案。虽然我意识到我需要使用 NLTK 库,但我以前没有使用过它,所以我不知道最好的方法是什么。 (这是一个很好的例子 Extracting a word and its prior 10 word context to a dataframe in Python)
使用生成器表达式、re.findall
和itertools.chain.from_iterable
函数的解决方案:
import pandas as pd, re, itertools
data = pd.read_csv('test.csv') # change with your current file path
data_adjacents = ((i for sublist in (list(filter(None,t))
for t in re.findall(r'(\w*?\s*\w*?\s*\w*?\s+)(?=\bdata\b)|(?<=\bdata\b)(\s+\w*\s*\w*\s*\w*)', l, re.I)) for i in sublist)
for l in data.Answer.tolist())
print(list(itertools.chain.from_iterable(data_adjacents)))
输出:
[' is very important', 'We value ', ' since we need']
这可能有效:
import pandas as pd
import re
df = pd.read_csv('data.csv')
for value in df.Answer.values:
non_data = re.split('Data|data', value) # split text removing "data"
terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
substrs = [term.split()[0:3] for term in terms_list] # slice and grab first three terms
result = [' '.join(term) for term in substrs] # combine the terms back into substrings
print result
输出:
['is very important']
['We value', 'since we need']