如何在 pandas 数据帧中的列的所有行中提取字符串中的大写单词?
How to extract Uppercase word in a string in all rows of a column in a pandas dataframe?
已附加数据集。在名为“转录”的列中,我想从列中每一行的字符串中提取大写单词,并将其作为数据框的特征,大写单词后面的字符串作为该数据点的值在该功能下。
预期输出将是数据框中的另一列,命名为在字符串中找到的大写单词,并且特定数据点将在该特征下具有一个值。已尽力解释。
Link 样本输出 Sample output(显示前 2 个数据点)
试试这个:
def cust_func(data):
## split the transcription with , delimiter - later we will join
words = data.split(",")
## get index of words which are completely in uppercase and also endswith :,
column_idx = []
for i in range(len(words)):
if ((words[i].endswith(":") or words[i].endswith(": ")) and words[i].isupper()):
column_idx.append(i)
## Find the sentence for each of the capital word by joining the words
## between two consecutive capital words
## Save the cap word and the respective sentence in dict.
result = {}
for i in range(len(column_idx)):
if i != len(column_idx)-1:
result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:column_idx[i+1]])
else:
result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:])
return(pd.Series(result)) ## this creates new columns
df = pd.concat([df, df.transcription.apply(cust_func)], axis=1)
df
输出如下所示(无法在一张屏幕截图中捕获所有列。):
已附加数据集。在名为“转录”的列中,我想从列中每一行的字符串中提取大写单词,并将其作为数据框的特征,大写单词后面的字符串作为该数据点的值在该功能下。
预期输出将是数据框中的另一列,命名为在字符串中找到的大写单词,并且特定数据点将在该特征下具有一个值。已尽力解释。
Link 样本输出 Sample output(显示前 2 个数据点)
试试这个:
def cust_func(data):
## split the transcription with , delimiter - later we will join
words = data.split(",")
## get index of words which are completely in uppercase and also endswith :,
column_idx = []
for i in range(len(words)):
if ((words[i].endswith(":") or words[i].endswith(": ")) and words[i].isupper()):
column_idx.append(i)
## Find the sentence for each of the capital word by joining the words
## between two consecutive capital words
## Save the cap word and the respective sentence in dict.
result = {}
for i in range(len(column_idx)):
if i != len(column_idx)-1:
result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:column_idx[i+1]])
else:
result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:])
return(pd.Series(result)) ## this creates new columns
df = pd.concat([df, df.transcription.apply(cust_func)], axis=1)
df
输出如下所示(无法在一张屏幕截图中捕获所有列。):