如何在 pandas 数据帧中的列的所有行中提取字符串中的大写单词?

How to extract Uppercase word in a string in all rows of a column in a pandas dataframe?

已附加数据集。在名为“转录”的列中,我想从列中每一行的字符串中提取大写单词,并将其作为数据框的特征,大写单词后面的字符串作为该数据点的值在该功能下。

预期输出将是数据框中的另一列,命名为在字符串中找到的大写单词,并且特定数据点将在该特征下具有一个值。已尽力解释。

Dataset

Link 样本输出 Sample output(显示前 2 个数据点)

试试这个:

def cust_func(data):
    ## split the transcription with , delimiter - later we will join 
    words = data.split(",")
    
    ## get index of words which are completely in uppercase and also endswith :, 
    column_idx = []
    for i in range(len(words)):
        if ((words[i].endswith(":") or words[i].endswith(": ")) and words[i].isupper()):
            column_idx.append(i)
          
    ## Find the sentence for each of the capital word by joining the words
    ## between two consecutive capital words
    ## Save the cap word and the respective sentence in dict. 
    result = {}
    for i in range(len(column_idx)):
        if i != len(column_idx)-1:
            result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:column_idx[i+1]])
        else:
            result[words[column_idx[i]]] = ",".join(words[column_idx[i]+1:])
    return(pd.Series(result)) ## this creates new columns

df = pd.concat([df, df.transcription.apply(cust_func)], axis=1)
df

输出如下所示(无法在一张屏幕截图中捕获所有列。):