
Label any text with multiple topics in sequence of their occurrence

我有一个带有如下 ID 和文本的 DataFrame:


ID Text
1 I have completed my order
2 I have made the payment. When can I expect the order to be delivered?
3 I am unable to make the payment.
4 I am done with registration and payment. I need the order number?
5 I am unable to complete registration. How will I even order?

我有一些话题可以class证明这些文本: class = ["下单", "付款", "注册"]


classes = ["order", "payment", "registration"]
for c in classes:
    word_counter = Counter()
    list_df = []
    field = "Text"
    df2 = pd.DataFrame()
    df2 = df2[df2[field].str.contains(c)] 
    final_df = pd.concat(list_df)
    final_df.to_csv("./" + c + ".csv")    

这将为我生成 3 个 CSV 文件,稍后我将再次加入:

file_list = []
os.chdir('<file path>')

for file in os.listdir():
    if file.endswith('.csv'):
        df = pd.read_csv(file, sep=",", encoding='ISO-8859-1')
        df['filename'] = file

df_topic = pd.concat(file_list, ignore_index=True)
df_topic['topic'] = df_topic['filename'].str.split('.').str[0]
df_topic= df_topic.drop('filename', 1)

生成的 DataFrame 如下所示:

ID Text Topic
1 I have completed my order order
2 I have made the payment. When can I expect the order to be delivered? order
4 I am done with registration and payment. I need the order number? order
2 I have made the payment. When can I expect the order to be delivered? payment
3 I am unable to make the payment. payment
4 I am done with registration and payment. I need the order number? payment
4 I am done with registration and payment. I need the order number? registration
5 I am unable to complete registration. How will I even order? registration

但是,您在此处看到的问题是,同一文本可能也包含其他 class 的关键字,并且可以标记为其中任何一个(例如 id=2 的文本既有订单又有付款)。每个 id 只能有一个记录标签,因此我更愿意根据它们从文本开头出现的顺序将其作为主要或次要主题。如果文本超过 2 个,则优先选择前 2 个,但只是为了确保我们可能需要第三个主题(或第 n 个主题)用于将来的实例,我想将其作为列表存储在最后一个字段中。 (说明了 id = 4 的示例)

ID Text Primary Topic Secondary Topic Identified Topics Topics List
1 I have completed my order order null 1 [order]
2 I have made the payment. When can I expect the order to be delivered? payment order 2 [payment,order]
3 I am unable to make the payment. payment null 1 [payment]
4 I am done with registration and payment. I need the order number? registration payment 3 [registration,payment,order]
5 I am unable to complete registeration. How will I even order? registration order 2 [registration,order]


IIUC,你可以使用 str.extractall combined with GroupBy.agg:

lst = ["order", "payment", "registration"]
regex = f'({"|".join(lst)})'  # if lst contains special chars, wrap in re.escape
df2 = df.join(df['Text']
              .groupby(level=0).agg(**{'Primary Topic': 'first',
                                       'Secondary Topic': lambda x: x.iloc[1] if len(x)>1 else 'null',
                                       'Identified Topics': 'nunique',
                                       'Topics List': list})


   ID                                                                   Text Primary Topic Secondary Topic  Identified Topics                     Topics List
0   1                                              I have completed my order         order            null                  1                         [order]
1   2  I have made the payment. When can I expect the order to be delivered?       payment           order                  2                [payment, order]
2   3                                       I am unable to make the payment.       payment            null                  1                       [payment]
3   4      I am done with registration and payment. I need the order number?  registration         payment                  3  [registration, payment, order]
4   5           I am unable to complete registration. How will I even order\  registration           order                  2           [registration, order]