按出现顺序标记具有多个主题的任何文本

Label any text with multiple topics in sequence of their occurrence

我有一个带有如下 ID 和文本的 DataFrame:

df1

ID Text
1 I have completed my order
2 I have made the payment. When can I expect the order to be delivered?
3 I am unable to make the payment.
4 I am done with registration and payment. I need the order number?
5 I am unable to complete registration. How will I even order?

我有一些话题可以class证明这些文本: class = ["下单", "付款", "注册"]

我正在执行以下操作以获得结果:

classes = ["order", "payment", "registration"]
for c in classes:
    word_counter = Counter()
    list_df = []
    field = "Text"
    df2 = pd.DataFrame()
    df2 = df2[df2[field].str.contains(c)] 
    print(c)
    list_df.append(df2)
    final_df = pd.concat(list_df)
    final_df.to_csv("./" + c + ".csv")    

这将为我生成 3 个 CSV 文件,稍后我将再次加入:

file_list = []
os.chdir('<file path>')

for file in os.listdir():
    if file.endswith('.csv'):
        df = pd.read_csv(file, sep=",", encoding='ISO-8859-1')
        df['filename'] = file
        file_list.append(df)

df_topic = pd.concat(file_list, ignore_index=True)
df_topic['topic'] = df_topic['filename'].str.split('.').str[0]
df_topic= df_topic.drop('filename', 1)

生成的 DataFrame 如下所示:

ID Text Topic
1 I have completed my order order
2 I have made the payment. When can I expect the order to be delivered? order
4 I am done with registration and payment. I need the order number? order
2 I have made the payment. When can I expect the order to be delivered? payment
3 I am unable to make the payment. payment
4 I am done with registration and payment. I need the order number? payment
4 I am done with registration and payment. I need the order number? registration
5 I am unable to complete registration. How will I even order? registration

但是,您在此处看到的问题是,同一文本可能也包含其他 class 的关键字,并且可以标记为其中任何一个(例如 id=2 的文本既有订单又有付款)。每个 id 只能有一个记录标签,因此我更愿意根据它们从文本开头出现的顺序将其作为主要或次要主题。如果文本超过 2 个,则优先选择前 2 个,但只是为了确保我们可能需要第三个主题(或第 n 个主题)用于将来的实例,我想将其作为列表存储在最后一个字段中。 (说明了 id = 4 的示例)

ID Text Primary Topic Secondary Topic Identified Topics Topics List
1 I have completed my order order null 1 [order]
2 I have made the payment. When can I expect the order to be delivered? payment order 2 [payment,order]
3 I am unable to make the payment. payment null 1 [payment]
4 I am done with registration and payment. I need the order number? registration payment 3 [registration,payment,order]
5 I am unable to complete registeration. How will I even order? registration order 2 [registration,order]

这样可以吗。如果不是,解决此类标签问题的好方法是什么?

IIUC,你可以使用 str.extractall combined with GroupBy.agg:

lst = ["order", "payment", "registration"]
regex = f'({"|".join(lst)})'  # if lst contains special chars, wrap in re.escape
df2 = df.join(df['Text']
              .str.extractall(regex)[0]
              .groupby(level=0).agg(**{'Primary Topic': 'first',
                                       'Secondary Topic': lambda x: x.iloc[1] if len(x)>1 else 'null',
                                       'Identified Topics': 'nunique',
                                       'Topics List': list})
               )

输出:

   ID                                                                   Text Primary Topic Secondary Topic  Identified Topics                     Topics List
0   1                                              I have completed my order         order            null                  1                         [order]
1   2  I have made the payment. When can I expect the order to be delivered?       payment           order                  2                [payment, order]
2   3                                       I am unable to make the payment.       payment            null                  1                       [payment]
3   4      I am done with registration and payment. I need the order number?  registration         payment                  3  [registration, payment, order]
4   5           I am unable to complete registration. How will I even order\  registration           order                  2           [registration, order]