如何将一组标记化句子提供给 Word2vec 以获得嵌入？

Question

大家好：我想不出从 word2vec 模型中获取嵌入所需的代码。

这是我的 df 的结构（它是一些基于 android 的日志）：

日志日期时间 |行号 |进程ID |线程ID |优先 |应用 |留言 |事件模板 |事件ID ts int int int str str str str

基本上，我从日志消息中创建了一个独特的事件子集，并分配了一个具有关联 ID 的模板：

def eventCreation(df):
    df['eventTemplate'] = df['message'].str.replace('\d+', '*')
    df['eventTemplate'] = df['eventTemplate'].str.replace('true', '*')
    df['eventTemplate'] = df['eventTemplate'].str.replace('false', '*')
    df['eventID'] = df.groupby(df.eventTemplate.tolist(), sort=False).ngroup() + 1
    df['eventID'] = 'E'+df['eventID'].astype(str)

def seqGen(arr, k):
    for i in range(len(arr)-k+1):
        yield arr[i:i+k]

#define the variables here
cwd = os.getcwd()
#create a dataframe of the logs concatenated
df = pd.DataFrame.from_records(process_files(cwd,getFiles))
# call functions to establish df
cleanDf(df)
featureEng(df)
eventCreation(df)
df['eventToken'] = df.eventTemplate.apply(lambda x: word_tokenize(x))
seq = []
eventArray = df[["eventToken"]].to_numpy()
for sequence in seqGen(eventArray, 9):
    seq.append(eventArray)

所以，'seq' 最终看起来像这样：

[array([['[*,com.blah.blach.blahMainblach] '],
        ['[*,*,*,com.blah.blah/.permission.ui.blah,finish-imm] '],
        ['[*,*,*,*,startingNewTask] '],
        ...,
        ['mfc, isSoftKeyboardVisible in WMS : * '],
        ['mfc, isSoftKeyboardVisible in WMS : * '],
        ['Calling a method in the system process without a qualified user: android.app.ContextImpl.startService:* android.content.ContextWrapper.startService:* android.content.ContextWrapper.startService:* com.blahblah.usbmountreceiver.USBMountReceiver.onReceive:* android.app.ActivityThread.handleReceiver:* ']],
       dtype=object),

序列是包含标记化日志消息列表的数组。计划是在训练模型之后，我可以通过将 onehot 向量和权重矩阵相乘来获得日志事件的嵌入...还有更多工作要做，但我一直坚持获取嵌入。

我是一名尝试开发异常检测解决方案的新手。

Answer 1

如果您在 Python 中使用 Gensim 库实现其 Word2Vec，它希望将其语料库作为 可重复序列，其中每个item 本身是一个 字符串标记列表 .

一个本身包含每个项目作为字符串标记列表的列表是可行的。

您的 seq 很接近，但是：

它不需要（因此可能不应该）是一个 numpy 对象数组。
您的每个 object 项目都是 list（好）但每个项目内部只有一个未标记化的字符串（坏）。您需要将这些字符串分解为您希望模型学习的单个 'words'。

如何将一组标记化句子提供给 Word2vec 以获得嵌入？

How do I feed an array of Tokenized Sentences to Word2vec to get embeddings?

logging

python-3.x

word2vec

word-embedding

anomaly-detection