如何从 ft_word2vec (sparklyr-package) 获取词嵌入矩阵?

How do I get the word-embedding matrix from ft_word2vec (sparklyr-package)?

我在 word2vec 宇宙中还有一个问题。 我正在使用 'sparklyr'-包。在这个包中,我调用了 ft_word2vec() 函数。我在理解输出时遇到了一些麻烦: 对于我提供给 ft_word2vec() 函数的每个 sentences/paragraphs,我总是得到相同数量的向量。甚至,如果我比sentences/paragraphs多的话。对我来说,这看起来像是我得到了段落向量。也许代码示例有助于理解我的问题?

# add your spark_connection here as 'spark_connection = '

# create example data frame
FK_data = data.frame(sentences = c("This is my first sentence",
  "It is followed by the second sentence",
  "At the end there is the last sentence"))

# move the data to spark
sc_FK_data <- copy_to(spark_connection, FK_data, name = "FK_data", overwrite = TRUE)

# prepare data for ft_word2vec (sentences have to be tokenized [=list of words instead of one string in each row])
sc_FK_data <- ft_tokenizer(sc_FK_data, input_col = "icd_long", output_col = "tokens")

# split data into test and trainings sets
partitions <- sc_FK_data %>%
  sdf_random_split(training = 0.7, test = 0.3, seed = 123456) 
FK_train <- partitions$training
FK_test <- partitions$test

# given a trainings data set (FK_train) with a column "tokens" (for each row = a list of strings)
mymodel = ft_word2vec(
  FK_train,
  input_col = "tokens",
  output_col = "word2vec",
  vector_size = 15,
  min_count = 1,
  max_sentence_length = 4444,
  num_partitions = 1,
  step_size = 0.1,
  max_iter = 10,
  seed = 123456,
  uid = random_string("word2vec_"))

# I tried to get the data from spark with:
myemb = mymodel %>% sparklyr::collect()

有人有过类似经历吗?有人可以解释一下 ft_word2vec() 函数 returns 到底是什么吗?您有关于如何使用此函数获取词嵌入向量的示例吗?还是返回的列确实包含段落向量?

我的同事找到了解决办法!如果您知道如何操作,说明就会真正开始变得有意义!

# add your spark_connection here as 'spark_connection = '

# create example data frame
FK_data = data.frame(sentences = c("This is my first sentence",
  "It is followed by the second sentence",
  "At the end there is the last sentence"))

# move the data to spark
sc_FK_data <- copy_to(spark_connection, FK_data, name = "FK_data", overwrite = TRUE)

# prepare data for ft_word2vec (sentences have to be tokenized [=list of words instead of one string in each row])
sc_FK_data <- ft_tokenizer(sc_FK_data, input_col = "icd_long", output_col = "tokens")

# split data into test and trainings sets
partitions <- sc_FK_data %>%
  sdf_random_split(training = 0.7, test = 0.3, seed = 123456) 
FK_train <- partitions$training
FK_test <- partitions$test

# CHANGES FOLLOW HERE:
# We have to use the spark connection instead of the data. For me this was the confusing part, since i thought no data -> no model.
# maybe we can think of this step as an initialization
mymodel = ft_word2vec(
  spark_connection,
  input_col = "tokens",
  output_col = "word2vec",
  vector_size = 15,
  min_count = 1,
  max_sentence_length = 4444,
  num_partitions = 1,
  step_size = 0.1,
  max_iter = 10,
  seed = 123456,
  uid = random_string("word2vec_"))

# now that we have our model initialized, we add the word-embeddings to the model
w2v_model = ml_fit(w2v_model, sc_FK_EMB)

# now we can collect the embedding vectors
emb = word2vecmodel$vectors %>% collect()