在 sparklyr 中使用 semi_join 时出现匹配错误

Question

我正在尝试在生成的 ngrams 匹配列表的 spark 数据框中连接两个表。

文章列表(df_sparklyr):

id  description
1   In order to investigate the role of calcium pathway in myeloid  differentiation, the expression level of genes related to calcium pathway in all trans retinoic acid (ATRA) induced NB4 cell differentiation was detected by cDNA microarray, some of which were further confirmed by quantitative real time RT PCR. At the same time, the expressions of these genes in NB4 R1 cells treated with ATRA and 8 CPT cAM P alone or in combination, and in differentiation of primary cells from ATRA induced newly diagnosed APL patients were detected by real time RT PCR. The results showed that during differentiation of ATRA induced NB4 cells, the expressions of genes related to calcium concentration had changed, the expression of downstream effectors in calcium pathway was up regulated and confirmed by real time RT PCR assay. The expression of genes related to calcium concentration did not change significantly when NB4 R1 cells were treated by ATRA or 8 CPT cAMP alone, but expression changes of those genes were similar to the changes in ATRA induced NB4 cell differentiation when NB4 R1 cells were treated by ATRA combined with 8 CPT cAMP. In addition, the expression changes of those genes in ATRA induced primary cells of patients with APL were also similar to changes in ATRA induced NB4 cell differentiation. It is concluded that calcium pathway may be involved in ATRA induced differentiation in APL cell.
2   This study was aimed to investigate the inhibitory effect of flavonoids of puerarin (PR) in different concentrations on proliferation of 4 kinds of acute myeloid leukemia (AML) cell lines (Kasumi 1, HL 60, NB4 and U937), and to explore its possible mechanism. The MTT method was used to detected the inhibitory effect of PR on proliferation of AML cell lines. The flow cytometry was adopted to determine the change of cell cycle in vitro. The results showed that a certain concentration of PR could inhibit the proliferation of these 4 cell lines effectively in time and dose dependent manners, and the intensity of inhibition on 4 kinds of AML cell lines was from high to low as follows: NB4>Kasumi 1>U937>HL 60. Meanwhile, PR could also change cycle process, cell proportion in G1 G0 phase decreased, cells in S phase increased and Sub diploid peak also appeared. It is concluded that PR can selectively inhibit the proliferation of 4 AML cell lines and block cell cycle process, especially for NB4 cells.
3   This study was aimed to investigate the effects of flavonoids of puerarin (PR) on apoptosis of acute promyelocytic leukemia (APL) cell line NB4 cells and its mechanism. The NB4 were treated with PR in vitro, the MTT assay was used to detect the inhibitory effect of PR on cell proliferation. The apoptosis of NB4 cells were detected by flow cytometry labelled with Annexin V PI. The expressions of pml rar alpha, bcl 2 and survivin were detected by real time reverse transcription polymerase chain reaction (real time RT PCR), the expressions of JNK, p38 MAPK, FasL, caspase 3, caspase 8 were detected by Western blot. The results showed that with the increasing of PR concentrations, the apoptosis rates of NB4 cells were gradually elevated. Simultaneously, the mRNA expression of pml rar alpha, bcl 2 and survivin decreased, while the protein expression of JNK, FasL, caspase 3 and caspase 8 increased, which presented the positive correlation to PR concentrations. When PR combined with arsenic trioxide (ATO), the expression levels of above mentioned mRNA and protein decreased or increased more significantly. It is concluded that PR can effectively induce the apoptosis of NB4 cells. PR combined with ATO displays synergistic effect. It may be triggered by the activation of JNK signal pathway.

关键字列表(dict_tbl):

[1] "3 M SYNDROME"                                                                
   [2] "3-M SYNDROME"                                                                
   [3] "3-M SYNDROME 1"                                                              
   [4] "3M SYNDROME"                                                                 
   [5] "DOLICHOSPONDYLIC DYSPLASIA"                                                  
   [6] "GLOOMY FACE SYNDROME"                                                        
   [7] "LE MERRER SYNDROME"                                                          
   [8] "THREE M SYNDROME"                                                            
   [9] "YAKUT SHORT STATURE SYNDROME"                                                
  [10] "ABDOMINAL AORTIC ANEURYSM"                                                   
  [11] "ANEURYSM ABDOMINAL AORTIC"                                                   
  [12] "AORTIC ANEURYSM ABDOMINAL"                                                   
  [13] "AORTIC ANEURYSM FAMILIAL ABDOMINAL 1"                                        
  [14] "ABSENCE EPILEPSY"                                                            
  [15] "ABSENCE SEIZURE"                                                             
  [16] "CHILDHOOD ABSENCE EPILEPSY"                                                  
  [17] "JUVENILE ABSENCE EPILEPSY"                                                   
  [18] "PETIT MAL SEIZURE"                                                           
  [19] "PYKNOLEPSY"                                                                  
  [20] "ACANTHAMOEBA INFECTION"                                                      
  [21] "ACANTHAMOEBA INFECTIONS"                                                     
  [22] "ACANTHAMOEBA KERATITIS"                                                      
  [23] "ACCOMMODATIVE SPASM"

使用以下代码：

s_2 = df_sparklyr %>%
  ft_tokenizer("description", "words")%>%
  ft_ngram(input_col = "words", output_col = "ngrams")%>%
  semi_join(y = dict_tbl, by = c("ngrams" = "Keywords"))

我收到以下错误：

Error: org.apache.spark.sql.AnalysisException: cannot resolve '(outer() = RHS.Keywords)' due to data type mismatch: differing types in '(outer() = RHS.Keywords)' (array and string).;

Answer 1

看来您遗漏了一些东西， 1. 参数 n 指示每个 ngram 使用多少 token 2. 函数 explode 将那些每行的 ngram 列表到每行的单个 ngram 3. 通过连接，只需重命名您在

上连接的列就容易多了

这里是详细的做法，希望对你有所帮助

第 1 步：生成 spark 数据帧

my_text = 
'In order to investigate the role of calcium pathway in myeloid  differentiation, the expression level of genes related to calcium pathway in all trans retinoic acid (ATRA) induced NB4 cell differentiation was detected by cDNA microarray, some of which were further confirmed by quantitative real time RT PCR. At the same time, the expressions of these genes in NB4 R1 cells treated with ATRA and 8 CPT cAM P alone or in combination, and in differentiation of primary cells from ATRA induced newly diagnosed APL patients were detected by real time RT PCR. The results showed that during differentiation of ATRA induced NB4 cells, the expressions of genes related to calcium concentration had changed, the expression of downstream effectors in calcium pathway was up regulated and confirmed by real time RT PCR assay. The expression of genes related to calcium concentration did not change significantly when NB4 R1 cells were treated by ATRA or 8 CPT cAMP alone, but expression changes of those genes were similar to the changes in ATRA induced NB4 cell differentiation when NB4 R1 cells were treated by ATRA combined with 8 CPT cAMP. In addition, the expression changes of those genes in ATRA induced primary cells of patients with APL were also similar to changes in ATRA induced NB4 cell differentiation. It is concluded that calcium pathway may be involved in ATRA induced differentiation in APL cell.
This study was aimed to investigate the inhibitory effect of flavonoids of puerarin (PR) in different concentrations on proliferation of 4 kinds of acute myeloid leukemia (AML) cell lines (Kasumi 1, HL 60, NB4 and U937), and to explore its possible mechanism. The MTT method was used to detected the inhibitory effect of PR on proliferation of AML cell lines. The flow cytometry was adopted to determine the change of cell cycle in vitro. The results showed that a certain concentration of PR could inhibit the proliferation of these 4 cell lines effectively in time and dose dependent manners, and the intensity of inhibition on 4 kinds of AML cell lines was from high to low as follows: NB4>Kasumi 1>U937>HL 60. Meanwhile, PR could also change cycle process, cell proportion in G1 G0 phase decreased, cells in S phase increased and Sub diploid peak also appeared. It is concluded that PR can selectively inhibit the proliferation of 4 AML cell lines and block cell cycle process, especially for NB4 cells.
This study was aimed to investigate the effects of flavonoids of puerarin (PR) on apoptosis of acute promyelocytic leukemia (APL) cell line NB4 cells and its mechanism. The NB4 were treated with PR in vitro, the MTT assay was used to detect the inhibitory effect of PR on cell proliferation. The apoptosis of NB4 cells were detected by flow cytometry labelled with Annexin V PI. The expressions of pml rar alpha, bcl 2 and survivin were detected by real time reverse transcription polymerase chain reaction (real time RT PCR), the expressions of JNK, p38 MAPK, FasL, caspase 3, caspase 8 were detected by Western blot. The results showed that with the increasing of PR concentrations, the apoptosis rates of NB4 cells were gradually elevated. Simultaneously, the mRNA expression of pml rar alpha, bcl 2 and survivin decreased, while the protein expression of JNK, FasL, caspase 3 and caspase 8 increased, which presented the positive correlation to PR concentrations. When PR combined with arsenic trioxide (ATO), the expression levels of above mentioned mRNA and protein decreased or increased more significantly. It is concluded that PR can effectively induce the apoptosis of NB4 cells. PR combined with ATO displays synergistic effect. It may be triggered by the activation of JNK signal pathway.'


my_col = my_text %>% strsplit(split = '\n') %>% unlist 

my_df <- 
as.data.frame(my_col, stringsAsFactors = FALSE) %>%  as_tibble() %>% 
rownames_to_column('id') %>%  
  rename(description = my_col)


my_spark_df <- my_df   %>% copy_to(sc, ., 'my_spark_df')

第 2 步：生成关键词列表

key_words <- c(
"3-M SYNDROME"                                                                
,"3-M SYNDROME 1"                                                              
,"3M SYNDROME"                                                                 
,"DOLICHOSPONDYLIC DYSPLASIA"                                                  
,"GLOOMY FACE SYNDROME"                                                        
,"LE MERRER SYNDROME"                                                          
,"THREE M SYNDROME"                                                            
,"YAKUT SHORT STATURE SYNDROME"                                                
,"ABDOMINAL AORTIC ANEURYSM"                                                   
,"ANEURYSM ABDOMINAL AORTIC"                                                   
,"AORTIC ANEURYSM ABDOMINAL"                                                   
,"AORTIC ANEURYSM FAMILIAL ABDOMINAL 1"                                        
,"ABSENCE EPILEPSY"                                                            
,"ABSENCE SEIZURE"                                                             
,"CHILDHOOD ABSENCE EPILEPSY"                                                  
,"JUVENILE ABSENCE EPILEPSY"                                                   
,"PETIT MAL SEIZURE"                                                           
,"PYKNOLEPSY"                                                                  
,"ACANTHAMOEBA INFECTION"                                                      
,"ACANTHAMOEBA INFECTIONS"                                                     
,"ACANTHAMOEBA KERATITIS"                                                      
,"ACCOMMODATIVE SPASM")



key_words_spark_df <- 
as.data.frame(key_words, stringsAsFactors = FALSE) %>%  as_tibble() %>% 
  mutate(key_words = tolower(key_words)) %>%  
  copy_to(sc, ., 'keywords_spark')

加入

my_spark_df %>%
  ft_tokenizer("description", "words")%>%
  ft_ngram(input_col = "words", output_col = "ngrams", n = 2)%>% 
  mutate(ngrams = explode(ngrams)) %>%  
  select(id, ngrams) %>%  
  rename( key_words = ngrams) %>%  
  inner_join(key_words_spark_df)

在 sparklyr 中使用 semi_join 时出现匹配错误

Match error when using semi_join in sparklyr

r

sparkr

sparklyr

第 1 步：生成 spark 数据帧

第 2 步：生成关键词列表

加入