使用R对单词中的相同模式进行分类
classifying identically pattern in words using R
我想进行文本挖掘分析,但是遇到了一些麻烦。
使用 dput(),我加载了一小部分文本。
text<-structure(list(ID_C_REGCODES_CASH_VOUCHER = c(3941L, 3941L, 3941L,
3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3953L, 3953L,
3953L, 3953L, 3953L, 3953L, 3960L, 3960L, 3960L, 3960L, 3960L,
3960L, 3967L, 3967L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), GOODS_NAME = structure(c(19L,
17L, 15L, 18L, 16L, 23L, 21L, 14L, 22L, 20L, 6L, 2L, 10L, 8L,
7L, 13L, 5L, 11L, 7L, 12L, 4L, 3L, 9L, 9L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("", "* 2108609 SLOB.Mayon.OLIVK.67% 400ml", "* 3014084 D.Dym.Spikachki DEREVEN.MINI 1kg",
"* 3398012 DD Kolb.SERV.OKHOTN in / to v / y0.35", "* 3426789 WH.The corn rav guava / yagn.d / CAT seed 85g",
"197 Onion 1 kg", "2013077 MAKFA Makar.RAKERS 450g", "2030918 MARIA TRADITIONAL Biscuit 180g",
"2049750 MAKFA Makar.SHIGHTS 450g", "3420159 LEBED.Mol.past.3,4-4,5% 900g",
"3491144 LIP.NAP.ICE TEA green yellow 0.5 liter", "6788 MAKFA Makar.perya 450g",
"809 Bananas 1kg", "FetaXa Cheese product 60% 400g (", "Lemons 55+",
"MAKFA Macaroni feathers like. in / with", "Napkins paper color 100pcs PL",
"Package \"Magnet\" white (Plastiktre)", "Pasta Makfa snail flow-pack 450 g.",
"SHEBEKINSKIE Macaroni Butterfly №40", "SOFT Cotton sticks 100 PE (BELL",
"TENDER AGE Cottage cheese 10", "TOBUS steering-wheel 0.5kg flow"
), class = "factor")), .Names = c("ID_C_REGCODES_CASH_VOUCHER",
"GOODS_NAME"), class = "data.frame", row.names = c(NA, -61L))
(NA是不小心。)
正文是支票的产品名称。
我想对所有相似的名字进行分组。
例如。在这里,我手动使用 MAKFA makar(乌克兰名称)。我找到了 7 行 "root or key word MAKFA Makar"
Pasta Makfa snail flow-pack 450 g.
MAKFA Macaroni feathers like. in / with
2013077 MAKFA Makar.RAKERS 450g
2013077 MAKFA Makar.RAKERS 450g
6788 MAKFA Makar.perya 450g
2049750 MAKFA Makar.SHIGHTS 450g
2049750 MAKFA Makar.SHIGHTS 450g
所有产品位置都有相同的词根。
MAKFA Makar 不能像 MFAMKR
作为输出我想得到
Initially class
1 Pasta Makfa snail flow-pack 450 g. MAKFA Makar.
2 MAKFA Macaroni feathers like. in / with MAKFA Makar.
3 2013077 MAKFA Makar.RAKERS 450g MAKFA Makar.
4 2013077 MAKFA Makar.RAKERS 450g MAKFA Makar.
5 6788 MAKFA Makar.perya 450g MAKFA Makar.
6 2049750 MAKFA Makar.SHIGHTS 450g MAKFA Makar.
7 2049750 MAKFA Makar.SHIGHTS 450g MAKFA Makar.
8 * 3398012 DD Kolb.SERV.OKHOTN in / to v / y0.35 kolb
9 * 3014084 D.Dym.Spikachki DEREVEN.MINI 1kg Spikachki
10 809 Bananas 1kg Bananas
11 Lemons 55+ Lemons
12 Napkins paper color 100pcs PL Napkins paper
13 SOFT Cotton sticks 100 PE (BELL Cotton sticks
14 SHEBEKINSKIE Macaroni Butterfly №40 SHEBEKINSKIE Macaroni
15 * 3426789 WH.The corn rav guava / yagn.d / Cat SEED 85g CAT seed
16 FetaXa Cheese product 60% 400g ( Cheese
17 3491144 LIP.NAP.ICE TEA green yellow 0.5 liter TEA
18 2030918 MARIA TRADITIONAL Biscuit 180g Biscuit
19 197 Onion 1 kg Onion
20 TOBUSsteering-wheel 0.5kg flow steering-wheel
21 Package "Magnet" white (Plastiktre) Package (Plastiktre)
22 * 2108609 SLOB.Mayon.OLIVK.67% 400ml Mayon
23 TENDER AGE Cottage cheese 10 Cottage cheese
我如何按词根对产品进行分类?(更确切地说,在词 Makar.Makfa、奶酪中存在相同的模式)
我认为您可以通过清理文本然后对其进行聚类来到达您想要的位置 - 这是一个入门者:
text <- text[1:24,]
library(quanteda)
library(tidyverse)
hc <- text %>%
pull(GOODS_NAME) %>%
as.character %>%
quanteda::tokens(
remove_numbers = T,
remove_punct = T,
remove_symbols = T,
remove_separators = T
) %>%
quanteda::tokens_tolower() %>%
quanteda::tokens_remove(valuetype="regex", pattern = c("^\d.*")) %>%
quanteda::dfm() %>%
textstat_simil(method = "jaccard") %>%
magrittr::multiply_by(-1) %>%
`attr<-`("Labels", text$GOODS_NAME) %>%
hclust(method = "average")
pdf(tf<-tempfile(fileext = ".pdf"), width = 20, height = 10)
plot(hc)
dev.off()
shell.exec(tf)
clusters <- cutree(hc, h = -0.1)
split(text, clusters)
这是一种具有要搜索的词向量的方法:
patt <- c("MAKFA Makar.", "kolb","Spikachki", "Bananas", "Lemons",
"Napkins paper", "Cotton sticks","SHEBEKINSKIE Macaroni","CAT seed","Cheese",
"TEA", "Biscuit", "Onion", "steering-wheel", "Package (Plastiktre)",
"Mayon", "Cottage", "cheese")
lst <-lapply(patt, function(x) text[grep(x,text$GOODS_NAME), ])
do.call(rbind.data.frame, lst)
ID_C_REGCODES_CASH_VOUCHER GOODS_NAME
15 3953 2013077 MAKFA Makar.RAKERS 450g
19 3960 2013077 MAKFA Makar.RAKERS 450g
20 3960 6788 MAKFA Makar.perya 450g
23 3967 2049750 MAKFA Makar.SHIGHTS 450g
24 3967 2049750 MAKFA Makar.SHIGHTS 450g
22 3960 * 3014084 D.Dym.Spikachki DEREVEN.MINI 1kg
16 3953 809 Bananas 1kg
3 3941 Lemons 55+
2 3941 Napkins paper color 100pcs PL
7 3945 SOFT Cotton sticks 100 PE (BELL
10 3945 SHEBEKINSKIE Macaroni Butterfly №40
17 3960 * 3426789 WH.The corn rav guava / yagn.d / CAT seed 85g
8 3945 FetaXa Cheese product 60% 400g (
18 3960 3491144 LIP.NAP.ICE TEA green yellow 0.5 liter
14 3953 2030918 MARIA TRADITIONAL Biscuit 180g
11 3953 197 Onion 1 kg
6 3945 TOBUS steering-wheel 0.5kg flow
12 3953 * 2108609 SLOB.Mayon.OLIVK.67% 400ml
9 3945 TENDER AGE Cottage cheese 10
91 3945 TENDER AGE Cottage cheese 10
我想进行文本挖掘分析,但是遇到了一些麻烦。 使用 dput(),我加载了一小部分文本。
text<-structure(list(ID_C_REGCODES_CASH_VOUCHER = c(3941L, 3941L, 3941L,
3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3953L, 3953L,
3953L, 3953L, 3953L, 3953L, 3960L, 3960L, 3960L, 3960L, 3960L,
3960L, 3967L, 3967L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), GOODS_NAME = structure(c(19L,
17L, 15L, 18L, 16L, 23L, 21L, 14L, 22L, 20L, 6L, 2L, 10L, 8L,
7L, 13L, 5L, 11L, 7L, 12L, 4L, 3L, 9L, 9L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("", "* 2108609 SLOB.Mayon.OLIVK.67% 400ml", "* 3014084 D.Dym.Spikachki DEREVEN.MINI 1kg",
"* 3398012 DD Kolb.SERV.OKHOTN in / to v / y0.35", "* 3426789 WH.The corn rav guava / yagn.d / CAT seed 85g",
"197 Onion 1 kg", "2013077 MAKFA Makar.RAKERS 450g", "2030918 MARIA TRADITIONAL Biscuit 180g",
"2049750 MAKFA Makar.SHIGHTS 450g", "3420159 LEBED.Mol.past.3,4-4,5% 900g",
"3491144 LIP.NAP.ICE TEA green yellow 0.5 liter", "6788 MAKFA Makar.perya 450g",
"809 Bananas 1kg", "FetaXa Cheese product 60% 400g (", "Lemons 55+",
"MAKFA Macaroni feathers like. in / with", "Napkins paper color 100pcs PL",
"Package \"Magnet\" white (Plastiktre)", "Pasta Makfa snail flow-pack 450 g.",
"SHEBEKINSKIE Macaroni Butterfly №40", "SOFT Cotton sticks 100 PE (BELL",
"TENDER AGE Cottage cheese 10", "TOBUS steering-wheel 0.5kg flow"
), class = "factor")), .Names = c("ID_C_REGCODES_CASH_VOUCHER",
"GOODS_NAME"), class = "data.frame", row.names = c(NA, -61L))
(NA是不小心。) 正文是支票的产品名称。
我想对所有相似的名字进行分组。
例如。在这里,我手动使用 MAKFA makar(乌克兰名称)。我找到了 7 行 "root or key word MAKFA Makar"
Pasta Makfa snail flow-pack 450 g.
MAKFA Macaroni feathers like. in / with
2013077 MAKFA Makar.RAKERS 450g
2013077 MAKFA Makar.RAKERS 450g
6788 MAKFA Makar.perya 450g
2049750 MAKFA Makar.SHIGHTS 450g
2049750 MAKFA Makar.SHIGHTS 450g
所有产品位置都有相同的词根。
MAKFA Makar 不能像 MFAMKR
作为输出我想得到
Initially class
1 Pasta Makfa snail flow-pack 450 g. MAKFA Makar.
2 MAKFA Macaroni feathers like. in / with MAKFA Makar.
3 2013077 MAKFA Makar.RAKERS 450g MAKFA Makar.
4 2013077 MAKFA Makar.RAKERS 450g MAKFA Makar.
5 6788 MAKFA Makar.perya 450g MAKFA Makar.
6 2049750 MAKFA Makar.SHIGHTS 450g MAKFA Makar.
7 2049750 MAKFA Makar.SHIGHTS 450g MAKFA Makar.
8 * 3398012 DD Kolb.SERV.OKHOTN in / to v / y0.35 kolb
9 * 3014084 D.Dym.Spikachki DEREVEN.MINI 1kg Spikachki
10 809 Bananas 1kg Bananas
11 Lemons 55+ Lemons
12 Napkins paper color 100pcs PL Napkins paper
13 SOFT Cotton sticks 100 PE (BELL Cotton sticks
14 SHEBEKINSKIE Macaroni Butterfly №40 SHEBEKINSKIE Macaroni
15 * 3426789 WH.The corn rav guava / yagn.d / Cat SEED 85g CAT seed
16 FetaXa Cheese product 60% 400g ( Cheese
17 3491144 LIP.NAP.ICE TEA green yellow 0.5 liter TEA
18 2030918 MARIA TRADITIONAL Biscuit 180g Biscuit
19 197 Onion 1 kg Onion
20 TOBUSsteering-wheel 0.5kg flow steering-wheel
21 Package "Magnet" white (Plastiktre) Package (Plastiktre)
22 * 2108609 SLOB.Mayon.OLIVK.67% 400ml Mayon
23 TENDER AGE Cottage cheese 10 Cottage cheese
我如何按词根对产品进行分类?(更确切地说,在词 Makar.Makfa、奶酪中存在相同的模式)
我认为您可以通过清理文本然后对其进行聚类来到达您想要的位置 - 这是一个入门者:
text <- text[1:24,]
library(quanteda)
library(tidyverse)
hc <- text %>%
pull(GOODS_NAME) %>%
as.character %>%
quanteda::tokens(
remove_numbers = T,
remove_punct = T,
remove_symbols = T,
remove_separators = T
) %>%
quanteda::tokens_tolower() %>%
quanteda::tokens_remove(valuetype="regex", pattern = c("^\d.*")) %>%
quanteda::dfm() %>%
textstat_simil(method = "jaccard") %>%
magrittr::multiply_by(-1) %>%
`attr<-`("Labels", text$GOODS_NAME) %>%
hclust(method = "average")
pdf(tf<-tempfile(fileext = ".pdf"), width = 20, height = 10)
plot(hc)
dev.off()
shell.exec(tf)
clusters <- cutree(hc, h = -0.1)
split(text, clusters)
这是一种具有要搜索的词向量的方法:
patt <- c("MAKFA Makar.", "kolb","Spikachki", "Bananas", "Lemons",
"Napkins paper", "Cotton sticks","SHEBEKINSKIE Macaroni","CAT seed","Cheese",
"TEA", "Biscuit", "Onion", "steering-wheel", "Package (Plastiktre)",
"Mayon", "Cottage", "cheese")
lst <-lapply(patt, function(x) text[grep(x,text$GOODS_NAME), ])
do.call(rbind.data.frame, lst)
ID_C_REGCODES_CASH_VOUCHER GOODS_NAME
15 3953 2013077 MAKFA Makar.RAKERS 450g
19 3960 2013077 MAKFA Makar.RAKERS 450g
20 3960 6788 MAKFA Makar.perya 450g
23 3967 2049750 MAKFA Makar.SHIGHTS 450g
24 3967 2049750 MAKFA Makar.SHIGHTS 450g
22 3960 * 3014084 D.Dym.Spikachki DEREVEN.MINI 1kg
16 3953 809 Bananas 1kg
3 3941 Lemons 55+
2 3941 Napkins paper color 100pcs PL
7 3945 SOFT Cotton sticks 100 PE (BELL
10 3945 SHEBEKINSKIE Macaroni Butterfly №40
17 3960 * 3426789 WH.The corn rav guava / yagn.d / CAT seed 85g
8 3945 FetaXa Cheese product 60% 400g (
18 3960 3491144 LIP.NAP.ICE TEA green yellow 0.5 liter
14 3953 2030918 MARIA TRADITIONAL Biscuit 180g
11 3953 197 Onion 1 kg
6 3945 TOBUS steering-wheel 0.5kg flow
12 3953 * 2108609 SLOB.Mayon.OLIVK.67% 400ml
9 3945 TENDER AGE Cottage cheese 10
91 3945 TENDER AGE Cottage cheese 10