从一个范围中提取和计算唯一的词频
Extracting and counting unique word frequency from a range
我有一列,每一行都是一个句子。例如:
COLUMN1
R1: -Do you think they'll come, sir?
R2: -Oh they'll come, they'll come all right.
R3: Here. Stamp those and mail them.
R4: It's ringing.
R5: Would you walk Myron the other way?
从这个范围中,我想提取一个唯一单词列表 (COLUMN2),以及它们在该范围中出现的频率的计数 (COLUMN3)。
诀窍是删除标点符号,如逗号、句点等。
所以上面的期望结果是:
COLUMN2 COLUMN3
Do 1
you 2
think 1
they'll 3
come 2
sir 1
Oh 1
all 1
right 1
Here 1
Stamp 1
those 1
and 1
mail 1
them 1
It's 1
ringing 1
Would 1
walk 1
Myron 1
the 1
other 1
way 1
我尝试使用 SPLIT 函数解析每一行,将每个单词分成各自的单元格,但我无法删除标点符号,并构建唯一单词列表(我知道这将涉及 UNIQUE 函数)。我猜的计数也将涉及 COUNTUNIQUE 函数。
任何指导将不胜感激!
你可以试试
=query(ArrayFormula(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']" ,""),,50000)," "))), "Select Col1, Count(Col1) where Col1 <>'' group by Col1 label Count(Col1)''")
改变范围以适应。
如果您想排除一个单词列表(例如 J1:J20 范围内的单词),您可以尝试
=ArrayFormula(query(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']" ,""),,50000)," ")), "Select Col1, Count(Col1) where not UPPER(Col1) matches '\b"&textjoin("|", 1, UPPER(J1:J20))&"\b' group by Col1 order by Count(Col1) desc label Count(Col1)''"))
或者,您也可以将排除列表添加到正则表达式模式中...
=query(ArrayFormula(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']|\b((?i)the|oh|or|and)\b" ,""),,50000)," "))), "Select Col1, Count(Col1) where Col1 <>'' group by Col1 order by Count(Col1) desc label Count(Col1)''")
已更新:
=ArrayFormula(substitute(query(transpose(split(query(regexreplace(substitute(C11:C, char(39), "_"), "[^A-Za-z\s_]" ,""),,50000)," ")), "Select Col1, Count(Col1) where not UPPER(Col1) matches '\b"&textjoin("|", 1, UPPER(substitute(G11:G,char(39),"_")))&"\b' group by Col1 order by Count(Col1) desc label Count(Col1)''", 0), "_", char(39)))
或者,使用不同的方法
=query(filter(regexreplace(transpose(split(query(regexreplace(C11:C, "[^A-Za-z\s'-]" ,""),,50000)," ")), "^-",), isna(match(upper(regexreplace(transpose(split(query(regexreplace(C11:C, "[^A-Za-z\s'-]" ,""),,50000)," ")), "^-",)), upper(filter(G11:G, len(G11:G))),0))), "Select Col1, count(Col1) group by Col1 order by count(Col1) desc label count(Col1)''", 0)
尝试:
=ARRAYFORMULA(QUERY(TRANSPOSE(SPLIT(REGEXREPLACE(
TEXTJOIN(" ", 1, LOWER(A:A)), "\.|\,|\?", ), " ")),
"select Col1,count(Col1)
group by Col1
order by count(Col1) desc
label count(Col1)''", 0))
或:
=ARRAYFORMULA(QUERY(TRANSPOSE(SPLIT(REGEXREPLACE(
QUERY(LOWER(A:A),,999^99), "[^a-z0-9а-я ]", ), " ")),
"select Col1,count(Col1)
group by Col1
order by count(Col1) desc
label count(Col1)''", 0))
更新:
=ARRAYFORMULA(QUERY(TRANSPOSE(SPLIT(REGEXREPLACE(
QUERY(LOWER(A:A),,999^99), "[^a-z0-9 ]", ), " ")),
"select Col1,count(Col1)
where not Col1 matches 'the|and|i|you|its'
group by Col1
order by count(Col1) desc
label count(Col1)''", 0))
您可以使用 Mid、RegexReplace、Query、Split 等,像这样:
= query
(
transpose
(
split
(
regexreplace ( textjoin ( " ", true,filter(mid(A11:A,4, len(A11:A)),A11:A<>"") ) , "[>,.?/!-]"," " ) ," ",true,true
)
)
,"Select Col1, Count(Col1) group by Col1 label Col1 'Column2', Count(Col1) 'Column3' "
)
或者如果没有前缀 R1: ~ R5, 像这样使用:
= query
(
transpose
(
split
(
regexreplace ( textjoin ( " ", true,filter(A11:A,A11:A<>"")) , "[>,.?/!-]"," " ) ," ",true,true
)
)
, "Select Col1, Count(Col1) group by Col1 label Col1 'Column2', Count(Col1) 'Column3' "
)
我有一列,每一行都是一个句子。例如:
COLUMN1
R1: -Do you think they'll come, sir?
R2: -Oh they'll come, they'll come all right.
R3: Here. Stamp those and mail them.
R4: It's ringing.
R5: Would you walk Myron the other way?
从这个范围中,我想提取一个唯一单词列表 (COLUMN2),以及它们在该范围中出现的频率的计数 (COLUMN3)。
诀窍是删除标点符号,如逗号、句点等。
所以上面的期望结果是:
COLUMN2 COLUMN3
Do 1
you 2
think 1
they'll 3
come 2
sir 1
Oh 1
all 1
right 1
Here 1
Stamp 1
those 1
and 1
mail 1
them 1
It's 1
ringing 1
Would 1
walk 1
Myron 1
the 1
other 1
way 1
我尝试使用 SPLIT 函数解析每一行,将每个单词分成各自的单元格,但我无法删除标点符号,并构建唯一单词列表(我知道这将涉及 UNIQUE 函数)。我猜的计数也将涉及 COUNTUNIQUE 函数。
任何指导将不胜感激!
你可以试试
=query(ArrayFormula(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']" ,""),,50000)," "))), "Select Col1, Count(Col1) where Col1 <>'' group by Col1 label Count(Col1)''")
改变范围以适应。
如果您想排除一个单词列表(例如 J1:J20 范围内的单词),您可以尝试
=ArrayFormula(query(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']" ,""),,50000)," ")), "Select Col1, Count(Col1) where not UPPER(Col1) matches '\b"&textjoin("|", 1, UPPER(J1:J20))&"\b' group by Col1 order by Count(Col1) desc label Count(Col1)''"))
或者,您也可以将排除列表添加到正则表达式模式中...
=query(ArrayFormula(transpose(split(query(regexreplace(A1:A5, "[^A-Za-z\s/']|\b((?i)the|oh|or|and)\b" ,""),,50000)," "))), "Select Col1, Count(Col1) where Col1 <>'' group by Col1 order by Count(Col1) desc label Count(Col1)''")
已更新:
=ArrayFormula(substitute(query(transpose(split(query(regexreplace(substitute(C11:C, char(39), "_"), "[^A-Za-z\s_]" ,""),,50000)," ")), "Select Col1, Count(Col1) where not UPPER(Col1) matches '\b"&textjoin("|", 1, UPPER(substitute(G11:G,char(39),"_")))&"\b' group by Col1 order by Count(Col1) desc label Count(Col1)''", 0), "_", char(39)))
或者,使用不同的方法
=query(filter(regexreplace(transpose(split(query(regexreplace(C11:C, "[^A-Za-z\s'-]" ,""),,50000)," ")), "^-",), isna(match(upper(regexreplace(transpose(split(query(regexreplace(C11:C, "[^A-Za-z\s'-]" ,""),,50000)," ")), "^-",)), upper(filter(G11:G, len(G11:G))),0))), "Select Col1, count(Col1) group by Col1 order by count(Col1) desc label count(Col1)''", 0)
尝试:
=ARRAYFORMULA(QUERY(TRANSPOSE(SPLIT(REGEXREPLACE(
TEXTJOIN(" ", 1, LOWER(A:A)), "\.|\,|\?", ), " ")),
"select Col1,count(Col1)
group by Col1
order by count(Col1) desc
label count(Col1)''", 0))
或:
=ARRAYFORMULA(QUERY(TRANSPOSE(SPLIT(REGEXREPLACE(
QUERY(LOWER(A:A),,999^99), "[^a-z0-9а-я ]", ), " ")),
"select Col1,count(Col1)
group by Col1
order by count(Col1) desc
label count(Col1)''", 0))
更新:
=ARRAYFORMULA(QUERY(TRANSPOSE(SPLIT(REGEXREPLACE(
QUERY(LOWER(A:A),,999^99), "[^a-z0-9 ]", ), " ")),
"select Col1,count(Col1)
where not Col1 matches 'the|and|i|you|its'
group by Col1
order by count(Col1) desc
label count(Col1)''", 0))
您可以使用 Mid、RegexReplace、Query、Split 等,像这样:
= query
(
transpose
(
split
(
regexreplace ( textjoin ( " ", true,filter(mid(A11:A,4, len(A11:A)),A11:A<>"") ) , "[>,.?/!-]"," " ) ," ",true,true
)
)
,"Select Col1, Count(Col1) group by Col1 label Col1 'Column2', Count(Col1) 'Column3' "
)
或者如果没有前缀 R1: ~ R5, 像这样使用:
= query
(
transpose
(
split
(
regexreplace ( textjoin ( " ", true,filter(A11:A,A11:A<>"")) , "[>,.?/!-]"," " ) ," ",true,true
)
)
, "Select Col1, Count(Col1) group by Col1 label Col1 'Column2', Count(Col1) 'Column3' "
)