文本挖掘和抽取词
Text mining and extracting words
我正在尝试从 R 中的 SQL 语句中提取 table 名称。例如,我会将 SQL 查询导入 R,一行将包含:
SELECT A , B
FROM Table.1 p
JOIN Table.2 pv
ON p.ProdID.1 = ProdID.1
JOIN Table.3 v
ON pv.BusID.1 = v.BusID
WHERE SubID = 15
ORDER BY v.Name;
在 R 中,我一直在尝试将 strsplit 用于 SQL 语句,该语句将每个单词分成一列,创建一个数据框,然后找到与单词 "from" 和提取下一个单词 Table.1.
我在如何从多个连接中提取其他 table 时遇到问题,或者是否有更有效的方法或我在研究期间没有遇到的包。任何帮助将不胜感激!
这是使用正则表达式的一种方法:
lines <- strsplit("SELECT A, B
FROM Table.1 p
JOIN Table.2 pv
ON p.ProdID.1 = ProdID.1
JOIN Table.3 v
ON pv.BusID.1 = v.BusID
WHERE SubID = 15
ORDER BY v.Name;", split = "\n")[[1]]
sub(".*(FROM|JOIN) ([^ ]+).*", "\2", lines[grep("(FROM|JOIN)", lines)]) # "Table.1" "Table.2" "Table.3"
细分:
# Use grep to find the indeces of any line containing 'FROM' or 'JOIN'
keywords_regex <- "(FROM|JOIN)"
line_indeces <- grep(keywords_regex, lines) # gives: 2 3 5
table_lines <- lines[line_indeces] # get just the lines that have table names
# Build regular expression to capture the next word after either keyword
table_name_regex <- paste0(".*", keywords_regex, " ([^ ]+).*")
# The "\2" means to replace each match with the contents of the second capture
# group, where a capture group is defined by parentheses in the regex
sub(table_name_regex, "\2", table_lines)
我正在尝试从 R 中的 SQL 语句中提取 table 名称。例如,我会将 SQL 查询导入 R,一行将包含:
SELECT A , B
FROM Table.1 p
JOIN Table.2 pv
ON p.ProdID.1 = ProdID.1
JOIN Table.3 v
ON pv.BusID.1 = v.BusID
WHERE SubID = 15
ORDER BY v.Name;
在 R 中,我一直在尝试将 strsplit 用于 SQL 语句,该语句将每个单词分成一列,创建一个数据框,然后找到与单词 "from" 和提取下一个单词 Table.1.
我在如何从多个连接中提取其他 table 时遇到问题,或者是否有更有效的方法或我在研究期间没有遇到的包。任何帮助将不胜感激!
这是使用正则表达式的一种方法:
lines <- strsplit("SELECT A, B
FROM Table.1 p
JOIN Table.2 pv
ON p.ProdID.1 = ProdID.1
JOIN Table.3 v
ON pv.BusID.1 = v.BusID
WHERE SubID = 15
ORDER BY v.Name;", split = "\n")[[1]]
sub(".*(FROM|JOIN) ([^ ]+).*", "\2", lines[grep("(FROM|JOIN)", lines)]) # "Table.1" "Table.2" "Table.3"
细分:
# Use grep to find the indeces of any line containing 'FROM' or 'JOIN'
keywords_regex <- "(FROM|JOIN)"
line_indeces <- grep(keywords_regex, lines) # gives: 2 3 5
table_lines <- lines[line_indeces] # get just the lines that have table names
# Build regular expression to capture the next word after either keyword
table_name_regex <- paste0(".*", keywords_regex, " ([^ ]+).*")
# The "\2" means to replace each match with the contents of the second capture
# group, where a capture group is defined by parentheses in the regex
sub(table_name_regex, "\2", table_lines)