如何 Return 从 R 中的公司名称查询股票行情名称
How To Return Query For Name of Stock Ticker From Corporation Name In R
做一个我需要抓取的项目https://www.sec.gov/divisions/enforce/friactions/friactions2017.shtml。
基本上我已经编制了一份 SEC AAER 发布的清单,最终是一份私人和 public 公司的清单。我需要做的是 return 公司的自动收报机。任何对此有用的 R 包的想法。
例如,我想 "PCRFY" return 用于 Panasonic Corporation。但是,这可能是个问题:KPMG 有两个列表,一个是 "KPMG",另一个是 "KPMG Inc." 我怎样才能确保两个查询都 return 结果?
等式的一个例子是:
returnTicker(("Panasonic Corporation","Apple Corporation"))
哪个 return:
("PCRFY","APPL")
希望这接近您所需要的。它没有使用模糊匹配,但应该有可比较的结果。
部分改编自 的答案。
# The TTR package includes stock symbols and names for NASDAQ, NYSE, and AMEX
library(TTR)
master <- TTR::stockSymbols()[,c('Name', 'Symbol')]
# We are going to clean up the company names by removing some unimportant words.
# Replace the words ' Incorporated', ' Corporated', and ' Corporation' with '' (no text), and put results in master$clean.
master <- cbind(master, clean = gsub(' Incorporated| Corporated| Corporation', '', master$Name))
# Some further cleaning of the master$clean column (the straight line | seperates the strings we are removing)...
master$clean <- gsub(', Inc|, Inc.| Inc| Inc.| Corp|, Corp| Corp.|, Corp.| Ltd.| Ltd', '', master$clean)
# Clean some special characters. For explanations, check out http://www.endmemo.com/program/R/gsub.php
master$clean <- gsub('\(The\)|[.]|\'|,', '', master$clean)
# You should also do the 3 cleaning cleaning steps above on your company names as well.
# Lastly, scroll through your data; you may find some more character strings to remove.
# Create a data frame which would contain your company names....
yourCompanyNames <- data.frame(name = c('apple', 'microsoft', 'allstate', 'ramp capital'), stringsAsFactors = F)
# This is the important part. Symbols are added to the data frame of yourCompanyNames....
yourCompanyNames$sym <- sapply(X = yourCompanyNames$name, FUN = function(YOUR.NAME) {
master[grep(pattern = YOUR.NAME, x = master$clean, ignore.case = T), 'Symbol'] })
# ------------ END ---------------
# I dunno how much R experience you have, but here is a quick explanation of what is happening, chunk-by-chunk...
# companyNames$sym <-
# Create a new column in your dataframe for the symbols we will be finding
# sapply(X = yourCompanyNames$name, FUN = function(YOUR.NAME) {
# sapply() applies a function (found on the next line) to your data (X).
# master[grep(
# grep() searches for a string in a vector of strings, and will return the indices where it is found. For example...
# grep('hel', c('hello', 'world', 'help')) returns 1 and 3
# pattern = YOUR.NAME, x = master$clean, ignore.case = T),
# The pattern which grep() is looking for is YOUR.NAME, which is an individual company name from yourCompanyNames.
# (Remember, we are moving through yourCompanyNames one-by-one)
# grep() looks for YOUR.NAME in each of the strings in master$clean, and ignores capitalization of the strings.
# 'Symbol'] })
# We can simplify the second line to master[grep(), 'Symbol']
# Since grep() is returning indicies where YOUR.NAME is found in master$clean,
# the second line gives us the symbols for the companies located at those indicies (rows).
# Finally, sapply() returns the list of symbols we found, and the list is added to yourCompanyName$sym
# Using the 4 example companies from above, we get....
# name sym
# 1 apple AAPL, APLE, DPS, MLP
# 2 microsoft MSFT
# 3 allstate ALL, ALL-PA, ALL-PB, ALL-PC, ALL-PD, ALL-PE, ALL-PF, ALL-PG
# 4 ramp capital
# The word 'apple' appeared in multiple names, and 'allstate' has multiple tickers.
# You may need to clean some of them up using fix(yourCompanyNames)
希望这对您有所帮助,或者至少让您走上正确的道路。
做一个我需要抓取的项目https://www.sec.gov/divisions/enforce/friactions/friactions2017.shtml。
基本上我已经编制了一份 SEC AAER 发布的清单,最终是一份私人和 public 公司的清单。我需要做的是 return 公司的自动收报机。任何对此有用的 R 包的想法。
例如,我想 "PCRFY" return 用于 Panasonic Corporation。但是,这可能是个问题:KPMG 有两个列表,一个是 "KPMG",另一个是 "KPMG Inc." 我怎样才能确保两个查询都 return 结果?
等式的一个例子是:
returnTicker(("Panasonic Corporation","Apple Corporation"))
哪个 return:
("PCRFY","APPL")
希望这接近您所需要的。它没有使用模糊匹配,但应该有可比较的结果。
部分改编自
# The TTR package includes stock symbols and names for NASDAQ, NYSE, and AMEX
library(TTR)
master <- TTR::stockSymbols()[,c('Name', 'Symbol')]
# We are going to clean up the company names by removing some unimportant words.
# Replace the words ' Incorporated', ' Corporated', and ' Corporation' with '' (no text), and put results in master$clean.
master <- cbind(master, clean = gsub(' Incorporated| Corporated| Corporation', '', master$Name))
# Some further cleaning of the master$clean column (the straight line | seperates the strings we are removing)...
master$clean <- gsub(', Inc|, Inc.| Inc| Inc.| Corp|, Corp| Corp.|, Corp.| Ltd.| Ltd', '', master$clean)
# Clean some special characters. For explanations, check out http://www.endmemo.com/program/R/gsub.php
master$clean <- gsub('\(The\)|[.]|\'|,', '', master$clean)
# You should also do the 3 cleaning cleaning steps above on your company names as well.
# Lastly, scroll through your data; you may find some more character strings to remove.
# Create a data frame which would contain your company names....
yourCompanyNames <- data.frame(name = c('apple', 'microsoft', 'allstate', 'ramp capital'), stringsAsFactors = F)
# This is the important part. Symbols are added to the data frame of yourCompanyNames....
yourCompanyNames$sym <- sapply(X = yourCompanyNames$name, FUN = function(YOUR.NAME) {
master[grep(pattern = YOUR.NAME, x = master$clean, ignore.case = T), 'Symbol'] })
# ------------ END ---------------
# I dunno how much R experience you have, but here is a quick explanation of what is happening, chunk-by-chunk...
# companyNames$sym <-
# Create a new column in your dataframe for the symbols we will be finding
# sapply(X = yourCompanyNames$name, FUN = function(YOUR.NAME) {
# sapply() applies a function (found on the next line) to your data (X).
# master[grep(
# grep() searches for a string in a vector of strings, and will return the indices where it is found. For example...
# grep('hel', c('hello', 'world', 'help')) returns 1 and 3
# pattern = YOUR.NAME, x = master$clean, ignore.case = T),
# The pattern which grep() is looking for is YOUR.NAME, which is an individual company name from yourCompanyNames.
# (Remember, we are moving through yourCompanyNames one-by-one)
# grep() looks for YOUR.NAME in each of the strings in master$clean, and ignores capitalization of the strings.
# 'Symbol'] })
# We can simplify the second line to master[grep(), 'Symbol']
# Since grep() is returning indicies where YOUR.NAME is found in master$clean,
# the second line gives us the symbols for the companies located at those indicies (rows).
# Finally, sapply() returns the list of symbols we found, and the list is added to yourCompanyName$sym
# Using the 4 example companies from above, we get....
# name sym
# 1 apple AAPL, APLE, DPS, MLP
# 2 microsoft MSFT
# 3 allstate ALL, ALL-PA, ALL-PB, ALL-PC, ALL-PD, ALL-PE, ALL-PF, ALL-PG
# 4 ramp capital
# The word 'apple' appeared in multiple names, and 'allstate' has multiple tickers.
# You may need to clean some of them up using fix(yourCompanyNames)
希望这对您有所帮助,或者至少让您走上正确的道路。