创建公司新闻报道和匹配日期列表
creating a list of company news stories and matching dates
我正在尝试创建一个列表,将公司股票代码与新闻标题及其相应日期分组。
数据的头部基本上如下所示:
{'ford-motor-co': "\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup
Rivian\nBy The Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 millionRivian
shares at a price of .88, the company says. That followed an 8-million-share
sale earlier in the week at about the same price.\n\n\n\n\n \nFord sells shares in
EV maker Rivian for 8.2 million\nBy Reuters\xa0-\xa0May 14, 2022 (Reuters) -
Ford Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Automotive
Inc for about 8.2 million, or .88 apiece, the U.S. automaker said in a filing
on Friday. Ford now... \n\n\n\n\n\n\n\nFord sells sh
我已经成功地提取了日期和股票代码,但我不知道如何将日期与其相关的新闻标题分组。
parsed_data = []
for stock , stock_news_table in stock_news_tables.items():
date_data = re.findall(r'[A-Z][a-z]{2} \d{1,2}, \d{4}' , str(stock_news_table))
headline = stock_news_table
#print(date_data)
parsed_data.append([stock , date_data , headline])
目前的输出如下所示。如您所见,标题在多行换行处被拆分:\n\n\n\n .
[['ford-motor-co',
['May 14, 2022',
'May 14, 2022',
'May 13, 2022',
'May 13, 2022',
'May 13, 2022',
'May 13, 2022',
'May 12, 2022',
'May 12, 2022',
'May 12, 2022'],
"\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup Rivian\nBy The
Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 million Rivian shares at a
price of .88, the company says. That followed an 8-million-share sale earlier
in the week at about the same price.\n\n\n\n\n \nFord sells shares in EV maker
Rivian for 8.2 million\nBy Reuters\xa0-\xa0May 14, 2022 (Reuters) - Ford
Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Aut
我设法使用 dateparser
、自然语言日期解析器和 2 个不同的正则表达式解决了您的问题。希望足够了。
首先,安装dateparaser
:
pip install dateparser
然后运行代码:
import collections, re, dateparser
Stock = collections.namedtuple("Stock", ["name", "symbol", "headlines"])
# Remember, '.' is not multiline, equiv to '[^\n]+'
headline_re =re.compile(r"\n\n ?\n(?P<headline>.+)\nBy .+?\xa0-\xa0(?P<date>[\w ,]+)")
symbol_re = re.compile(r"\(([A-Z]{1,4}:[A-Z]{1,4})\)")
input_data = {'ford-motor-co':(
"\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup "
"Rivian\nBy The Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 millionRivian "
"shares at a price of .88, the company says. That followed an 8-million-share "
"sale earlier in the week at about the same price.\n\n\n\n\n \nFord sells shares in "
"EV maker Rivian for 8.2 million\nBy Reuters\xa0-\xa0May 14, 2022 (Reuters) - "
"Ford Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Automotive "
"Inc for about 8.2 million, or .88 apiece, the U.S. automaker said in a filing "
"on Friday. Ford now... \n\n\n\n\n\n\n\nFord sells sh")}
stocks = []
for name, data in input_data.items():
headlines = []
for match in headline_re.finditer(data):
date_str = match.group("date")
date = dateparser.parse(date_str)
headlines.append((match.group("headline"), date))
symbol = symbol_re.search(data).group(1)
stocks.append(Stock(name, symbol, headlines))
产量(库存):
[Stock(name='ford-motor-co', symbol='NYSE:F', headlines=[('Ford Unloads More Shares in Electric-Vehicle Startup Rivian', datetime.datetime(2022, 5, 14, 20, 58, 28, 30552)), ('Ford sells shares in EV maker Rivian for 8.2 million', datetime.datetime(2022, 5, 14, 0, 0))])]
确保符号正则表达式正确,因为我不确定股票市场的限制。
你可以re.split.
文档说,if capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list
所以如果你使用 r'([A-Z][a-z]{2} \d{1,2}, \d{4})'
stock = 'ford-motor-co'
stock_news_table = """\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup
Rivian\nBy The Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 millionRivian
shares at a price of .88, the company says. That followed an 8-million-share
sale earlier in the week at about the same price.\n\n\n\n\n \nFord sells shares in
EV maker Rivian for 8.2 million\nBy Reuters\xa0-\xa0May 14, 2022 (Reuters) -
Ford Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Automotive
Inc for about 8.2 million, or .88 apiece, the U.S. automaker said in a filing
on Friday. Ford now... \n\n\n\n\n\n\n\nFord sells sh"""
date_data = re.split(r'([A-Z][a-z]{2} \d{1,2}, \d{4})' , str(stock_news_table))
headline = stock_news_table
date_data
将return
['\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup \nRivian\nBy The Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 millionRivian \nshares at a price of .88, the company says. That followed an 8-million-share \nsale earlier in the week at about the same price.\n\n\n\n\n \nFord sells shares in \nEV maker Rivian for 8.2 million\nBy Reuters\xa0-\xa0',
'May 14, 2022',
' (Reuters) - \nFord Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Automotive \nInc for about 8.2 million, or .88 apiece, the U.S. automaker said in a filing \non Friday. Ford now... \n\n\n\n\n\n\n\nFord sells sh']
我正在尝试创建一个列表,将公司股票代码与新闻标题及其相应日期分组。
数据的头部基本上如下所示:
{'ford-motor-co': "\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup
Rivian\nBy The Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 millionRivian
shares at a price of .88, the company says. That followed an 8-million-share
sale earlier in the week at about the same price.\n\n\n\n\n \nFord sells shares in
EV maker Rivian for 8.2 million\nBy Reuters\xa0-\xa0May 14, 2022 (Reuters) -
Ford Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Automotive
Inc for about 8.2 million, or .88 apiece, the U.S. automaker said in a filing
on Friday. Ford now... \n\n\n\n\n\n\n\nFord sells sh
我已经成功地提取了日期和股票代码,但我不知道如何将日期与其相关的新闻标题分组。
parsed_data = []
for stock , stock_news_table in stock_news_tables.items():
date_data = re.findall(r'[A-Z][a-z]{2} \d{1,2}, \d{4}' , str(stock_news_table))
headline = stock_news_table
#print(date_data)
parsed_data.append([stock , date_data , headline])
目前的输出如下所示。如您所见,标题在多行换行处被拆分:\n\n\n\n .
[['ford-motor-co',
['May 14, 2022',
'May 14, 2022',
'May 13, 2022',
'May 13, 2022',
'May 13, 2022',
'May 13, 2022',
'May 12, 2022',
'May 12, 2022',
'May 12, 2022'],
"\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup Rivian\nBy The
Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 million Rivian shares at a
price of .88, the company says. That followed an 8-million-share sale earlier
in the week at about the same price.\n\n\n\n\n \nFord sells shares in EV maker
Rivian for 8.2 million\nBy Reuters\xa0-\xa0May 14, 2022 (Reuters) - Ford
Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Aut
我设法使用 dateparser
、自然语言日期解析器和 2 个不同的正则表达式解决了您的问题。希望足够了。
首先,安装dateparaser
:
pip install dateparser
然后运行代码:
import collections, re, dateparser
Stock = collections.namedtuple("Stock", ["name", "symbol", "headlines"])
# Remember, '.' is not multiline, equiv to '[^\n]+'
headline_re =re.compile(r"\n\n ?\n(?P<headline>.+)\nBy .+?\xa0-\xa0(?P<date>[\w ,]+)")
symbol_re = re.compile(r"\(([A-Z]{1,4}:[A-Z]{1,4})\)")
input_data = {'ford-motor-co':(
"\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup "
"Rivian\nBy The Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 millionRivian "
"shares at a price of .88, the company says. That followed an 8-million-share "
"sale earlier in the week at about the same price.\n\n\n\n\n \nFord sells shares in "
"EV maker Rivian for 8.2 million\nBy Reuters\xa0-\xa0May 14, 2022 (Reuters) - "
"Ford Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Automotive "
"Inc for about 8.2 million, or .88 apiece, the U.S. automaker said in a filing "
"on Friday. Ford now... \n\n\n\n\n\n\n\nFord sells sh")}
stocks = []
for name, data in input_data.items():
headlines = []
for match in headline_re.finditer(data):
date_str = match.group("date")
date = dateparser.parse(date_str)
headlines.append((match.group("headline"), date))
symbol = symbol_re.search(data).group(1)
stocks.append(Stock(name, symbol, headlines))
产量(库存):
[Stock(name='ford-motor-co', symbol='NYSE:F', headlines=[('Ford Unloads More Shares in Electric-Vehicle Startup Rivian', datetime.datetime(2022, 5, 14, 20, 58, 28, 30552)), ('Ford sells shares in EV maker Rivian for 8.2 million', datetime.datetime(2022, 5, 14, 0, 0))])]
确保符号正则表达式正确,因为我不确定股票市场的限制。
你可以re.split.
文档说,if capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list
所以如果你使用 r'([A-Z][a-z]{2} \d{1,2}, \d{4})'
stock = 'ford-motor-co'
stock_news_table = """\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup
Rivian\nBy The Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 millionRivian
shares at a price of .88, the company says. That followed an 8-million-share
sale earlier in the week at about the same price.\n\n\n\n\n \nFord sells shares in
EV maker Rivian for 8.2 million\nBy Reuters\xa0-\xa0May 14, 2022 (Reuters) -
Ford Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Automotive
Inc for about 8.2 million, or .88 apiece, the U.S. automaker said in a filing
on Friday. Ford now... \n\n\n\n\n\n\n\nFord sells sh"""
date_data = re.split(r'([A-Z][a-z]{2} \d{1,2}, \d{4})' , str(stock_news_table))
headline = stock_news_table
date_data
将return
['\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup \nRivian\nBy The Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 millionRivian \nshares at a price of .88, the company says. That followed an 8-million-share \nsale earlier in the week at about the same price.\n\n\n\n\n \nFord sells shares in \nEV maker Rivian for 8.2 million\nBy Reuters\xa0-\xa0',
'May 14, 2022',
' (Reuters) - \nFord Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Automotive \nInc for about 8.2 million, or .88 apiece, the U.S. automaker said in a filing \non Friday. Ford now... \n\n\n\n\n\n\n\nFord sells sh']