创建公司新闻报道和匹配日期列表

Question

我正在尝试创建一个列表，将公司股票代码与新闻标题及其相应日期分组。

数据的头部基本上如下所示：

{'ford-motor-co': "\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup 
Rivian\nBy The Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 millionRivian 
shares at a price of .88, the company says. That followed an 8-million-share 
sale earlier in the week at about the same price.\n\n\n\n\n \nFord sells shares in 
EV maker Rivian for 8.2 million\nBy Reuters\xa0-\xa0May 14, 2022  (Reuters) - 
Ford Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Automotive 
Inc for about 8.2 million, or .88 apiece, the U.S. automaker said in a filing 
on Friday. Ford now... \n\n\n\n\n\n\n\nFord sells sh

我已经成功地提取了日期和股票代码，但我不知道如何将日期与其相关的新闻标题分组。

parsed_data = []

for stock , stock_news_table in stock_news_tables.items():

    date_data = re.findall(r'[A-Z][a-z]{2} \d{1,2}, \d{4}' , str(stock_news_table))

    headline = stock_news_table

    #print(date_data)

    parsed_data.append([stock , date_data , headline])

目前的输出如下所示。如您所见，标题在多行换行处被拆分：\n\n\n\n .

 [['ford-motor-co',
  ['May 14, 2022',
   'May 14, 2022',
   'May 13, 2022',
   'May 13, 2022',
   'May 13, 2022',
   'May 13, 2022',
   'May 12, 2022',
   'May 12, 2022',
   'May 12, 2022'],
  "\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup Rivian\nBy The 
   Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 million Rivian shares at a 
   price of .88, the company says. That followed an 8-million-share sale earlier 
   in the week at about the same price.\n\n\n\n\n \nFord sells shares in EV maker 
   Rivian for 8.2 million\nBy Reuters\xa0-\xa0May 14, 2022  (Reuters) - Ford 
   Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Aut

Answer 1

我设法使用 dateparser、自然语言日期解析器和 2 个不同的正则表达式解决了您的问题。希望足够了。

首先，安装dateparaser:

pip install dateparser

然后运行代码：

import collections, re, dateparser
Stock = collections.namedtuple("Stock", ["name", "symbol", "headlines"])

# Remember, '.' is not multiline, equiv to '[^\n]+'
headline_re =re.compile(r"\n\n ?\n(?P<headline>.+)\nBy .+?\xa0-\xa0(?P<date>[\w ,]+)")
symbol_re = re.compile(r"\(([A-Z]{1,4}:[A-Z]{1,4})\)")
input_data = {'ford-motor-co':(
    "\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup "
    "Rivian\nBy The Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 millionRivian "
    "shares at a price of .88, the company says. That followed an 8-million-share "
    "sale earlier in the week at about the same price.\n\n\n\n\n \nFord sells shares in "
    "EV maker Rivian for 8.2 million\nBy Reuters\xa0-\xa0May 14, 2022  (Reuters) - "
    "Ford Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Automotive "
    "Inc for about 8.2 million, or .88 apiece, the U.S. automaker said in a filing "
    "on Friday. Ford now... \n\n\n\n\n\n\n\nFord sells sh")}

stocks = []
for name, data in input_data.items():
    headlines = []
    for match in headline_re.finditer(data):
        date_str = match.group("date")
        date = dateparser.parse(date_str)
        headlines.append((match.group("headline"), date))
    symbol = symbol_re.search(data).group(1)
    stocks.append(Stock(name, symbol, headlines))

产量（库存）：

[Stock(name='ford-motor-co', symbol='NYSE:F', headlines=[('Ford Unloads More Shares in Electric-Vehicle Startup Rivian', datetime.datetime(2022, 5, 14, 20, 58, 28, 30552)), ('Ford sells shares in EV maker Rivian for 8.2 million', datetime.datetime(2022, 5, 14, 0, 0))])]

确保符号正则表达式正确，因为我不确定股票市场的限制。

Answer 2

你可以re.split.

文档说，if capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list

所以如果你使用 r'([A-Z][a-z]{2} \d{1,2}, \d{4})'

stock = 'ford-motor-co'
stock_news_table =  """\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup 
Rivian\nBy The Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 millionRivian 
shares at a price of .88, the company says. That followed an 8-million-share 
sale earlier in the week at about the same price.\n\n\n\n\n \nFord sells shares in 
EV maker Rivian for 8.2 million\nBy Reuters\xa0-\xa0May 14, 2022  (Reuters) - 
Ford Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Automotive 
Inc for about 8.2 million, or .88 apiece, the U.S. automaker said in a filing 
on Friday. Ford now... \n\n\n\n\n\n\n\nFord sells sh"""
date_data = re.split(r'([A-Z][a-z]{2} \d{1,2}, \d{4})' , str(stock_news_table))
headline = stock_news_table
date_data

将return

['\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup \nRivian\nBy The Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 millionRivian \nshares at a price of .88, the company says. That followed an 8-million-share \nsale earlier in the week at about the same price.\n\n\n\n\n \nFord sells shares in \nEV maker Rivian for 8.2 million\nBy Reuters\xa0-\xa0',
 'May 14, 2022',
 '  (Reuters) - \nFord Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Automotive \nInc for about 8.2 million, or .88 apiece, the U.S. automaker said in a filing \non Friday. Ford now... \n\n\n\n\n\n\n\nFord sells sh']

创建公司新闻报道和匹配日期列表

creating a list of company news stories and matching dates

python

parsing

list