自然语言处理——提取数据
Natural language processing - extracting data
我需要帮助处理 day-trading/swing-trading/investment 推荐的非结构化数据。我有 CSV
.
形式的非结构化数据
以下是需要提取数据的 3 个示例段落:
Chandan Taparia of Anand Rathi has a buy call on Coal India Ltd. with
an intra-day target price of Rs 338 . The current market
price of Coal India Ltd. is 325.15 . Chandan Taparia recommended to
keep stop loss at Rs 318 .
Kotak Securities Limited has a buy call on Engineers India Ltd. with a
target price of Rs 335 .The current market price of Engineers India Ltd. is Rs 266.05 The analyst gave a year for Engineers
India Ltd. price to reach the defined target. Engineers India enjoys a
healthy market share in the Hydrocarbon consultancy segment. It enjoys
a prolific relationship with few of the major oil & gas companies like
HPCL, BPCL, ONGC and IOC. The company is well poised to benefit from a
recovery in the infrastructure spending in the hydrocarbon sector.
Independent analyst Kunal Bothra has a sell call on Ceat Ltd. with a
target price of Rs 1150 .The current market price of Ceat Ltd. is Rs 1199.6 The time period given by the analyst is 1-3 days
when Ceat Ltd. price can reach the defined target. Kunal Bothra
maintained stop loss at Rs 1240.
从段落中提取 4 条信息是一项挑战:
每条建议都有不同的框架,但本质上有
- 目标价
- 止损价
- 当前价格。
- 持续时间
而且不一定所有的信息都会在所有的建议中可用——每个建议至少会有目标价。
我尝试使用正则表达式,但不是很成功,谁能指导我如何使用 nltk
提取此信息?
到目前为止我在清理数据方面的代码:
import pandas as pd
import re
#etanalysis_final.csv has 4 columns with
#0th Column having data time
#1st Column having a simple hint like 'Sell Ceat Ltd. target Rs 1150 : Kunal Bothra,Sell Ceat Ltd. at a price target of Rs 1150 and a stoploss at Rs 1240 from entry point', not all the hints are same, I can rely on it for recommender, Buy or Sell, which stock.
#4th column has the detailed recommendation given.
df = pd.read_csv('etanalysis_final.csv',encoding='ISO-8859-1')
df.DATE = pd.to_datetime(df.DATE)
df.dropna(inplace=True)
df['RECBY'] = df['C1'].apply(lambda x: re.split(':|\x96',x)[-1].strip())
df['ACT'] = df['C1'].apply(lambda x: x.split()[0].strip())
df['STK'] = df['C1'].apply(lambda x: re.split('\.|\,|:| target| has| and|Buy|Sell| with',x)[1])
#Getting the target price - not always correct
df['TGT'] = df['C4'].apply(lambda x: re.findall('\d+.', x)[0])
#Getting the stop loss price - not always correct
df['STL'] = df['C4'].apply(lambda x: re.findall('\d+.\d+', x)[-1])
这是一个难题,因为这 4 条信息中的每一条都有不同的可能性。这是一种可能有效的幼稚方法,尽管需要验证。我将为目标做示例,但您可以将其扩展到任何:
CONTEXT = 6
def is_float(x):
try:
float(x)
return True
except ValueError:
return False
def get_target_price(s):
words = s.split()
n = words.index('target')
words_in_range = words[n-CONTEXT:n+CONTEXT]
return float(list(filter(is_float, words_in_range))[0]) # returns any instance of a float
这是一种让您入门的简单方法,但您可以进行额外检查以使其更安全。有待改进的地方:
- 确保找到提议的浮点数之前的索引是 Rs。
- 如果上下文范围内没有找到浮点数,则展开上下文
- 如果存在歧义,即在上下文范围内有多个目标实例或多个浮动等,则添加用户验证。
我得到了解决方案:
此处的代码仅包含问题的解决方案部分。使用 fuzzywuzzy 库可以大大改进此解决方案。
from nltk import word_tokenize
periods = ['year',"year's", 'day','days',"day's", 'month', "month's", 'week',"week's", 'intra-day', 'intraday']
stop = ['target', 'current', 'stop', 'period', 'stoploss']
def extractinfo(row):
if 'intra day' in row.lower():
row = row.lower().replace('intra day', 'intra-day')
tks = [ w for w in word_tokenize(row) if any([w.lower() in stop, isfloat(w)])]
tgt = ''
crt = ''
stp = ''
prd = ''
if 'target' in tks:
if len(tks[tks.index('target'):tks.index('target')+2]) == 2:
tgt = tks[tks.index('target'):tks.index('target')+2][-1]
if 'current' in tks:
if len(tks[tks.index('current'):tks.index('current')+2]) == 2:
crt = tks[tks.index('current'):tks.index('current')+2][-1]
if 'stop' in tks:
if len(tks[tks.index('stop'):tks.index('stop')+2]) == 2:
stp = tks[tks.index('stop'):tks.index('stop')+2][-1]
prdd = set(periods).intersection(tks)
if 'period' in tks:
pdd = tks[tks.index('period'):tks.index('period')+3]
prr = set(periods).intersection(pdd)
if len(prr) > 0:
if len(pdd) > 2:
prd = ' '.join(pdd[-2::1])
elif len(pdd) == 2:
prd = pdd[-1]
elif len(prdd) > 0:
prd = list(prdd)[0]
return (crt, tgt, stp, prd)
解决方案相对不言自明 - 否则请告诉我。
我需要帮助处理 day-trading/swing-trading/investment 推荐的非结构化数据。我有 CSV
.
以下是需要提取数据的 3 个示例段落:
Chandan Taparia of Anand Rathi has a buy call on Coal India Ltd. with an intra-day target price of Rs 338 . The current market price of Coal India Ltd. is 325.15 . Chandan Taparia recommended to keep stop loss at Rs 318 .
Kotak Securities Limited has a buy call on Engineers India Ltd. with a target price of Rs 335 .The current market price of Engineers India Ltd. is Rs 266.05 The analyst gave a year for Engineers India Ltd. price to reach the defined target. Engineers India enjoys a healthy market share in the Hydrocarbon consultancy segment. It enjoys a prolific relationship with few of the major oil & gas companies like HPCL, BPCL, ONGC and IOC. The company is well poised to benefit from a recovery in the infrastructure spending in the hydrocarbon sector.
Independent analyst Kunal Bothra has a sell call on Ceat Ltd. with a target price of Rs 1150 .The current market price of Ceat Ltd. is Rs 1199.6 The time period given by the analyst is 1-3 days when Ceat Ltd. price can reach the defined target. Kunal Bothra maintained stop loss at Rs 1240.
从段落中提取 4 条信息是一项挑战: 每条建议都有不同的框架,但本质上有
- 目标价
- 止损价
- 当前价格。
- 持续时间
而且不一定所有的信息都会在所有的建议中可用——每个建议至少会有目标价。
我尝试使用正则表达式,但不是很成功,谁能指导我如何使用 nltk
提取此信息?
到目前为止我在清理数据方面的代码:
import pandas as pd
import re
#etanalysis_final.csv has 4 columns with
#0th Column having data time
#1st Column having a simple hint like 'Sell Ceat Ltd. target Rs 1150 : Kunal Bothra,Sell Ceat Ltd. at a price target of Rs 1150 and a stoploss at Rs 1240 from entry point', not all the hints are same, I can rely on it for recommender, Buy or Sell, which stock.
#4th column has the detailed recommendation given.
df = pd.read_csv('etanalysis_final.csv',encoding='ISO-8859-1')
df.DATE = pd.to_datetime(df.DATE)
df.dropna(inplace=True)
df['RECBY'] = df['C1'].apply(lambda x: re.split(':|\x96',x)[-1].strip())
df['ACT'] = df['C1'].apply(lambda x: x.split()[0].strip())
df['STK'] = df['C1'].apply(lambda x: re.split('\.|\,|:| target| has| and|Buy|Sell| with',x)[1])
#Getting the target price - not always correct
df['TGT'] = df['C4'].apply(lambda x: re.findall('\d+.', x)[0])
#Getting the stop loss price - not always correct
df['STL'] = df['C4'].apply(lambda x: re.findall('\d+.\d+', x)[-1])
这是一个难题,因为这 4 条信息中的每一条都有不同的可能性。这是一种可能有效的幼稚方法,尽管需要验证。我将为目标做示例,但您可以将其扩展到任何:
CONTEXT = 6
def is_float(x):
try:
float(x)
return True
except ValueError:
return False
def get_target_price(s):
words = s.split()
n = words.index('target')
words_in_range = words[n-CONTEXT:n+CONTEXT]
return float(list(filter(is_float, words_in_range))[0]) # returns any instance of a float
这是一种让您入门的简单方法,但您可以进行额外检查以使其更安全。有待改进的地方:
- 确保找到提议的浮点数之前的索引是 Rs。
- 如果上下文范围内没有找到浮点数,则展开上下文
- 如果存在歧义,即在上下文范围内有多个目标实例或多个浮动等,则添加用户验证。
我得到了解决方案:
此处的代码仅包含问题的解决方案部分。使用 fuzzywuzzy 库可以大大改进此解决方案。
from nltk import word_tokenize
periods = ['year',"year's", 'day','days',"day's", 'month', "month's", 'week',"week's", 'intra-day', 'intraday']
stop = ['target', 'current', 'stop', 'period', 'stoploss']
def extractinfo(row):
if 'intra day' in row.lower():
row = row.lower().replace('intra day', 'intra-day')
tks = [ w for w in word_tokenize(row) if any([w.lower() in stop, isfloat(w)])]
tgt = ''
crt = ''
stp = ''
prd = ''
if 'target' in tks:
if len(tks[tks.index('target'):tks.index('target')+2]) == 2:
tgt = tks[tks.index('target'):tks.index('target')+2][-1]
if 'current' in tks:
if len(tks[tks.index('current'):tks.index('current')+2]) == 2:
crt = tks[tks.index('current'):tks.index('current')+2][-1]
if 'stop' in tks:
if len(tks[tks.index('stop'):tks.index('stop')+2]) == 2:
stp = tks[tks.index('stop'):tks.index('stop')+2][-1]
prdd = set(periods).intersection(tks)
if 'period' in tks:
pdd = tks[tks.index('period'):tks.index('period')+3]
prr = set(periods).intersection(pdd)
if len(prr) > 0:
if len(pdd) > 2:
prd = ' '.join(pdd[-2::1])
elif len(pdd) == 2:
prd = pdd[-1]
elif len(prdd) > 0:
prd = list(prdd)[0]
return (crt, tgt, stp, prd)
解决方案相对不言自明 - 否则请告诉我。