使用正则表达式从字符串创建二元语法
Creating bigrams from a string using regex
我有一个像这样的字符串:
"[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
取自 Excel 文件。这看起来像一个数组,但因为它是从文件中提取的,所以它只是一个字符串。
我需要做的是:
a) 删除 [ ]
b) 按 ,
拆分字符串,从而实际创建一个新列表
c) 只取第一个字符串即 u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT'
d) 将结果字符串的二元语法创建为由空格吐出的实际字符串(而不是二元语法):
LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to *extend*to~prepc_according_to+expectancy~-nsubj expectancy~-nsubj+is~parataxis is~parataxis+NUMBER~nsubj NUMBER~nsubj+NUMBER_SLOT
我一直在研究的当前代码片段。
text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
text = re.sub('^\[(.*)\]',"",text)
text = [text.split(",")[0]]
bigrams = [b for l in text for b in zip(l.split("+")[:-1], l.split("+")[1:])]
bigrams = [("+").join(bigram).encode('utf-8') for bigram in bigrams]
bigrams = (' ').join(map(str, bigrams))
bigrams = ('').join(bigrams)
虽然我的正则表达式似乎 return 什么都没有。
我已经解决了。正则表达式需要经过两次,首先替换括号,然后获取第一个字符串,然后删除引号:
text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
text = re.sub(r'\[u|\]',"",text)
text = text.split(",")[0]
text = re.sub(r'\'',"",text)
text = text.split("+")
bigrams = [text[i:i+2] for i in xrange(len(text)-2)]
bigrams = [("+").join(bigram).encode('utf-8') for bigram in bigrams]
bigrams = (' ').join(map(str, bigrams))
您的字符串看起来像一个 Python unicode 字符串列表,对吗?
您可以评估它以获取 unicode 字符串列表。一个好的方法是使用 ast 模块中的 ast.literal_eval
函数。
简单写:
text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT'," \
" u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
import ast
lines = ast.literal_eval(text)
结果是 unicode 字符串列表:
for line in lines:
print(line)
您将获得:
LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT
LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT
计算二元组:
bigrams = [b for l in lines for b in zip(l.split("+")[:-1], l.split("+")[1:])]
bigrams = ["+".join(bigram).encode('utf-8') for bigram in bigrams]
bigrams = ' '.join(map(str, bigrams))
bigrams = ''.join(bigrams)
我有一个像这样的字符串:
"[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
取自 Excel 文件。这看起来像一个数组,但因为它是从文件中提取的,所以它只是一个字符串。
我需要做的是:
a) 删除 [ ]
b) 按 ,
拆分字符串,从而实际创建一个新列表
c) 只取第一个字符串即 u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT'
d) 将结果字符串的二元语法创建为由空格吐出的实际字符串(而不是二元语法):
LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to *extend*to~prepc_according_to+expectancy~-nsubj expectancy~-nsubj+is~parataxis is~parataxis+NUMBER~nsubj NUMBER~nsubj+NUMBER_SLOT
我一直在研究的当前代码片段。
text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
text = re.sub('^\[(.*)\]',"",text)
text = [text.split(",")[0]]
bigrams = [b for l in text for b in zip(l.split("+")[:-1], l.split("+")[1:])]
bigrams = [("+").join(bigram).encode('utf-8') for bigram in bigrams]
bigrams = (' ').join(map(str, bigrams))
bigrams = ('').join(bigrams)
虽然我的正则表达式似乎 return 什么都没有。
我已经解决了。正则表达式需要经过两次,首先替换括号,然后获取第一个字符串,然后删除引号:
text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
text = re.sub(r'\[u|\]',"",text)
text = text.split(",")[0]
text = re.sub(r'\'',"",text)
text = text.split("+")
bigrams = [text[i:i+2] for i in xrange(len(text)-2)]
bigrams = [("+").join(bigram).encode('utf-8') for bigram in bigrams]
bigrams = (' ').join(map(str, bigrams))
您的字符串看起来像一个 Python unicode 字符串列表,对吗?
您可以评估它以获取 unicode 字符串列表。一个好的方法是使用 ast 模块中的 ast.literal_eval
函数。
简单写:
text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT'," \
" u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
import ast
lines = ast.literal_eval(text)
结果是 unicode 字符串列表:
for line in lines:
print(line)
您将获得:
LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT
LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT
计算二元组:
bigrams = [b for l in lines for b in zip(l.split("+")[:-1], l.split("+")[1:])]
bigrams = ["+".join(bigram).encode('utf-8') for bigram in bigrams]
bigrams = ' '.join(map(str, bigrams))
bigrams = ''.join(bigrams)