使用正则表达式从字符串创建二元语法

Creating bigrams from a string using regex

我有一个像这样的字符串:

"[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"

取自 Excel 文件。这看起来像一个数组,但因为它是从文件中提取的,所以它只是一个字符串。

我需要做的是:

a) 删除 [ ]

b) 按 , 拆分字符串,从而实际创建一个新列表

c) 只取第一个字符串即 u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT'

d) 将结果字符串的二元语法创建为由空格吐出的实际字符串(而不是二元语法):

LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to *extend*to~prepc_according_to+expectancy~-nsubj expectancy~-nsubj+is~parataxis  is~parataxis+NUMBER~nsubj NUMBER~nsubj+NUMBER_SLOT

我一直在研究的当前代码片段。

text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
text = re.sub('^\[(.*)\]',"",text)
text = [text.split(",")[0]]
bigrams = [b for l in text for b in zip(l.split("+")[:-1], l.split("+")[1:])]
bigrams = [("+").join(bigram).encode('utf-8') for bigram in bigrams]
bigrams = (' ').join(map(str, bigrams))
bigrams = ('').join(bigrams)

虽然我的正则表达式似乎 return 什么都没有。

我已经解决了。正则表达式需要经过两次,首先替换括号,然后获取第一个字符串,然后删除引号:

   text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
                        text =  re.sub(r'\[u|\]',"",text)
                        text = text.split(",")[0]
                        text = re.sub(r'\'',"",text)
                        text = text.split("+")
                        bigrams = [text[i:i+2] for i in xrange(len(text)-2)]
                        bigrams = [("+").join(bigram).encode('utf-8') for bigram in bigrams]
                        bigrams = (' ').join(map(str, bigrams))

您的字符串看起来像一个 Python unicode 字符串列表,对吗?

您可以评估它以获取 unicode 字符串列表。一个好的方法是使用 ast 模块中的 ast.literal_eval 函数。

简单写:

text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT'," \
       " u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"

import ast

lines = ast.literal_eval(text)

结果是 unicode 字符串列表:

for line in  lines:
    print(line)

您将获得:

LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT
LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT    

计算二元组:

bigrams = [b for l in lines for b in zip(l.split("+")[:-1], l.split("+")[1:])]
bigrams = ["+".join(bigram).encode('utf-8') for bigram in bigrams]
bigrams = ' '.join(map(str, bigrams))
bigrams = ''.join(bigrams)