在 Python 中写入打开的文件时拆分函数

Question

所以我有一个程序，我应该在其中获取一个外部文件，在 python 中打开它，然后分隔每个单词和每个标点符号，包括逗号、撇号和句号。然后我应该将这个文件保存为每个单词和标点符号在文本中出现的整数位置。

例如：- 我喜欢编码，因为编码很有趣。计算机的骨架。

在我的程序中，我必须将其保存为：-

1,2,3,4,5,6,3,4,7,8,9,10,11,12,13,14

（不明白的求助） 1-I , 2-like, 3-to, 4-code, 5-(,), 6-because, 7-is, 8-fun 9-(.), 10-A, 11-computer, 12-( '), 13-s, 14-骨架

这样就显示了每个单词的位置，即使重复也显示同一个单词的第一个出现位置

抱歉解释得太长了，但这是我的实际问题。到目前为止我已经这样做了：-

    file = open('newfiles.txt', 'r')
    with open('newfiles.txt','r') as file:
        for line in file:
            for word in line.split():
                 print(word)

结果如下：-

  They
  say
  it's
  a
  dog's
  life,.....

不幸的是，这种拆分文件的方式不会将单词与标点符号分开，也不会水平打印。 .split 不适用于文件，有谁知道我可以拆分文件的更有效方法 - 标点符号中的单词？然后将分开的单词和标点符号一起存储在一个列表中？

Answer 1

built-in 字符串方法 .split 只能使用简单的分隔符。没有参数，它只是在 whitespace 上拆分。对于更复杂的拆分行为，最简单的方法是使用正则表达式：

>>> s = "I like to code, because to code is fun. A computer's skeleton."
>>> import re
>>> delim = re.compile(r"""\s|([,.;':"])""")
>>> tokens = filter(None, delim.split(s))
>>> idx = {}
>>> result = []
>>> i = 1
>>> for token in tokens:
...     if token in idx:
...         result.append(idx[token])
...     else:
...         result.append(i)
...         idx[token] = i
...         i += 1
...
>>> result
[1, 2, 3, 4, 5, 6, 3, 4, 7, 8, 9, 10, 11, 12, 13, 14, 9]

此外，我认为您不需要按照您的规范逐行遍历文件。你应该做类似的事情：

with open('my file.txt') as f:
    s = f.read()

这会将整个文件作为字符串放入 s。请注意，我从未在 with 语句之前使用过 open，这没有任何意义。

Answer 2

使用正则表达式捕获相关子字符串：

import re

my_string = "I like to code, because to code is fun. A computer's skeleton."
matched = re.findall("(\w+)([',.]?)", my_string) # Split up relevant pieces of text

过滤掉空匹配并添加到结果中：

result = []
for word, punc in matched:
    result.append(word)
    if punc: # Check if punctuation follows the word
        result.append(punc)

然后将结果写入您的文件：

with open("file.txt", "w") as f:
    f.writelines(result) # Write pieces on separate lines

正则表达式的工作原理是查找字母字符，然后检查后面是否有标点符号（可选）。

Answer 3

您可以使用正则表达式和拆分来解决此问题。希望这能为您指明正确的方向。祝你好运！

import re
str1 = '''I like to code, because to code is fun. A computer's skeleton.'''

#Split your string into a list using regex and a capturing group:
matches = [x.strip() for x in re.split("([a-zA-Z]+)", str1) if x not in ['',' ']]
print matches
d = {}
i = 1
list_with_positions = []

#now build the dictionary entries:
for match in matches:
    if match not in d.keys():
        d[match] = i
        i+=1
    list_with_positions.append(d[match])

print list_with_positions

这是输出。请注意，最后一个句点的位置为 #9:

['I', 'like', 'to', 'code', ',', 'because', 'to', 'code', 'is', 'fun', '.', 'A', 'computer', "'", 's', 'skeleton', '.']

[1, 2, 3, 4, 5, 6, 3, 4, 7, 8, 9, 10, 11, 12, 13, 14, 9]

在 Python 中写入打开的文件时拆分函数

Split function when writing an opened file in Python

python

split

position

list