如何剥离多个内容的txt文件?

how to strip a txt file of multiple things?

我正在创建一个读取 txt 文件数据的函数,该文本文件设置为每行一个句子。我有 6 个要求来剥离文件以使其稍后在我的程序中可用:

 1. Make everything lowercase
 2. Split the line into words
 3. Remove all punctuation, such as “,”, “.”, “!”, etc.
 4. Remove apostrophes and hyphens, e.g. transform “can’t” into “cant” and 
 “first-born” into “firstborn”
 5. Remove the words that are not all alphabetic characters (do not remove 
 “can’t” because you have transformed it to “cant”, similarly for 
 “firstborn”).
 6. Remove the words with less than 2 characters, like “a”. 

这是我目前所拥有的...

def read_data(fp):
    file_dict={}
    fp=fp.lower
    fp=fp.strip(string.punctuation)
    lines=fp.readlines()

我有点卡住了,那么如何从这个文件中删除这 6 个项目?

这可以通过一系列正则表达式检查然后循环删除所有少于 2 个字符的项目来完成:

代码

import re

with open("text.txt", "r") as fi:
    lowerFile = re.sub("[^\w ]", "", fi.read().lower())
    lowerFile = re.sub("(^| )[^ ]*[^a-z ][^ ]*(?=$| )", "", lowerFile)
    words = [word for word in lowerFile.split() if len(word) >= 2]
    print(words)

输入

I li6ke to swim, dance, and Run r8un88.

输出

['to', 'swim', 'dance', 'and', 'run']