Python Regex.split 文本 + 将每个拆分为 .txt 并将每个拆分单词作为文件名导出到指定文件夹

Question

Python regex.split text + 导出每个拆分为.txt，每个拆分为文件名到指定路径文件夹

大家好！ 我学习 Python 并且我尝试用文本做出不同的动作 :

使用 NLTK 拆分文本 regex.split

Regex.split 除了 'word-hyphen'
没有空结果作为 '' 和单独的 '-'
将每个拆分导出为 .txt，每个拆分单词作为文件名导出到指定文件夹 --> Regex.split 没有空结果为 ''和单独的 '-' 除了 'word-hyphen' 不创建空文件

第 1 步完成：

# coding: utf-8 import nltk s = "This sentence is in first place. This second sentence isn't in first place." import regex regex.split("[\s\.\,]", s) ['This', 'sentence', 'is', 'in', 'first', 'place', '', 'This', 'second', 'sentence', "isn't", 'in', 'first', 'place', '']

第 2 步和第 3 步是我尝试做的：

2。除了 'word-hyphen'
外，不要将空结果计为 '' 和单独的 '-'
第 2 步完成了什么：

# coding: utf-8 import nltk s = "This sentence is in first place and contain a word-hyphen — Hello I am the second sentence and I'm in second place." import regex regex.split("[\s\.;!?…»,«\,]", s) ['This', 'sentence', 'is', 'in', 'first', 'place', 'and', 'contain', 'a', 'word-hyphen', '-', 'Hello', 'I', 'am', 'the', 'second', 'sentence', 'and', "I'm", 'in', 'second', 'place', '']

3。将每个拆分为 .txt，每个拆分单词作为文件名导出到指定文件夹

有人知道我们怎样才能做出那样的东西吗？

感谢您的帮助

Answer 1

您没有使用 nltk 的正则表达式引擎。也许你想要 RegexpTokenizer?

因为你没有使用变量并且有这个“自动打印”，我猜你正在使用命令行或 IDLE。您必须在第 3 步中使用变量，有时您还必须使用 .py 文件。让我们现在开始；如果我错了，对不起。

由于在第 2 步中要求您不要有空结果，这表明您在第 1 步中遇到了问题。让我们尝试 RegexpTokenizer 然后：

from nltk.tokenize import RegexpTokenizer
s = "This sentence is in first place. This second sentence isn't in first place."
tokenizer = RegexpTokenizer("[\s\.\,]", gaps=True)
split=tokenizer.tokenize(s)
print(split)

输出：

['This', 'sentence', 'is', 'in', 'first', 'place', 'This', 'second', 'sentence', "isn't", 'in', 'first', 'place']

这里没有空结果，我们很好。

对于第 2 步，我不理解您的正则表达式：只需从第 1 步 "[\s\.\,]" 中提取正则表达式并添加破折号 "[\s\.\,—]":

from nltk.tokenize import RegexpTokenizer
s = "This sentence is in first place and contain a word-hyphen — Hello I am the second sentence and I'm in second place."
tokenizer = RegexpTokenizer("[\s\.\,—]", gaps=True)
split=tokenizer.tokenize(s)
print(split)

输出：

['This', 'sentence', 'is', 'in', 'first', 'place', 'and', 'contain', 'a', 'word-hyphen', 'Hello', 'I', 'am', 'the', 'second', 'sentence', 'and', "I'm", 'in', 'second', 'place']

对于第3步，最简单的方法应该是这样：

import os.path
path_to_files = 'C:\Users\username\Desktop\Split txt export'

for word in split:
    filename=word+'.txt'
    fullpath=os.path.join(path_to_files, filename)
    with open(fullpath, 'w') as f:
        f.write(word)

Python Regex.split 文本 + 将每个拆分为 .txt 并将每个拆分单词作为文件名导出到指定文件夹

Python Regex.split text + export each split as .txt with each split words as filename to a specified folder

python

regex

nlp

export

nltk

Python regex.split text + 导出每个拆分为.txt，每个拆分为文件名到指定路径文件夹

第 1 步完成：

第 2 步和第 3 步是我尝试做的：

感谢您的帮助