在应用 nltk 的句子标记器而不是 Python 3.5.1 中的句子后获取字母

Question

import codecs, os
import re
import string
import mysql
import mysql.connector
y_ = ""

'''Searching and reading text files from a folder.'''
for root, dirs, files in os.walk("/Users/ultaman/Documents/PAN dataset/Pan     Plagiarism dataset 2010/pan-plagiarism-corpus-2010/source-documents/test1"):
for file in files:
    if file.endswith(".txt"):
        x_ = codecs.open(os.path.join(root,file),"r", "utf-8-sig")
        for lines in x_.readlines():
            y_ = y_ + lines
'''Tokenizing the senteces of the text file.'''
from nltk.tokenize import sent_tokenize
raw_docs = sent_tokenize(y_)

tokenized_docs = [sent_tokenize(y_) for sent in raw_docs]

'''Removing punctuation marks.'''

regex = re.compile('[%s]' % re.escape(string.punctuation)) 

tokenized_docs_no_punctuation = ''

for review in tokenized_docs:

new_review = ''
for token in review: 
    new_token = regex.sub(u'', token)
    if not new_token == u'':
        new_review+= new_token

tokenized_docs_no_punctuation += (new_review)
print(tokenized_docs_no_punctuation)

'''Connecting and inserting tokenized documents without punctuation in database field.'''
def connect():
    for i in range(len(tokenized_docs_no_punctuation)):
        conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'test' )
        cursor = conn.cursor()
        cursor.execute("""INSERT INTO splitted_sentences(sentence_id, splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,(tokenized_docs_no_punctuation[i])))
        conn.commit()
        conn.close()
if __name__ == '__main__':
    connect()


After writing the above code, The result is like

            2 | S      | N                                                                                                                                                                                                                                                                                                             |
|           3 | S      | o                                                                                                                                                                                                                                                                                                            |

| 4 |年代 | | | 5 |年代 | d | | 6 |年代 | o | | 7 |年代 |你 | | 8 |年代 |乙 | | 9 |年代 |吨 | | 10 |年代 | | | 11 |年代 |米 | | 12 |年代 |是 | | 13 | S |
| 14 |年代 | d
在数据库中。

It should be like:
     1 | S      | No doubt, my dear friend.
     2 | S      | no doubt.

Answer 1

我建议进行以下编辑（使用您喜欢的内容）。但这就是我用来获取你的代码运行。您的问题是 for review in tokenized_docs: 中的 review 已经是一个字符串。因此，这使得 token 变成了 for token in review: 个字符。因此，为了解决这个问题，我尝试了 -

tokenized_docs = ['"No doubt, my dear friend, no doubt; but in the meanwhile suppose we talk of this annuity.', 'Shall we say one thousand francs a year."', '"What!"', 'asked Bonelle, looking at him very fixedly.', '"My dear friend, I mistook; I meant two thousand francs per annum," hurriedly rejoined Ramin.', 'Monsieur Bonelle closed his eyes, and appeared to fall into a gentle slumber.', 'The mercer coughed;\nthe sick man never moved.', '"Monsieur Bonelle."']

'''Removing punctuation marks.'''

regex = re.compile('[%s]' % re.escape(string.punctuation)) 

tokenized_docs_no_punctuation = []
for review in tokenized_docs:
    new_token = regex.sub(u'', review)
    if not new_token == u'':
        tokenized_docs_no_punctuation.append(new_token)

print(tokenized_docs_no_punctuation)

得到这个 -

['No doubt my dear friend no doubt but in the meanwhile suppose we talk of this annuity', 'Shall we say one thousand francs a year', 'What', 'asked Bonelle looking at him very fixedly', 'My dear friend I mistook I meant two thousand francs per annum hurriedly rejoined Ramin', 'Monsieur Bonelle closed his eyes and appeared to fall into a gentle slumber', 'The mercer coughed\nthe sick man never moved', 'Monsieur Bonelle']

输出的最终格式由您决定。我更喜欢使用列表。但是您也可以将其连接成一个字符串。

Answer 2

nw = []
for review in tokenized_docs[0]:
    new_review = ''
    for token in review: 
        new_token = regex.sub(u'', token)
        if not new_token == u'':
        new_review += new_token
nw.append(new_review)
'''Inserting into database'''
def connect():
    for j in nw:
        conn = mysql.connector.connect(user = 'root', password = '', unix_socket = "/tmp/mysql.sock", database = 'Thesis' )
        cursor = conn.cursor()
        cursor.execute("""INSERT INTO splitted_sentences(sentence_id,  splitted_sentences) VALUES(%s, %s)""",(cursor.lastrowid,j))
        conn.commit()
        conn.close()
if __name__ == '__main__':
    connect()

在应用 nltk 的句子标记器而不是 Python 3.5.1 中的句子后获取字母

getting alphabets after applying sentence tokenizer of nltk instead of sentences in Python 3.5.1

python

mysql

tokenize

punctuation