句子分词器 - spaCy 到 pandas
Sentence tokenizer - spaCy to pandas
使用 spaCy NLP 执行句子分词器并将其写入 Pandas Dataframe。
# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
# Extraction
import spacy,en_core_web_sm
import pandas as pd
# Read the text file
nlp = en_core_web_sm.load()
doc = nlp(unicode(open('o.txt').read().decode('utf8')) )
for idno, sentence in enumerate(doc.sents):
print 'Sentence {}:'.format(idno + 1), sentence
Sentences = list(doc.sents)
df = pd.DataFrame(Sentences)
print df
输出:
Sentence 1: This is a sample sentence.
Sentence 2: This is a second sample sentence.
Sentence 3: This is a third sample sentence.
0 1 2 3 4 5 6
0 This is a sample sentence . None
1 This is a second sample sentence .
2 This is a third sample sentence .
Pandas
中的预期输出
0
0 This is a sample sentence.
1 This is a second sample sentence.
2 This is a third sample sentence.
如何达到预期的输出?
你可以做的是构造一个列表,然后将其转换为Dataframe
# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
# Extraction
import spacy,en_core_web_sm
import pandas as pd
# Read the text file
nlp = en_core_web_sm.load()
doc = nlp(unicode(open('o.txt').read().decode('utf8')) )
d = []
for idno, sentence in enumerate(doc.sents):
d.append({"id": idno, "sentence":str(sentence)})
print 'Sentence {}:'.format(idno + 1), sentence
df = pd.DataFrame(d)
df.set_index('id', inplace=True)
print df
您应该能够使用 pd.read_table(input_file_path)
并调整 args 以将您的文本导入单个列,我们称之为 df['text']。
然后试试这个:
df['sents'] = df['text'].apply(lambda x: list(nlp(x).sents))
您将拥有一个包含句子标记列表的新列。
祝你好运!
使用 spaCy NLP 执行句子分词器并将其写入 Pandas Dataframe。
# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
# Extraction
import spacy,en_core_web_sm
import pandas as pd
# Read the text file
nlp = en_core_web_sm.load()
doc = nlp(unicode(open('o.txt').read().decode('utf8')) )
for idno, sentence in enumerate(doc.sents):
print 'Sentence {}:'.format(idno + 1), sentence
Sentences = list(doc.sents)
df = pd.DataFrame(Sentences)
print df
输出:
Sentence 1: This is a sample sentence.
Sentence 2: This is a second sample sentence.
Sentence 3: This is a third sample sentence.
0 1 2 3 4 5 6
0 This is a sample sentence . None
1 This is a second sample sentence .
2 This is a third sample sentence .
Pandas
中的预期输出 0
0 This is a sample sentence.
1 This is a second sample sentence.
2 This is a third sample sentence.
如何达到预期的输出?
你可以做的是构造一个列表,然后将其转换为Dataframe
# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
# Extraction
import spacy,en_core_web_sm
import pandas as pd
# Read the text file
nlp = en_core_web_sm.load()
doc = nlp(unicode(open('o.txt').read().decode('utf8')) )
d = []
for idno, sentence in enumerate(doc.sents):
d.append({"id": idno, "sentence":str(sentence)})
print 'Sentence {}:'.format(idno + 1), sentence
df = pd.DataFrame(d)
df.set_index('id', inplace=True)
print df
您应该能够使用 pd.read_table(input_file_path)
并调整 args 以将您的文本导入单个列,我们称之为 df['text']。
然后试试这个:
df['sents'] = df['text'].apply(lambda x: list(nlp(x).sents))
您将拥有一个包含句子标记列表的新列。
祝你好运!