如何将元组列表从文本文件转换为列
how to Convert list of tuple into column from text file
我有一个包含元组列表的文本文件。我想将此列表转换为列。
文件包含以下数据:
[(0, u'0.025*"minimalism" + 0.018*"diwali" + 0.018*"sunday" + 0.018*"minimalistics" + 0.018*"plant" + 0.010*"thought" + 0.010*"take" + 0.010*"httpstcog21yvu1vyo" + 0.010*"time" + 0.010*"cause"'),
(1, u'0.029*"panshet" + 0.022*"im" + 0.015*"video" + 0.015*"project" + 0.015*"shade" + 0.015*"nature" + 0.015*"motionphotography\u2026" + 0.015*"motionjpeg" + 0.015*"trip" + 0.015*"lake"'),
(2, u'0.013*"light" + 0.013*"take" + 0.013*"minimalist" + 0.013*"unm4sk" + 0.013*"first" + 0.013*"minimalism\u2026" + 0.013*"minimal" + 0.013*"possible" + 0.013*"quick" + 0.013*"story"')]
我想要以下格式的输出:
topic 0 topic 1 topic 2
minimalism panshet light
diwali im take
sunday video minimalist
minimalistics project unm4sk
plant shade first
编辑 1
with open('LDA.txt') as f:
lis = [x.split() for x in f]
cols=[x for x in zip(*lis)]
for x in cols:
print(x)
您的第一个错误是从文本文件加载 "data" 的方式(这甚至不是保存数据的最佳方式。如果您要保存 python 个对象,最好使用 pickle
来做到这一点)。
无论如何,修复很简单。读取文件时,调用 ast.literal_eval
.
import ast
with open('LDA.txt') as f:
data = ast.literal_eval(f.read())
您期待已久的部分来了。您可以使用 re.findall
非常轻松地提取单词。对于数据中的每个元组,提取所有单词并存储在字典中。然后,将字典传递给 pd.DataFrame
构造函数。
import re
import pandas as pd
d = {}
for i, y in data:
d['topic {}'.format(i)] = re.findall('"(.*?)"', y)
df = pd.DataFrame(d)
df
topic 0 topic 1 topic 2
0 minimalism panshet light
1 diwali im take
2 sunday video minimalist
3 minimalistics project unm4sk
4 plant shade first
5 thought nature minimalism…
6 take motionphotography… minimal
7 httpstcog21yvu1vyo motionjpeg possible
8 time trip quick
9 cause lake story
如果您想要其他方式制表数据(不使用数据框),请参阅 here(第二个答案)。
我认为输出看起来像 gensim
LDA 模型输出的 __str__
格式。
而不是打印主题并保存字符串,然后执行 post-processing:
from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
for text in texts]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)
model.print_topics(3)
[输出]:
[(51, '0.083*"response" + 0.083*"time" + 0.083*"graph" + 0.083*"trees" + 0.083*"eps" + 0.083*"computer" + 0.083*"survey" + 0.083*"interface" + 0.083*"user" + 0.083*"human"'), (48, '0.083*"response" + 0.083*"time" + 0.083*"graph" + 0.083*"trees" + 0.083*"eps" + 0.083*"computer" + 0.083*"survey" + 0.083*"interface" + 0.083*"user" + 0.083*"human"'), (42, '0.083*"response" + 0.083*"time" + 0.083*"graph" + 0.083*"trees" + 0.083*"eps" + 0.083*"computer" + 0.083*"survey" + 0.083*"interface" + 0.083*"user" + 0.083*"human"')]
你应该使用 models.LdaModel.top_topics()
:
model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)
top3_topics = model.top_topics(corpus)[:3]
for topic, topic_score in top3_topics:
word_scores, words = zip(*topic)
top10_words = words[:10]
print(top10_words)
[输出]:
('time', 'response', 'user', 'computer', 'human', 'interface', 'system', 'survey', 'eps', 'trees')
('survey', 'minors', 'graph', 'computer', 'human', 'interface', 'user', 'system', 'time', 'response')
('computer', 'human', 'interface', 'user', 'system', 'time', 'survey', 'response', 'eps', 'trees')
如果你想把它们放在 pandas.DataFrame
:
>>> import pandas as pd
>>>
>>> top10_words_per_topic = []
>>> for topic, topic_score in top3_topics:
... word_scores, words = zip(*topic)
... top10_words_per_topic.append(words[:10])
...
>>> df = pd.DataFrame(top10_words_per_topic).transpose()
>>> df.rename(columns={0:'Topic0', 1:'Topic1', 2:'Topic2'})
Topic0 Topic1 Topic2
0 time survey computer
1 response minors human
2 user graph interface
3 computer computer user
4 human human system
5 interface interface time
6 system user survey
7 survey system response
8 eps time eps
9 trees response trees
我有一个包含元组列表的文本文件。我想将此列表转换为列。
文件包含以下数据:
[(0, u'0.025*"minimalism" + 0.018*"diwali" + 0.018*"sunday" + 0.018*"minimalistics" + 0.018*"plant" + 0.010*"thought" + 0.010*"take" + 0.010*"httpstcog21yvu1vyo" + 0.010*"time" + 0.010*"cause"'),
(1, u'0.029*"panshet" + 0.022*"im" + 0.015*"video" + 0.015*"project" + 0.015*"shade" + 0.015*"nature" + 0.015*"motionphotography\u2026" + 0.015*"motionjpeg" + 0.015*"trip" + 0.015*"lake"'),
(2, u'0.013*"light" + 0.013*"take" + 0.013*"minimalist" + 0.013*"unm4sk" + 0.013*"first" + 0.013*"minimalism\u2026" + 0.013*"minimal" + 0.013*"possible" + 0.013*"quick" + 0.013*"story"')]
我想要以下格式的输出:
topic 0 topic 1 topic 2
minimalism panshet light
diwali im take
sunday video minimalist
minimalistics project unm4sk
plant shade first
编辑 1
with open('LDA.txt') as f:
lis = [x.split() for x in f]
cols=[x for x in zip(*lis)]
for x in cols:
print(x)
您的第一个错误是从文本文件加载 "data" 的方式(这甚至不是保存数据的最佳方式。如果您要保存 python 个对象,最好使用 pickle
来做到这一点)。
无论如何,修复很简单。读取文件时,调用 ast.literal_eval
.
import ast
with open('LDA.txt') as f:
data = ast.literal_eval(f.read())
您期待已久的部分来了。您可以使用 re.findall
非常轻松地提取单词。对于数据中的每个元组,提取所有单词并存储在字典中。然后,将字典传递给 pd.DataFrame
构造函数。
import re
import pandas as pd
d = {}
for i, y in data:
d['topic {}'.format(i)] = re.findall('"(.*?)"', y)
df = pd.DataFrame(d)
df
topic 0 topic 1 topic 2
0 minimalism panshet light
1 diwali im take
2 sunday video minimalist
3 minimalistics project unm4sk
4 plant shade first
5 thought nature minimalism…
6 take motionphotography… minimal
7 httpstcog21yvu1vyo motionjpeg possible
8 time trip quick
9 cause lake story
如果您想要其他方式制表数据(不使用数据框),请参阅 here(第二个答案)。
我认为输出看起来像 gensim
LDA 模型输出的 __str__
格式。
而不是打印主题并保存字符串,然后执行 post-processing:
from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
for text in texts]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)
model.print_topics(3)
[输出]:
[(51, '0.083*"response" + 0.083*"time" + 0.083*"graph" + 0.083*"trees" + 0.083*"eps" + 0.083*"computer" + 0.083*"survey" + 0.083*"interface" + 0.083*"user" + 0.083*"human"'), (48, '0.083*"response" + 0.083*"time" + 0.083*"graph" + 0.083*"trees" + 0.083*"eps" + 0.083*"computer" + 0.083*"survey" + 0.083*"interface" + 0.083*"user" + 0.083*"human"'), (42, '0.083*"response" + 0.083*"time" + 0.083*"graph" + 0.083*"trees" + 0.083*"eps" + 0.083*"computer" + 0.083*"survey" + 0.083*"interface" + 0.083*"user" + 0.083*"human"')]
你应该使用 models.LdaModel.top_topics()
:
model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)
top3_topics = model.top_topics(corpus)[:3]
for topic, topic_score in top3_topics:
word_scores, words = zip(*topic)
top10_words = words[:10]
print(top10_words)
[输出]:
('time', 'response', 'user', 'computer', 'human', 'interface', 'system', 'survey', 'eps', 'trees')
('survey', 'minors', 'graph', 'computer', 'human', 'interface', 'user', 'system', 'time', 'response')
('computer', 'human', 'interface', 'user', 'system', 'time', 'survey', 'response', 'eps', 'trees')
如果你想把它们放在 pandas.DataFrame
:
>>> import pandas as pd
>>>
>>> top10_words_per_topic = []
>>> for topic, topic_score in top3_topics:
... word_scores, words = zip(*topic)
... top10_words_per_topic.append(words[:10])
...
>>> df = pd.DataFrame(top10_words_per_topic).transpose()
>>> df.rename(columns={0:'Topic0', 1:'Topic1', 2:'Topic2'})
Topic0 Topic1 Topic2
0 time survey computer
1 response minors human
2 user graph interface
3 computer computer user
4 human human system
5 interface interface time
6 system user survey
7 survey system response
8 eps time eps
9 trees response trees