为什么我不能从 CSV 文件中获取两个特定文本并以此制作数据框?
Why i can't get two specifics texts from CSV file and make a dataframe with this?
我有一个包含很多推文的 csv 文件。我试图获取两个特定文本并使用以下信息制作数据框:日期,主题标签
Created At,Text
Fri Jan 06 11:02:14 +0000 2017, #beta #betalab #mg Afiliada da Globo: Apresentador no AM é demitido após criticar governador
我想要这样的结果:
以下是我尝试过的许多方法之一,但无论如何,结果都不是我需要的。
我试过下面的代码
import os
os.chdir(r'C:\Users\Documents')
dataset = pd.read_csv('Tweets_Mg.csv', encoding='utf-8')
dataset.drop_duplicates(['Text'], inplace=True)
def Preprocessing(instancia):
stemmer = nltk.stem.RSLPStemmer()
instancia = re.sub(r"http\S+", "", instancia).lower().replace('?','').replace('!','').replace('.','').replace(';','').replace('-','').replace(':','').replace(')','')
#List of stopwords in portuguese language
stopwords = set(nltk.corpus.stopwords.words('portuguese'))
palavras = [stemmer.stem(i) for i in instancia.split() if not i in stopwords]
return (" ".join(palavras))
tweets = [Preprocessing(i) for i in dataset.Text]
def procurar_hashtags(tweet):
return re.findall('(#[A-Za-z]+[A-Za-z0-9-_]+)', tweet)
hashtag_list = [procurar_hashtags(i) for i in tweets]
def hashtag_top(hashtag_list):
hashtag_df = pd.DataFrame(hashtag_list)
hashtag_df = pd.concat([hashtag_df[0],hashtag_df[1],hashtag_df[2],
hashtag_df[3],hashtag_df[4],hashtag_df[5],
hashtag_df[6],hashtag_df[7],
hashtag_df[8]], ignore_index=True)
hashtag_df = hashtag_df.dropna()
hashtag_df = pd.DataFrame(hashtag_df)
hashags_unicas = hashtag_df[0].value_counts()
return hashags_unicas
hashtag_dataframe = hashtag_top(hashtag_list)
hashtag_dataframe[hashtag_dataframe>=25]
结果不好,无论我做什么,我都无法从主题标签中捕获日期。像这样:
#timbet 193
#glob 119
#operacaobetalab 118
#sigodevolt 77
我做错了...
您可以以此为起点:
from itertools import product
from pathlib import Path
import csv
import re
hashtag = re.compile('(#\w+)')
csvfile = Path('/path/to/your/file.csv')
tags_by_date = []
for line in csv.reader(csvfile.open()):
tags = hashtag.findall(line[1])
if tags:
for date, tag in product(line[0], tags):
tags_by_date.append([date, tag])
这是一个小的概念证明(远非完整的解决方案,因为您没有花时间以更好的方式阐述您的问题):
>>> line
['Fri Jan 06 11:02:14 +0000 2017', ' #beta #betalab #mg Afiliada da Globo: Apresentador no AM é demitido após criticar governador']
>>> hashtag.findall(line[1])
['#beta', '#betalab', '#mg']
我有一个包含很多推文的 csv 文件。我试图获取两个特定文本并使用以下信息制作数据框:日期,主题标签
Created At,Text
Fri Jan 06 11:02:14 +0000 2017, #beta #betalab #mg Afiliada da Globo: Apresentador no AM é demitido após criticar governador
我想要这样的结果:
以下是我尝试过的许多方法之一,但无论如何,结果都不是我需要的。
我试过下面的代码
import os
os.chdir(r'C:\Users\Documents')
dataset = pd.read_csv('Tweets_Mg.csv', encoding='utf-8')
dataset.drop_duplicates(['Text'], inplace=True)
def Preprocessing(instancia):
stemmer = nltk.stem.RSLPStemmer()
instancia = re.sub(r"http\S+", "", instancia).lower().replace('?','').replace('!','').replace('.','').replace(';','').replace('-','').replace(':','').replace(')','')
#List of stopwords in portuguese language
stopwords = set(nltk.corpus.stopwords.words('portuguese'))
palavras = [stemmer.stem(i) for i in instancia.split() if not i in stopwords]
return (" ".join(palavras))
tweets = [Preprocessing(i) for i in dataset.Text]
def procurar_hashtags(tweet):
return re.findall('(#[A-Za-z]+[A-Za-z0-9-_]+)', tweet)
hashtag_list = [procurar_hashtags(i) for i in tweets]
def hashtag_top(hashtag_list):
hashtag_df = pd.DataFrame(hashtag_list)
hashtag_df = pd.concat([hashtag_df[0],hashtag_df[1],hashtag_df[2],
hashtag_df[3],hashtag_df[4],hashtag_df[5],
hashtag_df[6],hashtag_df[7],
hashtag_df[8]], ignore_index=True)
hashtag_df = hashtag_df.dropna()
hashtag_df = pd.DataFrame(hashtag_df)
hashags_unicas = hashtag_df[0].value_counts()
return hashags_unicas
hashtag_dataframe = hashtag_top(hashtag_list)
hashtag_dataframe[hashtag_dataframe>=25]
结果不好,无论我做什么,我都无法从主题标签中捕获日期。像这样:
#timbet 193
#glob 119
#operacaobetalab 118
#sigodevolt 77
我做错了...
您可以以此为起点:
from itertools import product
from pathlib import Path
import csv
import re
hashtag = re.compile('(#\w+)')
csvfile = Path('/path/to/your/file.csv')
tags_by_date = []
for line in csv.reader(csvfile.open()):
tags = hashtag.findall(line[1])
if tags:
for date, tag in product(line[0], tags):
tags_by_date.append([date, tag])
这是一个小的概念证明(远非完整的解决方案,因为您没有花时间以更好的方式阐述您的问题):
>>> line
['Fri Jan 06 11:02:14 +0000 2017', ' #beta #betalab #mg Afiliada da Globo: Apresentador no AM é demitido após criticar governador']
>>> hashtag.findall(line[1])
['#beta', '#betalab', '#mg']