FeedParser,删除特殊字符并写入 CSV
FeedParser, Removing Special Characters and Writing to CSV
我正在学习 Python。我为自己设定了一个小目标,那就是构建一个 RSS 抓取工具。我正在尝试收集作者、Link 和标题。我想从那里写入 CSV。
我遇到了一些问题。自昨晚以来,我一直在寻找答案,但似乎找不到解决方案。我确实有一种感觉,我在 feedparser 正在解析的内容和将其移动到 CSV 之间缺少一些知识,但我还没有词汇表,不知道要 Google.
- 如何删除“[”和“'”等特殊字符?
- 如何在创建新文件时将作者、link 和标题写入新行?
1 个特殊字符
rssurls = 'http://feeds.feedburner.com/TechCrunch/'
techart = feedparser.parse(rssurls)
# feeds = []
# for url in rssurls:
# feedparser.parse(url)
# for feed in feeds:
# for post in feed.entries:
# print(post.title)
# print(feed.entires)
techdeets = [post.author + " , " + post.title + " , " + post.link for post in techart.entries]
techdeets = [y.strip() for y in techdeets]
techdeets
输出:我得到了我需要的信息,但 .strip 标签没有剥离。
['Darrell Etherington , Spin launches first city-sanctioned dockless
bike sharing in Bay Area ,
http://feedproxy.google.com/~r/Techcrunch/~3/BF74UZWBinI/', 'Ryan
Lawler , With .3 million in funding, CarDash wants to change how you
get your car serviced ,
http://feedproxy.google.com/~r/Techcrunch/~3/pkamfdPAhhY/', 'Ron
Miller , AlienVault plug-in searches for stolen passwords on Dark Web
, http://feedproxy.google.com/~r/Techcrunch/~3/VbmdS0ODoSo/', 'Lucas
Matney , Firefox for Windows gets native WebVR support, performance
bumps in latest update ,
http://feedproxy.google.com/~r/Techcrunch/~3/j91jQJm-f2E/',...]
2) 写入 CSV
import csv
savedfile = open('/test1.txt', 'w')
savedfile.write(str(techdeets) + "/n")
savedfile.close()
import pandas as pd
df = pd.read_csv('/test1.txt', encoding='cp1252')
df
输出:
输出是一个只有 1 行和多列的数据框。
你快到了:-)
如何使用 pandas 先创建一个数据框然后保存它,像这样 "continuing from your code":
df = pd.DataFrame(columns=['author', 'title', 'link'])
for i, post in enumerate(techart.entries):
df.loc[i] = post.author, post.title, post.link
那你就可以保存了:
df.to_csv('myfilename.csv', index=False)
或
您也可以直接从 feedparser 条目写入数据框:
>>> import feedparser
>>> import pandas as pd
>>>
>>> rssurls = 'http://feeds.feedburner.com/TechCrunch/'
>>> techart = feedparser.parse(rssurls)
>>>
>>> df = pd.DataFrame()
>>>
>>> df['author'] = [post.author for post in techart.entries]
>>> df['title'] = [post.title for post in techart.entries]
>>> df['link'] = [post.link for post in techart.entries]
我正在学习 Python。我为自己设定了一个小目标,那就是构建一个 RSS 抓取工具。我正在尝试收集作者、Link 和标题。我想从那里写入 CSV。
我遇到了一些问题。自昨晚以来,我一直在寻找答案,但似乎找不到解决方案。我确实有一种感觉,我在 feedparser 正在解析的内容和将其移动到 CSV 之间缺少一些知识,但我还没有词汇表,不知道要 Google.
- 如何删除“[”和“'”等特殊字符?
- 如何在创建新文件时将作者、link 和标题写入新行?
1 个特殊字符
rssurls = 'http://feeds.feedburner.com/TechCrunch/'
techart = feedparser.parse(rssurls)
# feeds = []
# for url in rssurls:
# feedparser.parse(url)
# for feed in feeds:
# for post in feed.entries:
# print(post.title)
# print(feed.entires)
techdeets = [post.author + " , " + post.title + " , " + post.link for post in techart.entries]
techdeets = [y.strip() for y in techdeets]
techdeets
输出:我得到了我需要的信息,但 .strip 标签没有剥离。
['Darrell Etherington , Spin launches first city-sanctioned dockless bike sharing in Bay Area , http://feedproxy.google.com/~r/Techcrunch/~3/BF74UZWBinI/', 'Ryan Lawler , With .3 million in funding, CarDash wants to change how you get your car serviced , http://feedproxy.google.com/~r/Techcrunch/~3/pkamfdPAhhY/', 'Ron Miller , AlienVault plug-in searches for stolen passwords on Dark Web , http://feedproxy.google.com/~r/Techcrunch/~3/VbmdS0ODoSo/', 'Lucas Matney , Firefox for Windows gets native WebVR support, performance bumps in latest update , http://feedproxy.google.com/~r/Techcrunch/~3/j91jQJm-f2E/',...]
2) 写入 CSV
import csv
savedfile = open('/test1.txt', 'w')
savedfile.write(str(techdeets) + "/n")
savedfile.close()
import pandas as pd
df = pd.read_csv('/test1.txt', encoding='cp1252')
df
输出: 输出是一个只有 1 行和多列的数据框。
你快到了:-)
如何使用 pandas 先创建一个数据框然后保存它,像这样 "continuing from your code":
df = pd.DataFrame(columns=['author', 'title', 'link'])
for i, post in enumerate(techart.entries):
df.loc[i] = post.author, post.title, post.link
那你就可以保存了:
df.to_csv('myfilename.csv', index=False)
或
您也可以直接从 feedparser 条目写入数据框:
>>> import feedparser
>>> import pandas as pd
>>>
>>> rssurls = 'http://feeds.feedburner.com/TechCrunch/'
>>> techart = feedparser.parse(rssurls)
>>>
>>> df = pd.DataFrame()
>>>
>>> df['author'] = [post.author for post in techart.entries]
>>> df['title'] = [post.title for post in techart.entries]
>>> df['link'] = [post.link for post in techart.entries]