使用 Python 将 Scraped 数据写入 JSON

Writing Scraped data into JSON using Python

我正在尝试使用 python 将我抓取的数据保存到 JSON 文件中。下面是我的代码。我可以抓取数据,但无法将其保存到 JSON 文件中。有人可以告诉我问题出在哪里吗?我以前没有用过这样的东西。我搜索了很多解决方案,但没有确切的解决方案。

这是我的代码

from urllib.request import urlopen
from bs4 import BeautifulSoup
import json

for page in range(1,2):
    url = "https://whosebug.com/questions?tab=unanswered&page={}".format(page)
    html = urlopen(url)
    soup = BeautifulSoup(html,"html.parser")
    Container = soup.find_all("div", {"class":"question-summary"})
    for i in Container:
        try:
            title = i.find("a", {"class":"question-hyperlink"}).get_text()
            det = i.find("div", {"class":"excerpt"}).get_text()
            tags = i.find("div",{"class":"tags"}).get_text()
            votes = i.find("div",{"class":"votes"}).get_text()
            ans = i.find("div",{"class":"status"}).get_text()
            views = i.find("div",{"class":"views"}).get_text()
            time = i.find("span",{"class":"relativetime"}).get_text()
            print(title, det, tags, votes, ans, views, time )
        except: AttributeError
        ## the problem starts from here.
def questions(f):
    job_dict = {}
    job_dict['Title'] = title
    job_dict['Description'] = det
    job_dict['Tags'] = tags
    job_dict['Votes'] = votes
    job_dict['Answers'] = ans
    job_dict['Views'] = views
    job_dict['Time'] = time

    json_job = json.dumps(job_dict)
    f.seek(0)
    txt = f.readline()
    if txt.endswith("}"):
        f.write(",")
    f.write(json_job)

下面的代码有效

from urllib.request import urlopen
from bs4 import BeautifulSoup
import json

data = []
for page in range(1, 2):
    url = "https://whosebug.com/questions?tab=unanswered&page={}".format(page)
    html = urlopen(url)
    soup = BeautifulSoup(html, "html.parser")
    Container = soup.find_all("div", {"class": "question-summary"})
    for i in Container:
        entry = {'title': i.find("a", {"class": "question-hyperlink"}).get_text(),
                 'det': i.find("div", {"class": "excerpt"}).get_text()}
        data.append(entry)
        # TODO Add more attributes


print(json.dumps(data))

输出

[{"title": "JsTestDriver on NetBeans stops testing after a failed assertion", "det": "\r\n            I have set up JavaScript unit testing with JS Test Driver on Netbeans as per this Link. However, unlike the results in that tutorial, no more tests are executed after an assertion fails. How can I ...\r\n        "}, {"title": "Receiving kAUGraphErr_CannotDoInCurrentContext when calling AUGraphStart for playback", "det": "\r\n            I'm working with AUGraph and Audio Units API to playback and record audio in my iOS app. Now I have a rare issue when an AUGraph is unable to start with the following error:\r\n  result = ...\r\n        "}, {"title": "SilverStripe PHP Forms - If I nest a SelectionGroup inside a FieldGroup, one of the related SelectionGroup_Items' Radio Box does not show up. Why?", "det": "\r\n            I have a form that has two FieldGroups, and in one of the FieldGroups I have a SelectionGroup.\n\nThe SelectionGroup_Items show up in the form FieldGroup but the radio boxes to select one of the options ...\r\n        "}, {"title": "Implementing a Neural Network in Haskell", "det": "\r\n            I'm trying to implement a neural network architecture in Haskell, and use it on MNIST.\n\nI'm using the hmatrix package for the linear algebra.\nMy training framework is built using the pipes package.\n\n...\r\n        "}, {"title": "dequeueBuffer: can't dequeue multiple buffers without setting the buffer count", "det": "\r\n            I'm getting the error below on Android 4.4.2 Moto X 2013 in a Rhomobile 5.0.2 WebView app. The app is compiled with SDK 19 and minAPI 17.\n\nAfter some research it seems that this is an issue with ...\r\n        "}, {"title": "How to read audio data from a 'MediaStream' object in a C++ addon", "det": "\r\n            After sweating blood and tears I've finally managed to set up a Node C++ addon and shove a web-platform standard MediaStream object into one of its C++ methods for good. For compatibility across ...\r\n        "}, {"title": "Akka finite state machine instances", "det": "\r\n            I am trying to leverage Akka's finite state machine framework for my use case. I am working on a system that processes a request that goes through various states.\n\nThe request here is the application ...\r\n        "}, {"title": "How to use classes to \u201ccontrol dreams\u201d?", "det": "\r\n            Background\n\nI've been playing around with Deep Dream and Inceptionism, using the Caffe framework to visualize layers of GoogLeNet, an architecture built for the Imagenet project, a large visual ...\r\n        "}, {"title": "iOS : Use of HKObserverQuery's background update completionHandler", "det": "\r\n            HKObserverQuery has the following method that supports receiving updates in the background:\n\n- initWithSampleType:predicate:updateHandler:\r\nThe updateHandler has a completionHandler which has the ...\r\n        "}, {"title": "Representing Parametric Survival Model in 'Counting Process' form in JAGS", "det": "\r\n            The Problem\n\nI am trying to build a survival-model in JAGS that allows for time-varying covariates. I'd like it to be a parametric model - for example, assuming survival follows the Weibull ...\r\n        "}, {"title": "Separate cookie jar per WebView in OS X", "det": "\r\n            I've been trying to achieve the goal of having a unique (not shared) cookie jar per WebView in macOS (cookies management works different for iOS).\n\nAfter reading a lot of Whosebug questions and ...\r\n        "}, {"title": "Flexible Space in Android", "det": "\r\n            Using this tutorial to implement a Flexible Space pattern (the one with the collapsing toolbar).\n\nI'm trying to achieve a similar effect as in the Lollipop Contacts activity, which at the beginning ...\r\n        "}, {"title": "How do I upgrade to jlink (JDK 9+) from Java Web Start (JDK 8) for an auto-updating application?", "det": "\r\n            Java 8 and prior versions have Java Web Start, which auto-updates the application when we change it.  Oracle has recommended that users migrate to jlink, as that is the new Oracle technology.  So far, ...\r\n        "}, {"title": "Newly Published App reporting Version as \u201cUnknown\u201d in iTunes Connect", "det": "\r\n            New version of my app is 1.2. But in \"Sales and Trends\" in iTunes Connect I see \"unknown\" app version. Also new reviews not showing in App Store in \"Current version\" tab (only in all versions tab).\n...\r\n        "}, {"title": "VerifyError: Uninitialized object exists on backward branch / JVM Spec 4.10.2.4", "det": "\r\n            The JVM Spec 4.10.2.4 version 7, last paragraph, says\r\n  A valid instruction sequence must not have an uninitialized object on the operand stack or in a local variable at the target of a backwards ...\r\n        "}, {"title": "Pandas read_xml() method test strategies", "det": "\r\n            Interestingly, pandas I/O tools does not maintain a read_xml() method and the counterpart to_xml(). However, read_json proves tree-like structures can be implemented for dataframe import and read_html ...\r\n        "}, {"title": "Visual bug in Safari using jQuery Mobile - Content duplication", "det": "\r\n            I'm building a mobile app using jQuery Mobile 1.3.0, EaselJs 0.6.0 and TweenJs 0.4.0.\n\nSo, when I load the page, some content gets visually duplicated. The DIVs are not really duplicated, it is just ...\r\n        "}, {"title": "Saving child collections with OrmLite on Android with objects created from Jackson", "det": "\r\n            I have a REST service which I'm calling from my app, which pulls in a JSON object as a byte[] that is then turned into a nice nested collection of objects -- all of that bit works fine. What I then ...\r\n        "}, {"title": "Transitions with GStreamer Editing Services freezes, but works OK without transitions", "det": "\r\n            I'm trying to use gstreamer's GStreamer Editing Services to concatenate 2 videos, and to have a transition between the two.\n\nThis command, which just joins 2 segments of the videos together without a ...\r\n        "}, {"title": "Cannot log-in to rstudio-server", "det": "\r\n            I have previously successfully installed rstudio-server with brew install rstudio-server on a Mac OS X 10.11.4.\n\nNow, I am trying to login to rstudio-server 0.99.902 without success. From the client ...\r\n        "}, {"title": "How to implement Isotope with Pagination", "det": "\r\n            I am trying to implement isotope with pagination on my WordPress site (which obviously is a problem for most people). I've come up with a scenario which may work if I can figure a few things out.\n\nOn ...\r\n        "}, {"title": "Input range slider not working on iOS Safari when clicking on track", "det": "\r\n            I have a pretty straight-forward range input slider.  It works pretty well on all browsers except on iOS Safari.  \n\nThe main problem I have is when I click on the slider track, it doesn't move the ...\r\n        "}, {"title": "Could not load IOSurface for time string. Rendering locally instead swift 4", "det": "\r\n            Could you help me with this problem when I running my project : \r\n  Could not load IOSurface for time string. Rendering locally instead\r\nI don't know what is going on with my codding ..... pleas help ....\r\n        "}, {"title": "Creating multiple aliases for the same QueryDSL path in Spring Data", "det": "\r\n            I have a generic Spring Data repository interface that extends QuerydslBinderCustomizer, allowing me to customize the query execution.  I am trying to extend the basic equality testing built into the ...\r\n        "}, {"title": "React Native WebView html <select> not opening options on Android tablets", "det": "\r\n            I am experiencing a very strange problem in React Native's WebView with HTML <select> tags on Android tablets.\n\nFor some reason, tapping on the rendered <select> button does not open the ...\r\n        "}, {"title": "In Xamarin.Forms Device.BeginInvokeOnMainThread() doesn\u2019t show message box from notification callback *only* in Release config on physical device", "det": "\r\n            I'm rewriting my existing (swift) iOS physical therapy app \"On My Nerves\" to Xamarin.Forms. It's a timer app to help people with nerve damage (like me!) do their desensitization exercises. You have ...\r\n        "}, {"title": "iOS 11: ATS (App Transport Security) no longer accepts custom anchor certs?", "det": "\r\n            I am leasing a self signed certificate using NSMutableURLRequest and when the certificate is anchored using a custom certificate with SecTrustSetAnchorCertificates IOS 11 fails with the following ...\r\n        "}, {"title": "What is an appropriate type for smart contracts?", "det": "\r\n            I'm wondering what is the best way to express smart contracts in typed languages such as Haskell or Idris (so you could, for example, compile it to run on the Ethereum network). My main concern is: ...\r\n        "}, {"title": "USB bulkTransfer between Android tablet and camera", "det": "\r\n            I would like to exchange data/commands between a camera and an Android tablet device using the bulkTransfer function. I wrote this Activity, but the method bulkTransfer returns -1 (error status). Why ...\r\n        "}, {"title": "ember-cli-code-coverage mocha showing 0% coverage when there are tests", "det": "\r\n            I'm using ember-cli-code-coverage with ember-cli-mocha. When I run COVERAGE=true ember test I'm getting 0% coverage for statements, functions, and lines. Yet, I have tests that are covering those ...\r\n        "}, {"title": "SNIReadSyncOverAsync Performance issue", "det": "\r\n            Recently I used dot Trace profiler to find the bottle necks in my application.\n\nSuddenly I have seen that in most of the places which is taking more time and more cpu usage too is ...\r\n        "}, {"title": "IOS: Text Selection in WKWebView (WKSelectionGranularityCharacter)", "det": "\r\n            I've got an app that uses a web view where text can be selected. It's long been an annoyance that you can't select text across a block boundary in UIWebView.  WKWebView seems to fix this with a ...\r\n        "}, {"title": "Creating a shadow copy using the \u201cBackup\u201d context in a PowerShell", "det": "\r\n            I am in the process of writing a PowerShell script for backing up a windows computer using rsync. To this end, I am attempting to use WMI from said script to create a non-persistent Shadow copy with ...\r\n        "}, {"title": "`std::variant` vs. inheritance vs. other ways (performance)", "det": "\r\n            I'm wondering about std::variant performance. When should I not use it? It seems like virtual functions are still much better than using std::visit which surprised me!\n\nIn \"A Tour of C++\" Bjarne ...\r\n        "}, {"title": "Resources, scopes, permissions and policies in keycloak", "det": "\r\n            I want to create a fairly simple role-based access control system using Keycloak's authorizaion system. The system Keycloak is replacing allows us to create a \"user\", who is a member of one or more \"...\r\n        "}, {"title": "iOS Internal testing - Unable to download crash information?", "det": "\r\n            I have recently uploaded my app to the App Store for internal testing (TestFlight, iOS 8). I am currently the only tester. When I test using TestFlight, my app crashes; however, the same operation ...\r\n        "}, {"title": "Traversing lists and streams with a function returning a future", "det": "\r\n            Introduction\n\nScala's Future (new in 2.10 and now 2.9.3) is an applicative functor, which means that if we have a traversable type F, we can take an F[A] and a function A => Future[B] and turn them ...\r\n        "}, {"title": "Symfony2: how to get all entities of one type which are marked with \u201cEDIT\u201d ACL permission?", "det": "\r\n            Can someone tell me how to get all entities of one type which are marked with \"EDIT\" ACL permission?\n\nI would like to build a query with the Doctrine EntityManager.\r\n        "}, {"title": "How can we calculate \u201cflex-basis: auto & min-width\u201d and \u201cwidth at cross axis\u201d?", "det": "\r\n            I want to know how flex-basis: auto & min-width: 0 and width: auto is calculated (width is not set for parent element and flex item) . Therefore, I confirmed the specification of W3C. Is my ...\r\n        "}, {"title": "Rendering Angular components in Handsontable Cells", "det": "\r\n            In a project of mine I try to display Angular Components (like an Autocomplete Dropdown Search) in a table. Because of the requirements I have (like multi-selecting different cells with ctrl+click) I ...\r\n        "}, {"title": "Getting Symbols from debugged process MainModule", "det": "\r\n            I started writing a debugger in C#, to debug any process on my operating system. For now, it only can handle breakpoints (HW, SW, and Memory), but now I wanted to show the opcode of the process.\n\nMy ...\r\n        "}, {"title": "@Transactional in super classes not weaved when using load time weaving", "det": "\r\n            The project I am working on has a similar structure for the DAOs to the one bellow:\n\n/** \n* Base DAO class\n*/\n@Transactional    \npublic class JPABase {\n\n  @PersistenceContext\n  private EntityManager ...\r\n        "}, {"title": "Alert, confirm, and prompt not working after using History API on Safari, iOS", "det": "\r\n            After calling history.pushState in Safari on iOS, it's no longer possible to use alert(), confirm() or prompt(), when using the browser back button to change back.\n\nIs this an iOS bug? Are there any ...\r\n        "}, {"title": "ExoPlayer AudioTrack Stuttering", "det": "\r\n            I have my own implementation of TrackRenderer for a mp3 decoder, that I integrated. When a lollipop device goes to standby and comes back, its not always repeatable but the audio starts to stutter ...\r\n        "}, {"title": "How to diagnose COM-callable wrapper object creation failure?", "det": "\r\n            I am creating a COM object (from native code) using CoCreateInstance:\n\nconst \n   CLASS_GP2010: TGUID = \"{DC55D96D-2D44-4697-9165-25D790DD8593}\";\n\nhr = CoCreateInstance(CLASS_GP2010, nil, ...\r\n        "}, {"title": "libMobileGestalt MobileGestalt.c:890: MGIsDeviceOneOfType is not supported on this platform", "det": "\r\n            I am using Xcode 9 I kept getting this error when I load my app \r\n  libMobileGestalt MobileGestalt.c:890: MGIsDeviceOneOfType is not supported on this platform.\r\nHow to stop that?\r\n        "}, {"title": "How to add a builtin function in a GCC plugin?", "det": "\r\n            It is possible for a GCC plugin to add a new builtin function? If so, how to do it properly?\n\nGCC version is 5.3 (or newer). The code being compiled and processed by the plugin is written in C.\n\nIt is ...\r\n        "}, {"title": "Chain is null when retrieving private key", "det": "\r\n            I'm encrypting data in my app using a RSA keypair that I am storing in the Android keystore.\n\nI've been seeing NullPointerExceptions in the Play Store, but I have not been able to reproduce them:\n\n...\r\n        "}, {"title": "Managing the lifetimes of garbage-collected objects", "det": "\r\n            I am making a simplistic mark-and-compact garbage collector. Without going too much into details, the API it exposes is like this:\n\n/// Describes the internal structure of a managed object.\npub struct ...\r\n        "}, {"title": "Sneaking lenses and CPS past the value restriction", "det": "\r\n            I'm encoding a form of van Laarhoven lenses in OCaml but am having difficulty due to the value restriction.\n\nThe relevant code is as follows\n\nmodule Optic : sig\n  type (-'s, +'t, +'a, -'b) t\n  val ...\r\n        "}]

在循环之前,您应该为所有数据创建列表 all_jobs

try 中,您应该使用作业数据创建字典并将其附加到列表 all_jobs

在循环之后你可以一次写完。

如果您尝试单独编写每个作业,那么您可能会创建不正确的 JSON 文件,因为它需要 [ 在 beginnig 和 ] 在结束时我不需要在您的代码中添加到文件。

并且在 except 中您必须添加任何代码 - 至少命令 pass 但最好显示出现问题的消息。如果你只使用 pass 那么你永远不会知道你遇到了错误 - 有时这个错误可以回答为什么代码没有给出结果的问题。


编辑: 通常它全部写在一行中但它是正确的 JSON 字符串并且在其他工具中读取它没有问题。但是如果你想格式化文件中的数据,那么你可以添加缩进 - 即。 json_dump(all_jobs, indent=2).

您还可以在保存前清理文本 - .get_text(strip=True)


from urllib.request import urlopen
from bs4 import BeautifulSoup
import json

all_jobs = []

for page in range(1, 2):
    url = "https://whosebug.com/questions?tab=unanswered&page={}".format(page)
    html = urlopen(url)
    soup = BeautifulSoup(html,"html.parser")
    Container = soup.find_all("div", {"class":"question-summary"})
    for i in Container:
        try:
            title = i.find("a", {"class":"question-hyperlink"}).get_text() # .get_text(strip=True)
            det = i.find("div", {"class":"excerpt"}).get_text()
            tags = i.find("div",{"class":"tags"}).get_text()
            votes = i.find("div",{"class":"votes"}).get_text()
            ans = i.find("div",{"class":"status"}).get_text()
            views = i.find("div",{"class":"views"}).get_text()
            time = i.find("span",{"class":"relativetime"}).get_text()

            print(title, det, tags, votes, ans, views, time )

            job_dict = {}
            job_dict['Title'] = title
            job_dict['Description'] = det
            job_dict['Tags'] = tags
            job_dict['Votes'] = votes
            job_dict['Answers'] = ans
            job_dict['Views'] = views
            job_dict['Time'] = time

            all_jobs.append(job_dict)

        except AttributeError as ex:
            print('Error:', ex)

# --- after loop ---

f = open('output.json', 'w')
#f.write(json.dumps(all_jobs)) # all in one line
f.write(json.dumps(all_jobs, ident=2))
f.close()

编辑: 使用模块 Elastichsearch

直接导入 Elasticsearch
from urllib.request import urlopen
from bs4 import BeautifulSoup
from elasticsearch import Elasticsearch

es = Elasticsearch()

for page in range(2):
    url = "https://whosebug.com/questions?tab=unanswered&page={}".format(page)
    html = urlopen(url)
    soup = BeautifulSoup(html,"html.parser")

    container = soup.find_all("div", {"class":"question-summary"})
    for item in container:
        try:
            job = {
                'Title': item.find("a", {"class":"question-hyperlink"}).get_text(strip=True),
                'Description': item.find("div", {"class":"excerpt"}).get_text(strip=True),
                'Tags': item.find("div",{"class":"tags"}).get_text(strip=True),
                'Votes': item.find("div",{"class":"votes"}).get_text(strip=True),
                'Answers': item.find("div",{"class":"status"}).get_text(strip=True),
                'Views': item.find("div",{"class":"views"}).get_text(strip=True),
                'Time': item.find("span",{"class":"relativetime"}).get_text(strip=True),
            }
        except AttributeError as ex:
            print('Error:', ex)
            continue

        # --- importing job to Elasticsearch ---

        res = es.index(index="Whosebug", doc_type='job', body=job) # without `id` to autocreate `id` 
        print(res['result'])


# --- searching ---

#es.indices.refresh(index="Whosebug")

res = es.search(index="Whosebug", body={"query": {"match_all": {}}})
print("Got %d Hits:" % res['hits']['total']['value'])
for hit in res['hits']['hits']:
    #print(hit)
    print("%(Title)s: %(Tags)s" % hit["_source"])