Swift 使用正则表达式或其他方式抓取网页
Swift scraping a webpage using regex or alternative
先查看下面的更新。
我正在尝试为 reddit 上指定的子 reddit 抓取所有 moderators。
API 只允许您获取所有 moderators 子 reddit 的用户名,所以最初我已经获取了所有这些,然后对这些配置文件中的每一个执行额外的请求以获取头像 url.这最终超过了 API 限制。
所以我只想获取下一页的来源并分页,同时在每个页面上收集 10 个用户名和头像 url。这将最终以更少的请求轮询网站。我了解如何进行分页部分,但现在我正在尝试了解如何收集用户名和相邻的头像 URL。
所以采取以下 url:
https://www.reddit.com/r/videos/about/moderators/
所以我会拉取整个页面源,
将所有 mod 的用户名和 url 添加到一个 mod 对象中,然后添加到一个数组中。
在我返回的字符串上使用正则表达式是个好主意吗?
到目前为止,这是我的代码,任何帮助都会很棒:
func tester() {
let url = URL(string: "https://www.reddit.com/r/videos/about/moderators")!
let task = URLSession.shared.dataTask(with: url) { data, response, error in
guard let data = data, error == nil else {
print("\(error)")
return
}
let string = String(data: data, encoding: .utf8)
let regexUsernames = try? NSRegularExpression(pattern: "href=\"/user/[a-z0-9]\"", options: .caseInsensitive)
var results = regexUsernames?.matches(in: string as String, options: [], range: NSRange(location: 0, length: string.length))
let regexProfileURLs = try? NSRegularExpression(pattern: "><img src=\"[a-z0-9]\" style", options: .caseInsensitive)
print("\(results)") // This shows as empty array
}
task.resume()
}
我也试过以下但出现此错误:
Can't form Range with upperBound < lowerBound
代码:
func tester() {
let url = URL(string: "https://www.reddit.com/r/videos/about/moderators")!
let task = URLSession.shared.dataTask(with: url) { data, response, error in
guard let data = data, error == nil else {
print("data was nil")
return
}
guard let htmlString = String(data: data, encoding: .utf8) else {
print("cannot cast data into string")
return
}
let leftSideOfValue = "href=\"/user/"
let rightSideOfValue = "\""
guard let leftRange = htmlString.range(of: leftSideOfValue) else {
print("cannot find range left")
return
}
guard let rightRange = htmlString.range(of: rightSideOfValue) else {
print("cannot find range right")
return
}
let rangeOfTheValue = leftRange.upperBound..<rightRange.lowerBound
print(htmlString[rangeOfTheValue])
}
更新:
所以我已经到了它会给我第一个用户名的地步,但是我正在循环并且一遍又一遍地得到相同的用户名。推进每个增量步骤的最佳方式是什么?有没有办法做类似 let newHTMLString = htmlString.dropFirst(k: ?) 的事情,用我们刚得到的元素之后的子字符串替换 htmlString?
func tester() {
let url = URL(string: "https://www.reddit.com/r/pics/about/moderators")!
let task = URLSession.shared.dataTask(with: url) { data, response, error in
guard let data = data, error == nil else {
print("data was nil")
return
}
guard let htmlString = String(data: data, encoding: .utf8) else {
print("cannot cast data into string")
return
}
let counter = htmlString.components(separatedBy:"href=\"/user/")
let count = counter.count
for i in 0...count {
let leftSideOfUsernameValue = "href=\"/user/"
let rightSideOfUsernameValue = "\""
let leftSideOfAvatarURLValue = "><img src=\""
let rightSideOfAvatarURLValue = "\">"
guard let leftRange = htmlString.range(of: leftSideOfUsernameValue) else {
print("cannot find range left")
return
}
guard let rightRange = htmlString.range(of: rightSideOfUsernameValue) else {
print("cannot find range right")
return
}
let username = htmlString.slice(from: leftSideOfUsernameValue, to: rightSideOfUsernameValue)
print(username)
guard let avatarURL = htmlString.slice(from: leftSideOfAvatarURLValue, to: rightSideOfAvatarURLValue) else {
print("Error")
return
}
print(avatarURL)
}
}
task.resume()
}
我也试过:
let endString = String(avatarURL + rightSideOfAvatarURLValue)
let endIndex = htmlString.index(endString.endIndex, offsetBy: 0)
let substringer = htmlString[endIndex...]
htmlString = String(substringer)
您应该能够通过执行以下操作调用简单的正则表达式,将所有名称和网址拉入两个单独的数组:
func tester() {
let url = URL(string: "https://www.reddit.com/r/pics/about/moderators")!
let task = URLSession.shared.dataTask(with: url) { data, response, error in
guard let data = data, error == nil else { return }
guard let htmlString = String(data: data, encoding: .utf8) else { return }
let names = htmlString.matching(regex: "href=\"/user/(.*?)\"")
let imageUrls = htmlString.matching(regex: "><img src=\"(.*?)\" style")
print(names)
print(imageUrls)
}
task.resume()
}
extension String {
func matching(regex: String) -> [String] {
guard let regex = try? NSRegularExpression(pattern: regex, options: []) else { return [] }
let result = regex.matches(in: self, options: [], range: NSMakeRange(0, self.count))
return result.map {
return String(self[Range([=10=].range, in: self)!])
}
}
}
或 您可以为每个 <div class="_1sIhmckJjyRyuR_z7M5kbI">
创建一个对象,然后根据需要获取要使用的名称和 url。
先查看下面的更新。
我正在尝试为 reddit 上指定的子 reddit 抓取所有 moderators。 API 只允许您获取所有 moderators 子 reddit 的用户名,所以最初我已经获取了所有这些,然后对这些配置文件中的每一个执行额外的请求以获取头像 url.这最终超过了 API 限制。
所以我只想获取下一页的来源并分页,同时在每个页面上收集 10 个用户名和头像 url。这将最终以更少的请求轮询网站。我了解如何进行分页部分,但现在我正在尝试了解如何收集用户名和相邻的头像 URL。
所以采取以下 url:
https://www.reddit.com/r/videos/about/moderators/
所以我会拉取整个页面源,
将所有 mod 的用户名和 url 添加到一个 mod 对象中,然后添加到一个数组中。
在我返回的字符串上使用正则表达式是个好主意吗?
到目前为止,这是我的代码,任何帮助都会很棒:
func tester() {
let url = URL(string: "https://www.reddit.com/r/videos/about/moderators")!
let task = URLSession.shared.dataTask(with: url) { data, response, error in
guard let data = data, error == nil else {
print("\(error)")
return
}
let string = String(data: data, encoding: .utf8)
let regexUsernames = try? NSRegularExpression(pattern: "href=\"/user/[a-z0-9]\"", options: .caseInsensitive)
var results = regexUsernames?.matches(in: string as String, options: [], range: NSRange(location: 0, length: string.length))
let regexProfileURLs = try? NSRegularExpression(pattern: "><img src=\"[a-z0-9]\" style", options: .caseInsensitive)
print("\(results)") // This shows as empty array
}
task.resume()
}
我也试过以下但出现此错误:
Can't form Range with upperBound < lowerBound
代码:
func tester() {
let url = URL(string: "https://www.reddit.com/r/videos/about/moderators")!
let task = URLSession.shared.dataTask(with: url) { data, response, error in
guard let data = data, error == nil else {
print("data was nil")
return
}
guard let htmlString = String(data: data, encoding: .utf8) else {
print("cannot cast data into string")
return
}
let leftSideOfValue = "href=\"/user/"
let rightSideOfValue = "\""
guard let leftRange = htmlString.range(of: leftSideOfValue) else {
print("cannot find range left")
return
}
guard let rightRange = htmlString.range(of: rightSideOfValue) else {
print("cannot find range right")
return
}
let rangeOfTheValue = leftRange.upperBound..<rightRange.lowerBound
print(htmlString[rangeOfTheValue])
}
更新:
所以我已经到了它会给我第一个用户名的地步,但是我正在循环并且一遍又一遍地得到相同的用户名。推进每个增量步骤的最佳方式是什么?有没有办法做类似 let newHTMLString = htmlString.dropFirst(k: ?) 的事情,用我们刚得到的元素之后的子字符串替换 htmlString?
func tester() {
let url = URL(string: "https://www.reddit.com/r/pics/about/moderators")!
let task = URLSession.shared.dataTask(with: url) { data, response, error in
guard let data = data, error == nil else {
print("data was nil")
return
}
guard let htmlString = String(data: data, encoding: .utf8) else {
print("cannot cast data into string")
return
}
let counter = htmlString.components(separatedBy:"href=\"/user/")
let count = counter.count
for i in 0...count {
let leftSideOfUsernameValue = "href=\"/user/"
let rightSideOfUsernameValue = "\""
let leftSideOfAvatarURLValue = "><img src=\""
let rightSideOfAvatarURLValue = "\">"
guard let leftRange = htmlString.range(of: leftSideOfUsernameValue) else {
print("cannot find range left")
return
}
guard let rightRange = htmlString.range(of: rightSideOfUsernameValue) else {
print("cannot find range right")
return
}
let username = htmlString.slice(from: leftSideOfUsernameValue, to: rightSideOfUsernameValue)
print(username)
guard let avatarURL = htmlString.slice(from: leftSideOfAvatarURLValue, to: rightSideOfAvatarURLValue) else {
print("Error")
return
}
print(avatarURL)
}
}
task.resume()
}
我也试过:
let endString = String(avatarURL + rightSideOfAvatarURLValue)
let endIndex = htmlString.index(endString.endIndex, offsetBy: 0)
let substringer = htmlString[endIndex...]
htmlString = String(substringer)
您应该能够通过执行以下操作调用简单的正则表达式,将所有名称和网址拉入两个单独的数组:
func tester() {
let url = URL(string: "https://www.reddit.com/r/pics/about/moderators")!
let task = URLSession.shared.dataTask(with: url) { data, response, error in
guard let data = data, error == nil else { return }
guard let htmlString = String(data: data, encoding: .utf8) else { return }
let names = htmlString.matching(regex: "href=\"/user/(.*?)\"")
let imageUrls = htmlString.matching(regex: "><img src=\"(.*?)\" style")
print(names)
print(imageUrls)
}
task.resume()
}
extension String {
func matching(regex: String) -> [String] {
guard let regex = try? NSRegularExpression(pattern: regex, options: []) else { return [] }
let result = regex.matches(in: self, options: [], range: NSMakeRange(0, self.count))
return result.map {
return String(self[Range([=10=].range, in: self)!])
}
}
}
或 您可以为每个 <div class="_1sIhmckJjyRyuR_z7M5kbI">
创建一个对象,然后根据需要获取要使用的名称和 url。