检索同义词和相似度
Retrieve synonyms and similarity
我想从 www.thesaurus.com 中抓取几页。
我对一个词的同义词和反义词都感兴趣。
例如我要找的词是angry
,我对下面图片中的词感兴趣(页面中还有很多,但我只对前2个块感兴趣):
和:
我可以用这段代码找到那些词(以及其他相关词):
import requests
from bs4 import BeautifulSoup
word = "angry"
url = 'https://www.thesaurus.com/browse/{}'.format(word)
r = requests.get(url)
returned_words_list = []
soup = BeautifulSoup(r.text, 'html.parser')
word_ul = soup.find("ul", {"class":'css-1lc0dpe et6tpn80'})
for idx, elem in enumerate(word_ul.findAll("a")):
returned_words_list.append(elem.text.strip())
print (returned_words_list)
但我也对相似度(单词的颜色)感兴趣。
查看源代码,有一个类似 JSON 的文件:
<script>window.INITIAL_STATE = {"routerReducer":{"location":null},"searchData":{"isFetchingTunaApi":false,"isFetchingSpellSuggestion":false,"isFetchingRelatedWordsApi":false,"searchTerm":"angry","tunaApiData":{"entry":"angry","type":"normal","slugLuna":"angry","slug":"angry","pronunciation":{"audio":{"audio\u002Fogg":"https:\u002F\u002Fstatic.sfdict.com\u002Faudio\u002Flunawav\u002FA04\u002FA0484200.ogg","audio\u002Fmpeg":"https:\u002F\u002Fstatic.sfdict.com\u002Faudio\u002FA04\u002FA0484200.mp3"},"spell":"\u003Cspan class=\"bold\"\u003Eang\u003C\u002Fspan\u003E-gree","ipa":"ˈæŋ gri"},"posTabs":[{"isInformal":null,"isVulgar":"0","definition":"being mad, often extremely mad","thesRid":"842","pos":"adj.","synonyms":[{"similarity":"100","isInformal":"0","isVulgar":null,"term":"annoyed","targetTerm":"annoyed","targetSlug":"annoyed"},{"similarity":"100","isInformal":"0","isVulgar":null,"term":"bitter","targetTerm":"bitter","targetSlug":"bitter"},{"similarity":"100","isInformal":"0","isVulgar":null,"term":"enraged","targetTerm":"enraged","targetSlug":"enraged"},.....
但是我不知道怎么读。最后我想要这样的输出:
"synonyms":[{"similarity":"100","isInformal":"0","isVulgar":null,"term":"annoyed","targetTerm":"annoyed","targetSlug":"annoyed"},{"similarity":"100","isInformal":"0","isVulgar":null,"term":"bitter","targetTerm":"bitter","targetSlug":"bitter"},{"similarity":"100","isInformal":"0","isVulgar":null,"term":"enraged","targetTerm":"enraged","targetSlug":"enraged"},
"antonyms":[{"similarity":"-100","isInformal":"0","isVulgar":null,"term":"calm","targetTerm":"calm","targetSlug":"calm"},{"similarity":"-100","isInformal":"0","isVulgar":null,"term":"cheerful","targetTerm":"cheerful","targetSlug":"cheerful"},
我可以在哪里读取 term
和 similarity
(或者只是元组的输出列表),如下所示:
[("annoyed", 100), ("bitter", 100)...]
[("calm", -100), ("cheerful", -100)...]
归功于
import re
import json
import requests
url = 'https://www.thesaurus.com/browse/angry?s=t'
txt = re.findall(r'INITIAL_STATE\s*=\s*({.*})', requests.get(url).text)[0]
data = json.loads(txt)
# print(json.dumps(data, indent=4)) # <-- uncomment to see all data
print(data.keys())
我想从 www.thesaurus.com 中抓取几页。
我对一个词的同义词和反义词都感兴趣。
例如我要找的词是angry
,我对下面图片中的词感兴趣(页面中还有很多,但我只对前2个块感兴趣):
和:
import requests
from bs4 import BeautifulSoup
word = "angry"
url = 'https://www.thesaurus.com/browse/{}'.format(word)
r = requests.get(url)
returned_words_list = []
soup = BeautifulSoup(r.text, 'html.parser')
word_ul = soup.find("ul", {"class":'css-1lc0dpe et6tpn80'})
for idx, elem in enumerate(word_ul.findAll("a")):
returned_words_list.append(elem.text.strip())
print (returned_words_list)
但我也对相似度(单词的颜色)感兴趣。
查看源代码,有一个类似 JSON 的文件:
<script>window.INITIAL_STATE = {"routerReducer":{"location":null},"searchData":{"isFetchingTunaApi":false,"isFetchingSpellSuggestion":false,"isFetchingRelatedWordsApi":false,"searchTerm":"angry","tunaApiData":{"entry":"angry","type":"normal","slugLuna":"angry","slug":"angry","pronunciation":{"audio":{"audio\u002Fogg":"https:\u002F\u002Fstatic.sfdict.com\u002Faudio\u002Flunawav\u002FA04\u002FA0484200.ogg","audio\u002Fmpeg":"https:\u002F\u002Fstatic.sfdict.com\u002Faudio\u002FA04\u002FA0484200.mp3"},"spell":"\u003Cspan class=\"bold\"\u003Eang\u003C\u002Fspan\u003E-gree","ipa":"ˈæŋ gri"},"posTabs":[{"isInformal":null,"isVulgar":"0","definition":"being mad, often extremely mad","thesRid":"842","pos":"adj.","synonyms":[{"similarity":"100","isInformal":"0","isVulgar":null,"term":"annoyed","targetTerm":"annoyed","targetSlug":"annoyed"},{"similarity":"100","isInformal":"0","isVulgar":null,"term":"bitter","targetTerm":"bitter","targetSlug":"bitter"},{"similarity":"100","isInformal":"0","isVulgar":null,"term":"enraged","targetTerm":"enraged","targetSlug":"enraged"},.....
但是我不知道怎么读。最后我想要这样的输出:
"synonyms":[{"similarity":"100","isInformal":"0","isVulgar":null,"term":"annoyed","targetTerm":"annoyed","targetSlug":"annoyed"},{"similarity":"100","isInformal":"0","isVulgar":null,"term":"bitter","targetTerm":"bitter","targetSlug":"bitter"},{"similarity":"100","isInformal":"0","isVulgar":null,"term":"enraged","targetTerm":"enraged","targetSlug":"enraged"},
"antonyms":[{"similarity":"-100","isInformal":"0","isVulgar":null,"term":"calm","targetTerm":"calm","targetSlug":"calm"},{"similarity":"-100","isInformal":"0","isVulgar":null,"term":"cheerful","targetTerm":"cheerful","targetSlug":"cheerful"},
我可以在哪里读取 term
和 similarity
(或者只是元组的输出列表),如下所示:
[("annoyed", 100), ("bitter", 100)...]
[("calm", -100), ("cheerful", -100)...]
归功于
import re
import json
import requests
url = 'https://www.thesaurus.com/browse/angry?s=t'
txt = re.findall(r'INITIAL_STATE\s*=\s*({.*})', requests.get(url).text)[0]
data = json.loads(txt)
# print(json.dumps(data, indent=4)) # <-- uncomment to see all data
print(data.keys())