wikipedia api 调用页面上的特定内容

wikipedia api call for specific content on the page

如何使维基百科 API 调用以获取此页面上排名前 5 位的机场名称、位置、国家/地区?

http://en.wikipedia.org/wiki/List_of_the_world%27s_busiest_airports_by_passenger_traffic

在这里你可以看到所有你需要美化的json:

http://en.wikipedia.org/w/api.php?format=jsonfm&action=query&titles=List_of_the_world's_busiest_airports_by_passenger_traffic&prop=revisions&rvprop=content

?format=jsonfm 更改为 ?format=json,您将获得有用的数据。

解决方法:

通过在 linux 上执行此命令,您将获得所有列表的行:

curl http://en.wikipedia.org/w/api.php?format=json\&action=query\&titles=List_of_the_world\'s_busiest_airports_by_passenger_traffic\&prop=revisions\&rvprop=content | sed 's|\u||g' | grep -onE '\n\|[0-9]+\.\|\|[^\]*'

输出中提示的每一行都是按排名顺序排列的每个机场(每个列表 30 或 50 个机场,具体取决于列表)。

并且此命令会提示其名称,而不会提示其他任何内容:

curl http://en.wikipedia.org/w/api.php?format=json\&action=query\&titles=List_of_the_world\'s_busiest_airports_by_passenger_traffic\&prop=revisions\&rvprop=content | sed 's|\u||g' | grep -onE '\n\|[0-9]+\.\|\|[^\]*' | grep -onE '} \[\[[^[\]*]' | sed 's/[\[|:}]//g; s/]]//; s/[0-9][0-9]*//g; s/ //' 

注意:所有页面的列表都是串联的,所以最后一个不是真正的数字600,而是前30个是它的真实数字,每个30或50(取决于您正在查看的列表)来自不同的列表。

解释:

我从 here 获得了 url 端点,然后使用 curl 向维基百科的 API 发出 GET 请求,它获取了所有可用数据您请求的页面,然后我使用正则表达式来解析所需的值。我使用的正则表达式是:

sed 's|\u||g' 

this one is being performed by sed (stream editor) and what it does is to search for every appearance of \u (which stands for unicode characters) and removing it. I need to do that because later I will use the string '\n' (which stands for new line) as separator for the rows. The way it does what I say it does is by using the command s of sed for substituting every appearance of the string \u, the reason of being two back slashes is because it needs to be escaped or it would be interpreted as a part of the command.

grep -onE '\n\|[0-9]+\.\|\|[^\]*'

This regular expression is being performed by grep, the first we do (as mentioned before) is to match any new line which would be \n, again, we need to escape the back slash. Then we need to match the character | and it needs to be escaped too. Then we need to match any amount of digits with [0-9]+ everything inside [] would be a character, 0-9 is the range we want to match and + stands for one or more,we also want the character . which also needs to be escaped and then two times this character again: |. At this point we already matched the index and now we want to match every single character until the end of line, which would be '\n', but since we've already deleted the useless \u , all the back slashes left are for new lines, so, here is the match we need: [\], but we want to negate it, thats why we add the ^ in front of the back slashes, and then the * would match zero or more unknown characters which aren't back slashes. The -onE in front of the regular expression are the options passed to grep and its meanings are o = only match , n = number each line and E = extended regular expression.

grep -onE '} \[\[[^\]*]]'

At this point we have all the rows with all the available data in each of them and we want to fetch just the names which are enclosed within [[...]] and always after a } , this is the same as before but the character we don't want this time is ] instead of \

sed 's/[\[|:}]//g; s/]]//; s/[0-9][0-9]*//g; s/ //'

The only thing this sed command does is to delete all non-alphabetical characters by grouping them within [] and substituting them with nothing. Maybe it isn't the more efficient way to do it, but it works.

重要提示: 我现在注意到 json 中有一些空格,我不得不稍微调整一下正则表达式,我不会更改上面的内容解释因为我只添加了一些 ? ,只要它可能是空格。

curl http://en.wikipedia.org/w/api.php?format=json\&action=query\&titles=List_of_the_world\'s_busiest_airports_by_passenger_traffic\&prop=revisions\&rvprop=content | sed 's|\u||g' | grep -E '\n\|[0-9]+\.\|\|[^\]*'  | grep -onE '} ?\[\[[^[\]*]' | sed 's/[\[|:}]//g; s/]]//; s/[0-9][0-9]*//g; s/ //'

here 你在 pastebin 上有输出。

进一步讲座this link 将帮助您在 javascript.

中使用正则表达式

不需要curl: 你可以在here

中测试任何请求输出的内容