使用 BeautifulSoup 从 google chrome 中提取书签和文件夹层次结构
Extracting bookmarks and folder hierarchy from google chrome with BeautifulSoup
我在 google-chrome 中有一个 大 书签集合,其中包含链接、链接之间的子文件夹和一些子文件夹甚至更多子文件夹。
现在,我想将 URL 与其他信息一起提取为纯文本以供进一步处理。
为此,我将所有书签从 google-chrome 书签管理器导出到名为 bookmarks_8_2_21.html 的 html 文件中。 =21=]
我将在下文中使用的文件示例部分是:
<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
It will be read and overwritten.
DO NOT EDIT! -->
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
<DT><H3 ADD_DATE="1606927410" LAST_MODIFIED="1620226362" PERSONAL_TOOLBAR_FOLDER="true">Bookmarks bar</H3>
<DL><p>
<DT><A HREF="javascript:location.href='org-protocol://capture?template=l&url='+encodeURIComponent(location.href)+'&title='+encodeURIComponent(document.title)+'&body='+encodeURIComponent(window.getSelection())" ADD_DATE="1607739285">org-capture-bookmark</A>
<DT><A HREF="https://www.google.de/" ADD_DATE="1554935207" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAACIklEQVQ4jYWSS0iUURTHf/fe8RvHooE2VlT2FNqUGWmNEYUR9lhEEVJhUIsoXOQuap1Rq6KHNQt3LaPAIOxhlNTChUwLMU3NR1CklUzg6xvPd1ro2KhTHjjcA/e8/uf/hzmmqsUiEheRLhHxp/2TiDxQ1aK5+ZmFeSJSrwuYiMRVNZKuMxnFz51zu9T3GX/6iPGmRqS/F5WAUMEawuUVRI5UYjwPEWl2zlUYY8YMgIjUW2vPBkPfSV6uYbKvJ+uW3rZSojfuABAEQdw5d96oajHQqr7P8IUqpL8X43lEjp3EK4mBtfgt75l4+4po7U3cytWZPbcyjUlTidv642ipDu7foX7bh2zgs92jDhHpUlWdbNmuEw15OvqweqE7ZjboCAEFADrSjs1LkRM7NAt3+bWRebfYudFx9XguwFqbwePs9z/mT/6NLdAHMBpex28W0/C1Y1Zy05VFM75nUwiAZVGT/v5sgdcA3UurOPUrxvXOFhJD7fOmdn4LeNc5NbpkfWimv5mWZ8KXFKdfXqInOYBnc6gsPEjZ8mKssbQOtvEkMczYl0oK8z3un4lgppbYkhZS3Fp7bnD0Jxeba+lODmTFviFcxq29NeRHDUEQ1DnnqtNSjohIo3Nutx+keNz9gmf9zfQkB0ChYMkK9q2KcaLwMJFQGFV9Y4w5YIwZzyBBI2lRLcD9PVXN/SdFqlokInUi0iEiE9P+UUTuqurmufl/AKTzsFGmvUNUAAAAAElFTkSuQmCC"></A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Folder_1</H3>
<DL><p>
<DT><A HREF="https://whosebug.com/" ADD_DATE="1605695883" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABXklEQVQ4jbWQsUsCYRjGn/fuSu/Sk3ALmlzNtoagKRqSaHMKGkKhEOV0KWispSXPQaglAnNobOgfaCyIcgicmxO9zFPv/N5WwTs5gt7x+5739/2eDwgw/bK67HcnBQG4Ag3L0LJ/BoBFDuDzTiGUCAywDC3bNbRtANCrwxaBziRZanAGcjADwR8AX1uGesEZyFGzXwO43VsKn07GaJa5lY/GMefUAYooEvaELDnCEW9M2I1V7GdPg04hlLAM7dYqqut67ftLNwdpMB5dgRfXdVMgHIFpx9egfbwYk0eDA2LKAWJMkK6cUOhOGdkpZmoQiy29OmwFq1AKb5CgQyakAXqQJKpELn/eJzPK1JKhPhHjk4EmMzUVmU/coVLkeXff672pk155YXUsxikCJQFeYVCSgCiAV920N311b+r37FslH413S+qaV86rggfIBbG38RRAN+2ZHzsTMKvGv80vvziHGAusG84AAAAASUVORK5CYII=">Stack Overflow - Where Developers Learn, Share, & Build Careers</A>
<DT><A HREF="https://stackexchange.com/" ADD_DATE="1605695914" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABIUlEQVQ4ja2SvUoDQRSFvztZDSIKWwbxB2zs7eyCWGljYWUlCKKm0kfwEQRR8A0kaGEfW+2VgD+JGxMlELOsWxhJdizMrmtM1hVyYGDm3Pm4c5grtJXLaaM0UV8HTKJVH7fM43RamgCG75Yn7bOkUot/wADkU7V5YAVA+eaAyEIc+LXRpPbRitUolsTf7F64Ouqi9dbi0fGC89WqKRCK8B+4rwoirB3d/YpQqDa4r753BUv7s9ERouC+6jvC3vmTSqixQsXmoWxHQhrcYUOnbk623SCCiCTjwO2uQw7JQYCEb47OLOWLFXsaeAGev5Z2QEYIjTxwKyI7Vnbj8keEXppaPgj/zmbxdOswXI81SICjIdMJ0/G0nvKUN2dlM9fdap8MMGR5HOUBZgAAAABJRU5ErkJggg==">Hot Questions - Stack Exchange</A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Subfolder</H3>
<DL><p>
<DT><A HREF="https://meta.stackexchange.com/" ADD_DATE="1605695986" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAm0lEQVQ4je2SsQ3CMBBF30XMglKyBB7DEiNATSpnADYAeQtgEcIyR2EiHONgU9DxSv/736fTFyKMv+1BHEW0u9i2B2imZnZlM4C45zyL+DFNz/HaUhzQN+nAJ3NOfwv4ln/ALwLGgsyR6lGRtBsLYvxQrLOg28kGoSDalZcOn51tewhBFRg/KIAim6tdnmKt+og5czXr4301pz0AqgIzDZOACvcAAAAASUVORK5CYII=">Meta Stack Exchange</A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Another Subfolder</H3>
<DL><p>
<DT><A HREF="https://en.wikipedia.org/wiki/Main_Page" ADD_DATE="1605696025" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABO0lEQVQ4jaWTMaoCMRCG/wnvDtELmHaxdAmIXcheZA9hYeMNxNZqsc81lu0X+2VLTzBj8V5C8uQ9UAcG5k+YP5kvhPATzCx4IZRSBAD0TnNuQu82J5NPmgFADcMAay2UUjifzwAA733S8zzDWgtrLeZ5xvV6xXK5hPcet9vte/5pmoSIJIQgURtj5HQ6CTOLc06maRJmFmaWuq5TjVg454qNrutEay0hBDkej8V6NC4M+r4XANL3fdo0xogxJul4UK4TxPV6Decc9vt9ArTb7XC/35MehgFVVZUUc7cQghCRjOOYTtNaS9d1wszStm3BgpnlKzfz3mO1WuFyuWCz2aBpGlhrcTgcsN1uAQCLxeLvG0RIRJRmjS9U13XB5wlinlrrgnTbtk/w/jWIDPL8PXvMzz9TzuLVZgB4AExRsO8ga8hoAAAAAElFTkSuQmCC">Wikipedia, the free encyclopedia</A>
</DL><p>
</DL><p>
<DT><A HREF="https://www.wikipedia.org/" ADD_DATE="1605696017" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABO0lEQVQ4jaWTMaoCMRCG/wnvDtELmHaxdAmIXcheZA9hYeMNxNZqsc81lu0X+2VLTzBj8V5C8uQ9UAcG5k+YP5kvhPATzCx4IZRSBAD0TnNuQu82J5NPmgFADcMAay2UUjifzwAA733S8zzDWgtrLeZ5xvV6xXK5hPcet9vte/5pmoSIJIQgURtj5HQ6CTOLc06maRJmFmaWuq5TjVg454qNrutEay0hBDkej8V6NC4M+r4XANL3fdo0xogxJul4UK4TxPV6Decc9vt9ArTb7XC/35MehgFVVZUUc7cQghCRjOOYTtNaS9d1wszStm3BgpnlKzfz3mO1WuFyuWCz2aBpGlhrcTgcsN1uAQCLxeLvG0RIRJRmjS9U13XB5wlinlrrgnTbtk/w/jWIDPL8PXvMzz9TzuLVZgB4AExRsO8ga8hoAAAAAElFTkSuQmCC">Wikipedia</A>
<DT><A HREF="https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)" ADD_DATE="1605696102" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABO0lEQVQ4jaWTMaoCMRCG/wnvDtELmHaxdAmIXcheZA9hYeMNxNZqsc81lu0X+2VLTzBj8V5C8uQ9UAcG5k+YP5kvhPATzCx4IZRSBAD0TnNuQu82J5NPmgFADcMAay2UUjifzwAA733S8zzDWgtrLeZ5xvV6xXK5hPcet9vte/5pmoSIJIQgURtj5HQ6CTOLc06maRJmFmaWuq5TjVg454qNrutEay0hBDkej8V6NC4M+r4XANL3fdo0xogxJul4UK4TxPV6Decc9vt9ArTb7XC/35MehgFVVZUUc7cQghCRjOOYTtNaS9d1wszStm3BgpnlKzfz3mO1WuFyuWCz2aBpGlhrcTgcsN1uAQCLxeLvG0RIRJRmjS9U13XB5wlinlrrgnTbtk/w/jWIDPL8PXvMzz9TzuLVZgB4AExRsO8ga8hoAAAAAElFTkSuQmCC">Beautiful Soup (HTML parser) - Wikipedia</A>
</DL><p>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Folder_2</H3>
<DL><p>
<DT><A HREF="https://www.reddit.com/" ADD_DATE="1605696212" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAACdUlEQVQ4jWXTTYjVdRTG8c/5/e+945hNMhLEgFRUA1lgNYHojJjpDXduQkSqjRiKJNSu2rRJV0VQi14IKohCW8UEIRJZ05uUGESrBBOsJocayRznvvxPizup0dmcxeF5zoHzfAOSCBKybRP2SpO4SUjMSjOKV+Ooz67VhKXKdUZc75CwWzGk/neAQBXUuSi9oevp+NRFiCRMGDHqiIa2rkSNgqu3Xc5aUQwLfR9bsCO+8FcjyLzBIQ1tPR2hIVWD7ZEeforMsPLGyh331g7u6rr4xzbDnseBkltNqezW0xfRIIqqpDrSytHQGg7Tb6XZM6mzWNDQUwuP5xbrI9veU+xU61tUBHpCB+vWs3qcd9+mhYbUxDK1UKm905AmpdQTbh2n2wnj97F5F7fcRd1j7WaOv89Pp0JzKP36c2hKTMmHymJulHlgQ51/X8i8MJdZ9/N/1e9lzp3LnD+fuW+izo0y26VTXPljDvrIKl5+gn330+/R67B3gjefYdUYzSFKdUVWZD1reaRvv0rffJT6fe6epP0YVYNGi8nt3LxmYHjyWDp1Ii2Xsj5bpM+VCEX6+kNZVTy4i3s2Mf878+fZsJ32o7JqcGJ6EK8SIcxUz91mVp2PaEZx+gcREcYn0ty5cPp7fjvD8HVpZFQceSEcfjG1hDo7evYH5BavGLJfR8dlDXeuLR7YkcZuH8T4l9Ph+Af8eLLW0tfS1PVSHPPkIMqTVhh2WNM2/UgLWesITQNWulJLGo6iytA1rdjpqEtXYVpjhTEHhT2qsoygXiKqlAFVvXoBr/nTs/GdS5Y4+y/OW01hD6awesn/rDCj7/X4xJfXav4BhnocQyGrEocAAAAASUVORK5CYII=">reddit: the front page of the internet</A>
<DT><A HREF="https://www.youtube.com/" ADD_DATE="1574152707" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABx0lEQVQ4jZ2TQWtTQRDHfzO7yUsNKSG0heJJ0YKnCvVSkHrV7+BBeu7Vk9+lH8CLN6EXk4Lo1V48lGClFBELGmmSmr73djy8fS8vll4c+LO7s7OzM///LhQmBmrg+uDtZrgIBQQAKyf/YQbiBexFt9t9niSP7ji3umrW7og0UfWL0ZaOzGYjmBxn2c/BdPpJxuNz+vDwFxxnqmaqZs4V8L5AuS6haqmqXcDpO3jCEN4YmEFmkMcxrSCSGqSh8GcGIY52Am91Ce5HJ5EYBRwbG45222HmUHVS+DX2DhASuKu3oAeolGSqCiDs7gqHh8L2thCCxOQSbxFAO9BToClzNQSJokwmsLUFgwHs78P6eklnlQho6c0axUKbTVhZgUYjHpeqCwE8kMWFxYNF9uVlGA5hbw8ODuqJzajqnPENPtfYDxU2N4OtrRVzkVDfC0V8/h2G/g98AR7ESkJF8tFRcYdzkOf15iRW6y/hlD48voCz+BbmELFrvhrG8OMjPBOAV3D7qXM791R73Var00qSJRoNTyifB5Dn+eVsNv19dTX+mmWj93n+4SWciM1LKkmqy7RoImFBqPqPfH39K7t/4A18f74nAH8Bjm35s3ZkOjEAAAAASUVORK5CYII=">YouTube</A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">stuff</H3>
<DL><p>
<DT><A HREF="https://www.pgadmin.org/" ADD_DATE="1566393697" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAC4ElEQVQ4jU2TS2hdZRSFv7XPf26eNbV52jRp0tYYiVVDfUBBlEpHTuxIOhEzMHUi6EjswIEOBQV1YsFWyMCJCI6sGhBRJ1JDrVpMGnuJSXprTFqpjXmc8//bwb0R93zvvdb+9tL4qfeetBDOxFgMmeMOBiAhAHdHEu44gCC5kIXKQiz8VJA4CwwGKbowASbh7sSUGs2OEA44mHtyT3FY5ueCuw9sb27Gje3SKsEoo1PGqKY8o6UpR4AkortM8vrATMQYHfpDEcs40LPbHhrZq9mltdS5q8UO7u30mSvXdKlawx2KoqS1uclvb24Jl+fBMJOFzFJY3yjsscNDevXkExRlVB6yhlU4/cF5vvtlgVeeeZzDw336dfFPr9Zu6shIP1NfzvDZ97MKkrwsk2JKrG9u89qH0/TtaefFp4+ycnOd9186wehgD9dv/M3YUK+OH7kbgG9+rnoRnYAnudcPNz3zG1PTMwx072a47052tTUxOtjD5YU/mHjzY279s8WZl09wdGw/G1uFJAhIyOqHuv9AH2P7exkd6Karow1PTh0lxOjEmADIzGhQJngDG8DIvi59/fYkmRkA585foFq7wdhQL5++8Szg9Hd14O6Y1fEGAWVMxJT49qcFv7K8qkP9nVycrzG3tMrzb33C5FOP8PA9+7hUvc7iyl88eu8gRRkbChwPmSkz4/bGFqfPfk7XHW2UMSLBsfGDzF9b46OvfqS7o43XnzvukjS7tOZ5CIQQTEtrt/hhbtnnllfp6Wj31uZcABtbBSePPcD4oX7+V5r6YoYLc0tqb8nR+AvvFEWZzJMbUqoEM5CDK7nT1lzhvqFeDty1x4sy6vLvK35xvkYlD5jhenDy3WQiIdlOeMC1s66Mic3tkpgSIK+EjNbmHJyUIAtm2SKZBr0sI40k7igAkYeMpjyA5Dg4TkzJLcszeVo20ITQVRcSpP+MSvXsAcnrPxBTwpMnSUhW9eQT/wJc4GRalsmdmQAAAABJRU5ErkJggg==">pgAdmin - PostgreSQL Tools</A>
</DL><p>
<DT><A HREF="https://www.gnu.org/software/emacs/" ADD_DATE="1605696341" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADN0lEQVQ4jW2TXWibZRiG7/f7S76k+WKS/iTNTLRzSSSr60TmWEVQ8GdSJjpxIIh4oB4peqQFj/TAE0EEEakgKII60ENxYwpSmTQd27Ru0OiSNj9Nk7Rp/r6kyfe+7+PB1qFj99HzwH0/XAfPzXAbzU29HfONed/SXeaklJLtDvq5Tm3j4zPrC5Vbvey/y8np+dTI6PjnoVj0kP/OsE/RVEBI8N0hGmul1na5cmGrWnj57LVPi3sZdW949vC7J8IHEqfjR6bTZtByMcZAgjAaNJFMhWBafrdrZHSKhuK5qJleztZ/L9w8cCo1nxhPHjgduS8RBWOAIJCQGAuZCAbcuP9IBI8cvxuXzm/A4wlZw17v0ZAS+zbfvGhrAGCGxxfCBxNREBAJe3Bsdh98PgM+y4DHq2OwK1DMteDYnMAlYlPpeLu9vQDgae2J+JsRKxqeYSAQJ/h9LqTSoyjkm/j6sz/Q2LRBnOg6FQFCQhGEgH/swccnXvSqDySPvxObST/GcB29XrGxsd7GwZlx6LqCarFLQ5uDhAQEAYIghYRLNTyVbrGkqIa+X1FUgBOIS4BLbFVsLJ5Zo2jMwuvvHWNPnkowt65CcgnJJcAJpjLCDOae0SAJ4BIkCJbPwNzzSexPhSC4ZFcubGLp7Brl/txCr+vcIJDQuAJFMIAAzenvrvH+AKqi4amTCdxzbwjd1gBffJChnWoP8kYIgqBwQBUqNK6g37cxFPaK0rfrnzQKlTYcovXVHSICpCRE4xaCIRM6ADgE1WHQuArdUaFxFbVmoe60Ol8yAHjh6PuLyemjsyQlJiZHcPihKLsrFYThVgECVE2B129g8+82vp+/TLIjsFj66ccfsh/OaQBgd2qvlXJXz0X3JcOVfBuVf5rkcWuIJYKIxCwEJ7ysWx1Q9lwV6BBWaplStZF99eYnrtYz9UlvssSH/GHLG/AamopAyINmvofC8g7yv22jcrEFpyHxVz1TXu/mXvl145vL/+tCtrZ0JeCJ/2K3dg6xIQsqXZeOvgKdq2AOodzKd1e2lzLl7uozPxe+Wr5tG/c0G3spfYfb/4apuMNEQg7EsNwY1D46X/zu2q3efwF9w4d36At8owAAAABJRU5ErkJggg==">GNU Emacs - GNU Project</A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">other stuff</H3>
<DL><p>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">emacs</H3>
<DL><p>
<DT><A HREF="https://www.gnu.org/software/emacs/download.html" ADD_DATE="1605696357" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADN0lEQVQ4jW2TXWibZRiG7/f7S76k+WKS/iTNTLRzSSSr60TmWEVQ8GdSJjpxIIh4oB4peqQFj/TAE0EEEakgKII60ENxYwpSmTQd27Ru0OiSNj9Nk7Rp/r6kyfe+7+PB1qFj99HzwH0/XAfPzXAbzU29HfONed/SXeaklJLtDvq5Tm3j4zPrC5Vbvey/y8np+dTI6PjnoVj0kP/OsE/RVEBI8N0hGmul1na5cmGrWnj57LVPi3sZdW949vC7J8IHEqfjR6bTZtByMcZAgjAaNJFMhWBafrdrZHSKhuK5qJleztZ/L9w8cCo1nxhPHjgduS8RBWOAIJCQGAuZCAbcuP9IBI8cvxuXzm/A4wlZw17v0ZAS+zbfvGhrAGCGxxfCBxNREBAJe3Bsdh98PgM+y4DHq2OwK1DMteDYnMAlYlPpeLu9vQDgae2J+JsRKxqeYSAQJ/h9LqTSoyjkm/j6sz/Q2LRBnOg6FQFCQhGEgH/swccnXvSqDySPvxObST/GcB29XrGxsd7GwZlx6LqCarFLQ5uDhAQEAYIghYRLNTyVbrGkqIa+X1FUgBOIS4BLbFVsLJ5Zo2jMwuvvHWNPnkowt65CcgnJJcAJpjLCDOae0SAJ4BIkCJbPwNzzSexPhSC4ZFcubGLp7Brl/txCr+vcIJDQuAJFMIAAzenvrvH+AKqi4amTCdxzbwjd1gBffJChnWoP8kYIgqBwQBUqNK6g37cxFPaK0rfrnzQKlTYcovXVHSICpCRE4xaCIRM6ADgE1WHQuArdUaFxFbVmoe60Ol8yAHjh6PuLyemjsyQlJiZHcPihKLsrFYThVgECVE2B129g8+82vp+/TLIjsFj66ccfsh/OaQBgd2qvlXJXz0X3JcOVfBuVf5rkcWuIJYKIxCwEJ7ysWx1Q9lwV6BBWaplStZF99eYnrtYz9UlvssSH/GHLG/AamopAyINmvofC8g7yv22jcrEFpyHxVz1TXu/mXvl145vL/+tCtrZ0JeCJ/2K3dg6xIQsqXZeOvgKdq2AOodzKd1e2lzLl7uozPxe+Wr5tG/c0G3spfYfb/4apuMNEQg7EsNwY1D46X/zu2q3efwF9w4d36At8owAAAABJRU5ErkJggg==">GNU Emacs download - GNU Project</A>
<DT><A HREF="https://www.gnu.org/software/emacs/documentation.html" ADD_DATE="1605696393" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADN0lEQVQ4jW2TXWibZRiG7/f7S76k+WKS/iTNTLRzSSSr60TmWEVQ8GdSJjpxIIh4oB4peqQFj/TAE0EEEakgKII60ENxYwpSmTQd27Ru0OiSNj9Nk7Rp/r6kyfe+7+PB1qFj99HzwH0/XAfPzXAbzU29HfONed/SXeaklJLtDvq5Tm3j4zPrC5Vbvey/y8np+dTI6PjnoVj0kP/OsE/RVEBI8N0hGmul1na5cmGrWnj57LVPi3sZdW949vC7J8IHEqfjR6bTZtByMcZAgjAaNJFMhWBafrdrZHSKhuK5qJleztZ/L9w8cCo1nxhPHjgduS8RBWOAIJCQGAuZCAbcuP9IBI8cvxuXzm/A4wlZw17v0ZAS+zbfvGhrAGCGxxfCBxNREBAJe3Bsdh98PgM+y4DHq2OwK1DMteDYnMAlYlPpeLu9vQDgae2J+JsRKxqeYSAQJ/h9LqTSoyjkm/j6sz/Q2LRBnOg6FQFCQhGEgH/swccnXvSqDySPvxObST/GcB29XrGxsd7GwZlx6LqCarFLQ5uDhAQEAYIghYRLNTyVbrGkqIa+X1FUgBOIS4BLbFVsLJ5Zo2jMwuvvHWNPnkowt65CcgnJJcAJpjLCDOae0SAJ4BIkCJbPwNzzSexPhSC4ZFcubGLp7Brl/txCr+vcIJDQuAJFMIAAzenvrvH+AKqi4amTCdxzbwjd1gBffJChnWoP8kYIgqBwQBUqNK6g37cxFPaK0rfrnzQKlTYcovXVHSICpCRE4xaCIRM6ADgE1WHQuArdUaFxFbVmoe60Ol8yAHjh6PuLyemjsyQlJiZHcPihKLsrFYThVgECVE2B129g8+82vp+/TLIjsFj66ccfsh/OaQBgd2qvlXJXz0X3JcOVfBuVf5rkcWuIJYKIxCwEJ7ysWx1Q9lwV6BBWaplStZF99eYnrtYz9UlvssSH/GHLG/AamopAyINmvofC8g7yv22jcrEFpyHxVz1TXu/mXvl145vL/+tCtrZ0JeCJ/2K3dg6xIQsqXZeOvgKdq2AOodzKd1e2lzLl7uozPxe+Wr5tG/c0G3spfYfb/4apuMNEQg7EsNwY1D46X/zu2q3efwF9w4d36At8owAAAABJRU5ErkJggg==">GNU Emacs documentation - GNU Project</A>
</DL><p>
</DL><p>
<DT><A HREF="https://orgmode.org/" ADD_DATE="1605696413" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAClElEQVQ4jZ1TzU8TcRSc/f36Ddt2aWkLKbi0FCSUCCSiRUTExCBGYyJKotGLHv03jJ41Xjx48aQXPRA1xujBGJRoJCIYAhQo1NKvbZfdbbct2/WkIQIhOqdJ3ps5vJlHsT9qAOgAqrsN6V6qjqGhcQeloeELp29bOJe0sbI6/y8Gnu5TA89bONvNoxWh3dQaOOkd6DMvTn6JApC2LzLbOHsu5D7R6HZ3NVD9hnJxPFh2e0BKKlS1jHq+CbGZH6vxleV7qiDFqtDLm8mNZQYAejxs5Erk0OP2poYgCjLKiTXMRYah9EUAWYYsiLCxNtjsLKpaFVJOhMPNYeHz1xgJ2s2t1473vuoPtwWh69jMZaGAAZn6iFJsDQylcHpcMFktIJTAaDGBUAoxm0d6PfnCcL63/c7hUIs9KxehCBlUJAnR5hAyfh7leALFLQ2s046ipCATi7/TCSNpqtaQTSZL0+9f3zX467j+LR1Q8wK2ZBESMUI62AlHOAxhaRkr07PfNtP5h7KYi868/fDy72sbNB1qQUijJArQ1BIS/laAUmiyAr1aRUGS30w+m3iwV9wklU7PavkMNFlG1N0IJRAC4/XBQBgYTCawdY4RH8/zexlQO0NFr4u9muI8SAfaYA53wcAAhBAYjEYQSus9LU3XWadzOb4QndthsCjKC2WrNWs6Mzpay/NgCANCKXRdh8lqgdlmRbVSsbJu7pLFap//ubj0fUcTo8nclNXnzXsDB0ac9S4YTEYwhKAgSqioKkpFFZYaG2rr7GOlMp4K6+uZ3ZqIjsGewc4jA5eNFqNDSmXnYrG1T0o8keK8PgfX6DrmavbfknJifuL+o26GYXZ9rv3ARsbOPmmL9Jz6H/Ef8Dzv/M1/AdxXB/z0rsGnAAAAAElFTkSuQmCC">Org mode for Emacs</A>
</DL><p>
</DL><p>
我想从此文件中提取以下信息:
- URL
- 描述
- 添加日期
- URL
的文件夹层次结构/路径
我使用 BeautifulSoup 满足了前三个要求,但我似乎无法满足第四个要求。所以我会尝试进一步解释这一点。
让我们假设以下文件夹层次结构:
Bookmarks
\_Bookmarks bar
\_Folder_1
\_Subfolder
\_Another Subfolder
\_Folder_2
\_stuff
\_other stuff
\_emacs
理想情况下,我希望 'Another Subfolder' 中的 URL 具有以下示例性输出:
https://en.wikipedia.org/wiki/Main_Page
Wikipedia, the free encyclopedia
1605696025
Bookmarks bar/Folder_1/Subfolder/Another Subfolder
但是这个输出已经非常有用了:
https://en.wikipedia.org/wiki/Main_Page
Wikipedia, the free encyclopedia
1605696025
Another Subfolder
我目前的代码是:
from bs4 import BeautifulSoup
def read_in_file(filename):
f = open(filename, 'r')
soup = BeautifulSoup(f.read(), 'html.parser')
f.close()
return soup
soup = read_in_file('bookmarks_8_2_21.html')
for line in soup.find_all('a'):
print(line.get('href')) # 1) URL: works
print(line.get_text()) # 2) Description: works
print(line.get('add_date')) # 3) Add Date: works
dir = soup.find('h3') # 4) Folder hierarch/ path: not working
print(dir.contents) # only prints ['Bookmarks bar']
print()
到目前为止条目的输出:
https://en.wikipedia.org/wiki/Main_Page
Wikipedia, the free encyclopedia
1605696025
['Bookmarks bar']
我也对兄弟姐妹进行了试验,发现了如何打印出文件夹层次结构,但我无法让它与其他代码一起工作:
代码片段:
for dir in soup.find_all('h3', recursive=True):
print(dir.text)
输出:
Bookmarks bar
Folder_1
Subfolder
Another Subfolder
Folder_2
stuff
other stuff
emacs
感谢您的帮助和建议!
问题可能与您的书签文件的导入方式或BS
读取该文件的方式有关。更具体地说,它是如何读取 Description Term
或 <DT>
元素的。这是因为这些标签在您导出的文件中没有关闭。因此它不知道标签应该在哪里关闭从而关闭它一些随机的地方。
所以我在开始时在同一行关闭了标签,之后你应该很容易提取数据。
from bs4 import BeautifulSoup
soup = BeautifulSoup()
with open('bookmarks.html') as f:
soup = BeautifulSoup(f.read(), 'lxml')
dt = soup.find_all('dt')
folder_name =''
for i in dt:
n = i.find_next()
if n.name == 'h3':
folder_name = n.text
continue
else:
print(f'url = {n.get("href")}')
print(f'website name = {n.text}')
print(f'add date = {n.get("add_date")}')
print(f'folder name = {folder_name}')
print()
o/p 的这一小节希望对您有所帮助:
url = https://whosebug.com/
website name = Stack Overflow - Where Developers Learn, Share, & Build Careers
add date = 1605695883
folder name = Folder_1
url = https://stackexchange.com/
website name = Hot Questions - Stack Exchange
add date = 1605695914
folder name = Folder_1
url = https://meta.stackexchange.com/
website name = Meta Stack Exchange
add date = 1605695986
folder name = Subfolder
url = https://en.wikipedia.org/wiki/Main_Page
website name = Wikipedia, the free encyclopedia
add date = 1605696025
folder name = Another Subfolder
url = https://www.wikipedia.org/
website name = Wikipedia
add date = 1605696017
folder name = Another Subfolder
这里我假设文件夹名称下的任何 link 都属于该文件夹,但这可能会因为我在下面添加的原因而改变。
如果您想得到更准确的结果,那么您应该考虑关闭 p
标签,因为它们也保持打开状态,可以在任何地方填写。
前进的方向是找到 dl
标签并分别遍历它们以找出哪个 dt
标签位于哪个文件夹或 dl
元素下.
这是一个非常特殊的问题类型,因为并非所有人都以相同的方式保存书签。此外,您还必须注意 html 根据文件夹的组织而有所不同。 例如:如果 links 在前或子文件夹在前,相应地html 文件也会更改。
我一直在努力解决这个问题,我想获得完整的文件夹集,即父文件夹和子文件夹。
我写了一个简单的函数来通过传递 link 元素来查找父目录
例如
import bs4
def find_parent_dir(l):
if l is None:
return None
if l.h3 and l.name == "dt":
current_folder = l.h3.getText()
parents = find_parent_dir(l.find_parent("dl"))
if parents is None:
return [current_folder]
else:
return parents + [current_folder]
return find_parent_dir(l.parent)
with open("bookmarks_8_29_21.html") as fh:
html_obj = bs4.BeautifulSoup(fh.read(), 'html.parser')
links = [link for link in html_obj.find_all("a") ]
folders_path = find_parent_dir(link[0])
print(folders_path)
我在 google-chrome 中有一个 大 书签集合,其中包含链接、链接之间的子文件夹和一些子文件夹甚至更多子文件夹。
现在,我想将 URL 与其他信息一起提取为纯文本以供进一步处理。
为此,我将所有书签从 google-chrome 书签管理器导出到名为 bookmarks_8_2_21.html 的 html 文件中。 =21=]
我将在下文中使用的文件示例部分是:
<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
It will be read and overwritten.
DO NOT EDIT! -->
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
<DT><H3 ADD_DATE="1606927410" LAST_MODIFIED="1620226362" PERSONAL_TOOLBAR_FOLDER="true">Bookmarks bar</H3>
<DL><p>
<DT><A HREF="javascript:location.href='org-protocol://capture?template=l&url='+encodeURIComponent(location.href)+'&title='+encodeURIComponent(document.title)+'&body='+encodeURIComponent(window.getSelection())" ADD_DATE="1607739285">org-capture-bookmark</A>
<DT><A HREF="https://www.google.de/" ADD_DATE="1554935207" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAACIklEQVQ4jYWSS0iUURTHf/fe8RvHooE2VlT2FNqUGWmNEYUR9lhEEVJhUIsoXOQuap1Rq6KHNQt3LaPAIOxhlNTChUwLMU3NR1CklUzg6xvPd1ro2KhTHjjcA/e8/uf/hzmmqsUiEheRLhHxp/2TiDxQ1aK5+ZmFeSJSrwuYiMRVNZKuMxnFz51zu9T3GX/6iPGmRqS/F5WAUMEawuUVRI5UYjwPEWl2zlUYY8YMgIjUW2vPBkPfSV6uYbKvJ+uW3rZSojfuABAEQdw5d96oajHQqr7P8IUqpL8X43lEjp3EK4mBtfgt75l4+4po7U3cytWZPbcyjUlTidv642ipDu7foX7bh2zgs92jDhHpUlWdbNmuEw15OvqweqE7ZjboCAEFADrSjs1LkRM7NAt3+bWRebfYudFx9XguwFqbwePs9z/mT/6NLdAHMBpex28W0/C1Y1Zy05VFM75nUwiAZVGT/v5sgdcA3UurOPUrxvXOFhJD7fOmdn4LeNc5NbpkfWimv5mWZ8KXFKdfXqInOYBnc6gsPEjZ8mKssbQOtvEkMczYl0oK8z3un4lgppbYkhZS3Fp7bnD0Jxeba+lODmTFviFcxq29NeRHDUEQ1DnnqtNSjohIo3Nutx+keNz9gmf9zfQkB0ChYMkK9q2KcaLwMJFQGFV9Y4w5YIwZzyBBI2lRLcD9PVXN/SdFqlokInUi0iEiE9P+UUTuqurmufl/AKTzsFGmvUNUAAAAAElFTkSuQmCC"></A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Folder_1</H3>
<DL><p>
<DT><A HREF="https://whosebug.com/" ADD_DATE="1605695883" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABXklEQVQ4jbWQsUsCYRjGn/fuSu/Sk3ALmlzNtoagKRqSaHMKGkKhEOV0KWispSXPQaglAnNobOgfaCyIcgicmxO9zFPv/N5WwTs5gt7x+5739/2eDwgw/bK67HcnBQG4Ag3L0LJ/BoBFDuDzTiGUCAywDC3bNbRtANCrwxaBziRZanAGcjADwR8AX1uGesEZyFGzXwO43VsKn07GaJa5lY/GMefUAYooEvaELDnCEW9M2I1V7GdPg04hlLAM7dYqqut67ftLNwdpMB5dgRfXdVMgHIFpx9egfbwYk0eDA2LKAWJMkK6cUOhOGdkpZmoQiy29OmwFq1AKb5CgQyakAXqQJKpELn/eJzPK1JKhPhHjk4EmMzUVmU/coVLkeXff672pk155YXUsxikCJQFeYVCSgCiAV920N311b+r37FslH413S+qaV86rggfIBbG38RRAN+2ZHzsTMKvGv80vvziHGAusG84AAAAASUVORK5CYII=">Stack Overflow - Where Developers Learn, Share, & Build Careers</A>
<DT><A HREF="https://stackexchange.com/" ADD_DATE="1605695914" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABIUlEQVQ4ja2SvUoDQRSFvztZDSIKWwbxB2zs7eyCWGljYWUlCKKm0kfwEQRR8A0kaGEfW+2VgD+JGxMlELOsWxhJdizMrmtM1hVyYGDm3Pm4c5grtJXLaaM0UV8HTKJVH7fM43RamgCG75Yn7bOkUot/wADkU7V5YAVA+eaAyEIc+LXRpPbRitUolsTf7F64Ouqi9dbi0fGC89WqKRCK8B+4rwoirB3d/YpQqDa4r753BUv7s9ERouC+6jvC3vmTSqixQsXmoWxHQhrcYUOnbk623SCCiCTjwO2uQw7JQYCEb47OLOWLFXsaeAGev5Z2QEYIjTxwKyI7Vnbj8keEXppaPgj/zmbxdOswXI81SICjIdMJ0/G0nvKUN2dlM9fdap8MMGR5HOUBZgAAAABJRU5ErkJggg==">Hot Questions - Stack Exchange</A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Subfolder</H3>
<DL><p>
<DT><A HREF="https://meta.stackexchange.com/" ADD_DATE="1605695986" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAm0lEQVQ4je2SsQ3CMBBF30XMglKyBB7DEiNATSpnADYAeQtgEcIyR2EiHONgU9DxSv/736fTFyKMv+1BHEW0u9i2B2imZnZlM4C45zyL+DFNz/HaUhzQN+nAJ3NOfwv4ln/ALwLGgsyR6lGRtBsLYvxQrLOg28kGoSDalZcOn51tewhBFRg/KIAim6tdnmKt+og5czXr4301pz0AqgIzDZOACvcAAAAASUVORK5CYII=">Meta Stack Exchange</A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Another Subfolder</H3>
<DL><p>
<DT><A HREF="https://en.wikipedia.org/wiki/Main_Page" ADD_DATE="1605696025" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABO0lEQVQ4jaWTMaoCMRCG/wnvDtELmHaxdAmIXcheZA9hYeMNxNZqsc81lu0X+2VLTzBj8V5C8uQ9UAcG5k+YP5kvhPATzCx4IZRSBAD0TnNuQu82J5NPmgFADcMAay2UUjifzwAA733S8zzDWgtrLeZ5xvV6xXK5hPcet9vte/5pmoSIJIQgURtj5HQ6CTOLc06maRJmFmaWuq5TjVg454qNrutEay0hBDkej8V6NC4M+r4XANL3fdo0xogxJul4UK4TxPV6Decc9vt9ArTb7XC/35MehgFVVZUUc7cQghCRjOOYTtNaS9d1wszStm3BgpnlKzfz3mO1WuFyuWCz2aBpGlhrcTgcsN1uAQCLxeLvG0RIRJRmjS9U13XB5wlinlrrgnTbtk/w/jWIDPL8PXvMzz9TzuLVZgB4AExRsO8ga8hoAAAAAElFTkSuQmCC">Wikipedia, the free encyclopedia</A>
</DL><p>
</DL><p>
<DT><A HREF="https://www.wikipedia.org/" ADD_DATE="1605696017" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABO0lEQVQ4jaWTMaoCMRCG/wnvDtELmHaxdAmIXcheZA9hYeMNxNZqsc81lu0X+2VLTzBj8V5C8uQ9UAcG5k+YP5kvhPATzCx4IZRSBAD0TnNuQu82J5NPmgFADcMAay2UUjifzwAA733S8zzDWgtrLeZ5xvV6xXK5hPcet9vte/5pmoSIJIQgURtj5HQ6CTOLc06maRJmFmaWuq5TjVg454qNrutEay0hBDkej8V6NC4M+r4XANL3fdo0xogxJul4UK4TxPV6Decc9vt9ArTb7XC/35MehgFVVZUUc7cQghCRjOOYTtNaS9d1wszStm3BgpnlKzfz3mO1WuFyuWCz2aBpGlhrcTgcsN1uAQCLxeLvG0RIRJRmjS9U13XB5wlinlrrgnTbtk/w/jWIDPL8PXvMzz9TzuLVZgB4AExRsO8ga8hoAAAAAElFTkSuQmCC">Wikipedia</A>
<DT><A HREF="https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)" ADD_DATE="1605696102" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABO0lEQVQ4jaWTMaoCMRCG/wnvDtELmHaxdAmIXcheZA9hYeMNxNZqsc81lu0X+2VLTzBj8V5C8uQ9UAcG5k+YP5kvhPATzCx4IZRSBAD0TnNuQu82J5NPmgFADcMAay2UUjifzwAA733S8zzDWgtrLeZ5xvV6xXK5hPcet9vte/5pmoSIJIQgURtj5HQ6CTOLc06maRJmFmaWuq5TjVg454qNrutEay0hBDkej8V6NC4M+r4XANL3fdo0xogxJul4UK4TxPV6Decc9vt9ArTb7XC/35MehgFVVZUUc7cQghCRjOOYTtNaS9d1wszStm3BgpnlKzfz3mO1WuFyuWCz2aBpGlhrcTgcsN1uAQCLxeLvG0RIRJRmjS9U13XB5wlinlrrgnTbtk/w/jWIDPL8PXvMzz9TzuLVZgB4AExRsO8ga8hoAAAAAElFTkSuQmCC">Beautiful Soup (HTML parser) - Wikipedia</A>
</DL><p>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Folder_2</H3>
<DL><p>
<DT><A HREF="https://www.reddit.com/" ADD_DATE="1605696212" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAACdUlEQVQ4jWXTTYjVdRTG8c/5/e+945hNMhLEgFRUA1lgNYHojJjpDXduQkSqjRiKJNSu2rRJV0VQi14IKohCW8UEIRJZ05uUGESrBBOsJocayRznvvxPizup0dmcxeF5zoHzfAOSCBKybRP2SpO4SUjMSjOKV+Ooz67VhKXKdUZc75CwWzGk/neAQBXUuSi9oevp+NRFiCRMGDHqiIa2rkSNgqu3Xc5aUQwLfR9bsCO+8FcjyLzBIQ1tPR2hIVWD7ZEeforMsPLGyh331g7u6rr4xzbDnseBkltNqezW0xfRIIqqpDrSytHQGg7Tb6XZM6mzWNDQUwuP5xbrI9veU+xU61tUBHpCB+vWs3qcd9+mhYbUxDK1UKm905AmpdQTbh2n2wnj97F5F7fcRd1j7WaOv89Pp0JzKP36c2hKTMmHymJulHlgQ51/X8i8MJdZ9/N/1e9lzp3LnD+fuW+izo0y26VTXPljDvrIKl5+gn330+/R67B3gjefYdUYzSFKdUVWZD1reaRvv0rffJT6fe6epP0YVYNGi8nt3LxmYHjyWDp1Ii2Xsj5bpM+VCEX6+kNZVTy4i3s2Mf878+fZsJ32o7JqcGJ6EK8SIcxUz91mVp2PaEZx+gcREcYn0ty5cPp7fjvD8HVpZFQceSEcfjG1hDo7evYH5BavGLJfR8dlDXeuLR7YkcZuH8T4l9Ph+Af8eLLW0tfS1PVSHPPkIMqTVhh2WNM2/UgLWesITQNWulJLGo6iytA1rdjpqEtXYVpjhTEHhT2qsoygXiKqlAFVvXoBr/nTs/GdS5Y4+y/OW01hD6awesn/rDCj7/X4xJfXav4BhnocQyGrEocAAAAASUVORK5CYII=">reddit: the front page of the internet</A>
<DT><A HREF="https://www.youtube.com/" ADD_DATE="1574152707" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABx0lEQVQ4jZ2TQWtTQRDHfzO7yUsNKSG0heJJ0YKnCvVSkHrV7+BBeu7Vk9+lH8CLN6EXk4Lo1V48lGClFBELGmmSmr73djy8fS8vll4c+LO7s7OzM///LhQmBmrg+uDtZrgIBQQAKyf/YQbiBexFt9t9niSP7ji3umrW7og0UfWL0ZaOzGYjmBxn2c/BdPpJxuNz+vDwFxxnqmaqZs4V8L5AuS6haqmqXcDpO3jCEN4YmEFmkMcxrSCSGqSh8GcGIY52Am91Ce5HJ5EYBRwbG45222HmUHVS+DX2DhASuKu3oAeolGSqCiDs7gqHh8L2thCCxOQSbxFAO9BToClzNQSJokwmsLUFgwHs78P6eklnlQho6c0axUKbTVhZgUYjHpeqCwE8kMWFxYNF9uVlGA5hbw8ODuqJzajqnPENPtfYDxU2N4OtrRVzkVDfC0V8/h2G/g98AR7ESkJF8tFRcYdzkOf15iRW6y/hlD48voCz+BbmELFrvhrG8OMjPBOAV3D7qXM791R73Var00qSJRoNTyifB5Dn+eVsNv19dTX+mmWj93n+4SWciM1LKkmqy7RoImFBqPqPfH39K7t/4A18f74nAH8Bjm35s3ZkOjEAAAAASUVORK5CYII=">YouTube</A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">stuff</H3>
<DL><p>
<DT><A HREF="https://www.pgadmin.org/" ADD_DATE="1566393697" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAC4ElEQVQ4jU2TS2hdZRSFv7XPf26eNbV52jRp0tYYiVVDfUBBlEpHTuxIOhEzMHUi6EjswIEOBQV1YsFWyMCJCI6sGhBRJ1JDrVpMGnuJSXprTFqpjXmc8//bwb0R93zvvdb+9tL4qfeetBDOxFgMmeMOBiAhAHdHEu44gCC5kIXKQiz8VJA4CwwGKbowASbh7sSUGs2OEA44mHtyT3FY5ueCuw9sb27Gje3SKsEoo1PGqKY8o6UpR4AkortM8vrATMQYHfpDEcs40LPbHhrZq9mltdS5q8UO7u30mSvXdKlawx2KoqS1uclvb24Jl+fBMJOFzFJY3yjsscNDevXkExRlVB6yhlU4/cF5vvtlgVeeeZzDw336dfFPr9Zu6shIP1NfzvDZ97MKkrwsk2JKrG9u89qH0/TtaefFp4+ycnOd9186wehgD9dv/M3YUK+OH7kbgG9+rnoRnYAnudcPNz3zG1PTMwx072a47052tTUxOtjD5YU/mHjzY279s8WZl09wdGw/G1uFJAhIyOqHuv9AH2P7exkd6Karow1PTh0lxOjEmADIzGhQJngDG8DIvi59/fYkmRkA585foFq7wdhQL5++8Szg9Hd14O6Y1fEGAWVMxJT49qcFv7K8qkP9nVycrzG3tMrzb33C5FOP8PA9+7hUvc7iyl88eu8gRRkbChwPmSkz4/bGFqfPfk7XHW2UMSLBsfGDzF9b46OvfqS7o43XnzvukjS7tOZ5CIQQTEtrt/hhbtnnllfp6Wj31uZcABtbBSePPcD4oX7+V5r6YoYLc0tqb8nR+AvvFEWZzJMbUqoEM5CDK7nT1lzhvqFeDty1x4sy6vLvK35xvkYlD5jhenDy3WQiIdlOeMC1s66Mic3tkpgSIK+EjNbmHJyUIAtm2SKZBr0sI40k7igAkYeMpjyA5Dg4TkzJLcszeVo20ITQVRcSpP+MSvXsAcnrPxBTwpMnSUhW9eQT/wJc4GRalsmdmQAAAABJRU5ErkJggg==">pgAdmin - PostgreSQL Tools</A>
</DL><p>
<DT><A HREF="https://www.gnu.org/software/emacs/" ADD_DATE="1605696341" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADN0lEQVQ4jW2TXWibZRiG7/f7S76k+WKS/iTNTLRzSSSr60TmWEVQ8GdSJjpxIIh4oB4peqQFj/TAE0EEEakgKII60ENxYwpSmTQd27Ru0OiSNj9Nk7Rp/r6kyfe+7+PB1qFj99HzwH0/XAfPzXAbzU29HfONed/SXeaklJLtDvq5Tm3j4zPrC5Vbvey/y8np+dTI6PjnoVj0kP/OsE/RVEBI8N0hGmul1na5cmGrWnj57LVPi3sZdW949vC7J8IHEqfjR6bTZtByMcZAgjAaNJFMhWBafrdrZHSKhuK5qJleztZ/L9w8cCo1nxhPHjgduS8RBWOAIJCQGAuZCAbcuP9IBI8cvxuXzm/A4wlZw17v0ZAS+zbfvGhrAGCGxxfCBxNREBAJe3Bsdh98PgM+y4DHq2OwK1DMteDYnMAlYlPpeLu9vQDgae2J+JsRKxqeYSAQJ/h9LqTSoyjkm/j6sz/Q2LRBnOg6FQFCQhGEgH/swccnXvSqDySPvxObST/GcB29XrGxsd7GwZlx6LqCarFLQ5uDhAQEAYIghYRLNTyVbrGkqIa+X1FUgBOIS4BLbFVsLJ5Zo2jMwuvvHWNPnkowt65CcgnJJcAJpjLCDOae0SAJ4BIkCJbPwNzzSexPhSC4ZFcubGLp7Brl/txCr+vcIJDQuAJFMIAAzenvrvH+AKqi4amTCdxzbwjd1gBffJChnWoP8kYIgqBwQBUqNK6g37cxFPaK0rfrnzQKlTYcovXVHSICpCRE4xaCIRM6ADgE1WHQuArdUaFxFbVmoe60Ol8yAHjh6PuLyemjsyQlJiZHcPihKLsrFYThVgECVE2B129g8+82vp+/TLIjsFj66ccfsh/OaQBgd2qvlXJXz0X3JcOVfBuVf5rkcWuIJYKIxCwEJ7ysWx1Q9lwV6BBWaplStZF99eYnrtYz9UlvssSH/GHLG/AamopAyINmvofC8g7yv22jcrEFpyHxVz1TXu/mXvl145vL/+tCtrZ0JeCJ/2K3dg6xIQsqXZeOvgKdq2AOodzKd1e2lzLl7uozPxe+Wr5tG/c0G3spfYfb/4apuMNEQg7EsNwY1D46X/zu2q3efwF9w4d36At8owAAAABJRU5ErkJggg==">GNU Emacs - GNU Project</A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">other stuff</H3>
<DL><p>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">emacs</H3>
<DL><p>
<DT><A HREF="https://www.gnu.org/software/emacs/download.html" ADD_DATE="1605696357" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADN0lEQVQ4jW2TXWibZRiG7/f7S76k+WKS/iTNTLRzSSSr60TmWEVQ8GdSJjpxIIh4oB4peqQFj/TAE0EEEakgKII60ENxYwpSmTQd27Ru0OiSNj9Nk7Rp/r6kyfe+7+PB1qFj99HzwH0/XAfPzXAbzU29HfONed/SXeaklJLtDvq5Tm3j4zPrC5Vbvey/y8np+dTI6PjnoVj0kP/OsE/RVEBI8N0hGmul1na5cmGrWnj57LVPi3sZdW949vC7J8IHEqfjR6bTZtByMcZAgjAaNJFMhWBafrdrZHSKhuK5qJleztZ/L9w8cCo1nxhPHjgduS8RBWOAIJCQGAuZCAbcuP9IBI8cvxuXzm/A4wlZw17v0ZAS+zbfvGhrAGCGxxfCBxNREBAJe3Bsdh98PgM+y4DHq2OwK1DMteDYnMAlYlPpeLu9vQDgae2J+JsRKxqeYSAQJ/h9LqTSoyjkm/j6sz/Q2LRBnOg6FQFCQhGEgH/swccnXvSqDySPvxObST/GcB29XrGxsd7GwZlx6LqCarFLQ5uDhAQEAYIghYRLNTyVbrGkqIa+X1FUgBOIS4BLbFVsLJ5Zo2jMwuvvHWNPnkowt65CcgnJJcAJpjLCDOae0SAJ4BIkCJbPwNzzSexPhSC4ZFcubGLp7Brl/txCr+vcIJDQuAJFMIAAzenvrvH+AKqi4amTCdxzbwjd1gBffJChnWoP8kYIgqBwQBUqNK6g37cxFPaK0rfrnzQKlTYcovXVHSICpCRE4xaCIRM6ADgE1WHQuArdUaFxFbVmoe60Ol8yAHjh6PuLyemjsyQlJiZHcPihKLsrFYThVgECVE2B129g8+82vp+/TLIjsFj66ccfsh/OaQBgd2qvlXJXz0X3JcOVfBuVf5rkcWuIJYKIxCwEJ7ysWx1Q9lwV6BBWaplStZF99eYnrtYz9UlvssSH/GHLG/AamopAyINmvofC8g7yv22jcrEFpyHxVz1TXu/mXvl145vL/+tCtrZ0JeCJ/2K3dg6xIQsqXZeOvgKdq2AOodzKd1e2lzLl7uozPxe+Wr5tG/c0G3spfYfb/4apuMNEQg7EsNwY1D46X/zu2q3efwF9w4d36At8owAAAABJRU5ErkJggg==">GNU Emacs download - GNU Project</A>
<DT><A HREF="https://www.gnu.org/software/emacs/documentation.html" ADD_DATE="1605696393" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADN0lEQVQ4jW2TXWibZRiG7/f7S76k+WKS/iTNTLRzSSSr60TmWEVQ8GdSJjpxIIh4oB4peqQFj/TAE0EEEakgKII60ENxYwpSmTQd27Ru0OiSNj9Nk7Rp/r6kyfe+7+PB1qFj99HzwH0/XAfPzXAbzU29HfONed/SXeaklJLtDvq5Tm3j4zPrC5Vbvey/y8np+dTI6PjnoVj0kP/OsE/RVEBI8N0hGmul1na5cmGrWnj57LVPi3sZdW949vC7J8IHEqfjR6bTZtByMcZAgjAaNJFMhWBafrdrZHSKhuK5qJleztZ/L9w8cCo1nxhPHjgduS8RBWOAIJCQGAuZCAbcuP9IBI8cvxuXzm/A4wlZw17v0ZAS+zbfvGhrAGCGxxfCBxNREBAJe3Bsdh98PgM+y4DHq2OwK1DMteDYnMAlYlPpeLu9vQDgae2J+JsRKxqeYSAQJ/h9LqTSoyjkm/j6sz/Q2LRBnOg6FQFCQhGEgH/swccnXvSqDySPvxObST/GcB29XrGxsd7GwZlx6LqCarFLQ5uDhAQEAYIghYRLNTyVbrGkqIa+X1FUgBOIS4BLbFVsLJ5Zo2jMwuvvHWNPnkowt65CcgnJJcAJpjLCDOae0SAJ4BIkCJbPwNzzSexPhSC4ZFcubGLp7Brl/txCr+vcIJDQuAJFMIAAzenvrvH+AKqi4amTCdxzbwjd1gBffJChnWoP8kYIgqBwQBUqNK6g37cxFPaK0rfrnzQKlTYcovXVHSICpCRE4xaCIRM6ADgE1WHQuArdUaFxFbVmoe60Ol8yAHjh6PuLyemjsyQlJiZHcPihKLsrFYThVgECVE2B129g8+82vp+/TLIjsFj66ccfsh/OaQBgd2qvlXJXz0X3JcOVfBuVf5rkcWuIJYKIxCwEJ7ysWx1Q9lwV6BBWaplStZF99eYnrtYz9UlvssSH/GHLG/AamopAyINmvofC8g7yv22jcrEFpyHxVz1TXu/mXvl145vL/+tCtrZ0JeCJ/2K3dg6xIQsqXZeOvgKdq2AOodzKd1e2lzLl7uozPxe+Wr5tG/c0G3spfYfb/4apuMNEQg7EsNwY1D46X/zu2q3efwF9w4d36At8owAAAABJRU5ErkJggg==">GNU Emacs documentation - GNU Project</A>
</DL><p>
</DL><p>
<DT><A HREF="https://orgmode.org/" ADD_DATE="1605696413" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAClElEQVQ4jZ1TzU8TcRSc/f36Ddt2aWkLKbi0FCSUCCSiRUTExCBGYyJKotGLHv03jJ41Xjx48aQXPRA1xujBGJRoJCIYAhQo1NKvbZfdbbct2/WkIQIhOqdJ3ps5vJlHsT9qAOgAqrsN6V6qjqGhcQeloeELp29bOJe0sbI6/y8Gnu5TA89bONvNoxWh3dQaOOkd6DMvTn6JApC2LzLbOHsu5D7R6HZ3NVD9hnJxPFh2e0BKKlS1jHq+CbGZH6vxleV7qiDFqtDLm8mNZQYAejxs5Erk0OP2poYgCjLKiTXMRYah9EUAWYYsiLCxNtjsLKpaFVJOhMPNYeHz1xgJ2s2t1473vuoPtwWh69jMZaGAAZn6iFJsDQylcHpcMFktIJTAaDGBUAoxm0d6PfnCcL63/c7hUIs9KxehCBlUJAnR5hAyfh7leALFLQ2s046ipCATi7/TCSNpqtaQTSZL0+9f3zX467j+LR1Q8wK2ZBESMUI62AlHOAxhaRkr07PfNtP5h7KYi868/fDy72sbNB1qQUijJArQ1BIS/laAUmiyAr1aRUGS30w+m3iwV9wklU7PavkMNFlG1N0IJRAC4/XBQBgYTCawdY4RH8/zexlQO0NFr4u9muI8SAfaYA53wcAAhBAYjEYQSus9LU3XWadzOb4QndthsCjKC2WrNWs6Mzpay/NgCANCKXRdh8lqgdlmRbVSsbJu7pLFap//ubj0fUcTo8nclNXnzXsDB0ac9S4YTEYwhKAgSqioKkpFFZYaG2rr7GOlMp4K6+uZ3ZqIjsGewc4jA5eNFqNDSmXnYrG1T0o8keK8PgfX6DrmavbfknJifuL+o26GYXZ9rv3ARsbOPmmL9Jz6H/Ef8Dzv/M1/AdxXB/z0rsGnAAAAAElFTkSuQmCC">Org mode for Emacs</A>
</DL><p>
</DL><p>
我想从此文件中提取以下信息:
- URL
- 描述
- 添加日期
- URL 的文件夹层次结构/路径
我使用 BeautifulSoup 满足了前三个要求,但我似乎无法满足第四个要求。所以我会尝试进一步解释这一点。
让我们假设以下文件夹层次结构:
Bookmarks
\_Bookmarks bar
\_Folder_1
\_Subfolder
\_Another Subfolder
\_Folder_2
\_stuff
\_other stuff
\_emacs
理想情况下,我希望 'Another Subfolder' 中的 URL 具有以下示例性输出:
https://en.wikipedia.org/wiki/Main_Page
Wikipedia, the free encyclopedia
1605696025
Bookmarks bar/Folder_1/Subfolder/Another Subfolder
但是这个输出已经非常有用了:
https://en.wikipedia.org/wiki/Main_Page
Wikipedia, the free encyclopedia
1605696025
Another Subfolder
我目前的代码是:
from bs4 import BeautifulSoup
def read_in_file(filename):
f = open(filename, 'r')
soup = BeautifulSoup(f.read(), 'html.parser')
f.close()
return soup
soup = read_in_file('bookmarks_8_2_21.html')
for line in soup.find_all('a'):
print(line.get('href')) # 1) URL: works
print(line.get_text()) # 2) Description: works
print(line.get('add_date')) # 3) Add Date: works
dir = soup.find('h3') # 4) Folder hierarch/ path: not working
print(dir.contents) # only prints ['Bookmarks bar']
print()
到目前为止条目的输出:
https://en.wikipedia.org/wiki/Main_Page
Wikipedia, the free encyclopedia
1605696025
['Bookmarks bar']
我也对兄弟姐妹进行了试验,发现了如何打印出文件夹层次结构,但我无法让它与其他代码一起工作:
代码片段:
for dir in soup.find_all('h3', recursive=True):
print(dir.text)
输出:
Bookmarks bar
Folder_1
Subfolder
Another Subfolder
Folder_2
stuff
other stuff
emacs
感谢您的帮助和建议!
问题可能与您的书签文件的导入方式或BS
读取该文件的方式有关。更具体地说,它是如何读取 Description Term
或 <DT>
元素的。这是因为这些标签在您导出的文件中没有关闭。因此它不知道标签应该在哪里关闭从而关闭它一些随机的地方。
所以我在开始时在同一行关闭了标签,之后你应该很容易提取数据。
from bs4 import BeautifulSoup
soup = BeautifulSoup()
with open('bookmarks.html') as f:
soup = BeautifulSoup(f.read(), 'lxml')
dt = soup.find_all('dt')
folder_name =''
for i in dt:
n = i.find_next()
if n.name == 'h3':
folder_name = n.text
continue
else:
print(f'url = {n.get("href")}')
print(f'website name = {n.text}')
print(f'add date = {n.get("add_date")}')
print(f'folder name = {folder_name}')
print()
o/p 的这一小节希望对您有所帮助:
url = https://whosebug.com/
website name = Stack Overflow - Where Developers Learn, Share, & Build Careers
add date = 1605695883
folder name = Folder_1
url = https://stackexchange.com/
website name = Hot Questions - Stack Exchange
add date = 1605695914
folder name = Folder_1
url = https://meta.stackexchange.com/
website name = Meta Stack Exchange
add date = 1605695986
folder name = Subfolder
url = https://en.wikipedia.org/wiki/Main_Page
website name = Wikipedia, the free encyclopedia
add date = 1605696025
folder name = Another Subfolder
url = https://www.wikipedia.org/
website name = Wikipedia
add date = 1605696017
folder name = Another Subfolder
这里我假设文件夹名称下的任何 link 都属于该文件夹,但这可能会因为我在下面添加的原因而改变。
如果您想得到更准确的结果,那么您应该考虑关闭 p
标签,因为它们也保持打开状态,可以在任何地方填写。
前进的方向是找到 dl
标签并分别遍历它们以找出哪个 dt
标签位于哪个文件夹或 dl
元素下.
这是一个非常特殊的问题类型,因为并非所有人都以相同的方式保存书签。此外,您还必须注意 html 根据文件夹的组织而有所不同。 例如:如果 links 在前或子文件夹在前,相应地html 文件也会更改。
我一直在努力解决这个问题,我想获得完整的文件夹集,即父文件夹和子文件夹。
我写了一个简单的函数来通过传递 link 元素来查找父目录
例如
import bs4
def find_parent_dir(l):
if l is None:
return None
if l.h3 and l.name == "dt":
current_folder = l.h3.getText()
parents = find_parent_dir(l.find_parent("dl"))
if parents is None:
return [current_folder]
else:
return parents + [current_folder]
return find_parent_dir(l.parent)
with open("bookmarks_8_29_21.html") as fh:
html_obj = bs4.BeautifulSoup(fh.read(), 'html.parser')
links = [link for link in html_obj.find_all("a") ]
folders_path = find_parent_dir(link[0])
print(folders_path)