使用 BeautifulSoup 从 google chrome 中提取书签和文件夹层次结构

Extracting bookmarks and folder hierarchy from google chrome with BeautifulSoup

我在 google-chrome 中有一个 书签集合,其中包含链接、链接之间的子文件夹和一些子文件夹甚至更多子文件夹。
现在,我想将 URL 与其他信息一起提取为纯文本以供进一步处理。
为此,我将所有书签从 google-chrome 书签管理器导出到名为 bookmarks_8_2_21.html 的 html 文件中。 =21=]

我将在下文中使用的文件示例部分是:

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
     It will be read and overwritten.
     DO NOT EDIT! -->
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
    <DT><H3 ADD_DATE="1606927410" LAST_MODIFIED="1620226362" PERSONAL_TOOLBAR_FOLDER="true">Bookmarks bar</H3>
    <DL><p>
        <DT><A HREF="javascript:location.href='org-protocol://capture?template=l&url='+encodeURIComponent(location.href)+'&title='+encodeURIComponent(document.title)+'&body='+encodeURIComponent(window.getSelection())" ADD_DATE="1607739285">org-capture-bookmark</A>
        <DT><A HREF="https://www.google.de/" ADD_DATE="1554935207" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAACIklEQVQ4jYWSS0iUURTHf/fe8RvHooE2VlT2FNqUGWmNEYUR9lhEEVJhUIsoXOQuap1Rq6KHNQt3LaPAIOxhlNTChUwLMU3NR1CklUzg6xvPd1ro2KhTHjjcA/e8/uf/hzmmqsUiEheRLhHxp/2TiDxQ1aK5+ZmFeSJSrwuYiMRVNZKuMxnFz51zu9T3GX/6iPGmRqS/F5WAUMEawuUVRI5UYjwPEWl2zlUYY8YMgIjUW2vPBkPfSV6uYbKvJ+uW3rZSojfuABAEQdw5d96oajHQqr7P8IUqpL8X43lEjp3EK4mBtfgt75l4+4po7U3cytWZPbcyjUlTidv642ipDu7foX7bh2zgs92jDhHpUlWdbNmuEw15OvqweqE7ZjboCAEFADrSjs1LkRM7NAt3+bWRebfYudFx9XguwFqbwePs9z/mT/6NLdAHMBpex28W0/C1Y1Zy05VFM75nUwiAZVGT/v5sgdcA3UurOPUrxvXOFhJD7fOmdn4LeNc5NbpkfWimv5mWZ8KXFKdfXqInOYBnc6gsPEjZ8mKssbQOtvEkMczYl0oK8z3un4lgppbYkhZS3Fp7bnD0Jxeba+lODmTFviFcxq29NeRHDUEQ1DnnqtNSjohIo3Nutx+keNz9gmf9zfQkB0ChYMkK9q2KcaLwMJFQGFV9Y4w5YIwZzyBBI2lRLcD9PVXN/SdFqlokInUi0iEiE9P+UUTuqurmufl/AKTzsFGmvUNUAAAAAElFTkSuQmCC"></A>
        <DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Folder_1</H3>
        <DL><p>
            <DT><A HREF="https://whosebug.com/" ADD_DATE="1605695883" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABXklEQVQ4jbWQsUsCYRjGn/fuSu/Sk3ALmlzNtoagKRqSaHMKGkKhEOV0KWispSXPQaglAnNobOgfaCyIcgicmxO9zFPv/N5WwTs5gt7x+5739/2eDwgw/bK67HcnBQG4Ag3L0LJ/BoBFDuDzTiGUCAywDC3bNbRtANCrwxaBziRZanAGcjADwR8AX1uGesEZyFGzXwO43VsKn07GaJa5lY/GMefUAYooEvaELDnCEW9M2I1V7GdPg04hlLAM7dYqqut67ftLNwdpMB5dgRfXdVMgHIFpx9egfbwYk0eDA2LKAWJMkK6cUOhOGdkpZmoQiy29OmwFq1AKb5CgQyakAXqQJKpELn/eJzPK1JKhPhHjk4EmMzUVmU/coVLkeXff672pk155YXUsxikCJQFeYVCSgCiAV920N311b+r37FslH413S+qaV86rggfIBbG38RRAN+2ZHzsTMKvGv80vvziHGAusG84AAAAASUVORK5CYII=">Stack Overflow - Where Developers Learn, Share, &amp; Build Careers</A>
            <DT><A HREF="https://stackexchange.com/" ADD_DATE="1605695914" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABIUlEQVQ4ja2SvUoDQRSFvztZDSIKWwbxB2zs7eyCWGljYWUlCKKm0kfwEQRR8A0kaGEfW+2VgD+JGxMlELOsWxhJdizMrmtM1hVyYGDm3Pm4c5grtJXLaaM0UV8HTKJVH7fM43RamgCG75Yn7bOkUot/wADkU7V5YAVA+eaAyEIc+LXRpPbRitUolsTf7F64Ouqi9dbi0fGC89WqKRCK8B+4rwoirB3d/YpQqDa4r753BUv7s9ERouC+6jvC3vmTSqixQsXmoWxHQhrcYUOnbk623SCCiCTjwO2uQw7JQYCEb47OLOWLFXsaeAGev5Z2QEYIjTxwKyI7Vnbj8keEXppaPgj/zmbxdOswXI81SICjIdMJ0/G0nvKUN2dlM9fdap8MMGR5HOUBZgAAAABJRU5ErkJggg==">Hot Questions - Stack Exchange</A>
            <DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Subfolder</H3>
            <DL><p>
                <DT><A HREF="https://meta.stackexchange.com/" ADD_DATE="1605695986" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAm0lEQVQ4je2SsQ3CMBBF30XMglKyBB7DEiNATSpnADYAeQtgEcIyR2EiHONgU9DxSv/736fTFyKMv+1BHEW0u9i2B2imZnZlM4C45zyL+DFNz/HaUhzQN+nAJ3NOfwv4ln/ALwLGgsyR6lGRtBsLYvxQrLOg28kGoSDalZcOn51tewhBFRg/KIAim6tdnmKt+og5czXr4301pz0AqgIzDZOACvcAAAAASUVORK5CYII=">Meta Stack Exchange</A>
                <DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Another Subfolder</H3>
                <DL><p>
                    <DT><A HREF="https://en.wikipedia.org/wiki/Main_Page" ADD_DATE="1605696025" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABO0lEQVQ4jaWTMaoCMRCG/wnvDtELmHaxdAmIXcheZA9hYeMNxNZqsc81lu0X+2VLTzBj8V5C8uQ9UAcG5k+YP5kvhPATzCx4IZRSBAD0TnNuQu82J5NPmgFADcMAay2UUjifzwAA733S8zzDWgtrLeZ5xvV6xXK5hPcet9vte/5pmoSIJIQgURtj5HQ6CTOLc06maRJmFmaWuq5TjVg454qNrutEay0hBDkej8V6NC4M+r4XANL3fdo0xogxJul4UK4TxPV6Decc9vt9ArTb7XC/35MehgFVVZUUc7cQghCRjOOYTtNaS9d1wszStm3BgpnlKzfz3mO1WuFyuWCz2aBpGlhrcTgcsN1uAQCLxeLvG0RIRJRmjS9U13XB5wlinlrrgnTbtk/w/jWIDPL8PXvMzz9TzuLVZgB4AExRsO8ga8hoAAAAAElFTkSuQmCC">Wikipedia, the free encyclopedia</A>
                </DL><p>
            </DL><p>
            <DT><A HREF="https://www.wikipedia.org/" ADD_DATE="1605696017" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABO0lEQVQ4jaWTMaoCMRCG/wnvDtELmHaxdAmIXcheZA9hYeMNxNZqsc81lu0X+2VLTzBj8V5C8uQ9UAcG5k+YP5kvhPATzCx4IZRSBAD0TnNuQu82J5NPmgFADcMAay2UUjifzwAA733S8zzDWgtrLeZ5xvV6xXK5hPcet9vte/5pmoSIJIQgURtj5HQ6CTOLc06maRJmFmaWuq5TjVg454qNrutEay0hBDkej8V6NC4M+r4XANL3fdo0xogxJul4UK4TxPV6Decc9vt9ArTb7XC/35MehgFVVZUUc7cQghCRjOOYTtNaS9d1wszStm3BgpnlKzfz3mO1WuFyuWCz2aBpGlhrcTgcsN1uAQCLxeLvG0RIRJRmjS9U13XB5wlinlrrgnTbtk/w/jWIDPL8PXvMzz9TzuLVZgB4AExRsO8ga8hoAAAAAElFTkSuQmCC">Wikipedia</A>
            <DT><A HREF="https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)" ADD_DATE="1605696102" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABO0lEQVQ4jaWTMaoCMRCG/wnvDtELmHaxdAmIXcheZA9hYeMNxNZqsc81lu0X+2VLTzBj8V5C8uQ9UAcG5k+YP5kvhPATzCx4IZRSBAD0TnNuQu82J5NPmgFADcMAay2UUjifzwAA733S8zzDWgtrLeZ5xvV6xXK5hPcet9vte/5pmoSIJIQgURtj5HQ6CTOLc06maRJmFmaWuq5TjVg454qNrutEay0hBDkej8V6NC4M+r4XANL3fdo0xogxJul4UK4TxPV6Decc9vt9ArTb7XC/35MehgFVVZUUc7cQghCRjOOYTtNaS9d1wszStm3BgpnlKzfz3mO1WuFyuWCz2aBpGlhrcTgcsN1uAQCLxeLvG0RIRJRmjS9U13XB5wlinlrrgnTbtk/w/jWIDPL8PXvMzz9TzuLVZgB4AExRsO8ga8hoAAAAAElFTkSuQmCC">Beautiful Soup (HTML parser) - Wikipedia</A>
        </DL><p>
        <DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Folder_2</H3>
        <DL><p>
            <DT><A HREF="https://www.reddit.com/" ADD_DATE="1605696212" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAACdUlEQVQ4jWXTTYjVdRTG8c/5/e+945hNMhLEgFRUA1lgNYHojJjpDXduQkSqjRiKJNSu2rRJV0VQi14IKohCW8UEIRJZ05uUGESrBBOsJocayRznvvxPizup0dmcxeF5zoHzfAOSCBKybRP2SpO4SUjMSjOKV+Ooz67VhKXKdUZc75CwWzGk/neAQBXUuSi9oevp+NRFiCRMGDHqiIa2rkSNgqu3Xc5aUQwLfR9bsCO+8FcjyLzBIQ1tPR2hIVWD7ZEeforMsPLGyh331g7u6rr4xzbDnseBkltNqezW0xfRIIqqpDrSytHQGg7Tb6XZM6mzWNDQUwuP5xbrI9veU+xU61tUBHpCB+vWs3qcd9+mhYbUxDK1UKm905AmpdQTbh2n2wnj97F5F7fcRd1j7WaOv89Pp0JzKP36c2hKTMmHymJulHlgQ51/X8i8MJdZ9/N/1e9lzp3LnD+fuW+izo0y26VTXPljDvrIKl5+gn330+/R67B3gjefYdUYzSFKdUVWZD1reaRvv0rffJT6fe6epP0YVYNGi8nt3LxmYHjyWDp1Ii2Xsj5bpM+VCEX6+kNZVTy4i3s2Mf878+fZsJ32o7JqcGJ6EK8SIcxUz91mVp2PaEZx+gcREcYn0ty5cPp7fjvD8HVpZFQceSEcfjG1hDo7evYH5BavGLJfR8dlDXeuLR7YkcZuH8T4l9Ph+Af8eLLW0tfS1PVSHPPkIMqTVhh2WNM2/UgLWesITQNWulJLGo6iytA1rdjpqEtXYVpjhTEHhT2qsoygXiKqlAFVvXoBr/nTs/GdS5Y4+y/OW01hD6awesn/rDCj7/X4xJfXav4BhnocQyGrEocAAAAASUVORK5CYII=">reddit: the front page of the internet</A>
            <DT><A HREF="https://www.youtube.com/" ADD_DATE="1574152707" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABx0lEQVQ4jZ2TQWtTQRDHfzO7yUsNKSG0heJJ0YKnCvVSkHrV7+BBeu7Vk9+lH8CLN6EXk4Lo1V48lGClFBELGmmSmr73djy8fS8vll4c+LO7s7OzM///LhQmBmrg+uDtZrgIBQQAKyf/YQbiBexFt9t9niSP7ji3umrW7og0UfWL0ZaOzGYjmBxn2c/BdPpJxuNz+vDwFxxnqmaqZs4V8L5AuS6haqmqXcDpO3jCEN4YmEFmkMcxrSCSGqSh8GcGIY52Am91Ce5HJ5EYBRwbG45222HmUHVS+DX2DhASuKu3oAeolGSqCiDs7gqHh8L2thCCxOQSbxFAO9BToClzNQSJokwmsLUFgwHs78P6eklnlQho6c0axUKbTVhZgUYjHpeqCwE8kMWFxYNF9uVlGA5hbw8ODuqJzajqnPENPtfYDxU2N4OtrRVzkVDfC0V8/h2G/g98AR7ESkJF8tFRcYdzkOf15iRW6y/hlD48voCz+BbmELFrvhrG8OMjPBOAV3D7qXM791R73Var00qSJRoNTyifB5Dn+eVsNv19dTX+mmWj93n+4SWciM1LKkmqy7RoImFBqPqPfH39K7t/4A18f74nAH8Bjm35s3ZkOjEAAAAASUVORK5CYII=">YouTube</A>
            <DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">stuff</H3>
            <DL><p>
                <DT><A HREF="https://www.pgadmin.org/" ADD_DATE="1566393697" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAC4ElEQVQ4jU2TS2hdZRSFv7XPf26eNbV52jRp0tYYiVVDfUBBlEpHTuxIOhEzMHUi6EjswIEOBQV1YsFWyMCJCI6sGhBRJ1JDrVpMGnuJSXprTFqpjXmc8//bwb0R93zvvdb+9tL4qfeetBDOxFgMmeMOBiAhAHdHEu44gCC5kIXKQiz8VJA4CwwGKbowASbh7sSUGs2OEA44mHtyT3FY5ueCuw9sb27Gje3SKsEoo1PGqKY8o6UpR4AkortM8vrATMQYHfpDEcs40LPbHhrZq9mltdS5q8UO7u30mSvXdKlawx2KoqS1uclvb24Jl+fBMJOFzFJY3yjsscNDevXkExRlVB6yhlU4/cF5vvtlgVeeeZzDw336dfFPr9Zu6shIP1NfzvDZ97MKkrwsk2JKrG9u89qH0/TtaefFp4+ycnOd9186wehgD9dv/M3YUK+OH7kbgG9+rnoRnYAnudcPNz3zG1PTMwx072a47052tTUxOtjD5YU/mHjzY279s8WZl09wdGw/G1uFJAhIyOqHuv9AH2P7exkd6Karow1PTh0lxOjEmADIzGhQJngDG8DIvi59/fYkmRkA585foFq7wdhQL5++8Szg9Hd14O6Y1fEGAWVMxJT49qcFv7K8qkP9nVycrzG3tMrzb33C5FOP8PA9+7hUvc7iyl88eu8gRRkbChwPmSkz4/bGFqfPfk7XHW2UMSLBsfGDzF9b46OvfqS7o43XnzvukjS7tOZ5CIQQTEtrt/hhbtnnllfp6Wj31uZcABtbBSePPcD4oX7+V5r6YoYLc0tqb8nR+AvvFEWZzJMbUqoEM5CDK7nT1lzhvqFeDty1x4sy6vLvK35xvkYlD5jhenDy3WQiIdlOeMC1s66Mic3tkpgSIK+EjNbmHJyUIAtm2SKZBr0sI40k7igAkYeMpjyA5Dg4TkzJLcszeVo20ITQVRcSpP+MSvXsAcnrPxBTwpMnSUhW9eQT/wJc4GRalsmdmQAAAABJRU5ErkJggg==">pgAdmin - PostgreSQL Tools</A>
            </DL><p>
            <DT><A HREF="https://www.gnu.org/software/emacs/" ADD_DATE="1605696341" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADN0lEQVQ4jW2TXWibZRiG7/f7S76k+WKS/iTNTLRzSSSr60TmWEVQ8GdSJjpxIIh4oB4peqQFj/TAE0EEEakgKII60ENxYwpSmTQd27Ru0OiSNj9Nk7Rp/r6kyfe+7+PB1qFj99HzwH0/XAfPzXAbzU29HfONed/SXeaklJLtDvq5Tm3j4zPrC5Vbvey/y8np+dTI6PjnoVj0kP/OsE/RVEBI8N0hGmul1na5cmGrWnj57LVPi3sZdW949vC7J8IHEqfjR6bTZtByMcZAgjAaNJFMhWBafrdrZHSKhuK5qJleztZ/L9w8cCo1nxhPHjgduS8RBWOAIJCQGAuZCAbcuP9IBI8cvxuXzm/A4wlZw17v0ZAS+zbfvGhrAGCGxxfCBxNREBAJe3Bsdh98PgM+y4DHq2OwK1DMteDYnMAlYlPpeLu9vQDgae2J+JsRKxqeYSAQJ/h9LqTSoyjkm/j6sz/Q2LRBnOg6FQFCQhGEgH/swccnXvSqDySPvxObST/GcB29XrGxsd7GwZlx6LqCarFLQ5uDhAQEAYIghYRLNTyVbrGkqIa+X1FUgBOIS4BLbFVsLJ5Zo2jMwuvvHWNPnkowt65CcgnJJcAJpjLCDOae0SAJ4BIkCJbPwNzzSexPhSC4ZFcubGLp7Brl/txCr+vcIJDQuAJFMIAAzenvrvH+AKqi4amTCdxzbwjd1gBffJChnWoP8kYIgqBwQBUqNK6g37cxFPaK0rfrnzQKlTYcovXVHSICpCRE4xaCIRM6ADgE1WHQuArdUaFxFbVmoe60Ol8yAHjh6PuLyemjsyQlJiZHcPihKLsrFYThVgECVE2B129g8+82vp+/TLIjsFj66ccfsh/OaQBgd2qvlXJXz0X3JcOVfBuVf5rkcWuIJYKIxCwEJ7ysWx1Q9lwV6BBWaplStZF99eYnrtYz9UlvssSH/GHLG/AamopAyINmvofC8g7yv22jcrEFpyHxVz1TXu/mXvl145vL/+tCtrZ0JeCJ/2K3dg6xIQsqXZeOvgKdq2AOodzKd1e2lzLl7uozPxe+Wr5tG/c0G3spfYfb/4apuMNEQg7EsNwY1D46X/zu2q3efwF9w4d36At8owAAAABJRU5ErkJggg==">GNU Emacs - GNU Project</A>
            <DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">other stuff</H3>
            <DL><p>
                <DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">emacs</H3>
                <DL><p>
                    <DT><A HREF="https://www.gnu.org/software/emacs/download.html" ADD_DATE="1605696357" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADN0lEQVQ4jW2TXWibZRiG7/f7S76k+WKS/iTNTLRzSSSr60TmWEVQ8GdSJjpxIIh4oB4peqQFj/TAE0EEEakgKII60ENxYwpSmTQd27Ru0OiSNj9Nk7Rp/r6kyfe+7+PB1qFj99HzwH0/XAfPzXAbzU29HfONed/SXeaklJLtDvq5Tm3j4zPrC5Vbvey/y8np+dTI6PjnoVj0kP/OsE/RVEBI8N0hGmul1na5cmGrWnj57LVPi3sZdW949vC7J8IHEqfjR6bTZtByMcZAgjAaNJFMhWBafrdrZHSKhuK5qJleztZ/L9w8cCo1nxhPHjgduS8RBWOAIJCQGAuZCAbcuP9IBI8cvxuXzm/A4wlZw17v0ZAS+zbfvGhrAGCGxxfCBxNREBAJe3Bsdh98PgM+y4DHq2OwK1DMteDYnMAlYlPpeLu9vQDgae2J+JsRKxqeYSAQJ/h9LqTSoyjkm/j6sz/Q2LRBnOg6FQFCQhGEgH/swccnXvSqDySPvxObST/GcB29XrGxsd7GwZlx6LqCarFLQ5uDhAQEAYIghYRLNTyVbrGkqIa+X1FUgBOIS4BLbFVsLJ5Zo2jMwuvvHWNPnkowt65CcgnJJcAJpjLCDOae0SAJ4BIkCJbPwNzzSexPhSC4ZFcubGLp7Brl/txCr+vcIJDQuAJFMIAAzenvrvH+AKqi4amTCdxzbwjd1gBffJChnWoP8kYIgqBwQBUqNK6g37cxFPaK0rfrnzQKlTYcovXVHSICpCRE4xaCIRM6ADgE1WHQuArdUaFxFbVmoe60Ol8yAHjh6PuLyemjsyQlJiZHcPihKLsrFYThVgECVE2B129g8+82vp+/TLIjsFj66ccfsh/OaQBgd2qvlXJXz0X3JcOVfBuVf5rkcWuIJYKIxCwEJ7ysWx1Q9lwV6BBWaplStZF99eYnrtYz9UlvssSH/GHLG/AamopAyINmvofC8g7yv22jcrEFpyHxVz1TXu/mXvl145vL/+tCtrZ0JeCJ/2K3dg6xIQsqXZeOvgKdq2AOodzKd1e2lzLl7uozPxe+Wr5tG/c0G3spfYfb/4apuMNEQg7EsNwY1D46X/zu2q3efwF9w4d36At8owAAAABJRU5ErkJggg==">GNU Emacs download - GNU Project</A>
                    <DT><A HREF="https://www.gnu.org/software/emacs/documentation.html" ADD_DATE="1605696393" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADN0lEQVQ4jW2TXWibZRiG7/f7S76k+WKS/iTNTLRzSSSr60TmWEVQ8GdSJjpxIIh4oB4peqQFj/TAE0EEEakgKII60ENxYwpSmTQd27Ru0OiSNj9Nk7Rp/r6kyfe+7+PB1qFj99HzwH0/XAfPzXAbzU29HfONed/SXeaklJLtDvq5Tm3j4zPrC5Vbvey/y8np+dTI6PjnoVj0kP/OsE/RVEBI8N0hGmul1na5cmGrWnj57LVPi3sZdW949vC7J8IHEqfjR6bTZtByMcZAgjAaNJFMhWBafrdrZHSKhuK5qJleztZ/L9w8cCo1nxhPHjgduS8RBWOAIJCQGAuZCAbcuP9IBI8cvxuXzm/A4wlZw17v0ZAS+zbfvGhrAGCGxxfCBxNREBAJe3Bsdh98PgM+y4DHq2OwK1DMteDYnMAlYlPpeLu9vQDgae2J+JsRKxqeYSAQJ/h9LqTSoyjkm/j6sz/Q2LRBnOg6FQFCQhGEgH/swccnXvSqDySPvxObST/GcB29XrGxsd7GwZlx6LqCarFLQ5uDhAQEAYIghYRLNTyVbrGkqIa+X1FUgBOIS4BLbFVsLJ5Zo2jMwuvvHWNPnkowt65CcgnJJcAJpjLCDOae0SAJ4BIkCJbPwNzzSexPhSC4ZFcubGLp7Brl/txCr+vcIJDQuAJFMIAAzenvrvH+AKqi4amTCdxzbwjd1gBffJChnWoP8kYIgqBwQBUqNK6g37cxFPaK0rfrnzQKlTYcovXVHSICpCRE4xaCIRM6ADgE1WHQuArdUaFxFbVmoe60Ol8yAHjh6PuLyemjsyQlJiZHcPihKLsrFYThVgECVE2B129g8+82vp+/TLIjsFj66ccfsh/OaQBgd2qvlXJXz0X3JcOVfBuVf5rkcWuIJYKIxCwEJ7ysWx1Q9lwV6BBWaplStZF99eYnrtYz9UlvssSH/GHLG/AamopAyINmvofC8g7yv22jcrEFpyHxVz1TXu/mXvl145vL/+tCtrZ0JeCJ/2K3dg6xIQsqXZeOvgKdq2AOodzKd1e2lzLl7uozPxe+Wr5tG/c0G3spfYfb/4apuMNEQg7EsNwY1D46X/zu2q3efwF9w4d36At8owAAAABJRU5ErkJggg==">GNU Emacs documentation - GNU Project</A>
                </DL><p>
            </DL><p>
            <DT><A HREF="https://orgmode.org/" ADD_DATE="1605696413" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAClElEQVQ4jZ1TzU8TcRSc/f36Ddt2aWkLKbi0FCSUCCSiRUTExCBGYyJKotGLHv03jJ41Xjx48aQXPRA1xujBGJRoJCIYAhQo1NKvbZfdbbct2/WkIQIhOqdJ3ps5vJlHsT9qAOgAqrsN6V6qjqGhcQeloeELp29bOJe0sbI6/y8Gnu5TA89bONvNoxWh3dQaOOkd6DMvTn6JApC2LzLbOHsu5D7R6HZ3NVD9hnJxPFh2e0BKKlS1jHq+CbGZH6vxleV7qiDFqtDLm8mNZQYAejxs5Erk0OP2poYgCjLKiTXMRYah9EUAWYYsiLCxNtjsLKpaFVJOhMPNYeHz1xgJ2s2t1473vuoPtwWh69jMZaGAAZn6iFJsDQylcHpcMFktIJTAaDGBUAoxm0d6PfnCcL63/c7hUIs9KxehCBlUJAnR5hAyfh7leALFLQ2s046ipCATi7/TCSNpqtaQTSZL0+9f3zX467j+LR1Q8wK2ZBESMUI62AlHOAxhaRkr07PfNtP5h7KYi868/fDy72sbNB1qQUijJArQ1BIS/laAUmiyAr1aRUGS30w+m3iwV9wklU7PavkMNFlG1N0IJRAC4/XBQBgYTCawdY4RH8/zexlQO0NFr4u9muI8SAfaYA53wcAAhBAYjEYQSus9LU3XWadzOb4QndthsCjKC2WrNWs6Mzpay/NgCANCKXRdh8lqgdlmRbVSsbJu7pLFap//ubj0fUcTo8nclNXnzXsDB0ac9S4YTEYwhKAgSqioKkpFFZYaG2rr7GOlMp4K6+uZ3ZqIjsGewc4jA5eNFqNDSmXnYrG1T0o8keK8PgfX6DrmavbfknJifuL+o26GYXZ9rv3ARsbOPmmL9Jz6H/Ef8Dzv/M1/AdxXB/z0rsGnAAAAAElFTkSuQmCC">Org mode for Emacs</A>
        </DL><p>
</DL><p>

我想从此文件中提取以下信息:

  1. URL
  2. 描述
  3. 添加日期
  4. URL
  5. 的文件夹层次结构/路径

我使用 BeautifulSoup 满足了前三个要求,但我似乎无法满足第四个要求。所以我会尝试进一步解释这一点。

让我们假设以下文件夹层次结构:

Bookmarks 
\_Bookmarks bar
  \_Folder_1
    \_Subfolder
      \_Another Subfolder
  \_Folder_2
    \_stuff
    \_other stuff
      \_emacs

理想情况下,我希望 'Another Subfolder' 中的 URL 具有以下示例性输出:

https://en.wikipedia.org/wiki/Main_Page
Wikipedia, the free encyclopedia
1605696025
Bookmarks bar/Folder_1/Subfolder/Another Subfolder

但是这个输出已经非常有用了:

https://en.wikipedia.org/wiki/Main_Page
Wikipedia, the free encyclopedia
1605696025
Another Subfolder

我目前的代码是:

from bs4 import BeautifulSoup

def read_in_file(filename): 
    f = open(filename, 'r') 
    soup = BeautifulSoup(f.read(), 'html.parser')
    f.close()
    return soup

soup = read_in_file('bookmarks_8_2_21.html')
for line in soup.find_all('a'):
    print(line.get('href'))     # 1) URL:         works 
    print(line.get_text())      # 2) Description: works 
    print(line.get('add_date')) # 3) Add Date:    works

    dir = soup.find('h3') # 4) Folder hierarch/ path: not working
    print(dir.contents)   # only prints ['Bookmarks bar']

    print()

到目前为止条目的输出:

https://en.wikipedia.org/wiki/Main_Page
Wikipedia, the free encyclopedia
1605696025
['Bookmarks bar']

我也对兄弟姐妹进行了试验,发现了如何打印出文件夹层次结构,但我无法让它与其他代码一起工作:

代码片段:

for dir in soup.find_all('h3', recursive=True):
    print(dir.text)

输出:

Bookmarks bar
Folder_1
Subfolder
Another Subfolder
Folder_2
stuff
other stuff
emacs

感谢您的帮助和建议!

问题可能与您的书签文件的导入方式或BS读取该文件的方式有关。更具体地说,它是如何读取 Description Term<DT> 元素的。这是因为这些标签在您导出的文件中没有关闭。因此它不知道标签应该在哪里关闭从而关闭它一些随机的地方。

所以我在开始时在同一行关闭了标签,之后你应该很容易提取数据。

from bs4 import BeautifulSoup

soup = BeautifulSoup()
with open('bookmarks.html') as f:
    soup = BeautifulSoup(f.read(), 'lxml')

dt = soup.find_all('dt')
folder_name =''
for i in dt:
    n = i.find_next()
    if n.name == 'h3':
        folder_name = n.text
        continue
    else:
        print(f'url = {n.get("href")}')
        print(f'website name = {n.text}')
        print(f'add date = {n.get("add_date")}')
        print(f'folder name = {folder_name}')
    print()

o/p 的这一小节希望对您有所帮助:

url = https://whosebug.com/
website name = Stack Overflow - Where Developers Learn, Share, & Build Careers
add date = 1605695883
folder name = Folder_1

url = https://stackexchange.com/
website name = Hot Questions - Stack Exchange
add date = 1605695914
folder name = Folder_1

url = https://meta.stackexchange.com/
website name = Meta Stack Exchange
add date = 1605695986
folder name = Subfolder

url = https://en.wikipedia.org/wiki/Main_Page
website name = Wikipedia, the free encyclopedia
add date = 1605696025
folder name = Another Subfolder

url = https://www.wikipedia.org/
website name = Wikipedia
add date = 1605696017
folder name = Another Subfolder

这里我假设文件夹名称下的任何 link 都属于该文件夹,但这可能会因为我在下面添加的原因而改变。

如果您想得到更准确的结果,那么您应该考虑关闭 p 标签,因为它们也保持打开状态,可以在任何地方填写。

前进的方向是找到 dl 标签并分别遍历它们以找出哪个 dt 标签位于哪个文件夹或 dl 元素下.

这是一个非常特殊的问题类型,因为并非所有人都以相同的方式保存书签。此外,您还必须注意 html 根据文件夹的组织而有所不同。 例如:如果 links 在前或子文件夹在前,相应地html 文件也会更改。

我一直在努力解决这个问题,我想获得完整的文件夹集,即父文件夹和子文件夹。

我写了一个简单的函数来通过传递 link 元素来查找父目录

例如

import bs4

def find_parent_dir(l):
    if l is None:
        return None
    
    if l.h3 and l.name == "dt":
        current_folder = l.h3.getText()
        parents = find_parent_dir(l.find_parent("dl"))
        if parents is None:
            return [current_folder]
        else:
            return parents + [current_folder]
    
    return find_parent_dir(l.parent)

with open("bookmarks_8_29_21.html") as fh:
    html_obj = bs4.BeautifulSoup(fh.read(), 'html.parser')

links = [link for link in html_obj.find_all("a") ]


folders_path = find_parent_dir(link[0])
print(folders_path)