如何将 headers 和 Html <script> 中的值保存为 csv 文件中的 table？

Question

我是编写代码的新手。使用slenium和beautifulsoup，我设法在网页上的几十个脚本中找到了我想要的脚本。我正在寻找脚本 [17]。当执行这些代码时，脚本[17]给出的结果如下。

我代码的最后一部分


html=driver.page_source
soup=BeautifulSoup(html, "html.parser")
scripts=soup.find_all("script")
x=scripts[17]
print(x)

结果，输出注意：日期列表在脚本 [17] 之前。滑动条。虚拟数据

虚拟数据

<script language="JavaScript"> var theHlp='/yardim/matris.asp';var theTitle = 'Piyasa Değeri';var theCaption='Cam (TL)';var lastmod = '';var h='<a class=hisselink href=../Hisse/HisseAnaliz.aspx?HNO=';var e='<a class=hisselink href=../endeks/endeksAnaliz.aspx?HNO=';var d='<center><font face=symbol size=1 color=#FF0000><b>ß</b></font></center>';var u='<center><font face=symbol size=1 color=#008000><b>İ</b></font></center>';var n='<center><font face=symbol size=1 color=#00A000><b>=</b></font></center>';var fr='<font color=#FF0000>';var fg='<font color=#008000>';var theFooter=new Array();var theCols = new Array();theCols[0] = new Array('Hisse',4,50);theCols[1] = new Array('2012.12',1,60);theCols[2] = new Array('2013.03',1,60);theCols[3] = new Array('2013.06',1,60);theCols[4] = new Array('2013.09',1,60);theCols[5] = new Array('2013.12',1,60);theCols[6] = new Array('2014.03',1,60);theCols[7] = new Array('2014.06',1,60);theCols[8] = new Array('2014.09',1,60);theCols[9] = new Array('2014.12',1,60);theCols[10] = new Array('2015.03',1,60);theCols[11] = new Array('2015.06',1,60);theCols[12] = new Array('2015.09',1,60);theCols[13] = new Array('2015.12',1,60);theCols[14] = new Array('2016.03',1,60);theCols[15] = new Array('2016.06',1,60);theCols[16] = new Array('2016.09',1,60);theCols[17] = new Array('2016.12',1,60);theCols[18] = new Array('2017.03',1,60);theCols[19] = new Array('2017.06',1,60);theCols[20] = new Array('2017.09',1,60);theCols[21] = new Array('2017.12',1,60);theCols[22] = new Array('2018.03',1,60);theCols[23] = new Array('2018.06',1,60);theCols[24] = new Array('2018.09',1,60);theCols[25] = new Array('2018.12',1,60);theCols[26] = new Array('2019.03',1,60);theCols[27] = new Array('2019.06',1,60);theCols[28] = new Array('2019.09',1,60);theCols[29] = new Array('2019.12',1,60);theCols[30] = new Array('2020.03',1,60);var theRows = new Array();
theRows[0] = new Array ('<b>'+h+'30>ANA</B></a>','1,114,919,783.60','1,142,792,778.19','1,091,028,645.38','991,850,000.48','796,800,000.38','697,200,000.34','751,150,000.36','723,720,000.33','888,000,000.40','790,320,000.36','883,560,000.40','927,960,000.42','737,040,000.33','879,120,000.40','914,640,000.41','927,960,000.42','1,172,160,000.53','1,416,360,000.64','1,589,520,000.72','1,552,500,000.41','1,972,500,000.53','2,520,000,000.67','2,160,000,000.58','2,475,000,000.66','2,010,000,000.54','2,250,000,000.60','2,077,500,000.55','2,332,500,000.62','3,270,000,000.87','2,347,500,000.63');
theRows[1] = new Array ('<b>'+h+'89>DEN</B></a>','55,200,000.00','55,920,000.00','45,960,000.00','42,600,000.00','35,760,000.00','39,600,000.00','40,200,000.00','47,700,000.00','50,460,000.00','45,300,000.00','41,760,000.00','59,340,000.00','66,600,000.00','97,020,000.00','81,060,000.00','69,300,000.00','79,800,000.00','68,400,000.00','66,900,000.00','66,960,000.00','71,220,000.00','71,520,000.00','71,880,000.00','60,600,000.00','69,120,000.00','62,640,000.00','57,180,000.00','89,850,000.00','125,100,000.00','85,350,000.00');
theRows[2] = new Array ('<b>'+h+'269>SIS</B></a>','4,425,000,000.00','4,695,000,000.00','4,050,000,000.00','4,367,380,000.00','4,273,120,000.00','3,644,720,000.00','4,681,580,000.00','4,913,000,000.00','6,188,000,000.00','5,457,000,000.00','6,137,000,000.00','5,453,000,000.00','6,061,000,000.00','6,954,000,000.00','6,745,000,000.00','6,519,000,000.00','7,851,500,000.00','8,548,500,000.00','9,430,000,000.00','9,225,000,000.00','10,575,000,000.00','11,610,000,000.00','9,517,500,000.00','13,140,000,000.00','12,757,500,000.00','13,117,500,000.00','11,677,500,000.00','10,507,500,000.00','11,857,500,000.00','9,315,000,000.00');
theRows[3] = new Array ('<b>'+h+'297>TRK</B></a>','1,692,579,200.00','1,983,924,800.00','1,831,315,200.00','1,704,000,000.00','1,803,400,000.00','1,498,100,000.00','1,803,400,000.00','1,884,450,000.00','2,542,160,000.00','2,180,050,000.00','2,069,200,000.00','1,682,600,000.00','1,619,950,000.00','1,852,650,000.00','2,040,600,000.00','2,315,700,000.00','2,641,200,000.00','2,938,800,000.00','3,599,100,000.00','4,101,900,000.00','5,220,600,000.00','5,808,200,000.00','4,689,500,000.00','5,375,000,000.00','3,787,500,000.00','4,150,000,000.00','3,662,500,000.00','3,712,500,000.00','4,375,000,000.00','3,587,500,000.00');
var thetable=new mytable();thetable.tableWidth=650;thetable.shownum=false;thetable.controlaccess=true;thetable.visCols=new Array(true,true,true,true,true);thetable.initsort=new Array(0,-1);thetable.inittable();thetable.refreshTable();</script>

我的目的是将此输出提取到 table 中并将其保存为 csv 文件。我怎样才能根据需要提取此脚本？

所有日期都应该在最上面，所有名字都应该在最右边，所有值都应该在两者之间。

Hisse      2012.12             2013.3             2013.4 ...
ANA      1,114,919,783.60   1,142,792,778.19     1,091,028,645.38  ...
DEN      55,200,000.00      55,920,000.00        45,960,000.00  ....
.
.
.

Answer 1

根据您的问题中提供的 <script>，您可以执行类似下面的代码来获得每个名称的日期列表 ANA, DEN ..:

for _ in range(1, len(aaa.split("<b>'"))-1):
  s = aaa.split("<b>'")[_].split("'")
  print(_)
  lst = []
  for i in s:
    if "</B>" in i:
      name = i.split('>')[1].split("<")[0]
      print("{} = ".format(name), end="")
    if any(j.isdigit() for j in i) and ',' in i:
      lst.append(i)
  print(lst)

这只是一个示例代码，所以它不是那么漂亮:)

希望对您有所帮助。

Answer 2

解决方案

自定义函数 process_scripts() 将生成您要查找的内容。我正在使用下面给出的虚拟数据（最后）。首先我们检查代码是否按预期执行，因此我们创建一个 pandas 数据框来查看输出。

You could also open this Colab Jupyter Notebook and run it on Cloud for free. This will allow you to not worry about any installation or setup and simply focus on examining the solution itself.

1。处理单个脚本

## Define CSV file-output folder-path
OUTPUT_PATH = './output'
## Process scripts
dfs = process_scripts(scripts = [s], 
                        output_path = OUTPUT_PATH, 
                        save_to_csv = False, 
                        verbose = 0)

print(dfs[0].reset_index(drop=True))

输出：

  Name           2012.12  ...            2019.12           2020.03
0  ANA  1,114,919,783.60  ...   3,270,000,000.87  2,347,500,000.63
1  DEN     55,200,000.00  ...     125,100,000.00     85,350,000.00
2  SIS  4,425,000,000.00  ...  11,857,500,000.00  9,315,000,000.00
3  TRK  1,692,579,200.00  ...   4,375,000,000.00  3,587,500,000.00

[4 rows x 31 columns]

2。处理所有脚本

您可以使用自定义函数 process_scripts() 处理所有脚本。代码如下。

## Define CSV file-output folder-path
OUTPUT_PATH = './output'
## Process scripts
dfs = process_scripts(scripts, 
                        output_path = OUTPUT_PATH, 
                        save_to_csv = True, 
                        verbose = 0)

## To clear the output dir-contents
#!rm -f $OUTPUT_PATH/*

我在 Google Colab 上做了这个，它按预期工作。

3。以 OS 不可知的方式创建路径

为 windows 或基于 unix 的系统创建路径可能非常不同。下面向您展示了一种实现该目标的方法，而不必担心 OS 您将运行代码。我也用过os library here. However, I would suggest you to look at the Pathlib library。

# Define relative path for output-folder
OUTPUT_PATH = './output'

# Dynamically define absolute path
pwd = os.getcwd() # present-working-directory
OUTPUT_PATH = os.path.join(pwd, os.path.abspath(OUTPUT_PATH))

4。代码：自定义函数`process_scripts()`

这里我们使用 regex（正则表达式）库，以及 pandas 以表格格式组织数据，然后写入 csv 文件。 tqdm 库用于在处理多个脚本时为您提供漂亮的 progressbar。请查看代码中的注释，了解如果您运行不是来自 jupyter notebook 时该怎么做。 os 库用于路径操作和创建输出目录。

#pip install -U pandas
#pip install tqdm
import pandas as pd
import re # regex
import os
from tqdm.notebook import tqdm
# Use the following line if not using a jupyter notebook
# from tqdm import tqdm

def process_scripts(scripts, 
                    output_path='./output', 
                    save_to_csv: bool=False, 
                    verbose: int=0):
    """Process all scripts and return a list of dataframes and 
    optionally write each dataframe to a CSV file.

    Parameters
    ----------
        scripts: list of scripts
        output_path (str): output-folder-path for csv files
        save_to_csv (bool):  default is False
        verbose (int): prints output for verbose>0

    Example
    -------
    OUTPUT_PATH = './output'
    dfs = process_scripts(scripts, 
                          output_path = OUTPUT_PATH, 
                          save_to_csv = True, 
                          verbose = 0)

    ## To clear the output dir-contents
    #!rm -f $OUTPUT_PATH/*
    """
    ## Define regex patterns and compile for speed
    pat_header = re.compile(r"theCols\[\d+\] = new Array\s*\([\'](\d{4}\.\d{1,2})[\'],\d+,\d+\)")
    pat_line = re.compile(r"theRows\[\d+\] = new Array\s*\((.*)\).*")
    pat_code = re.compile("([A-Z]{3})")

    # Calculate zfill-digits
    zfill_digits = len(str(len(scripts)))
    print(f'Total scripts: {len(scripts)}')

    # Create output_path
    if not os.path.exists(output_path):
        os.makedirs(output_path)

    # Define a list of dataframes:  
    #   An accumulator of all scripts
    dfs = []

    ## If you do not have tqdm installed, uncomment the 
    #  next line and comment out the following line.
    #for script_num, script in enumerate(scripts):
    for script_num, script in enumerate(tqdm(scripts, desc='Scripts Processed')):
        ## Extract: Headers, Rows
        #    Rows : code (Name: single column), line-data (multi-column)
        headers = script.strip().split('\n', 0)[0]
        headers = ['Name'] + re.findall(pat_header, headers)
        lines = re.findall(pat_line, script)
        codes = [re.findall(pat_code, line)[0] for line in lines]
        # Clean data for each row
        lines_data = dict()
        for line, code in zip(lines, codes):
            line_data = line.replace("','", "|").split('|')
            line_data[-1] = line_data[-1].replace("'", "")
            line_data[0] = code
            lines_data.update({code: line_data.copy()})
        if verbose>0:
            print('{}: {}'.format(script_num, codes))
        ## Load data into a pandas-dataframe
        #  and write to csv.
        df = pd.DataFrame(lines_data).T
        df.columns = headers
        dfs.append(df.copy()) # update list
        # Write to CSV
        if save_to_csv:
            num_label = str(script_num).zfill(zfill_digits)
            script_file_csv = f'Script_{num_label}.csv' 
            script_path = os.path.join(output_path, script_file_csv)   
            df.to_csv(script_path, index=False)
    return dfs

5。虚拟数据

## Dummy Data
s = """
<script language="JavaScript"> var theHlp='/yardim/matris.asp';var theTitle = 'Piyasa Değeri';var theCaption='Cam (TL)';var lastmod = '';var h='<a class=hisselink href=../Hisse/HisseAnaliz.aspx?HNO=';var e='<a class=hisselink href=../endeks/endeksAnaliz.aspx?HNO=';var d='<center><font face=symbol size=1 color=#FF0000><b>ß</b></font></center>';var u='<center><font face=symbol size=1 color=#008000><b>İ</b></font></center>';var n='<center><font face=symbol size=1 color=#00A000><b>=</b></font></center>';var fr='<font color=#FF0000>';var fg='<font color=#008000>';var theFooter=new Array();var theCols = new Array();theCols[0] = new Array('Hisse',4,50);theCols[1] = new Array('2012.12',1,60);theCols[2] = new Array('2013.03',1,60);theCols[3] = new Array('2013.06',1,60);theCols[4] = new Array('2013.09',1,60);theCols[5] = new Array('2013.12',1,60);theCols[6] = new Array('2014.03',1,60);theCols[7] = new Array('2014.06',1,60);theCols[8] = new Array('2014.09',1,60);theCols[9] = new Array('2014.12',1,60);theCols[10] = new Array('2015.03',1,60);theCols[11] = new Array('2015.06',1,60);theCols[12] = new Array('2015.09',1,60);theCols[13] = new Array('2015.12',1,60);theCols[14] = new Array('2016.03',1,60);theCols[15] = new Array('2016.06',1,60);theCols[16] = new Array('2016.09',1,60);theCols[17] = new Array('2016.12',1,60);theCols[18] = new Array('2017.03',1,60);theCols[19] = new Array('2017.06',1,60);theCols[20] = new Array('2017.09',1,60);theCols[21] = new Array('2017.12',1,60);theCols[22] = new Array('2018.03',1,60);theCols[23] = new Array('2018.06',1,60);theCols[24] = new Array('2018.09',1,60);theCols[25] = new Array('2018.12',1,60);theCols[26] = new Array('2019.03',1,60);theCols[27] = new Array('2019.06',1,60);theCols[28] = new Array('2019.09',1,60);theCols[29] = new Array('2019.12',1,60);theCols[30] = new Array('2020.03',1,60);var theRows = new Array();
theRows[0] = new Array ('<b>'+h+'30>ANA</B></a>','1,114,919,783.60','1,142,792,778.19','1,091,028,645.38','991,850,000.48','796,800,000.38','697,200,000.34','751,150,000.36','723,720,000.33','888,000,000.40','790,320,000.36','883,560,000.40','927,960,000.42','737,040,000.33','879,120,000.40','914,640,000.41','927,960,000.42','1,172,160,000.53','1,416,360,000.64','1,589,520,000.72','1,552,500,000.41','1,972,500,000.53','2,520,000,000.67','2,160,000,000.58','2,475,000,000.66','2,010,000,000.54','2,250,000,000.60','2,077,500,000.55','2,332,500,000.62','3,270,000,000.87','2,347,500,000.63');
theRows[1] = new Array ('<b>'+h+'89>DEN</B></a>','55,200,000.00','55,920,000.00','45,960,000.00','42,600,000.00','35,760,000.00','39,600,000.00','40,200,000.00','47,700,000.00','50,460,000.00','45,300,000.00','41,760,000.00','59,340,000.00','66,600,000.00','97,020,000.00','81,060,000.00','69,300,000.00','79,800,000.00','68,400,000.00','66,900,000.00','66,960,000.00','71,220,000.00','71,520,000.00','71,880,000.00','60,600,000.00','69,120,000.00','62,640,000.00','57,180,000.00','89,850,000.00','125,100,000.00','85,350,000.00');
theRows[2] = new Array ('<b>'+h+'269>SIS</B></a>','4,425,000,000.00','4,695,000,000.00','4,050,000,000.00','4,367,380,000.00','4,273,120,000.00','3,644,720,000.00','4,681,580,000.00','4,913,000,000.00','6,188,000,000.00','5,457,000,000.00','6,137,000,000.00','5,453,000,000.00','6,061,000,000.00','6,954,000,000.00','6,745,000,000.00','6,519,000,000.00','7,851,500,000.00','8,548,500,000.00','9,430,000,000.00','9,225,000,000.00','10,575,000,000.00','11,610,000,000.00','9,517,500,000.00','13,140,000,000.00','12,757,500,000.00','13,117,500,000.00','11,677,500,000.00','10,507,500,000.00','11,857,500,000.00','9,315,000,000.00');
theRows[3] = new Array ('<b>'+h+'297>TRK</B></a>','1,692,579,200.00','1,983,924,800.00','1,831,315,200.00','1,704,000,000.00','1,803,400,000.00','1,498,100,000.00','1,803,400,000.00','1,884,450,000.00','2,542,160,000.00','2,180,050,000.00','2,069,200,000.00','1,682,600,000.00','1,619,950,000.00','1,852,650,000.00','2,040,600,000.00','2,315,700,000.00','2,641,200,000.00','2,938,800,000.00','3,599,100,000.00','4,101,900,000.00','5,220,600,000.00','5,808,200,000.00','4,689,500,000.00','5,375,000,000.00','3,787,500,000.00','4,150,000,000.00','3,662,500,000.00','3,712,500,000.00','4,375,000,000.00','3,587,500,000.00');
var thetable=new mytable();thetable.tableWidth=650;thetable.shownum=false;thetable.controlaccess=true;thetable.visCols=new Array(true,true,true,true,true);thetable.initsort=new Array(0,-1);thetable.inittable();thetable.refreshTable();</script>
"""

## Make a dummy list of scripts
scripts = [s for _ in range(10)]

如何将 headers 和 Html <script> 中的值保存为 csv 文件中的 table？

How can I save the headers and values in Html <script> as a table in the csv file?

python

regex

text-processing

beautifulsoup

解决方案

1。处理单个脚本

2。处理所有脚本

3。以 OS 不可知的方式创建路径

4。代码：自定义函数`process_scripts()`

5。虚拟数据

如何将 headers 和 Html <script> 中的值保存为 csv 文件中的 table？

How can I save the headers and values in Html <script> as a table in the csv file?

python

regex

text-processing

beautifulsoup

解决方案

1。处理单个脚本

2。处理所有脚本

3。以 OS 不可知的方式创建路径

4。代码：自定义函数process_scripts()

5。虚拟数据

4。代码：自定义函数`process_scripts()`