如何在 Python 中构造 re.findall 正则表达式以捕获 Youtube 时间戳

Question

脚本

from __future__ import unicode_literals
import youtube_dl
import pandas as pd
import csv
import re

# Initialize YouTube-DL Array
ydl_opts = {}

# read the csv file
number_of_rows = pd.read_csv('single.csv')

# Scrape Online Product
def run_scraper():
    
    # Read CSV to List
    with open("single.csv", "r") as f:
        csv_reader = csv.reader(f)
        next(csv_reader)

        # Scrape Data From Store
        for csv_line_entry in csv_reader:
                        
            with youtube_dl.YoutubeDL(ydl_opts) as ydl:
                meta = ydl.extract_info(csv_line_entry[0], download=False)
                description = meta['description']
                #print('Description    :', description)

                # Function to Capture Timestamp Descriptions
                get_links(description)
                

def get_links(description):

  # Format: Timestamp + Text
  description_text = re.findall(r'(\d{2}:\d{2}?.*)', description)
  print(description_text)
  print()

  # Format: Text + Timestamp
  description_text1 = re.findall(r'(.*\d{2}:\d{2}?)', description)
  print(description_text1)

run_scraper()

CSV 文件

Videos, Format
https://www.youtube.com/watch?v=kqtD5dpn9C8, Format: Timestamp + Text
https://www.youtube.com/watch?v=pJ3IPRqiD2M, Format: Text + Timestamp
https://www.youtube.com/watch?v=rfscVS0vtbw, No Regex in code
https://www.youtube.com/watch?v=t8pPdKYpowI, No Regex in code

我的脚本从 CSV 文件中提取 YouTube 网址，以准备捕获一般的 YouTube 描述信息，例如介绍、链接、时间戳等。

我只想捕获 YouTube 时间戳描述，如下图突出显示：

我知道 YouTube 时间戳格式不一致，因此我在 CSV 文件中包含了一些示例。

在我的函数 get_links 中，我已经部分提取了 Timestamp + Text 和 Text + 列出的 4 个 CSV 网址中的 2 个的时间戳。

我需要一种方法来只显示时间戳的文本或描述部分，而不考虑所有 4 个 CSV 网址中显示的格式类型。

如有任何帮助，我们将不胜感激。

Answer 1

尝试：

import youtube_dl
import pandas as pd
import csv
import re

# Initialize YouTube-DL Array
ydl_opts = {}

r_pat = re.compile(r"\d+:\d+")
r_pat2 = re.compile(r"[^A-Za-z]*\d+:\d+:?\d*?[^A-Za-z]*")

# Scrape Online Product
def run_scraper():

    # Read CSV to List
    with open("single.csv", "r") as f:
        csv_reader = csv.reader(f)
        next(csv_reader)

        # Scrape Data From Store
        for csv_line_entry in csv_reader:
            with youtube_dl.YoutubeDL(ydl_opts) as ydl:
                meta = ydl.extract_info(csv_line_entry[0], download=False)
                description = meta["description"]
                out = get_links(description)
                print(*out, sep="\n")
                print("-" * 80)


def get_links(description):
    rv = []
    for line in description.splitlines():
        if r_pat.search(line):
            rv.append(r_pat2.sub("", line))
    return rv


run_scraper()

打印：

[youtube] kqtD5dpn9C8: Downloading webpage
Introduction 
What You Can Do With Python 
Your First Python Program 
Variables
Receiving Input
Type Conversion
Strings
Arithmetic Operators 
Operator Precedence 
Comparison Operators 
Logical Operators
If Statements
Exercise
While Loops
Lists
List Methods
For Loops
The range() Function 
Tuples
--------------------------------------------------------------------------------
[youtube] pJ3IPRqiD2M: Downloading webpage
Python Course
What is Python
Why choose Python
Features of Python
Applications of Python
Salary Trends
Quiz
Installing Python
Python Variable
Python Tokens


...and so on.

如何在 Python 中构造 re.findall 正则表达式以捕获 Youtube 时间戳

How To Construct re.findall Regex In Python To Capture Youtube Timestamp

python

regex

list

youtube

python-re

脚本

CSV 文件