如何解析电影脚本中与 R 具有一致间距的对话行?
How do I parse a movie script for lines of dialogue that have consistent spacing with R?
'''
A stray SKATEBOARD clips her, causing her to stumble and
spill her coffee, as well as the contents of her backpack.
The young RIDER dashes over to help, trembling when he sees
who his board has hit.
RIDER
Hey -- sorry.
Cowering in fear, he attempts to scoop up her scattered
belongings.
KAT
Leave it
He persists.
KAT (continuing)
I said, leave it!
RIDER
Hey -- sorry.
''''
我正在抓取一些我想用来进行文本分析的脚本。我只想从脚本中提取对话,看起来它有一定的间距。
因此,例如,我想要那行 "Hey -- sorry.". 我知道间距是 20 并且在整个脚本中是一致的。所以我怎么才能只读那一行,其余的间距相等?
我想说我准备用read.fwf,阅读固定宽度
大家怎么看?
我正在从这样的网址中抓取:
https://imsdb.com/scripts/10-Things-I-Hate-About-You.html
library(tidytext)
library(tidyverse)
text <- c("PADUA HIGH SCHOOL - DAY
Welcome to Padua High School,, your typical urban-suburban
high school in Portland, Oregon. Smarties, Skids, Preppies,
Granolas. Loners, Lovers, the In and the Out Crowd rub sleep
out of their eyes and head for the main building.
PADUA HIGH PARKING LOT - DAY
KAT STRATFORD, eighteen, pretty -- but trying hard not to be
-- in a baggy granny dress and glasses, balances a cup of
coffee and a backpack as she climbs out of her battered,
baby blue '75 Dodge Dart.
A stray SKATEBOARD clips her, causing her to stumble and
spill her coffee, as well as the contents of her backpack.
The young RIDER dashes over to help, trembling when he sees
who his board has hit.
RIDER
Hey -- sorry.
Cowering in fear, he attempts to scoop up her scattered
belongings.
KAT
Leave it
He persists.
KAT (continuing)
I said, leave it!
She grabs his skateboard and uses it to SHOVE him against a
car, skateboard tip to his throat. He whimpers pitifully
and she lets him go. A path clears for her as she marches
through a pack of fearful students and SLAMS open the door,
entering school.
INT. GIRLS' ROOM - DAY
BIANCA STRATFORD, a beautiful sophomore, stands facing the
mirror, applying lipstick. Her less extraordinary, but
still cute friend, CHASTITY stands next to her.
BIANCA
Did you change your hair?
CHASTITY
No.
BIANCA
You might wanna think about it
Leave the girls' room and enter the hallway.
HALLWAY - DAY- CONTINUOUS
Bianca is immediately greeted by an admiring crowd, both
boys
and girls alike.
BOY
(adoring)
Hey, Bianca.
GIRL
Awesome shoes.
The greetings continue as Chastity remains wordless and
unaddressed by her side. Bianca smiles proudly,
acknowledging her fans.
GUIDANCE COUNSELOR'S OFFICE - DAY
CAMERON JAMES, a clean-cut, easy-going senior with an open,
farm-boy face, sits facing Miss Perky, an impossibly cheery
guidance counselor.")
names_stopwords <- c("^(rider|kat|chastity|bianca|boy|girl)")
text %>%
as_tibble() %>%
unnest_tokens(text, value, token = "lines") %>%
filter(str_detect(text, "\s{15,}")) %>%
mutate(text = str_trim(text)) %>%
filter(!str_detect(text, names_stopwords))
输出:
# A tibble: 9 x 1
text
<chr>
1 hey -- sorry.
2 leave it
3 i said, leave it!
4 did you change your hair?
5 no.
6 you might wanna think about it
7 (adoring)
8 hey, bianca.
9 awesome shoes.
您可以在 names_stopwords
向量中包含更多字符名称。
您可以尝试以下方法:
url <- 'https://imsdb.com/scripts/10-Things-I-Hate-About-You.html'
url %>%
#Read webpage line by line
readLines() %>%
#Remove '<b>' and '</b>' from string
gsub('<b>|</b>', '', .) %>%
#select only the text which begins with 20 whitespace characters
grep('^\s{20,}', ., value = TRUE) %>%
#Remove whitespace
trimws() %>%
#Remove all caps string
grep('^([A-Z]+\s?)+$', ., value = TRUE, invert = TRUE)
#[1] "Hey -- sorry." "Leave it" "KAT (continuing)"
#[4] "I said, leave it!" "Did you change your hair?" "No."
#...
#...
我已尝试尽可能多地清理它,但可能需要根据您实际想要提取的内容进行更多清理。
'''
A stray SKATEBOARD clips her, causing her to stumble and
spill her coffee, as well as the contents of her backpack.
The young RIDER dashes over to help, trembling when he sees
who his board has hit.
RIDER
Hey -- sorry.
Cowering in fear, he attempts to scoop up her scattered
belongings.
KAT
Leave it
He persists.
KAT (continuing)
I said, leave it!
RIDER
Hey -- sorry.
''''
我正在抓取一些我想用来进行文本分析的脚本。我只想从脚本中提取对话,看起来它有一定的间距。 因此,例如,我想要那行 "Hey -- sorry.". 我知道间距是 20 并且在整个脚本中是一致的。所以我怎么才能只读那一行,其余的间距相等?
我想说我准备用read.fwf,阅读固定宽度
大家怎么看?
我正在从这样的网址中抓取: https://imsdb.com/scripts/10-Things-I-Hate-About-You.html
library(tidytext)
library(tidyverse)
text <- c("PADUA HIGH SCHOOL - DAY
Welcome to Padua High School,, your typical urban-suburban
high school in Portland, Oregon. Smarties, Skids, Preppies,
Granolas. Loners, Lovers, the In and the Out Crowd rub sleep
out of their eyes and head for the main building.
PADUA HIGH PARKING LOT - DAY
KAT STRATFORD, eighteen, pretty -- but trying hard not to be
-- in a baggy granny dress and glasses, balances a cup of
coffee and a backpack as she climbs out of her battered,
baby blue '75 Dodge Dart.
A stray SKATEBOARD clips her, causing her to stumble and
spill her coffee, as well as the contents of her backpack.
The young RIDER dashes over to help, trembling when he sees
who his board has hit.
RIDER
Hey -- sorry.
Cowering in fear, he attempts to scoop up her scattered
belongings.
KAT
Leave it
He persists.
KAT (continuing)
I said, leave it!
She grabs his skateboard and uses it to SHOVE him against a
car, skateboard tip to his throat. He whimpers pitifully
and she lets him go. A path clears for her as she marches
through a pack of fearful students and SLAMS open the door,
entering school.
INT. GIRLS' ROOM - DAY
BIANCA STRATFORD, a beautiful sophomore, stands facing the
mirror, applying lipstick. Her less extraordinary, but
still cute friend, CHASTITY stands next to her.
BIANCA
Did you change your hair?
CHASTITY
No.
BIANCA
You might wanna think about it
Leave the girls' room and enter the hallway.
HALLWAY - DAY- CONTINUOUS
Bianca is immediately greeted by an admiring crowd, both
boys
and girls alike.
BOY
(adoring)
Hey, Bianca.
GIRL
Awesome shoes.
The greetings continue as Chastity remains wordless and
unaddressed by her side. Bianca smiles proudly,
acknowledging her fans.
GUIDANCE COUNSELOR'S OFFICE - DAY
CAMERON JAMES, a clean-cut, easy-going senior with an open,
farm-boy face, sits facing Miss Perky, an impossibly cheery
guidance counselor.")
names_stopwords <- c("^(rider|kat|chastity|bianca|boy|girl)")
text %>%
as_tibble() %>%
unnest_tokens(text, value, token = "lines") %>%
filter(str_detect(text, "\s{15,}")) %>%
mutate(text = str_trim(text)) %>%
filter(!str_detect(text, names_stopwords))
输出:
# A tibble: 9 x 1
text
<chr>
1 hey -- sorry.
2 leave it
3 i said, leave it!
4 did you change your hair?
5 no.
6 you might wanna think about it
7 (adoring)
8 hey, bianca.
9 awesome shoes.
您可以在 names_stopwords
向量中包含更多字符名称。
您可以尝试以下方法:
url <- 'https://imsdb.com/scripts/10-Things-I-Hate-About-You.html'
url %>%
#Read webpage line by line
readLines() %>%
#Remove '<b>' and '</b>' from string
gsub('<b>|</b>', '', .) %>%
#select only the text which begins with 20 whitespace characters
grep('^\s{20,}', ., value = TRUE) %>%
#Remove whitespace
trimws() %>%
#Remove all caps string
grep('^([A-Z]+\s?)+$', ., value = TRUE, invert = TRUE)
#[1] "Hey -- sorry." "Leave it" "KAT (continuing)"
#[4] "I said, leave it!" "Did you change your hair?" "No."
#...
#...
我已尝试尽可能多地清理它,但可能需要根据您实际想要提取的内容进行更多清理。