需要从 BigQuery 格式化 table 以获取文本正文中的特定字词

Need to format table from BigQuery for specific words in the body of text

我正在使用 Google BigQuery 来抓取 reddit 评论数据库。我将从我正在处理的查询开始:

SELECT
  DATE(SEC_TO_TIMESTAMP(created_utc)) AS date,
  subreddit,
  author AS comment_author,
  ups AS upvotes,
  LOWER(body)
FROM
  [fh-bigquery:reddit_comments.2015_01]
WHERE
  body CONTAINS 'acid'
  OR body CONTAINS 'ecstasy'
  OR body CONTAINS 'fire'
  OR body CONTAINS 'heroin'
LIMIT 10;

我需要从 reddit 数据库中抓取大约 30 个与药物相关的单词的列表(为简洁起见,我将其限制为 3 个)。

我在两件事上遇到了麻烦:

  1. 我希望能够正确查询数据库,但是返回的结果很多都不符合条件a.k.a。不包含任何匹配的词。
  2. 我希望能够创建一个列来显示匹配的特定单词....所以如果它匹配单词 'drug',该单词将出现在 'word_matched' 列,以及正文、作者、日期等

我也尝试过使用正则表达式来匹配单词,但这似乎也无济于事:

  WHERE (REGEXP_MATCH(body,'drug|acid|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers'))

我们将不胜感激任何帮助。谢谢大家!

我建议使用 REGEXP_EXTRACT 调试它。我尝试了 运行 你的查询,它一直在 "something" 中找到类似 "meth" 的内容,这可能就是你所看到的。您可能想要检查匹配项周围的单词边界,因为您要搜索的某些单词可能包含在几个正常的 non-drug-related 单词中。

类似下面的内容应该有助于调试:

SELECT
  DATE(SEC_TO_TIMESTAMP(created_utc)) AS date,
  subreddit,
  author AS comment_author,
  ups AS upvotes,
  REGEXP_EXTRACT(body, '(drug|acid|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers)') AS match,
  LOWER(body),      
FROM
  [fh-bigquery:reddit_comments.2015_01]
WHERE (REGEXP_MATCH(body,'drug|acid|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers'))
LIMIT 10;

下面针对问题的两个要点
1. 只输出匹配的单词,而不是 another/different 单词的一部分。使用 REGEXP_MATCH 函数
很容易实现 2. 拥有包含所有匹配词的列。 (我认为有所有匹配的词比有问题的只有一个更有意义。

SELECT
    [date],
    subreddit,
    comment_author,
    upvotes,
    GROUP_CONCAT(word) AS matches, 
    body
FROM (
  SELECT 
    [date],
    subreddit,
    comment_author,
    upvotes,
    body,
    word
  FROM (
    SELECT
      DATE(SEC_TO_TIMESTAMP(created_utc)) AS [date],
      subreddit,
      author AS comment_author,
      ups AS upvotes,
      LOWER(body) AS body
    FROM
      [fh-bigquery:reddit_comments.2015_01]
    WHERE REGEXP_MATCH(body, r'\b(drug|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers)\b')
  ) x 
  CROSS JOIN (
    SELECT SPLIT(list,'|') AS word FROM 
    (SELECT 'drug|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers' AS list)
  ) y
  HAVING body CONTAINS word
)
GROUP BY [date], subreddit, comment_author, upvotes, body
LIMIT 1000

以上解决方案提供了best-effort基础上的匹配词列表,所以请注意:
如果列 matches 包含一个词 - 它肯定是完全匹配的词
但是,如果此列由几个词组成 - 其中一个仍然是完全匹配,但其他列可能不是完全匹配。
我认为很长 body - 至少将它们作为寻找内容的提示仍然很有价值。例如

drug,meth,heroin,alcohol,benzos it also inhibits the reuptake of serotonin and norepinephrine which gives a hell of a lot worse withdrawal symptoms than most other drugs(incl. heroin, meth, coke and etc.). from what i have heard the only things that rival tramadol it terms of withdrawal are benzos and alcohol.
liquor,beer,alcohol,booze       1. reinforce #3 - it is not cheap to live here. not by any stretch. expect to pay more than the rest of the country pays for everything. even franchises that operate nation-wide have special wa/perth pricing. 2. petrol has literally just dropped to  this past month, i wouldn't go as far as quoting that as our average price just yet. average is still between .20-1.30. 3. parking is free at beaches & parks, do not expect to get free parking anywhere in the city though. if you're using public parking in the city all day, expect to pay  unless you get in early. 4. forget bribing the cops, don't even call them "mate". last time i was pulled over (last week, random stop) i said "evening mate" as i was handing him my license and was responded with "don't call me mate, i'm not your friend, i don't know you". 5. unlike the rest of the world, regular stores do not sell alcohol here. liquor stores only, don't expect to buy beer from a gas station or grocery store. 6. rent is expensive, food is expensive, booze is expensive, being alive is expensive.
drug,meth,heroin,beer           that's simply not true. first there's a difference between legalization and decriminalization. second, some european countries have places to go to safely use drugs. there is middle ground between allowing heroin to be sold all over town and having users go to prison. heroin, meth and some other drugs are not good things for society and their use should encouraged by making it as easy to buy as a 6 pack of beer. i'm not really sure why you can't see a middle ground because it's clearly not as black and white as you say. you can go after the dealers while leaving the users alone.
drug,fire,joint,smoke           not a story about a rave, but still relevant i think: i was working a job called "fire watch," which is just what it sounds like, at a nine inch nails concert a few years ago. our comrades, the security workers, were far from seasoned professionals. they were mostly college temps with a yellow security tee shirt and a flashlight; they didn't even have radios. the job is basically to make sure people don't go into restricted areas. ...but this one boy scout took it upon himself to tame the metal masses. mid-concert, he pulled me close and shouted "they're smoking pot!" i shrugged, and shot him an "and?" look. i guess he thought i should care because technically a joint is a tiny dangerous drug fire, and i was on the fire crew. he then proceeded to disappear into the crowd, shoving people out of the way on his heroic journey toward the countless smoke puff origins. the next time i saw him he was bleeding out of his face and getting a flashlight in the eyes from an onsite emt. i guess it's pretty harsh to say that he deserved the beating, but it's hard to argue that he didn't go asking for it. i guess the moral of my story is that security people are just people, and some people's shittyness is inflamed when combined with authority. it sounds like your event just happened to be warded by a gaggle of douches, probably being captained by king fuckwad who really wanted to be a cop, but couldn't pass the exams.

注意:如果您只需要完全匹配的列表,使用 BigQuery User-Defined Functions

仍然相对容易