Python Polars 正则表达式 - 删除非英语，保留数字标点符号和表情符号

Question

我有 python 任务代码。

import re
import string

emoji_pat = '[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF]'
shrink_whitespace_reg = re.compile(r'\s{2,}')

def clean_text(raw_text):
    reg = re.compile(r'({})|[^a-zA-Z0-9 -{}]'.format(emoji_pat,r"\".join(list(string.punctuation)))) # line a
    result = reg.sub(lambda x: ' {} '.format(x.group(1)) if x.group(1) else ' ', raw_text)
    return shrink_whitespace_reg.sub(' ', result).lower()

我试过用极坐标polars.internals.series.StringNameSpace.contains

But I got an exceptions 
ComputeError: regex error: Syntax(

regex parse error:
    ([--☀-⛿✀-➿])|[^a-zA-Z0-9 -!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\}\~]
                     ^^
error: unclosed character class

中英文不详示例

texts = ['水虫対策にはコレが一番ですね','','I love polars!-ã„ã¤ã‚‚ã•ã‚‰ã•ã‚‰.','So good .']
df = pd.DataFrame({'text':texts})

d = df.text.apply(clean_text)

预计：

0                    
1                  
2    i love polars! .
3         so good  .
Name: text, dtype: object

另一个问题：

它比使用re快吗？

Answer 1

import polars as pl

emoji_pat = "[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF]"

texts = ['水虫対策にはコレが一番ですね','','I |love|  polars!-ã„ã¤ã‚‚ã•ã‚‰ã•ã‚‰.','So good       .']

df = pl.DataFrame(pl.Series("text", texts))

In [78]: df
Out[78]:
shape: (4, 1)
┌─────────────────────────────────────┐
│ text                                │
│ ---                                 │
│ str                                 │
╞═════════════════════════════════════╡
│ 水虫対策にはコレが一番ですね        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│                                 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ I |love|  polars!-ã„ã¤ã‚‚ã•ã‚‰ã•... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ So good       .                   │
└─────────────────────────────────────┘

# Add cleaned column (rust regex requires "[" inside [] to be escaped).
df_cleaned = df.with_column(
    pl.col("text").str.replace_all(
        "[^a-zA-Z0-9 " + string.punctuation.replace("[", "\[") + emoji_pat + "]+",
        ""
    ).str.replace_all(
        "\s{2,}", " "
    ).str.to_lowercase().alias("text_cleaned")
)

In[79]: df_cleaned
Out[79]:
shape: (4, 2)
┌─────────────────────────────────────┬────────────────────┐
│ text                                ┆ text_cleaned       │
│ ---                                 ┆ ---                │
│ str                                 ┆ str                │
╞═════════════════════════════════════╪════════════════════╡
│ 水虫対策にはコレが一番ですね        ┆                    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│                                 ┆                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ I |love|  polars!-ã„ã¤ã‚‚ã•ã‚‰ã•... ┆ i |love| polars!-. │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ So [good]       .                 ┆ so [good]  .     │
└─────────────────────────────────────┴────────────────────┘

Python Polars 正则表达式 - 删除非英语，保留数字标点符号和表情符号

Python Polars regex - remove non english, keep numbers punctuations and emojis

python-polars