使用 python 从提取的文本中获取数字数据

Question

我已经提取了用户的推文和位置以及其他重要的推文信息。下一步是提取水位数据（也就是说，如果推文有 'number' 后跟 'm' 或 'meter' 则可以将其视为水位数据。

数据集样本是这样的（'text'是提取的推文的列名，'df'是可以找到第'text'列的数据框的名称）：

text
there is 12m water here
I saw a 5m wave height

我试过使用下面的代码：

length = len(df['text'])
for i in range(length):
    if df.loc[df['text'].str.contains('%d'+ 'm')] or if df.loc[df['text'].str.contains('%d'+ 'meter')] :
        df.loc[df['remarks']]== 'YES'
    else:
        df.loc[df['remarks']] == 'NO'

我的错误是：

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

我只知道“%d”用于数字，但我不是 python 方面的专家。谁能帮忙修改上述代码？

Answer 1

您应该使用正则表达式，例如：

import re
txt = "The rain is 12m"

x = re.findall("\d[\d]*m*", txt)
print(x)
if x:
    print("Yes, there is at least one match!")
else:
    print("No match")

使用 python 从提取的文本中获取数字数据

Getting numeric data from extracted texts using python

python

string

twitter

numeric

data-extraction