如何使用 Google Cloud Natural Language API 对推文进行分类 - 如果可能的话?

How can I categorize tweets with Google Cloud Natural Language API - if possible?

我正在尝试使用 Google Cloud Natural Language API 到 classify/categorize 推文,以过滤掉与我的受众无关(与天气相关)的推文。我能理解人工智能解决方案对少量文本进行分类一定很棘手,但我想它至少会对这样的文本进行猜测:

Wind chills of zero to -5 degrees are expected in Northwestern Arkansas into North-Central Arkansas extending into portions of northern Oklahoma during the 6-9am window . #arwx #okwx

我已经测试了几条推文,但只有极少数得到分类,其余的没有结果(或“未找到类别。尝试更长的文本输入。”如果我通过 the GUI 尝试)。

希望它起作用是没有意义的吗?或者,是否可以降低分类的门槛?来自 NLP 解决方案的“有根据的猜测”总比没有过滤器要好。是否有替代解决方案(在训练我自己的 NLP 模型之外)?

编辑:为了澄清:

最后,我使用 Google 云平台自然语言 API 来对推文进行分类。为了测试它,我正在使用 GUI(上面链接)。我可以看到我测试的推文(在 GUI 中)中很少有来自 GCP NLP 的分类,即类别为空。

我想要的理想状态是让 GCP NLP 提供推文文本的类别猜测,而不是提供空结果。我假设 NLP 模型删除了置信度低于 X% 的任何结果。知道是否可以配置该阈值会很有趣。

我想推文的分类肯定已经完成了,如果有任何其他方法可以解决这个问题?

编辑 2:分类推文代码:

async function classifyTweet(tweetText) {
   const language = require('@google-cloud/language');
   const client = new language.LanguageServiceClient({projectId, keyFilename});
   //const tweetText = "Some light snow dusted the ground this morning, adding to the intense snow fall of yesterday. Here at my Warwick station the numbers are in, New Snow 19.5cm and total depth 26.6cm. A very good snow event. Photos to be posted. #ONStorm #CANWarnON4464 #CoCoRaHSON525"
   const document = {
      content: tweetText,
      type: 'PLAIN_TEXT',
   };   
   const [classification] = await client.classifyText({document});
   
   console.log('Categories:');
   classification.categories.forEach(category => {
     console.log(`Name: ${category.name}, Confidence: ${category.confidence}`);
   });
   
   return classification.categories
}

我深入了解了云自然语言的当前状态,我对您的主要问题的回答是,在自然语言的当前状态下,不可能对文本进行分类。不过,解决方法是,如果您将类别建立在分析输入文本所获得的输出之上。

考虑到 我们没有使用 custom model for this and just using the options that cloud natural language offers,对此问题的一种暂定方法如下:

首先,我已经将官方 samples 的代码更新到我们需要的地方,以进一步解释这一点:

from google.cloud import language_v1 
from google.cloud.language_v1 import enums 


def sample_cloud_natural_language_text(text_content):
    """ 
    Args:
      text_content The text content to analyze. Must include at least 20 words.
    """

    client = language_v1.LanguageServiceClient()
    type_ = enums.Document.Type.PLAIN_TEXT

    language = "en"
    document = {"content": text_content, "type": type_, "language": language}


    print("=====CLASSIFY TEXT=====")
    response = client.classify_text(document)
    for category in response.categories:
        print(u"Category name: {}".format(category.name))
        print(u"Confidence: {}".format(category.confidence))


    print("=====ANALYZE TEXT=====")
    response = client.analyze_entities(document)
    for entity in response.entities:
        print(f">>>>> ENTITY {entity.name}")  
        print(u"Entity type: {}".format(enums.Entity.Type(entity.type).name))
        print(u"Salience score: {}".format(entity.salience))

        for metadata_name, metadata_value in entity.metadata.items():
            print(u"{}: {}".format(metadata_name, metadata_value))

        for mention in entity.mentions:
            print(u"Mention text: {}".format(mention.text.content))
            print(u"Mention type: {}".format(enums.EntityMention.Type(mention.type).name))


if __name__ == "__main__":
    #text_content = "That actor on TV makes movies in Hollywood and also stars in a variety of popular new TV shows."
    text_content="Wind chills of zero to -5 degrees are expected in Northwestern Arkansas into North-Central Arkansas extending into portions of northern Oklahoma during the 6-9am window"
    
    sample_cloud_natural_language_text(text_content)

产出

=====CLASSIFY TEXT=====
=====ANALYZE TEXT=====
>>>>> ENTITY Wind chills
Entity type: OTHER
Salience score: 0.46825599670410156
Mention text: Wind chills
Mention type: COMMON
>>>>> ENTITY degrees
Entity type: OTHER
Salience score: 0.16041776537895203
Mention text: degrees
Mention type: COMMON
>>>>> ENTITY Northwestern Arkansas
Entity type: ORGANIZATION
Salience score: 0.07702474296092987
mid: /m/02vvkn4
wikipedia_url: https://en.wikipedia.org/wiki/Northwest_Arkansas
Mention text: Northwestern Arkansas
Mention type: PROPER
>>>>> ENTITY North
Entity type: LOCATION
Salience score: 0.07702474296092987
Mention text: North
Mention type: PROPER
>>>>> ENTITY Arkansas
Entity type: LOCATION
Salience score: 0.07088913768529892
mid: /m/0vbk
wikipedia_url: https://en.wikipedia.org/wiki/Arkansas
Mention text: Arkansas
Mention type: PROPER
>>>>> ENTITY window
Entity type: OTHER
Salience score: 0.06348973512649536
Mention text: window
Mention type: COMMON
>>>>> ENTITY Oklahoma
Entity type: LOCATION
Salience score: 0.04747137427330017
wikipedia_url: https://en.wikipedia.org/wiki/Oklahoma
mid: /m/05mph
Mention text: Oklahoma
Mention type: PROPER
>>>>> ENTITY portions
Entity type: OTHER
Salience score: 0.03542650490999222
Mention text: portions
Mention type: COMMON
>>>>> ENTITY 6
Entity type: NUMBER
Salience score: 0.0
value: 6
Mention text: 6
Mention type: TYPE_UNKNOWN
>>>>> ENTITY 9
Entity type: NUMBER
Salience score: 0.0
value: 9
Mention text: 9
Mention type: TYPE_UNKNOWN
>>>>> ENTITY -5
Entity type: NUMBER
Salience score: 0.0
value: -5
Mention text: -5
Mention type: TYPE_UNKNOWN
>>>>> ENTITY zero
Entity type: NUMBER
Salience score: 0.0
value: 0
Mention text: zero
Mention type: TYPE_UNKNOWN

如您所见,classify text 帮助不大(结果为空)。当我们开始 analyze text 时,我们可以得到一些值。我们可以使用它来构建或拥有类别。诀窍(以及 hard-work 也是)是创建适合每个类别(我们构建的类别)的关键字池,我们可以使用它来设置我们正在分析的数据。关于分类,我们可以查看google制作的available categories的当前列表,以了解类别应该是什么样子。

我不认为 lower the bar 的功能已在当前版本中实现,但它比 requested 到 google 的功能更重要。