带有 json 数组的词袋
Bag of Words with json array
我正在尝试按照本教程制作自定义词袋。
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'All my cats in a row',
'When my cat sits down, she looks like a Furby toy!',
'The cat from outer space',
'Sunshine loves to sit like this for some reason.'
]
vectorizer = CountVectorizer()
print( vectorizer.fit_transform(corpus).todense() )
print( vectorizer.vocabulary_ )
此脚本打印:
[[1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 1]
[0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0]
[0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 0 1 1 0 0]]
{u'all': 0, u'sunshine': 20, u'some': 18, u'down': 3, u'reason': 13, u'looks': 9, u'in': 7, u'outer': 12, u'sits': 17, u'row': 14, u'toy': 24, u'from': 5, u'like': 8, u'for': 4, u'space': 19, u'this': 22, u'sit': 16, u'when': 25, u'cat': 1, u'to': 23, u'cats': 2, u'she': 15, u'loves': 10, u'furby': 6, u'the': 21, u'my': 11}
所以这是我的问题:我有一个 json 文件,其数据结构为:
[
{
"id": "1",
"class": "positive",
"tags": [
"tag1",
"tag2"
]
},
{
"id": "2",
"class": "negative",
"tags": [
"tag1",
"tag3"
]
}
]
所以我尝试对标签数组进行矢量化,但没有成功。
我试过这样的事情:
data = json.load(open('data.json'));
print( vectorizer.fit_transform(data).todense() )
还有:
for element in data:
print( vectorizer.fit_transform(element).todense() )
#or
print( vectorizer.fit_transform(element['tags']).todense() )
没有人工作。有任何想法吗?
1。使用 pandas 将 json 文件读入 DataFrame
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_json('data.json', orient='values')
print(df)
这就是您的 DataFrame
的样子:
Out[]:
class id tags
0 positive 1 [tag1, tag2]
1 negative 2 [tag1, tag3]
2。将标签列从 list
转换为 str
df['tags'] = df['tags'].apply(lambda x: ' '.join(x))
print(df)
这将导致将 tags
转换为 space 分隔字符串:
Out[]:
class id tags
0 positive 1 tag1 tag2
1 negative 2 tag1 tag3
3。将 tags
列/pandas Series
插入 CountVectorizer
vectorizer = CountVectorizer()
print(vectorizer.fit_transform(df['tags']).todense())
print(vectorizer.vocabulary_)
这将产生您想要的输出:
Out[]:
[[1 1 0]
[1 0 1]]
{'tag1': 0, 'tag2': 1, 'tag3': 2}
我正在尝试按照本教程制作自定义词袋。
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'All my cats in a row',
'When my cat sits down, she looks like a Furby toy!',
'The cat from outer space',
'Sunshine loves to sit like this for some reason.'
]
vectorizer = CountVectorizer()
print( vectorizer.fit_transform(corpus).todense() )
print( vectorizer.vocabulary_ )
此脚本打印:
[[1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 1]
[0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0]
[0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 0 1 1 0 0]]
{u'all': 0, u'sunshine': 20, u'some': 18, u'down': 3, u'reason': 13, u'looks': 9, u'in': 7, u'outer': 12, u'sits': 17, u'row': 14, u'toy': 24, u'from': 5, u'like': 8, u'for': 4, u'space': 19, u'this': 22, u'sit': 16, u'when': 25, u'cat': 1, u'to': 23, u'cats': 2, u'she': 15, u'loves': 10, u'furby': 6, u'the': 21, u'my': 11}
所以这是我的问题:我有一个 json 文件,其数据结构为:
[
{
"id": "1",
"class": "positive",
"tags": [
"tag1",
"tag2"
]
},
{
"id": "2",
"class": "negative",
"tags": [
"tag1",
"tag3"
]
}
]
所以我尝试对标签数组进行矢量化,但没有成功。
我试过这样的事情:
data = json.load(open('data.json'));
print( vectorizer.fit_transform(data).todense() )
还有:
for element in data:
print( vectorizer.fit_transform(element).todense() )
#or
print( vectorizer.fit_transform(element['tags']).todense() )
没有人工作。有任何想法吗?
1。使用 pandas 将 json 文件读入 DataFrame
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_json('data.json', orient='values')
print(df)
这就是您的 DataFrame
的样子:
Out[]:
class id tags
0 positive 1 [tag1, tag2]
1 negative 2 [tag1, tag3]
2。将标签列从 list
转换为 str
df['tags'] = df['tags'].apply(lambda x: ' '.join(x))
print(df)
这将导致将 tags
转换为 space 分隔字符串:
Out[]:
class id tags
0 positive 1 tag1 tag2
1 negative 2 tag1 tag3
3。将 tags
列/pandas Series
插入 CountVectorizer
vectorizer = CountVectorizer()
print(vectorizer.fit_transform(df['tags']).todense())
print(vectorizer.vocabulary_)
这将产生您想要的输出:
Out[]:
[[1 1 0]
[1 0 1]]
{'tag1': 0, 'tag2': 1, 'tag3': 2}