Python Pandas 中的稀疏矩阵和数据框
Sparse Matrix and Dataframe in Python Pandas
我尝试在 Python Binary Classification: Twitter sentiment analysis
上复制这个项目
步骤是:
Step 1: Get data
Step 2: Text preprocessing using R
Step 3: Feature engineering
Step 4: Split the data into train and test
Step 5: Train prediction model
Step 6: Evaluate model performance
Step 7: Publish prediction web service
我现在在 Step 4
,但我想我不能继续了。
import pandas
import re
from sklearn.feature_extraction import FeatureHasher
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn import cross_validation
#read the dataset of tweets
header_row=['sentiment','tweetid','date','query', 'user', 'text']
train = pandas.read_csv("training.1600000.processed.noemoticon.csv",names=header_row)
#keep only the right columns
train = train[["sentiment","text"]]
#remove puctuation, special characters, numbers and lower case the text
def remove_spch(text):
return re.sub("[^a-z]", ' ', text.lower())
train['text'] = train['text'].apply(remove_spch)
#Feature Hashing
def tokens(doc):
"""Extract tokens from doc.
This uses a simple regex to break strings into tokens.
"""
return (tok.lower() for tok in re.findall(r"\w+", doc))
n_features = 2**18
hasher = FeatureHasher(n_features=n_features, input_type="string", non_negative=True)
X = hasher.transform(tokens(d) for d in train['text'])
#Feature Selection and choose the best 20.000 features using Chi-Square
X_new = SelectKBest(chi2, k=20000).fit_transform(X, train['sentiment'])
#Using Stratified KFold, split my data to train and test
skf = cross_validation.StratifiedKFold(X_new, n_folds=2)
我确定最后一行是错误的,因为它只包含 20.000 个特征,而不是 Pandas 中的 Sentiment
列。如何将稀疏矩阵 X_new
与 Dataframe train
“连接”起来,将其包含在 cross_validation
中,然后将其用于分类器?
你应该将你的类标签传递给StratifiedKFold,然后使用skf作为迭代器,在每次迭代时它会产生测试集和训练集的索引,你可以用它们来分离数据集。
查看官方 scikit-learn 文档中的代码示例:
StratifiedKFold
我尝试在 Python Binary Classification: Twitter sentiment analysis
上复制这个项目步骤是:
Step 1: Get data
Step 2: Text preprocessing using R
Step 3: Feature engineering
Step 4: Split the data into train and test
Step 5: Train prediction model
Step 6: Evaluate model performance
Step 7: Publish prediction web service
我现在在 Step 4
,但我想我不能继续了。
import pandas
import re
from sklearn.feature_extraction import FeatureHasher
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn import cross_validation
#read the dataset of tweets
header_row=['sentiment','tweetid','date','query', 'user', 'text']
train = pandas.read_csv("training.1600000.processed.noemoticon.csv",names=header_row)
#keep only the right columns
train = train[["sentiment","text"]]
#remove puctuation, special characters, numbers and lower case the text
def remove_spch(text):
return re.sub("[^a-z]", ' ', text.lower())
train['text'] = train['text'].apply(remove_spch)
#Feature Hashing
def tokens(doc):
"""Extract tokens from doc.
This uses a simple regex to break strings into tokens.
"""
return (tok.lower() for tok in re.findall(r"\w+", doc))
n_features = 2**18
hasher = FeatureHasher(n_features=n_features, input_type="string", non_negative=True)
X = hasher.transform(tokens(d) for d in train['text'])
#Feature Selection and choose the best 20.000 features using Chi-Square
X_new = SelectKBest(chi2, k=20000).fit_transform(X, train['sentiment'])
#Using Stratified KFold, split my data to train and test
skf = cross_validation.StratifiedKFold(X_new, n_folds=2)
我确定最后一行是错误的,因为它只包含 20.000 个特征,而不是 Pandas 中的 Sentiment
列。如何将稀疏矩阵 X_new
与 Dataframe train
“连接”起来,将其包含在 cross_validation
中,然后将其用于分类器?
你应该将你的类标签传递给StratifiedKFold,然后使用skf作为迭代器,在每次迭代时它会产生测试集和训练集的索引,你可以用它们来分离数据集。
查看官方 scikit-learn 文档中的代码示例: StratifiedKFold