python 中大型数据集的文本分类

Question

我有 220 万个数据样本要分类到 7500 多个类别。我正在使用 pandas 和 python 的 sckit-learn 来这样做。

下面是我的数据集的样本

itemid       description                                            category
11802974     SPRO VUH3C1 DIFFUSER VUH1 TRIPLE Space heaters    Architectural Diffusers
10688548     ANTIQUE BRONZE FINISH PUSHBUTTON  switch           Door Bell Pushbuttons
9836436     Descente pour Cable tray fitting and accessories    Tray Cable Drop Outs

以下是我遵循的步骤：

预处理
矢量表示

培训

 dataset=pd.read_csv("trainset.csv",encoding = "ISO-8859-1",low_memory=False)
 dataset['description']=dataset['description'].str.replace('[^a-zA-Z]', ' ')
 dataset['description']=dataset['description'].str.replace('[\d]', ' ')
 dataset['description']=dataset['description'].str.lower()

 stop = stopwords.words('english')
 lemmatizer = WordNetLemmatizer()

  dataset['description']=dataset['description'].str.replace(r'\b(' + r'|'.join(stop) + r')\b\s*', ' ')
  dataset['description']=dataset['description'].str.replace('\s\s+',' ')
  dataset['description'] =dataset['description'].apply(word_tokenize)
  ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
  POS_LIST = [NOUN, VERB, ADJ, ADV]
  for tag in POS_LIST:
  dataset['description'] = dataset['description'].apply(lambda x: 
  list(set([lemmatizer.lemmatize(item,tag) for item in x])))
  dataset['description']=dataset['description'].apply(lambda x : " ".join(x))


 countvec = CountVectorizer(min_df=0.0005)
 documenttermmatrix=countvec.fit_transform(dataset['description'])
 column=countvec.get_feature_names()

 y_train=dataset['category']
 y_train=dataset['category'].tolist()

 del dataset
 del stop
 del tag

生成的 documenttermmatrix 将是类型 scipy csr 矩阵，具有 12k 个特征和 220 万个样本。

为了训练，我尝试使用 sckit learn 的 xgboost

model = XGBClassifier(silent=False,n_estimators=500,objective='multi:softmax',subsample=0.8)
model.fit(documenttermmatrix,y_train,verbose=True)

执行上述代码 2-3 分钟后出现错误

OSError: [WinError 541541187] Windows Error 0x20474343

我还尝试了 sckit 的朴素贝叶斯学习，但我遇到了内存错误

问题

我使用了 Scipy 矩阵，它消耗的内存非常少，而且我在执行 xgboost 或朴素贝叶斯之前删除了所有未使用的对象，我使用的系统具有 128GB RAM 但在训练时仍然出现内存问题。

我是 python.Is 的新手，我的代码有什么问题吗？谁能告诉我如何才能有效地使用内存并进一步进行？

Answer 1

我想我可以解释你代码中的问题。 OS 错误似乎是：

ERROR_DS_RIDMGR_DISABLED
8263 (0x2047)

目录服务检测到分配相对标识符的子系统被禁用。当系统确定相当一部分相对标识符 (RID) 已用尽时，这可以作为一种保护机制发生。

" 通过 https://msdn.microsoft.com/en-us/library/windows/desktop/ms681390

我认为您在代码的这一步用尽了大部分 RID：

dataset['description'] = dataset['description'].apply(lambda x: 
list(set([lemmatizer.lemmatize(item,tag) for item in x])))

您在 lambda 中传递了词形还原器，但 lambda 是匿名的，因此看起来您可能在运行时制作了 220 万个词形还原器的副本。

每当遇到内存问题时，您应该尝试将 low_memory 标志更改为 true。

回复评论-

我查看了Pandas文档，你可以在dataset['description'].apply()之外定义一个函数，然后在调用dataset[[=38]时引用那个函数=]].申请（）。这是我将如何编写所述函数。

def lemmatize_descriptions(x):
return list(set([lemmatizer.lemmatize(item,tag) for item in x]))

然后，对 apply() 的调用将是-

dataset['description'] = dataset['description'].apply(lemmatize_descriptions)

Here is the documentation.

python 中大型数据集的文本分类

text classification of large dataset in python

python

large-data

pandas

scikit-learn

text-classification