清洗数据和过滤系列
Cleaning Data and Filtering Series
我正在分析来自 Indeed 的职位发布数据集。我的问题是过滤职位描述和获取包含特殊字符的技能。例如,我无法使用以下代码将 'c#' 放入绘图中:
def cleanData(desc):
desc = word_tokenize(desc)
desc = [word.lower() for word in desc]
desc = [word for word in desc if word not in stop_words]
return desc
stop_words = stopwords.words('english')
tags_df = df["Description"].apply(cleanData)
result = tags_df.apply(Counter).sum().items()
result = sorted(result, key=lambda kv: kv[1],reverse=True)
result_series = pd.Series({k: v for k, v in result})
skills = ["java", "c#", "c++", "javascript", "sql", "python", "php", "html", "css"]
filter_series = result_series.filter(items=skills)
filter_series.plot(kind='bar',figsize=(20,5))
不过,我还是能抓到'c++'、'asp.net'、'react.js'等词。感谢您提供任何帮助。
您可以通过更改标点符号的正则表达式来修改 nltk 分词器的行为:
from nltk.tokenize import TreebankWordTokenizer
import re
tokenizer = TreebankWordTokenizer()
tokenizer.PUNCTUATION = [
(re.compile(r"([:,])([^\d])"), r" "),
(re.compile(r"([:,])$"), r" "),
(re.compile(r"\.\.\."), r" ... "),
(re.compile(r"[;@$%&]"), r" \g<0> "),
(
re.compile(r'([^\.])(\.)([\]\)}>"\']*)\s*$'),
r" ",
), # Handles the final period.
(re.compile(r"[?!]"), r" \g<0> "),
(re.compile(r"([^'])' "), r" ' "),
]
text = 'My favorite programming languages are c# and c++'
tokens = tokenizer.tokenize(text)
print(tokens)
输出:
['My', 'favorite', 'programming', 'languages', 'are', 'c#', 'and', 'c++']
我正在分析来自 Indeed 的职位发布数据集。我的问题是过滤职位描述和获取包含特殊字符的技能。例如,我无法使用以下代码将 'c#' 放入绘图中:
def cleanData(desc):
desc = word_tokenize(desc)
desc = [word.lower() for word in desc]
desc = [word for word in desc if word not in stop_words]
return desc
stop_words = stopwords.words('english')
tags_df = df["Description"].apply(cleanData)
result = tags_df.apply(Counter).sum().items()
result = sorted(result, key=lambda kv: kv[1],reverse=True)
result_series = pd.Series({k: v for k, v in result})
skills = ["java", "c#", "c++", "javascript", "sql", "python", "php", "html", "css"]
filter_series = result_series.filter(items=skills)
filter_series.plot(kind='bar',figsize=(20,5))
不过,我还是能抓到'c++'、'asp.net'、'react.js'等词。感谢您提供任何帮助。
您可以通过更改标点符号的正则表达式来修改 nltk 分词器的行为:
from nltk.tokenize import TreebankWordTokenizer
import re
tokenizer = TreebankWordTokenizer()
tokenizer.PUNCTUATION = [
(re.compile(r"([:,])([^\d])"), r" "),
(re.compile(r"([:,])$"), r" "),
(re.compile(r"\.\.\."), r" ... "),
(re.compile(r"[;@$%&]"), r" \g<0> "),
(
re.compile(r'([^\.])(\.)([\]\)}>"\']*)\s*$'),
r" ",
), # Handles the final period.
(re.compile(r"[?!]"), r" \g<0> "),
(re.compile(r"([^'])' "), r" ' "),
]
text = 'My favorite programming languages are c# and c++'
tokens = tokenizer.tokenize(text)
print(tokens)
输出:
['My', 'favorite', 'programming', 'languages', 'are', 'c#', 'and', 'c++']