使用 ngrams 查找匹配词
Finding matching words with ngrams
数据集:
df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2)))
df[:,0:1]
Id bigram
1952043 [(Swimming,Pool),(Pool,in),(in,the),(the,roof),(roof,top),
1918916 [(Luxury,Apartments),(Apartments,consisting),(consisting,11),
1645751 [(Flat,available),(available,sale),(sale,Medavakkam),
1270503 [(Toddler,Pool),(Pool,with),(with,Jogging),(Jogging,Tracks),
1495638 [(near,medavakkam),(medavakkam,junction),(junction,calm),
我有一个 python 文件 (Categories.py),其中包含 property/Land 特征的无监督分类。
category = [('Luxury Apartments', 'IN', 'Recreation_Ammenities'),
('Swimming Pool', 'IN','Recreation_Ammenities'),
('Toddler Pool', 'IN', 'Recreation_Ammenities'),
('Jogging Tracks', 'IN', 'Recreation_Ammenities')]
Recreation = [e1 for (e1, rel, e2) in category if e2=='Recreation_Ammenities']
从二元组列和类别列表中查找匹配词:
tokens=pd.Series(df["bigram"])
Lid=pd.Series(df["Id"])
matches = tokens.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in Categories.Recreation])))
虽然运行上面的代码,我得到这个错误:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
需要这方面的帮助。
我想要的输出是:
Id bigram Recreation_Amenities
1952043 [(Swimming,Pool),(Pool,in),(in,the),.. Swimming Pool
1918916 [(Luxury,Apartments),(Apartments,.. Luxury Apartments
1645751 [(Flat,available),(available,sale)..
1270503 [(Toddler,Pool),(Jogging,Tracks).. Toddler Pool,Jogging Tracks
1495638 [(near,medavakkam),..
这些方面的内容应该适合您:
def match_bigrams(row):
categories = []
for bigram in row.bigram:
joined = ' '.join(list(bigram))
if joined in Recreation:
categories.append(joined)
return categories
df['Recreation_Amenities'] = df.apply(match_bigrams, axis=1)
print(df)
Id bigram Recreation_Amenities
0 1952043 [(Swimming, Pool), (Pool, in), (in, the), (the... [Swimming Pool]
1 1918916 [(Luxury, Apartments), (Apartments, consisting... [Luxury Apartments]
2 1645751 [(Flat, available), (available, sale), (sale, ... []
3 1270503 [(Toddler, Pool), (Pool, with), (with, Jogging... [Toddler Pool, Jogging Tracks]
4 1495638 [(near, medavakkam), (medavakkam, junction), (... []
每个二元组由一个 space 连接,以便可以测试该二元组是否包含在您的类别列表中(即 if joined in Recreation
)。
您可以通过 space 加入元组,然后使用双列表理解找到 Recreation 中存在的单词并应用,即
df['Recreation_Amenities'] = df['bigram'].apply(lambda x : [j for j in Recreation if j in [' '.join(i) for i in x]])
假设你有一个数据框
Id bigram
0 1270503 [(Toddler, Pool), (Pool, with), (with, Jogging), (Jogging, Tracks)]
1 1952043 [(Swimming, Pool), (Pool, in), (in, the), (the, roof), (roof, top)]
2 1918916 [(Luxury, Apartments), (Apartments, consisting), (consisting, 11)]
3 1495638 [(near, medavakkam), (medavakkam, junction), (junction, calm)]
4 1645751 [(Flat, available), (available, sale), (sale, Medavakkam)]
你有娱乐列表,即
Recreation = ['Luxury Apartments', 'Swimming Pool', 'Toddler Pool', 'Jogging Tracks']
然后
df['Recreation_Amenities'] = df['bigram'].apply(lambda x : [j for j in Recreation if j in [' '.join(i) for i in x]])
输出:df['Recreation_Amenities']
0 [Toddler Pool, Jogging Tracks]
1 [Swimming Pool]
2 [Luxury Apartments]
3 []
4 []
Name: Recreation_Amenities, dtype: object
数据集:
df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2)))
df[:,0:1]
Id bigram
1952043 [(Swimming,Pool),(Pool,in),(in,the),(the,roof),(roof,top),
1918916 [(Luxury,Apartments),(Apartments,consisting),(consisting,11),
1645751 [(Flat,available),(available,sale),(sale,Medavakkam),
1270503 [(Toddler,Pool),(Pool,with),(with,Jogging),(Jogging,Tracks),
1495638 [(near,medavakkam),(medavakkam,junction),(junction,calm),
我有一个 python 文件 (Categories.py),其中包含 property/Land 特征的无监督分类。
category = [('Luxury Apartments', 'IN', 'Recreation_Ammenities'),
('Swimming Pool', 'IN','Recreation_Ammenities'),
('Toddler Pool', 'IN', 'Recreation_Ammenities'),
('Jogging Tracks', 'IN', 'Recreation_Ammenities')]
Recreation = [e1 for (e1, rel, e2) in category if e2=='Recreation_Ammenities']
从二元组列和类别列表中查找匹配词:
tokens=pd.Series(df["bigram"])
Lid=pd.Series(df["Id"])
matches = tokens.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in Categories.Recreation])))
虽然运行上面的代码,我得到这个错误:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
需要这方面的帮助。
我想要的输出是:
Id bigram Recreation_Amenities
1952043 [(Swimming,Pool),(Pool,in),(in,the),.. Swimming Pool
1918916 [(Luxury,Apartments),(Apartments,.. Luxury Apartments
1645751 [(Flat,available),(available,sale)..
1270503 [(Toddler,Pool),(Jogging,Tracks).. Toddler Pool,Jogging Tracks
1495638 [(near,medavakkam),..
这些方面的内容应该适合您:
def match_bigrams(row):
categories = []
for bigram in row.bigram:
joined = ' '.join(list(bigram))
if joined in Recreation:
categories.append(joined)
return categories
df['Recreation_Amenities'] = df.apply(match_bigrams, axis=1)
print(df)
Id bigram Recreation_Amenities
0 1952043 [(Swimming, Pool), (Pool, in), (in, the), (the... [Swimming Pool]
1 1918916 [(Luxury, Apartments), (Apartments, consisting... [Luxury Apartments]
2 1645751 [(Flat, available), (available, sale), (sale, ... []
3 1270503 [(Toddler, Pool), (Pool, with), (with, Jogging... [Toddler Pool, Jogging Tracks]
4 1495638 [(near, medavakkam), (medavakkam, junction), (... []
每个二元组由一个 space 连接,以便可以测试该二元组是否包含在您的类别列表中(即 if joined in Recreation
)。
您可以通过 space 加入元组,然后使用双列表理解找到 Recreation 中存在的单词并应用,即
df['Recreation_Amenities'] = df['bigram'].apply(lambda x : [j for j in Recreation if j in [' '.join(i) for i in x]])
假设你有一个数据框
Id bigram 0 1270503 [(Toddler, Pool), (Pool, with), (with, Jogging), (Jogging, Tracks)] 1 1952043 [(Swimming, Pool), (Pool, in), (in, the), (the, roof), (roof, top)] 2 1918916 [(Luxury, Apartments), (Apartments, consisting), (consisting, 11)] 3 1495638 [(near, medavakkam), (medavakkam, junction), (junction, calm)] 4 1645751 [(Flat, available), (available, sale), (sale, Medavakkam)]
你有娱乐列表,即
Recreation = ['Luxury Apartments', 'Swimming Pool', 'Toddler Pool', 'Jogging Tracks']
然后
df['Recreation_Amenities'] = df['bigram'].apply(lambda x : [j for j in Recreation if j in [' '.join(i) for i in x]])
输出:df['Recreation_Amenities']
0 [Toddler Pool, Jogging Tracks] 1 [Swimming Pool] 2 [Luxury Apartments] 3 [] 4 [] Name: Recreation_Amenities, dtype: object