TShopping

標題: python 自然語言中的 CountVectorizer [打印本頁]

作者: woff 時間: 2021-5-4 22:07
標題: python 自然語言中的 CountVectorizer

• 預設會轉換為小寫。
• 預設會進行排序，由小而大排列。
• 拆解後的特徵名稱有以下幾個。
• vectorizer.get_feature_names( )
• 接著我們要找出現每一筆資料內每個特徵各
自出現多少次。print(X.toarray())

代碼

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'this document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer=CountVectorizer()
X=vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X)
print()
print(X.toarray())

複製代碼

• idxs=np.array(sorted(score_dict.items(),key=lambda x:x[1],reverse=True))[:return_num,0]
• score_dict他是一個dict結構，內容為相似度的計算結果，我們就依照大小進行排序
• 接著轉換為numpy的陣列，就可以用數值化的概念，找出你想找的N筆資料
• ndarray[:return_num,0] 代表區塊取值，前面是rows，後面是colums的範圍，我們預設return_num為3,也就是找01 2 這三筆資料

歡迎光臨 TShopping (http://www.tshopping.com.tw/)