隨著在線社交媒體和評(píng)論平臺(tái)的激增,大量的意見數(shù)據(jù)被記錄下來,具有支持決策過程的巨大潛力。情感分析研究人們?cè)谄渖傻奈谋局械那楦?,例?a target="_blank">產(chǎn)品評(píng)論、博客評(píng)論和論壇討論。它在政治(例如,公眾對(duì)政策的情緒分析)、金融(例如,市場(chǎng)情緒分析)和市場(chǎng)營(yíng)銷(例如,產(chǎn)品研究和品牌管理)等領(lǐng)域有著廣泛的應(yīng)用。
由于情緒可以被分類為離散的極性或尺度(例如,積極和消極),我們可以將情緒分析視為文本分類任務(wù),它將可變長(zhǎng)度的文本序列轉(zhuǎn)換為固定長(zhǎng)度的文本類別。在本章中,我們將使用斯坦福的大型電影評(píng)論數(shù)據(jù)集進(jìn)行情感分析。它由一個(gè)訓(xùn)練集和一個(gè)測(cè)試集組成,其中包含從 IMDb 下載的 25000 條電影評(píng)論。在這兩個(gè)數(shù)據(jù)集中,“正面”和“負(fù)面”標(biāo)簽的數(shù)量相等,表明不同的情緒極性。
import os import torch from torch import nn from d2l import torch as d2l
import os from mxnet import np, npx from d2l import mxnet as d2l npx.set_np()
16.1.1。讀取數(shù)據(jù)集
首先,在路徑中下載并解壓這個(gè) IMDb 評(píng)論數(shù)據(jù)集 ../data/aclImdb。
#@save d2l.DATA_HUB['aclImdb'] = (d2l.DATA_URL + 'aclImdb_v1.tar.gz', '01ada507287d82875905620988597833ad4e0903') data_dir = d2l.download_extract('aclImdb', 'aclImdb')
Downloading ../data/aclImdb_v1.tar.gz from http://d2l-data.s3-accelerate.amazonaws.com/aclImdb_v1.tar.gz...
#@save d2l.DATA_HUB['aclImdb'] = (d2l.DATA_URL + 'aclImdb_v1.tar.gz', '01ada507287d82875905620988597833ad4e0903') data_dir = d2l.download_extract('aclImdb', 'aclImdb')
Downloading ../data/aclImdb_v1.tar.gz from http://d2l-data.s3-accelerate.amazonaws.com/aclImdb_v1.tar.gz...
接下來,閱讀訓(xùn)練和測(cè)試數(shù)據(jù)集。每個(gè)示例都是評(píng)論及其標(biāo)簽:1 表示“正面”,0 表示“負(fù)面”。
#@save def read_imdb(data_dir, is_train): """Read the IMDb review dataset text sequences and labels.""" data, labels = [], [] for label in ('pos', 'neg'): folder_name = os.path.join(data_dir, 'train' if is_train else 'test', label) for file in os.listdir(folder_name): with open(os.path.join(folder_name, file), 'rb') as f: review = f.read().decode('utf-8').replace('n', '') data.append(review) labels.append(1 if label == 'pos' else 0) return data, labels train_data = read_imdb(data_dir, is_train=True) print('# trainings:', len(train_data[0])) for x, y in zip(train_data[0][:3], train_data[1][:3]): print('label:', y, 'review:', x[:60])
# trainings: 25000 label: 1 review: Henry Hathaway was daring, as well as enthusiastic, for his label: 1 review: An unassuming, subtle and lean film, "The Man in the White S label: 1 review: Eddie Murphy really made me laugh my ass off on this HBO sta
#@save def read_imdb(data_dir, is_train): """Read the IMDb review dataset text sequences and labels.""" data, labels = [], [] for label in ('pos', 'neg'): folder_name = os.path.join(data_dir, 'train' if is_train else 'test', label) for file in os.listdir(folder_name): with open(os.path.join(folder_name, file), 'rb') as f: review = f.read().decode('utf-8').replace('n', '') data.append(review) labels.append(1 if label == 'pos' else 0) return data, labels train_data = read_imdb(data_dir, is_train=True) print('# trainings:', len(train_data[0])) for x, y in zip(train_data[0][:3], train_data[1][:3]): print('label:', y, 'review:', x[:60])
# trainings: 25000 label: 1 review: Henry Hathaway was daring, as well as enthusiastic, for his label: 1 review: An unassuming, subtle and lean film, "The Man in the White S label: 1 review: Eddie Murphy really made me laugh my ass off on this HBO sta
16.1.2。預(yù)處理數(shù)據(jù)集
將每個(gè)單詞視為一個(gè)標(biāo)記并過濾掉出現(xiàn)次數(shù)少于 5 次的單詞,我們從訓(xùn)練數(shù)據(jù)集中創(chuàng)建了一個(gè)詞匯表。
train_tokens = d2l.tokenize(train_data[0], token='word') vocab = d2l.Vocab(train_tokens, min_freq=5, reserved_tokens=[''])
train_tokens = d2l.tokenize(train_data[0], token='word') vocab = d2l.Vocab(train_tokens, min_freq=5, reserved_tokens=[''])
標(biāo)記化后,讓我們繪制以標(biāo)記為單位的評(píng)論長(zhǎng)度直方圖。
d2l.set_figsize() d2l.plt.xlabel('# tokens per review') d2l.plt.ylabel('count') d2l.plt.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50));
d2l.set_figsize() d2l.plt.xlabel('# tokens per review') d2l.plt.ylabel('count') d2l.plt.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50));
正如我們所料,評(píng)論的長(zhǎng)度各不相同。為了每次處理一小批此類評(píng)論,我們將每個(gè)評(píng)論的長(zhǎng)度設(shè)置為 500,并進(jìn)行截?cái)嗪吞畛洌@類似于第 10.5 節(jié)中機(jī)器翻譯數(shù)據(jù)集的預(yù)處理 步驟。
num_steps = 500 # sequence length train_features = torch.tensor([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in train_tokens]) print(train_features.shape)
torch.Size([25000, 500])
num_steps = 500 # sequence length train_features = np.array([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in train_tokens]) print(train_features.shape)
(25000, 500)
16.1.3。創(chuàng)建數(shù)據(jù)迭代器
現(xiàn)在我們可以創(chuàng)建數(shù)據(jù)迭代器。在每次迭代中,返回一小批示例。
train_iter = d2l.load_array((train_features, torch.tensor(train_data[1])), 64) for X, y in train_iter: print('X:', X.shape, ', y:', y.shape) break print('# batches:', len(train_iter))
X: torch.Size([64, 500]) , y: torch.Size([64]) # batches: 391
train_iter = d2l.load_array((train_features, train_data[1]), 64) for X, y in train_iter: print('X:', X.shape, ', y:', y.shape) break print('# batches:', len(train_iter))
X: (64, 500) , y: (64,) # batches: 391
16.1.4。把它們放在一起
最后,我們將上述步驟包裝到函數(shù)中l(wèi)oad_data_imdb。它返回訓(xùn)練和測(cè)試數(shù)據(jù)迭代器以及 IMDb 評(píng)論數(shù)據(jù)集的詞匯表。
#@save def load_data_imdb(batch_size, num_steps=500): """Return data iterators and the vocabulary of the IMDb review dataset.""" data_dir = d2l.download_extract('aclImdb', 'aclImdb') train_data = read_imdb(data_dir, True) test_data = read_imdb(data_dir, False) train_tokens = d2l.tokenize(train_data[0], token='word') test_tokens = d2l.tokenize(test_data[0], token='word') vocab = d2l.Vocab(train_tokens, min_freq=5) train_features = torch.tensor([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in train_tokens]) test_features = torch.tensor([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in test_tokens]) train_iter = d2l.load_array((train_features, torch.tensor(train_data[1])), batch_size) test_iter = d2l.load_array((test_features, torch.tensor(test_data[1])), batch_size, is_train=False) return train_iter, test_iter, vocab
#@save def load_data_imdb(batch_size, num_steps=500): """Return data iterators and the vocabulary of the IMDb review dataset.""" data_dir = d2l.download_extract('aclImdb', 'aclImdb') train_data = read_imdb(data_dir, True) test_data = read_imdb(data_dir, False) train_tokens = d2l.tokenize(train_data[0], token='word') test_tokens = d2l.tokenize(test_data[0], token='word') vocab = d2l.Vocab(train_tokens, min_freq=5) train_features = np.array([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in train_tokens]) test_features = np.array([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in test_tokens]) train_iter = d2l.load_array((train_features, train_data[1]), batch_size) test_iter = d2l.load_array((test_features, test_data[1]), batch_size, is_train=False) return train_iter, test_iter, vocab
16.1.5。概括
情感分析研究人們?cè)谄渖傻奈谋局械那楦?,這被認(rèn)為是將變長(zhǎng)文本序列轉(zhuǎn)換為固定長(zhǎng)度文本類別的文本分類問題。
預(yù)處理后,我們可以將斯坦福的大型電影評(píng)論數(shù)據(jù)集(IMDb 評(píng)論數(shù)據(jù)集)加載到帶有詞匯表的數(shù)據(jù)迭代器中。
16.1.6。練習(xí)
我們可以修改本節(jié)中的哪些超參數(shù)來加速訓(xùn)練情緒分析模型?
你能實(shí)現(xiàn)一個(gè)函數(shù)來將亞馬遜評(píng)論的數(shù)據(jù)集加載到數(shù)據(jù)迭代器和標(biāo)簽中以進(jìn)行情感分析嗎?
-
數(shù)據(jù)集
+關(guān)注
關(guān)注
4文章
1208瀏覽量
24713 -
pytorch
+關(guān)注
關(guān)注
2文章
808瀏覽量
13235
發(fā)布評(píng)論請(qǐng)先 登錄
相關(guān)推薦
評(píng)論