从火炬中的张量中选择性替换向量的有效方法

debugcn 发表于 Dev

阿姆里斯·克里希纳（Amrith Krishna）

给定一批文本序列，将其转换为张量，其中每个单词使用单词嵌入或矢量（300个维度）表示。我需要用一组新的嵌入有选择地替换某些特定单词的向量。此外，这种替换将仅针对并非所有特定单词的出现而发生，而是仅随机发生。目前，我有以下代码可以实现此目的。它使用2个for循环遍历每个单词，检查单词是否在指定列表中splIndices。然后，它根据中的T或F值检查是否需要替换该单词selected_。

但这可以更有效地完成吗？

下面的代码可能不是MWE，但是我试图通过删除细节来简化代码，以便重点解决问题。请忽略代码的语义或用途，因为此代码段中可能未适当表示它。问题是关于提高性能。


splIndices = [45, 62, 2983, 456, 762]  # vocabulary indices which needs to be replaced
splFreqs = 2000  # assuming the words in splIndices occurs 2000 times
selected_ = Torch.Tensor(2000).uniform_(0, 1) > 0.2  # Tensor with 20% of the entries True
replIndexCtr = 0  # counter for selected_

# Dictionary with vectors to be replaced. This is a dummy function.
# Original function depends on some property of the word
diffVector = {45: Torch.Tensor(300).uniform_(0, 1), ...... 762: Torch.Tensor(300).uniform_(0, 1) } 

embeding = nn.Embedding.from_pretrained(embedding_matrix, freeze=False)
tempVals = x  # shape [32, 41] - batch of 32 sequences with 41 words each
x = embeding(x) # shape [32, 41, 300] - the sequence now has replaced vocab indices with embeddings

# iterate through batch for sequences
for i, item in enumerate(x):
    # iterate sequences for words
    for j, stuff in enumerate(item):
        if tempVals[i][j].item() in splIndices: 
            if self.selected_[replIndexCtr] == True:                   
                x[i,j] = diffVector[tempVals[i][j].item()]
                replIndexCtr += 1

麦克斯塔尼

可以通过以下方式对其进行矢量化：

import torch
import torch.nn as nn
import torch.nn.functional as F

batch_size, sentence_size, vocab_size, emb_size = 3, 2, 15, 1

# Make certain bias as a marker of embedding 
embedder_1 = nn.Linear(vocab_size, emb_size)
embedder_1.weight.data.fill_(0)
embedder_1.bias.data.fill_(200)

embedder_2 = nn.Linear(vocab_size, emb_size)
embedder_2.weight.data.fill_(0)
embedder_2.bias.data.fill_(404)

# Here are the indices of words which need different embdedding
replace_list = [3, 5, 7, 9] 

# Make a binary mask highlighing special words' indices
mask = torch.zeros(batch_size, sentence_size, vocab_size)
mask[..., replace_list] = 1

# Make random dataset
data_indices = torch.randint(0, vocab_size, (batch_size, sentence_size))
data_onehot = F.one_hot(data_indices, vocab_size)

# Check if onehot of a word collides with replace mask 
replace_mask = mask.long() * data_onehot
replace_mask = torch.sum(replace_mask, dim=-1).byte() # byte() is critical here

data_emb = torch.empty(batch_size, sentence_size, emb_size)

# Fill default embeddings
data_emb[1-replace_mask] = embedder_1(data_onehot[1-replace_mask].float())
if torch.max(replace_mask) != 0: # If not all zeros
    # Fill special embeddings
    data_emb[replace_mask] = embedder_2(data_onehot[replace_mask].float())

print(data_indices)
print(replace_mask)
print(data_emb.squeeze(-1).int())

这是一个可能的输出示例：

# Word indices
tensor([[ 6,  9],
        [ 5, 10],
        [ 4, 11]])
# Embedding replacement mask
tensor([[0, 1],
        [1, 0],
        [0, 0]], dtype=torch.uint8)
# Resulting replacement
tensor([[200, 404],
        [404, 200],
        [200, 200]], dtype=torch.int32)

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-04-1

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章

从火炬中的张量中选择性替换向量的有效方法

从火炬中的张量中选择性替换向量的有效方法

从长（且合理）稀疏向量中选择随机元素的最有效方法是什么？

在Eclipse中选择性删除方法

SQL Server：选择性XML索引未得到有效使用

R-高维稀疏数据帧的有效选择性和

从表中选择部分或全部记录的有效方法

从PARTITION BY中选择行的更有效方法

在Excel文件中选择大量行的有效方法

在Fortran中执行张量积的有效方法

在Python中从大文件中选择部分记录的更有效方法

从 Scala 中的 Seq 中选择元素子集的有效方法

在TensorFlow中计算张量中所有向量之间的成对欧几里得距离的有效方法

用更有效的方法替换子选择计数

从 R 中的列表中选择性消除

python中的有效张量收缩

C ++替换向量中的动态对象

如何替换向量中的值

用数字替换向量中的文本？

张量乘法的张量流有效方法

张量乘法的张量流有效方法

用新向量中的对应值替换向量中的所有奇数值

R中的选择性替换字符串

替换矩阵（R）中的值的有效方法

（SQL）在DataGridView中替换NULL值的有效方法？

在R中选择性缩放变量

在Excel中选择性滚动

在AngularJs中选择性显示重复

在熊猫中选择性使用fillna（）

取向量幂的有效方法

C ++-比较向量的有效方法