There are two main models : UniChannelTransformer
and the OmniChannelTransformer
. To see the detailed models check this file.
The goal of these models is to support arbitrary number of input channels and use a configuration for creating a transformer which can support those channels. To instantiate the models you need ChannelConfiguration
s and an OmniTransformerCoreConfig
. This helps create the transformer models.
The configuration to the transformer support Routing configuration amongst the information channels. Simplification of what this model tries to achieve:
# Give Arbitrary number of input channels.
# Get a condensed feature vector from a attention operation across channels.
def f(x1,x2,....xN):
# xi is an input cchannel with dimension (b,xis,d) where xis is length of the xi input channel.
# cross channel transformer jazz amongst input channels
return torch.Tensor(B,D)
Simplification of what this model tries to achieve:
# Give Arbitrary number of input channels.
# Get a condensed feature vector from a attention operation on same sequence.
def f(x1,x2,....xN):
# xi is an input cchannel with dimension (b,xis,d) where xis is length of the xi input channel.
# self attention transformer jazz within input channels
return torch.Tensor(B,D)
Each input channel of the model needs an explicit configuration instantiated like the one given below :
@dataclass
class ChannelConfiguration:
""" []
The is configuration class
Configuration Given For each channel based on which
this omni channel transformer will create embedding layer.
name:
Name of channel Configuration
channel_type : Values can be 'continous' | 'discrete'
The type of variable. Categorical vs continous.
If 'discrete' then embedding dim will be created.
if 'continous' then linear layer attached to it.
input_dim
if channel_type == 'continous'
dim of the Individual Item in the sequence of the channel
if channel_type == 'discrete'
number of categorical variables.
embedding_size
This is super useful when coming to figure 1d convolutions
no_embedding:
if True : Will not Create/Use Embedding Layer for this channel.
embedding_layer:
Instantiated nn.Module.
use_position_embed :
will inform weather Position embeddings will be used in
any of the transformer layers.
route_to_everything:
this is a boolean that will enforce that this channel will
route to everyother channel.
restricted_channels:
if `route_to_everything` is True then this will specify
the specific channels that the current channel's cross-channel-routing will be restricted for.
"""
name: str = ''
channel_type: str = 'discrete'
input_dim: int = None
embedding_size: int = None
no_embedding: bool = False
embedding_layer: nn.Module = None
use_position_embed: bool = True
route_to_everything:bool=True
restricted_channels:List[str] = field(default_factory=lambda:[])
def to_json(self):
return dict(
name= self.name,
channel_type= self.channel_type,
input_dim= self.input_dim,
embedding_size= self.embedding_size,
no_embedding= self.no_embedding,
embedding_layer= None,
use_position_embed= self.use_position_embed,
route_to_everything=self.route_to_everything,
restricted_channels=self.restricted_channels,
)
def __post_init__(self):
if not self.no_embedding and self.embedding_layer is None:
raise Exception(
"If `no_embedding` is False, then embedding_layer needs to be provided to map the inputs")
if self.no_embedding and self.input_dim == None:
raise Exception(
"If No Embedding are given then Dimsion of an individual item in input sequence is required")
if not self.route_to_everything and len(self.restricted_channels) == 0:
raise Exception(
"If ChannelConfiguration.route_to_everything=False then atleast one channel is required in ChannelConfiguration.restricted_channels")
ChannelConfiguration
explanations:-
name:str
: Name identifier of the input channel.name
is a unique identifier and should be maintained even in the data loading. -
channel_type:str
: type of input variable for the channel. Are input values of the channel discrete values like tokens or are they continuous d dimensional vectors.channel_type
==discrete
orchannel_type
==continous
. -
no_embedding:bool
: flag decides whether the input channel will undergo an embedding layer to for transformation on the base sequence. IfTrue
then no embeddings will be applied. IfFalse
then embeddings will be applied. The below two arguments for theChannelConfiguration
are dependent onno_embedding
.input_dim:int
: Ifno_embedding
isTrue
then the dimensions of the input channel's individual item is required for 1dconv operation. The 1dconv operation brings the dimensions of all sequences to the same dimensionality. Its mandatory Ifno_embedding
isTrue
embedding_layer:nn.Module
: Ifno_embedding
isFalse
then this layer converts the input channel to an embedding sequence. Its mandatory Ifno_embedding
isFalse
-
use_position_embed:bool
: Informs if positional embeddings are summed with each input sequence before passing to transformer layer. -
route_to_everything:bool
: This flag informs if this channel will perform the cross-channel attention operation for all the otherChannelConfiguration
provided to theOmniTransformerCoreConfig
. Based on the value of this boolean the following argument is dependent:restricted_channels:List[str]
: ifroute_to_everything
==False
then at-least one channel is required for restricted cross attention.
-
To instantiate an omnichannel transformer you need to instantiate a OmniTransformerCoreConfig
which consists of the transformer params and the configurations for the individual input channels (ChannelConfiguration
).
@dataclass
class OmniTransformerCoreConfig:
num_layers: int = 3
dropout: float = 0.1
num_heads: int = 4
scale: float = 0.2
embd_pdrop: float = 0.1
layer_norm_epsilon: float = 0.00001
resid_pdrop: float = 0.1
attn_pdrop: float = 0.1
# if pooling_strategy == 'mean' then mean of all
pooling_strategy: str = 'cls' # Can be 'cls' or 'mean'
# This is the size of the embedding
# that goes into the transformer
transformer_embedding_size: int = 256
# Per Channel Config With Embedding layer Comes Here.
channel_configurations: List[ChannelConfiguration] = field(
default_factory=[])
debug:bool=False # useless flag for now.
def to_json(self):
pass # todo
This class helps implement the channel as subclasses, so they can make the channels at time of instantiation of the entire mode. This abstraction helps for performance because it explicitly help create the embedding layer.
class ChannelMaker(metaclass=abc.ABCMeta):
def __init__(self,
name: str = '',
channel_type: str = 'discrete',
input_dim: int = None,
embedding_size: int = None,
no_embedding: bool = False,
embedding_layer: nn.Module = None,
use_position_embed: bool = True,
route_to_everything:bool=True,
restricted_channels:List[str] = []) -> None:
self.name = name
self.channel_type = channel_type
self.input_dim = input_dim
self.embedding_size = embedding_size
self.no_embedding = no_embedding
self.embedding_layer = embedding_layer
self.use_position_embed = use_position_embed
self.route_to_everything = route_to_everything
self.restricted_channels = restricted_channels
def make_channel(self)->ChannelConfiguration:
raise NotImplementedError
def from_json(self,json_dict)->ChannelConfiguration:
raise NotImplementedError
- Bigger Models are finding better decision boundaries with smaller Batchsizes
- Smaller Models are also doing good with bigger batch sizes
- Sentence Grounding examples based data-augmentation is extreamely benificial in boosting training results.
- Sentence grounding means that we creating training tuples we create
- Size of transfomer's embeddings were tuned down to as small as 16 but it still finds pretty distinct boundaries.