orkgnlp.annotation.tdm.encoder.TdmDataset

class TdmDataset(text, labels, tokenizer, max_input_sizes)[source]

Bases: Dataset

The TdmDataset is a customized torch.utils.data.Dataset that simplifies the tokenization of sequences and can be used afterwards in a torch.utils.data.Dataloader for batch creation.

Parameters
  • text (str) – Input text (hypothesis) to be concatenated with all known labels (premises).

  • labels (DataFrame) – TDM gold labels given as one-columned-dataframe

  • tokenizer (PreTrainedTokenizer) – Tokenizer for tokenizing the texts.

  • max_input_sizes (int) – Max length of a sequence including special characters.

Methods