Graphics from Billy Huynh – https://unsplash.com/de/@billy_huy

Unlocking the Power of T5: The Versatile Language Model for Text-to-Text Tasks

T5 is a powerful language model capable of performing a wide range of text-to-text tasks, including text classification, language translation or text summarization. The aim of this post is to introduce this pretrained transformer to the reader.

Henrik Bartsch

Henrik Bartsch

The texts in this article were partly composed with the help of artificial intelligence and corrected and revised by us. The following services were used for the generation:

Introduction

In today’s world, we are often exposed to large amounts of information. Not infrequently, these are long texts from which we need to filter out the information that is important to us. Whether it’s scientific papers, business analyses, or news and media, careful research can take a lot of time. Today, I will introduce a tool that will help you do just that - T5.

What is T5?

Designed specifically for text-to-text tasks, T5 is a variant of the popular BERT (Bidirectional Encoder Representations from Transformers) language model. Developed by Google, T5 is a fifth-generation language model. It builds on the advances of its predecessors and refines the model to produce coherent and contextually relevant text output. 1

T5 is a deep learning model that uses a transformer architecture to process input and output text. The transformer is a neural network designed for sequential input data, such as text, and processes it through a special mechanism called self-attention. We can think of self-attention as a mechanism that increases the information content of an embedding of an input by including information about the context of the input. In other words, the self-attention mechanism allows the model to evaluate the importance of different elements in an input sequence and dynamically adjust their influence on the output. This is particularly important for language processing tasks, where the meaning of a word can change depending on the context within a sentence or document. As a result, the model is able to understand the relationships between words and phrases in the input text and generate responses that are contextually relevant. 2 3

A large corpus of textual data is used to train T5. This enables it to recognize patterns and relationships in language. During the training process, T5 is given a task or a piece of input text. T5 is then trained to produce an answer that resembles the original text. This process enables the model to produce text output that is coherent and relevant to the context. 4 5

One of the key advantages of T5 is its ability to handle a wide range of text-to-speech tasks. T5 can therefore be adapted to a variety of tasks, making it a flexible tool for many applications, unlike other language models designed for specific tasks such as speech translation or text synthesis. 1

Application Examples

In the following, we will illustrate the advantages of the model. To do so, we will present a series of examples in which this pre-trained transformer can be used. To execute the following code, the following three packages are required: torch, transformers, and sentencepiece. If not already installed, these packages can be downloaded using pip.

The first step is to prepare the session by inserting all the necessary imports:

t5_test.py
from transformers import T5Tokenizer, T5ForConditionalGeneration

We can then load T5 with the following code. This loads not only the neural network itself, but also the tokenizer that enables natural language processing.

t5_test.py
tokenizer = T5Tokenizer.from_pretrained("t5-large", model_max_length=1024)
model = T5ForConditionalGeneration.from_pretrained("t5-large", max_length=1024)

Once this code has been executed, we can begin to explore the first applications of T5.

Essentially, all examples in this post are done using t5-large as an example. However, t5-small, t5-base, t5-3b or t5-11b can also be used. The performance of the models may vary with different numbers of parameters.

Summarizing texts

Let’s start with an example that I find often yields good results: Text summaries. In the first step, we define a function that processes the corresponding text using the tokenizer and the neural network, and then look at examples. Summaries are theoretically possible in English, German, French and Romanian. This is also true for all the following examples.

Accordingly, we start with the implementation of the text processing function. The command summarize: is used based on the model card.

t5_test.py
def summarize_text(text):
  task = "summarize: " + text
  input = tokenizer.encode(task, return_tensors="pt", max_length=1024, truncation=True)

  output = model.generate(input, min_length=80, max_length=100)

  text_summary = tokenizer.decode(output[0], skip_special_tokens=True)
  print(text_summary)

We can use the text parameter to pass original text to be summarized and output by the model. In the following, two texts have been extracted from icij to perform text summarization as an example.

Original Text

But the Deforestation Inc. investigation led by the International Consortium of Investigative Journalists identified flaws in how some companies operations and products were certified as sustainable. ICIJ, in collaboration with media partners, found that a lightly supervised environmental auditing industry approved green labels for products linked to deforestation, illegal logging and authoritarian regimes. “These repeated scandals are driven by deep-seated structural flaws in how these schemes operate,” said Sam Lawson, who heads Earthsight, an environmental organization that has reported extensively on the abuse of certification labels in the forestry sector. [Source]

Summarized Text

a new investigation by the international consortium of investigative journalists identifies flaws in how some companies operations and products were certified as sustainable. ICIJ found that a lightly supervised environmental auditing industry approved green labels for products linked to deforestation, illegal logging and authoritarian regimes. “these repeated scandals are driven by deep-seated structural flaws in how these schemes operate,” says earthsight’s lawson

In the example above, we can see that our model performs the text summarization as described by the command. We can observe that all important information has been kept while less relevant information has been discarded. Although the sentences lose grammatical correctness when they are merged, this does not really matter - as long as merging is done correctly.

Original Text

While negotiations on the New York state deforestation bill continue, a bipartisan effort at the federal level is also underway. A group of U.S. lawmakers recently introduced in both the House and Senate a new bill dubbed the Fostering Overseas Rule of Law and Environmentally Sound Trade Act, or FOREST Act, a bill which is intended to make the U.S. market deforestation-free and that would allow prosecutors to go after companies or individuals that import products derived from illegal deforestation. Also this month, the U.K. government announced a long-anticipated law intended to stop large companies from using commodities such as soy or beef produced in areas where forests were illegally logged. (Conservationists have criticized the new law for its limited scope as it does not include rubber, which is a major driver of deforestation.) [Source]

Summarized Text

a bipartisan effort at the federal level is also underway. a group of lawmakers recently introduced a new bill. the bill is intended to make the u.s. market deforestation-free. also this month, the u.k. government announced a long-anticipated law. a group of u.s. lawmakers also introduced a bill to make the u.s. market deforestation-

Just like in the last example, we can see that relevant information has been kept in place, while less relevant information was discarded. A look at the actual T5 paper shows that the command TL;DR: was actually trained instead of summarize:. We can also use this to summarize our texts.

t5_test.py
task = "TL;DR: But the Deforestation Inc. investigation [...]"

input = tokenizer.encode(task, return_tensors="pt", max_length=1024, truncation=True)
output = model.generate(input, min_length=80, max_length=100)

text_summary = tokenizer.decode(output[0], skip_special_tokens=True)
print(text_summary)

In this code snippet we refer to the first text we summarized. It has been abbreviated here due to its length and is not written out in full in the code. If we want to summarize this text through the code, we get the following output:

Summarized Text

TL;DR:..: TL;DR: TL;DR:DR:DR: TL;DR: TL;DR: in how some companies operations and products were certified as sustainable. by the and and and and found and that approved green labels for products linked to deforestation. that the thata approved. TL;DR

With this command, we get a different version of the summary. Although we get this summary in our text in upper and lower case, we also get a number of TL;DR:’s in the output text and also a few random dots. In our opinion, summaries created with the summarize: command are of higher quality than those created with TL;DR:.

It should be noted that these are general summaries. An application to e.g. meeting minutes and subsequent filtering for personal information is currently not possible with this model.

Translating texts

To get started with translating text, we will start with examples where we want to translate English text to German. We can define a function to do this in the following way:

t5_test.py
def translate_to_german(text):
  task = "translate English to German: "
  input = tokenizer(task + translation_sentences, return_tensors="pt", padding=True)

  output_sequences = model.generate(
    input_ids=input["input_ids"],
    attention_mask=input["attention_mask"],
    do_sample=False
  )

  print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))

We will then start with a simple example that might occur in everyday life.

Original Text

Hello there! How can I help you?

Output

Hallo, wie kann ich Ihnen helfen?

This example shows that a good quality is obtained. Let’s continue with a more complex case that uses technical terms.

Original Text

The engine was close to overheating, when I was about to check it.

Output

Der Motor war kurz vor der Überhitzung, als ich es überprüfen wollte.

Again, the model seems to work well. In addition, we are also able to use English from a higher level of the language, in this case from an English investigative journal:

Original Text

A US billionaire took over a tropical island pension fund — then hundreds of millions of dollars allegedly went missing. [Quelle]

Output

Die ersten zehn Jahre nach der Geburt von lteren hat sich die deutsche Gesellschaft für psychische Erkrankungen (PSK) in der Schweiz etabliert.

In this example, we can clearly see that the model is in trouble. The input seems to be completely uncorrelated with the corresponding output.

After translating from English to German, we can also look at the reverse case.

t5_test.py
def translate_german(text):
  task = "translate German to English: "
  input = tokenizer(task + translation_sentences, return_tensors="pt", padding=True)

  output_sequences = model.generate(
    input_ids=input["input_ids"],
    attention_mask=input["attention_mask"],
    do_sample=False
  )

  print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))

To test the capabilities of the model, we start here with simple examples.

Original Text

Heute geht es mir sehr gut.

Output

Heute geht es mir sehr gut.

Original Text

Meine Meinung zu diesem Thema ist sehr komplex, entsprechend kann ich diese nicht in einem Satz zusammenfassen.

Output

Meine Meinung zu diesem Thema ist sehr komplex, entsprechend kann ich diese nicht in einem Satz zusammenfassen.

We can see here that the model does not perform any translation. This also happens in more complex examples:

Original Text

Geflüchtete sind oft schwer traumatisiert. Oft entwickeln sich daraus psychische Erkrankungen und verhindern eine Integration. Betroffene können zur Gefahr für sich selbst werden - oder sogar für andere. Trotzdem werden sie fast nie therapiert. Ein Systemversagen mit Ansage. [Quelle]

Output

Die ersten zehn Jahre nach der Geburt von lteren hat sich die deutsche Gesellschaft für psychische Erkrankungen (PSK) in der Schweiz etabliert.

Original Text

Robustes Clustering mittels k-Means. Um ähnliche Gruppen in unbekannten Daten zu identifizieren und die Komplexität zu reduzieren, kann Clustering verwendet werden. Hier wird der k-Means-Algorithmus beschrieben, der für Clustering verwendet werden kann. [Quelle]

Output

Um ähnliche Gruppen in unbekannten Daten zu identifizieren und die Komplexität zu reduzieren, kann Clustering verwendet werden.

In these more complex examples, the text to be translated is not even copied, but (serious) errors are included in the output. Therefore, when translating from English, the suitability of the model should be checked or fine-tuned before use to ensure the suitability of the translation.

To summarize, in our example, a translation between English and German will only work if we translate from English to German. A translation in the opposite direction does not make sense.

CoLA

As a final example, let’s look at CoLA, which stands for “Corpus of Linguistic Acceptability”. The task of the model in this command is to check the correctness of an input text. It does not mark the error itself, but only classifies it as correct or incorrect. 1 6 A text checking function is easy to define:

t5_test.py
def cola_sentence(sentence):
  task = "cola sentence: " + sentence
  input = tokenizer.encode(task, return_tensors="pt", max_length=1024, truncation=True)

  output = model.generate(input, min_length=80, max_length=100)

  text_summary = tokenizer.decode(output[0], skip_special_tokens=True)
  # text_summary = tokenizer.decode(output[0])
  print(text_summary)

Then we can look at a number of examples. We will start with a simple but incorrect sentence:

Original Text

This is house falling.

Original Output

unacceptable - equivalent - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Shortened Output

unacceptable

The original output of the model seems arbitrary, but for us only the keyword acceptable or unacceptable is relevant. Therefore, we will work with these abbreviated outputs in the following. We see that this simple sentence was classified correctly.

After that, we are going to look at two more complex sentences.

Original Text

My bearded dragon appears happy as of late.

Shortened Output

acceptable

Original Text

The coca cola company tries to invest in previously unknown business branches.

Shortened Output

acceptable

In both cases a correct classification was achieved. Classification seems to be more difficult for the model when special characters are used:

Original Text

My dog was running for an hour straight.

Original Text

My d0g was running for an hour straight.

Original Text

My dOg was running for an hour straight.

Original Text

My d=g was running for an hour straight.

Shortened Output

acceptable

For all these inputs we get the classification that this input contains no errors - although we can see that special characters are in places where they should not be.

If we now insert a grammatical error into the sentence, the model recognizes this error.

Original Text

My dog was running for two hour straight.

Shortened Output

unacceptable

In summary, it probably makes sense to remove special characters from the input text if they are not relevant to the meaning. Alternatively, errors caused by special characters can possibly be ignored if this is not necessary. Otherwise, the neural network seems to be able to understand more complex sentences well in this task area and therefore seems to be an interesting tool for automatic text checking.

Further applications

T5 provides a foundation for performing a number of different tasks on texts. It is not an optimum, but a basis for further development that is basically capable of performing a range of tasks with acceptable or good quality. For more complex tasks or specific customizations to applications, fine-tuning of the model may be necessary and will (significantly) improve performance. The way in which tasks are given to the model and how it is trained makes it fundamentally very flexible. This makes it a good tool for a range of future tasks on which it has not yet been trained. 4 5

TL;DR

In summary, T5 is a powerful language model that can perform a variety of text-to-text tasks. Its ability to understand natural language and generate contextually relevant information makes it a valuable tool for many applications such as chatbots, language translation, and content generation.

What we have seen is that the quality of the summaries is good. Address-specific summaries are not possible, so we have to limit ourselves to general summaries.

In terms of translation, we have found that it is not possible to translate from German to English. Furthermore, the translation from English to German is a bit difficult, but for not too difficult translation tasks this model is well suited.

When checking the correctness of the text, we have seen that our model delivers good results. Only the use of special characters causes difficulties for the model, which should be considered and checked.

Author’s note: The examples given here are based on personal experience and cannot be guaranteed to be completely accurate. Especially with neural networks, there is always the risk that individual data points are processed incorrectly. As far as I know, these examples are given as examples.

Sources

Footnotes

  1. arxiv.org 2 3

  2. arxiv.org

  3. sebastianraschka.com

  4. huggingface.co 2

  5. huggingface.co 2

  6. arxiv.org