Robots Writing Stories

A closer look at language generation

Recent examples of AI and its applications include AI’s doing ‘creative’ tasks. These include: creating art, music, text, code and so on. Even generating a Eurovision song is possible today! In this slightly technical blog we will discuss how we can generate a new Shakespeare play using LSTM networks.


Portrait of Edmond Belamy, 2018, created by GAN (Generative Adversarial Network), sold for US$432,500 on Oct. 25, 2019, at Christie’s in New York.

Creative AI

Whether an AI can actually develop creativity is a discussion of definitions. Fact of the matter is, that it is common practice to use deep learning for the purpose of generating data. Language generation in particular is an incredibly interesting example of this. The output quality of such experiments may vary a lot and we’re still a long way from robots writing original high quality novels or automated news coverage of the Super Bowl. Nevertheless, the field of language generation will improve over the upcoming years and will have a major impact on how we generate/create textual content.

Language generation

Language generation can be used in many applications. These range from functional applications, such as auto-completing e-mails or generating advertising texts, to creative applications like creating a new Eurovision song. While techniques for language generation have existed for a long time, like with so many other fields, the recent emergence of deep learning has caused enormous improvements. In our example we will discuss LSTM networks and summarise the issues and pitfalls of this method. Generating new text which is trained on Shakespeare’s work basically means the network will write an original piece of “Shakespearean” text.


Before we can work with any text data we need to transform it, to make it interpretable to an algorithm. This is done through tokenization.

With tokenization we transform a piece of text into a random number. The entire text is therefore transformed into a sequence of numbers. The neural network’s task is to predict the next number in the sequence.


Text tokenization can be done in different ways. One possibility is to create a token for every character in the text. This has the advantage of reducing the number of possible options for a prediction, as a text usually has less than 100 unique characters. However, tokenizing a short sentence will create a long sequence. Moreover, a single prediction error will be very noticable, as it will cause a spelling mistake or gibberish word.


Another possibility is to create a token for every word in the text. This means the network now has to choose between thousands of possible tokens for each prediction. However, we can now fit larger texts into a sequence of a certain length. If we use sequences with a length of 50, character tokenization allows us to store 50 characters, or about one single sentence. On the other hand, word tokenization allows us to store 50 words, or about a single paragraph. This means that more information can be stored in a single sequence. A second important benefit of tokenizing words instead of characters is that wrong predictions often still produce coherent sentences. Errors in word tokenization are less noticeable than errors in character tokenization.

Which of these tokenization methods works best depends on the language. In some languages, a single noun can have many different forms, based on its grammatical case. Examples of these languages are Russian, Turkish, Finnish and Hungarian. In these languages, character tokenization produces better output than word tokenization. However, for languages like English and Dutch, it is better to tokenize each word. Therefore, in our example, we create a token per word in the text.


In order to predict the next token in a sequence, we use a Long Short-Term Memory (LSTM) network. An LSTM network is a type of recurrent neural network, which can model data sequences and predict the next item in a sequence.

For example, given a history of stock market values over time, an LSTM can predict the next stock value. LSTMs are also crucial for decision-making in complex gaming AIs. Both AlphaStar by Google DeepMind and OpenAI Five use LSTMs to beat human players in the games of StarCraft and Dota 2 respectively.

While there are many different methods that can predict the next word in a sequence, we specifically choose an LSTM model for its ability to handle large gaps in the input. In longer sentences, a next word might be determined by a word that is placed far away.

If we take the following sentence as example:

“When I was in the supermarket, I saw a woman in a white shirt holding two large bags full of …”

We might say that the word “groceries” would be the most logical prediction. However, this is mainly due to the word “supermarket”, which is 15 word-tokens away from the prediction.

Whereas most other models have trouble dealing with large amounts of information, an LSTM can erase part of its memory, only saving the relevant information. Therefore it is able to deal with longer sequences and forget irrelevant words in between. By doing so, it will create more consistent sentences or paragraphs.

Gyver explains LSTM

An LSTM network chains multiple so-called cells. These cells compute matrix multiplications on the memory of the model. Each cell receives the memory and predicted output h_i from the previous cell, as well as the next input token x_i directly from the text. Within each cell, three operations take place.

  • First, an “input gate” determines what part of the memory should be updated by the new input token.
  • Then, a “forget gate” determines what part of the memory should be erased.
  • Lastly, an “output gate” generates output based on the past sequence and the memory. The next token is predicted by a classifier using these output weights.


Now that we have defined how an LSTM cell functions, we need to set a few hyperparameters before the network can be trained. The correct values for these hyperparameters is highly dependent on the dataset. Each language behaves differently and texts from children's books are easier to learn than texts from encyclopedias. Also, the size of the dataset plays a crucial role. Therefore, the hyperparameters can best be set by experimentation.

  • Firstly, we need to determine the size of the output h_i that each cell produces. A large output from each cell allows us to transfer more information, but can cause more noise to be passed to the classifier. A small output will cause the classifier to make inaccurate predictions.
  • Secondly, we need to set the number of LSTM layers in the network. We can pass the output of one LSTM cell as input to a different LSTM cell with a different memory. Having multiple LSTM layers with different memories allows us to better model non-linearities, but causes higher workload, which slows the network down.
  • Lastly, we need to determine the length of the input sequences. When training with short sequences, we can not model dependencies between longer pieces of text.

However, training with long sequences causes a whole new problem: the problem of overfitting.

The art of balancing

When we use long sequences and train the model on the data for a long time, we are able to produce long and coherent texts. However, on close inspection, something strange is going on. The generated text is an exact match with a part of the text in the dataset!

This means that instead of learning how to write like Shakespeare, the network has learned to memorize and recite Shakespeare's complete works. While this would be impressive if done by a human, in this case it’s just another example of overfitting.

Overfitting is a big problem in all sub-fields of machine learning but in most cases it can be easily avoided. However, in language generation, it is a bit more complicated. While single predictions can be correct, humans evaluate complete sentences and paragraphs. This means that a bit of randomness is needed to avoid reciting the original texts.

Our network produces a probability per word. A simple sampling strategy would be to choose the word with the highest probability. However, this leads to a lot of repetition. To create more diversity in the produced texts, we can sample directly from the probability distribution. Thus, a word with a probability of 0.01 given an input sequence has a 1% chance to be generated.

While this improves diversity it also allows improbable words to sometimes be generated. A single strange prediction can cause an entire sentence to look odd.

This can be avoided by manipulating the probability distribution.

By using temperature, we edit the values per word before they turn into a probability distribution, making the entire distribution flatter or spikier. By making a distribution spikier, small chances are made even smaller, further reducing the risk of weird predictions.


We can also modify the sampling method.

In top-k sampling, we consider only the k most probable words, setting the probabilities of other words to zero. A downside of top-k sampling is that the probability distributions for different sequences can have a very different shape. In narrow distributions only a few tokens are plausible while in broader distributions many different tokens could be correct.

To avoid this downside we can apply top-p sampling, where we rank the probabilities and consider only the words that together cross the threshold probability p. This means we sample from more possible tokens in broad distributions than in narrow distributions.


Below is a demonstration of how the predicted texts will look when training the network. Structure emerges over time with sentences looking more similar to real text as time passes.

After training for a short period of time, the network will learn which words occur frequently, but does not yet understand which words follow others.

is the off the , and , thou and ,

After a training while longer, logical word combinations appear, but the sentences overall still don’t make any sense.

and in that I , thee will be , is

When training for a sufficient amount of time, we can see sentence structure come up.

and 'tis therefore thee shall feel , and with thy sword


In this blog we discussed how we could generate a new play by Shakespeare using LSTM networks. The result is a short text which clearly resembles Shakespeare’s style. However, the quality of this basic example is not sufficient to write an actual play. Balancing between nonsense and complete input replication is a difficult task, where sampling techniques can help. But for now, AI-written text still looks quite different from human-written text. Since the field of natural language generation is very active, we expect these differences to become much smaller over time.

We hope this blog has given you some basic understanding of how this super interesting sub-field of machine learning works. One thing is for sure, these methods will have a major impact on how we generate/create textual content in the upcoming years.

Let's get in touch