I remember the first time I saw a deep learning text generation project that was truly compelling and delightful to me. It was in 2016 when Andy Herd generated new Friends scenes by training a recurrent neural network on all the show’s episodes. Herd’s work went pretty viral at the time and I thought: via GIPHY And also: via GIPHY At the time I dabbled a bit with Andrej Karpathy’s tutorials for character-level RNNs; his work and tutorials undergird a lot of the kind of STUNT TEXT GENERATION work we see in the world. Python is not my strongest language, though, and I did not ever have a real motivation to understand the math of what was going on. I watched the masters like Janelle Shane instead. TensorFlow for R has changed that for me. Not only is the R interface that RStudio has developed just beautiful, but now these fun text generation projects provide a step into understanding how these neural networks model work at all, and deal with text in particular. Let’s step through how to take the text of Pride and Prejudice and generate ???? new ???? Jane-Austen-esque text. This code borrows heavily from a couple of excellent sources. Jonathan Nolis’ project on offensive license plates (That link is for their code; you can read a great narrative explanation as well.) RStudio’s example code for text generation Before starting, you will need to install keras so be sure to check out details on installation. Tokenize We are going to train a character-level language model, which means the model will take a single character and then predict what character should come next, based on the ones that have come before. First step? We need to take Pride and Prejudice and divide it up into individual characters. via GIPHY The code below keeps both capital and lowercase letters, and builds a model that learns when to use which one. This is computationally more intensive than training a model that only learns about the letters themselves in lower case; if you want to start off with that kind of model, change to the default behavior for tokenize_characters() of lowercase = TRUE. library(keras) library(tidyverse) library(janeaustenr) library(tokenizers) max_length % pull(text) %>% str_c(collapse = " ") %>% tokenize_characters(lowercase = FALSE, strip_non_alphanum = FALSE, simplify = TRUE) print(sprintf("Corpus length: %d", length(text))) ## [1] "Corpus length: 684767" chars % unique() %>% sort() print(sprintf("Total characters: %d", length(chars))) ## [1] "Total characters: 74" A good start! CHOP CHOP CHOP Next we want to cut the whole text into pieces: sequences of max_length characters. These will be the chunks of text that we use for training. dataset