Using an RNN to generate Bill Wurtz notes
Textgenrnn is funBill Wurtz is an American musician who became reasonably famous through short musical videos posted to Vine and YouTube. I was searching through his website the other day, and stumbled upon a page labeled notebook, and thought I should check it out.
Bill’s notebook is a large (about 580 posts) collection of random thoughts, ideas, and sometimes just collections of words. A prime source of entertainment, and neural network inputs..
“If you are looking to burn something, fire may be just the ticket” - Bill Wurtz
Choosing the right tool for the job
If you haven’t noticed yet, Im building a neural net to generate notes based on his writing style and content. Anyone who has read my first post will know that I have already done a similar project in the past. This means time to reuse come code!
For this project, I decided to use an amazing library by @minimaxir called textgenrnn. This Python library will handle all of the heavy (and light) work of training an RNN on a text dataset, then generating new text.
Building a dataset
This project was a joke, so I didn’t bother with properly grabbing each post, categorizing them, and parsing them. Instead, I build a little script to pull every HTML file from Bill’s website, and regex out the body. This ended up leaving some artifacts in the output, but I don’t really mind.
import re
import requests
def loadAllUrls():
page = requests.get("https://billwurtz.com/notebook.html").text
links = re.findall(r"HREF=\"(.*)\"style", page)
return links
def dumpEach(urls):
for url in urls:
page = requests.get(f"https://billwurtz.com/{url}").text.strip().replace(
"</br>", "").replace("<br>", "").replace("\n", " ")
data = re.findall(r"</head>(.*)", page, re.MULTILINE)
# ensure data
if len(data) == 0:
continue
print(data[0])
urls = loadAllUrls()
print(f"Loaded {len(urls)} pages")
dumpEach(urls)
This script will print each of Bill’s notes to the console (on it’s own line). I used a simple redirect to write this to a file.
python3 scrape.py > posts.txt
Training
To train the RNN, I just used some of textgenrnn’s example code to read the posts file, and build an HDF5 file to store the RNN’s neurons.
from textgenrnn import textgenrnn
generator = textgenrnn()
generator.train_from_file("/path/to/posts.txt", num_epochs=100)
This takes quite a while to run, so I offloaded it to a Droplet, and left it running overnight.
The results
Here are some of my favorite generated notes:
“note: do not feel better”
“hi I am a car.”
“i am stuff and think about this before . this is it, the pond. how do they make me feel better?”
“i am still about the floor”
Not perfect, but it is readable english, so i call it a win!
Play with the code
I have uploaded the basic code, the scraped posts, and a partial hdf5 file to GitHub for anyone to play with. Maybe make a twitter bot out of this?