Mining Hamlet

A lot of Shakespeare’s tragic heores don’t dominate the first act of their plays. Instead, other characters speak about them, setting the scene for exploring their personalities as the play unfolds. This is the case of Julius Caesar, Macbeth and Othello (but not of King Lear).

In this post I go over the text of Hamlet, Prince of Denmark, using the quantity of lines spoken by character to visualize the dynamic of the play. We’ll get the chance to use dplyr and some regular expressions.

Getting the text of the plays

The text of Shakespeare’s plays is available from the gutenberg package. I downloaded the text and made it available as a data frame here so you don’t have to.

books <- readRDS('shakespeare_plays.rds')
hamlet <- books %>%  
  filter(title == "Hamlet, Prince of Denmark")

Extracting the Character names

We can use a regular expressions to extract character names from the lines of the play. Most characters names appear abreviated (Ham. for Hamlet, Hor. for Horatio).

lag and cumsum are useful inside call to mutate to look at consecutive lines in the data frame. I also create a line number index with row_number.

# Fran., Ham., Pol.
CHAR_REGEX <- regex("^([A-Z][a-z]*)\\.")
# Stage Dir
# [Enter Horatio and Marcellus.]
STAGEDIR_REGEX <- regex("(\\[.+\\])")

hamlet <- hamlet %>% 
  mutate(char_name = str_match(text, CHAR_REGEX)[,2]) %>% 
  mutate(stage_dir = str_match(text, STAGEDIR_REGEX)[,2],
         char_name = if_else(!is.na(stage_dir), "director", char_name)) %>%
  mutate(start_speech = !is.na(char_name) & 
           lag(text) == "") %>%
  mutate(speech_idx = cumsum(start_speech)) %>% 
  mutate(line = row_number()) %>% 
  select(text, char_name, start_speech, speech_idx, line)

Now we need to create a data frame of speeches. Each line is a speech in the play, with the character that speaks it and the number of lines it lasts.

# Build a df with speech, start line, length char
speeches_df <- hamlet %>% 
  group_by(speech_idx) %>% 
  summarise(char_name = first(char_name), 
            line = first(line),
          speech_length = as.integer(n()-2)) %>% 
  dplyr::filter(char_name != "director")

The longest speech is by Hamlet (duh!), and starts at line 2677.

speeches_df %>% 
  arrange(-speech_length)
## # A tibble: 1,077 x 4
##    speech_idx char_name  line speech_length
##         <int> <chr>     <int>         <int>
##  1        498 Ham        2677            60
##  2        234 Ghost      1296            50
##  3         69 King        383            39
##  4        761 Ham        4039            36
##  5        154 Laer        846            35
##  6        522 Ham        2857            35
##  7        480 Pol        2553            34
##  8        479 Ham        2518            33
##  9        568 Ham        3139            32
## 10         84 King        502            31
## # ... with 1,067 more rows

Lets take a look at the text of the speech:

hamlet %>% 
  filter(line %in% 2677:2690) %>% 
  select(text) 
## # A tibble: 14 x 1
##    text                                           
##    <chr>                                          
##  1 Ham.                                           
##  2 Ay, so, God b' wi' ye!                         
##  3 Now I am alone.                                
##  4 O, what a rogue and peasant slave am I!        
##  5 Is it not monstrous that this player here,     
##  6 But in a fiction, in a dream of passion,       
##  7 Could force his soul so to his own conceit     
##  8 That from her working all his visage wan'd;    
##  9 Tears in his eyes, distraction in's aspect,    
## 10 A broken voice, and his whole function suiting 
## 11 With forms to his conceit? And all for nothing!
## 12 For Hecuba?                                    
## 13 What's Hecuba to him, or he to Hecuba,         
## 14 That he should weep for her? What would he do,

Lets focus on the characters with the most lines:

top_speakers <- speeches_df %>% 
  group_by(char_name)  %>%
  summarize(total_lines = sum(speech_length)) %>% 
  arrange(-total_lines) %>% 
  head(6)

Use inner_join to discard the less important characters:

# Keep speeches by these speakers
speeches_df_main <- speeches_df %>% 
  inner_join(top_speakers, by = "char_name") %>% 
  filter(!is.na(char_name))

The last thing we need is a column with the cumulative lines spoken by each character:

speeches_df_main <- speeches_df_main %>% 
   group_by(char_name) %>% 
   mutate(cum_lines = as.integer(cumsum(speech_length))) %>% 
   ungroup() %>% 
   mutate(char_name = fct_recode(char_name,
             "Horatio"       = "Hor",
             "King Claudius" = "King",
             "Laertes"       = "Laer",
             "Polonius"      = "Pol",
             "Ophelia"       = "Oph",
             "Hamlet"        = "Ham"))

Now we can plot the play:

# color palette
col.pal <- RColorBrewer::brewer.pal(8, "Set2")

# Plot Play
g <- ggplot(speeches_df_main, aes(line, cum_lines, fill = char_name)) + 
  guides(colour = guide_legend(title = NULL)) +
  geom_area(alpha = 0.8) + 
  guides(fill = guide_legend(title = NULL)) +
  scale_fill_brewer(palette = "Set2") + 
  labs(title = "Cumulative lines", subtitle = "By Character") + 
  xlab("Line") + 
  ylab("Spoken Lines") +
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5),
        panel.border = element_blank()) +
  scale_x_continuous(expand = c(0, 0)) + 
  scale_y_continuous(expand = c(0, 0)) + 
  facet_wrap(~char_name) + 
  guides(fill="none")

g

The plot shows the dynamic of the play quite nicely. Horatio, Hamlet’s friend figures quite prominenty at the beggining. Polonius has a lot of lines in the middle of the play, until he’s caught behind the arras just before line 4000. Ophelia dies around line 5000.

Towards the end, Hamlet eats up the whole play in his showdown with King Claudius.