Dinámica temporal de palabras y emociones

El análisis temporal nos permite evaluar cómo ha variado el tratamiento de un tema - mediante el uso de determinadas palabras o emociones - a lo largo del tiempo.

Cargar librerías

library(tidytext)
library(dplyr)
library(ggplot2)

Leer datos

climate_text <- read.csv(file = "data/climate_text.csv")
load("res/stop_words2.rda")

Estimar el número de palabras totales por cadena de TV.

climate_date <- as_tibble(climate_text) %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words2)

totals <- climate_date %>% 
  count(station) %>% 
  dplyr::rename(total_words = n)

Análisis de sentimiento

Para implementar el análisis de sentimiento :

1. Combinar el diccionario con el dataset limpio y 'tokenizado'. 
2. Resumir el tono emocional

climate_date_sentiment <- climate_date %>% 
  inner_join(totals, by = "station") %>% 
  inner_join(get_sentiments("nrc"), relationship = "many-to-many") 

climate_date_sentiment %>% 
  count(sentiment, word, sort = TRUE)

## # A tibble: 1,955 × 3
##    sentiment    word          n
##    <chr>        <chr>     <int>
##  1 positive     real        125
##  2 trust        real        125
##  3 positive     president   112
##  4 trust        president   112
##  5 surprise     trump        86
##  6 positive     talk         68
##  7 anticipation time         58
##  8 anger        threat       57
##  9 fear         threat       57
## 10 negative     threat       57
## # ℹ 1,945 more rows

Podemos identificar aquellas palabras que están contribuyendo a crear un tono (sentimiento) determinado.

climate_date_sentiment %>% 
  count(sentiment, word, sort = TRUE) %>% 
  # group by sentiment
  group_by(sentiment) %>%
  # take the top 10 words for each sentiment
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  # visualize (ggplot)
  ggplot(aes(x=word, y=n, fill=sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ sentiment, scales = "free") +
  coord_flip() +
  labs(title = "Top10 Words by Sentiment", 
       x = "", y = "")

Observa que aparecen nombres propios como Gore y Trump. Siempre puedes eliminar estas palabras de tu conjunto de datos (o del léxico de sentimientos) usando anti_join().

¿Qué cadena de televisión tiene la mayor proporción de palabras negativas ?

climate_date_sentiment %>% 
  # count using three arguments
  count(station, sentiment, total_words) %>% 
  ungroup() %>% 
  # Make a new percent column with mutate 
  mutate(percent = n / total_words) %>%
  # Filter for only negative words
  filter(sentiment == "negative") %>%
  # Arrange by descending percent
  arrange(desc(percent))

## # A tibble: 3 × 5
##   station  sentiment total_words     n percent
##   <chr>    <chr>           <int> <int>   <dbl>
## 1 FOX News negative         3586   387  0.108 
## 2 CNN      negative         3673   314  0.0855
## 3 MSNBC    negative         6095   486  0.0797

¿Qué palabras ‘negativas’ son las que usa cadacadena de televisión ?

climate_date_sentiment %>% 
  # filter for only negative words
  filter(sentiment == "negative") %>%
  # count by word and station
  count(word, station) %>%
  # group by station
  group_by(station) %>%
  # take the top 10 words for each station
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(paste(word, station, sep = "__"), n)) %>%
  # visualize
  ggplot(aes(x=word, y=n, fill=station)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ station, nrow = 2, scales = "free") +
  coord_flip() + 
  scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) + 
  labs(title = "Top10 Negative words by News Station", 
       x = "", y = "")

Series temporales

Para visualizar el tono o sentimiento a lo largo del tiempo, trabajaremos con la función floor_date(), del paquete lubridate. Esta función permite redondear fechas hacia abajo a la unidad de tiempo que especifiques. En este caso, la utilizaremos para agrupar y contar el uso de palabras positivas y negativas a lo largo del tiempo. Por ejemplo, con unit = “1 month”, cada fecha se redondea al primer día del mes correspondiente.

# Load the lubridate package
library(lubridate)

sentiment_by_time <- climate_date %>%
  # convert show_date(chr) into date
  mutate(show_date = ymd_hms(show_date)) %>% 
  # define a new column using floor_date()
  mutate(date = floor_date(show_date, unit = "6 months")) %>%
  # group by date
  group_by(date) %>%
  mutate(total_words = n()) %>%
  ungroup() %>%
  # implement sentiment analysis using the NRC lexicon
  inner_join(get_sentiments("nrc"),relationship = "many-to-many") 

sentiment_by_time %>%
  # Filter for positive and negative words
  filter(sentiment %in% c("positive", "negative")) %>%
  # Count by date, sentiment, and total_words
  count(date, sentiment, total_words) %>%
  ungroup() %>%
  mutate(percent = n / total_words) %>%
  # Set up the plot with aes()
  ggplot(aes(x=date, y=percent, color=sentiment)) +
  geom_line(size = 1) +
  geom_smooth(method = "lm", se = FALSE, lty = 2.5,lwd = 0.75) +
  expand_limits(y = 0) + 
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Visualizing sentiment over time", 
       x = "", y = "")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

sentiment_by_time %>%
  # convert show_date(chr) into date
  mutate(show_date = ymd_hms(show_date)) %>% 
  # define a new column that rounds each date to the nearest 1 month
  mutate(date = floor_date(show_date, unit="1 month")) %>%
  filter(word %in% c("threat", "hoax", "money",
                     "terrorism", "scientific", "hurricane")) %>% 
  # count by date and word
  count(date, word) %>%
  ungroup() %>%
  # Set up your plot with aes()
  ggplot(aes(x=date, y=n, color=word)) +
  # Make facets by word
  facet_wrap(~word, scales = "free") +
  geom_line(size = 1.5, show.legend = FALSE) +
  expand_limits(y = 0)

¡Qué gráfico tan interesante! Se puede observar que palabras como ‘hoax’ solo se han utilizado recientemente, mientras que ‘scientific’ ha tenido varios momentos de gran intensidad, y ‘money’ muestra una disminución en sus usos mensuales. También es posible identificar claramente cuándo se estaba discutiendo un huracán.