Dinámica temporal de palabras y emociones
El análisis temporal nos permite evaluar cómo ha variado el tratamiento de un tema - mediante el uso de determinadas palabras o emociones - a lo largo del tiempo.
Cargar librerías
library(tidytext)
library(dplyr)
library(ggplot2)
Leer datos
<- read.csv(file = "data/climate_text.csv")
climate_text load("res/stop_words2.rda")
Estimar el número de palabras totales por cadena de TV.
<- as_tibble(climate_text) %>%
climate_date unnest_tokens(word, text) %>%
anti_join(stop_words2)
<- climate_date %>%
totals count(station) %>%
::rename(total_words = n) dplyr
Análisis de sentimiento
Para implementar el análisis de sentimiento :
1. Combinar el diccionario con el dataset limpio y 'tokenizado'.
2. Resumir el tono emocional
<- climate_date %>%
climate_date_sentiment inner_join(totals, by = "station") %>%
inner_join(get_sentiments("nrc"), relationship = "many-to-many")
%>%
climate_date_sentiment count(sentiment, word, sort = TRUE)
## # A tibble: 1,955 × 3
## sentiment word n
## <chr> <chr> <int>
## 1 positive real 125
## 2 trust real 125
## 3 positive president 112
## 4 trust president 112
## 5 surprise trump 86
## 6 positive talk 68
## 7 anticipation time 58
## 8 anger threat 57
## 9 fear threat 57
## 10 negative threat 57
## # ℹ 1,945 more rows
Podemos identificar aquellas palabras que están contribuyendo a crear un tono (sentimiento) determinado.
%>%
climate_date_sentiment count(sentiment, word, sort = TRUE) %>%
# group by sentiment
group_by(sentiment) %>%
# take the top 10 words for each sentiment
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
# visualize (ggplot)
ggplot(aes(x=word, y=n, fill=sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ sentiment, scales = "free") +
coord_flip() +
labs(title = "Top10 Words by Sentiment",
x = "", y = "")
Observa que aparecen nombres propios como Gore y Trump. Siempre puedes eliminar estas palabras de tu conjunto de datos (o del léxico de sentimientos) usando
anti_join()
.
¿Qué cadena de televisión tiene la mayor proporción de palabras negativas ?
%>%
climate_date_sentiment # count using three arguments
count(station, sentiment, total_words) %>%
ungroup() %>%
# Make a new percent column with mutate
mutate(percent = n / total_words) %>%
# Filter for only negative words
filter(sentiment == "negative") %>%
# Arrange by descending percent
arrange(desc(percent))
## # A tibble: 3 × 5
## station sentiment total_words n percent
## <chr> <chr> <int> <int> <dbl>
## 1 FOX News negative 3586 387 0.108
## 2 CNN negative 3673 314 0.0855
## 3 MSNBC negative 6095 486 0.0797
¿Qué palabras ‘negativas’ son las que usa cadacadena de televisión ?
%>%
climate_date_sentiment # filter for only negative words
filter(sentiment == "negative") %>%
# count by word and station
count(word, station) %>%
# group by station
group_by(station) %>%
# take the top 10 words for each station
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(paste(word, station, sep = "__"), n)) %>%
# visualize
ggplot(aes(x=word, y=n, fill=station)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ station, nrow = 2, scales = "free") +
coord_flip() +
scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
labs(title = "Top10 Negative words by News Station",
x = "", y = "")
Series temporales
Para visualizar el tono o sentimiento a lo largo del tiempo, trabajaremos con la función floor_date()
, del paquete lubridate
. Esta función permite redondear fechas hacia abajo a la unidad de tiempo que especifiques. En este caso, la utilizaremos para agrupar y contar el uso de palabras positivas y negativas a lo largo del tiempo. Por ejemplo, con unit = “1 month”, cada fecha se redondea al primer día del mes correspondiente.
# Load the lubridate package
library(lubridate)
<- climate_date %>%
sentiment_by_time # convert show_date(chr) into date
mutate(show_date = ymd_hms(show_date)) %>%
# define a new column using floor_date()
mutate(date = floor_date(show_date, unit = "6 months")) %>%
# group by date
group_by(date) %>%
mutate(total_words = n()) %>%
ungroup() %>%
# implement sentiment analysis using the NRC lexicon
inner_join(get_sentiments("nrc"),relationship = "many-to-many")
%>%
sentiment_by_time # Filter for positive and negative words
filter(sentiment %in% c("positive", "negative")) %>%
# Count by date, sentiment, and total_words
count(date, sentiment, total_words) %>%
ungroup() %>%
mutate(percent = n / total_words) %>%
# Set up the plot with aes()
ggplot(aes(x=date, y=percent, color=sentiment)) +
geom_line(size = 1) +
geom_smooth(method = "lm", se = FALSE, lty = 2.5,lwd = 0.75) +
expand_limits(y = 0) +
scale_y_continuous(labels = scales::percent_format()) +
labs(title = "Visualizing sentiment over time",
x = "", y = "")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
%>%
sentiment_by_time # convert show_date(chr) into date
mutate(show_date = ymd_hms(show_date)) %>%
# define a new column that rounds each date to the nearest 1 month
mutate(date = floor_date(show_date, unit="1 month")) %>%
filter(word %in% c("threat", "hoax", "money",
"terrorism", "scientific", "hurricane")) %>%
# count by date and word
count(date, word) %>%
ungroup() %>%
# Set up your plot with aes()
ggplot(aes(x=date, y=n, color=word)) +
# Make facets by word
facet_wrap(~word, scales = "free") +
geom_line(size = 1.5, show.legend = FALSE) +
expand_limits(y = 0)
¡Qué gráfico tan interesante! Se puede observar que palabras como ‘hoax’ solo se han utilizado recientemente, mientras que ‘scientific’ ha tenido varios momentos de gran intensidad, y ‘money’ muestra una disminución en sus usos mensuales. También es posible identificar claramente cuándo se estaba discutiendo un huracán.