This is the first one of a 3-posts-series, where I go from fetching Twitter users and preparing the data to visualizing it (If I wanted to show everything I've done in a single post, it would be almost as long as my first one! And believe me: nobody wants that 😝 ):
- How to fetch Twitter users with R: this one, the title is kind of self explanatory…
- How to deal with ggplotly huge maps: where I go through the details of why I chose not to use
plot_geoinstead to generate the HTML.
- How to plot animated maps with gganimate: again, pretty obvious subject.
Finally I present my favourite visualization here.
I should warn you that there are a lot of emojis in this series, courtesy of the
emo package Hadley recently released and I fanatically adopted 😇
Let's get started!
Getting Twitter users
I had to learn how to retrieve data from the Twitter API, and I chose to use the
rtweet package, which is super easy to use! Since I only use public data I don't have to worry about getting my Twitter personal access token.
Every R-Ladies’ chapter uses a standard handle, with the RLadiesLocation format (thankfully they are very compliant with this!). I use the
rtweet::search_users function, setting the query to be searched with
q = 'RLadies' and the number of users to retrieve with
n = 1000, that being the maximum from a single search. As I want a dataframe as a result, I set the
parse parameter to
TRUE. This way I get 1,000 rows of users, with 36 variables regarding them. I'm only showing the variables I'm going to use, but there is a lot of extra information there.
library(rtweet) users <- search_users(q = 'RLadies', n = 1000, parse = TRUE)
Let's see what it returns:
library(DT) datatable(users[, c(2:5)], rownames = FALSE, options = list(pageLength = 5))
As I get so many duplicate users (nearly half of them!), I suspect it retrieves the user if `q` matches the user's _description_, _name_ or _screen\_name_ (handle), but also if it matches something they tweeted (neither the [Twitter API documentation](https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-users-search) nor the [`rtweet::search_users` one](https://cran.r-project.org/web/packages/rtweet/rtweet.pdf) are clear about this).
DT::datatable just in case someone wants to go through whats on the whole table (of course I'm thinking about the R-Ladies community here 😍 ). It was not easy to set up the environment for my blog to show this table (it uses the
htmlwidgets package), but luckily my hubby was more than willing to help me with that part 😅 If you are using RStudio it is just as simple as installing the
DT package, or you can always use
knitr::kable(head(users[, c(2:5)]), format = "html") to see the first rows.
Cleaning the data
First I remove all the duplicates, and then I keep only the handles that comply with the stipulated format, using a regular expression. I filter out 3 additional handles:
- ‘RLadies’, whose name is ‘Royal Ladies’ and I assume has something to do with royalty by the crown on their profile picture 👸
- ‘RLadies_LF’, a Japanese account that translated as follows on Google Translator: ‘Rakuten Ichiba fashion delivery’.
- ‘RLadiesGlobal’, because it is not a chapter, so I don't want to include it on the plot.
Then I format the date class variable
%Y-%m-%d (just because seeing the hours, minutes and seconds annoys me!), generate the age in days
age_days and select the variables I will use for my analysis.
library(dplyr) library(lubridate) library(stringr) library(tidyr) rladies <- unique(users) %>% filter(str_detect(screen_name, '^(RLadies).*') & !screen_name %in% c('RLadies', 'RLadies_LF', 'RLadiesGlobal')) %>% mutate(created_at = format(as.Date(created_at), format = '%Y-%m-%d'), age_days = difftime(as.Date('2017-5-15'), created_at, unit = 'days')) %>% select(screen_name, location, created_at, followers = followers_count, age_days)
One final fix: I have some missing values on
location that I'll need for geocoding the chapters, so I use an auxiliary table
lookup to match the
screen_name with the
library(tibble) lookup <- tibble(screen_name = c('RLadiesLx', 'RLadiesMTL' , 'RLadiesSeattle'), location = c('Lisbon' , 'Montreal' , 'Seattle' )) rladies <- rladies %>% left_join(lookup, by = 'screen_name') %>% mutate(location = coalesce(location.x, location.y)) %>% select(-location.x, -location.y)
There are two additional chapters with no presence on Twitter: one in Taipei, Taiwan, and the other in Warsaw, Poland. I add them according to their creation date, using the number of members on their Meetup account as followers.
rladies <- rladies %>% add_row( screen_name = 'RLadiesTaipei', location = 'Taipei', created_at = as.Date('2014-11-15'), followers = 347) %>% add_row( screen_name = 'RLadiesWarsaw', location = 'Warsaw', created_at = as.Date('2016-11-15'), followers = 80) datatable(rladies, rownames = FALSE, options = list(pageLength = 5))
As my ultimate goal is to plot the chapters on a map, I need to obtain the latitude and longitude for each one of them. That's when the
ggmap package really comes in handy: it interacts with Google Maps to retrieve the coordinates from the location, and I don't even have to worry about getting it into a specific format, because it is so good that it doesn't need it! (my first try was actually by extracting the cities using regular expressions, but many of the chapters didn't match or matched wrongly, so I tried it this way and it worked perfectly!)
ggmap::geocode function returns 2 columns, I thought about calling it twice: once for the longitude and once for the latitude. But I didn't like it because it was awfully inefficient, and the geocoding takes some (really long!) time. It was going to be something like this:
library(ggmap) rladies <- rladies %>% mutate(lon = geocode(location)[,1], lat = geocode(location)[,2])
Doing some research (and benefitting from Amelia‘s super helpful suggestion!) I finally decided to use the
purrr::map function for capturing both values in a single column of the dataframe, and then transform it into two separate columns with
tidyr::unnest. All of this with never having to leave the
tidyverse world 😏
I'm doing it in two steps to see the intermediate result, with the two columns in a single variable of the dataframe.
library(ggmap) library(purrr) rladies <- rladies %>% mutate(longlat = purrr::map(.$location, geocode))
|RLadiesSF||San Francisco||2012-10-15||916||1673||-122.41942, 37.77493|
|RLadiesNYC||New York||2016-09-01||309||256||-74.00594, 40.71278|
|RLadiesIstanbul||İstanbul, Türkiye||2016-09-06||436||251||28.97836, 41.00824|
|RLadiesBCN||Barcelona, Spain||2016-10-11||377||216||2.173404, 41.385064|
|RLadiesColumbus||Columbus, OH||2016-10-04||179||223||-82.99879, 39.96118|
|RLadiesBoston||Boston, MA||2016-09-06||259||251||-71.05888, 42.36008|
rladies <- rladies %>% unnest()
That's it! Now the dataframe is ready for me to use it for visualizing these Twitter users on a map (considering their sizes and dates of creation), and make some interactive maps and animations!
If you enjoyed this article, check out the next one of the series here or the code in my GitHub repo. You are also welcome to leave your comments and suggestions below or mention me on Twitter. Thank you for reading 😉