How to fetch Twitter users with R

May 15, 2017 · 16858 words · 80 minute read

This is the first one of a 3-posts-series, where I go from fetching Twitter users and preparing the data to visualizing it (If I wanted to show everything I've done in a single post, it would be almost as long as my first one! And believe me: nobody wants that 😝 ):

How to fetch Twitter users with R: this one, the title is kind of self explanatory…
How to deal with ggplotly huge maps: where I go through the details of why I chose not to use ggplotly and use plot_geo instead to generate the HTML.
How to plot animated maps with gganimate: again, pretty obvious subject.

Finally I present my favourite visualization here.

I should warn you that there are a lot of emojis in this series, courtesy of the emo package Hadley recently released and I fanatically adopted 😇

Let's get started!

Getting Twitter users

I had to learn how to retrieve data from the Twitter API, and I chose to use the rtweet package, which is super easy to use! Since I only use public data I don't have to worry about getting my Twitter personal access token.

Every R-Ladies’ chapter uses a standard handle, with the RLadiesLocation format (thankfully they are very compliant with this!). I use the rtweet::search_users function, setting the query to be searched with q = 'RLadies' and the number of users to retrieve with n = 1000, that being the maximum from a single search. As I want a dataframe as a result, I set the parse parameter to TRUE. This way I get 1,000 rows of users, with 36 variables regarding them. I'm only showing the variables I'm going to use, but there is a lot of extra information there.

library(rtweet)

users <- search_users(q = 'RLadies',
                      n = 1000,
                      parse = TRUE)

Let's see what it returns:

library(DT)
datatable(users[, c(2:5)], rownames = FALSE,
          options = list(pageLength = 5))

As I get so many duplicate users (nearly half of them!), I suspect it retrieves the user if `q` matches the user's _description_, _name_ or _screen\_name_ (handle), but also if it matches something they tweeted (neither the [Twitter API documentation](https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-users-search) nor the [`rtweet::search_users` one](https://cran.r-project.org/web/packages/rtweet/rtweet.pdf) are clear about this).

I used DT::datatable just in case someone wants to go through whats on the whole table (of course I'm thinking about the R-Ladies community here 😍 ). It was not easy to set up the environment for my blog to show this table (it uses the htmlwidgets package), but luckily my hubby was more than willing to help me with that part 😅 If you are using RStudio it is just as simple as installing the DT package, or you can always use knitr::kable(head(users[, c(2:5)]), format = "html") to see the first rows.

Cleaning the data

First I remove all the duplicates, and then I keep only the handles that comply with the stipulated format, using a regular expression. I filter out 3 additional handles:

‘RLadies’, whose name is ‘Royal Ladies’ and I assume has something to do with royalty by the crown on their profile picture 👸
‘RLadies_LF’, a Japanese account that translated as follows on Google Translator: ‘Rakuten Ichiba fashion delivery’.
‘RLadiesGlobal’, because it is not a chapter, so I don't want to include it on the plot.

Then I format the date class variable created_at as %Y-%m-%d (just because seeing the hours, minutes and seconds annoys me!), generate the age in days age_days and select the variables I will use for my analysis.

library(dplyr)
library(lubridate)
library(stringr)
library(tidyr)

rladies <- unique(users) %>%
  filter(str_detect(screen_name, '^(RLadies).*') & 
           !screen_name %in% c('RLadies', 'RLadies_LF', 'RLadiesGlobal')) %>% 
  mutate(created_at = format(as.Date(created_at), format = '%Y-%m-%d'),
         age_days = difftime(as.Date('2017-5-15'), created_at, unit = 'days')) %>%
  select(screen_name, location, created_at, followers = followers_count, age_days)

One final fix: I have some missing values on location that I'll need for geocoding the chapters, so I use an auxiliary table lookup to match the screen_name with the location, using dplyr::left_join.

library(tibble)
lookup <- tibble(screen_name = c('RLadiesLx', 'RLadiesMTL' , 'RLadiesSeattle'), 
                 location    = c('Lisbon'   , 'Montreal'   , 'Seattle'      ))

rladies <- rladies %>%
  left_join(lookup, by = 'screen_name') %>%
  mutate(location = coalesce(location.x, location.y)) %>%
  select(-location.x, -location.y)

There are two additional chapters with no presence on Twitter: one in Taipei, Taiwan, and the other in Warsaw, Poland. I add them according to their creation date, using the number of members on their Meetup account as followers.

rladies <- rladies %>% 
  add_row(      
    screen_name = 'RLadiesTaipei',
    location = 'Taipei',
    created_at = as.Date('2014-11-15'),
    followers = 347) %>% 
  add_row(      
    screen_name = 'RLadiesWarsaw',
    location = 'Warsaw',
    created_at = as.Date('2016-11-15'),
    followers = 80)

datatable(rladies, rownames = FALSE,
          options = list(pageLength = 5))

As my ultimate goal is to plot the chapters on a map, I need to obtain the latitude and longitude for each one of them. That's when the ggmap package really comes in handy: it interacts with Google Maps to retrieve the coordinates from the location, and I don't even have to worry about getting it into a specific format, because it is so good that it doesn't need it! (my first try was actually by extracting the cities using regular expressions, but many of the chapters didn't match or matched wrongly, so I tried it this way and it worked perfectly!)

Since the ggmap::geocode function returns 2 columns, I thought about calling it twice: once for the longitude and once for the latitude. But I didn't like it because it was awfully inefficient, and the geocoding takes some (really long!) time. It was going to be something like this:

library(ggmap)

rladies <- rladies %>% 
  mutate(lon = geocode(location)[,1],
         lat = geocode(location)[,2])

Doing some research (and benefitting from Amelia‘s super helpful suggestion!) I finally decided to use the purrr::map function for capturing both values in a single column of the dataframe, and then transform it into two separate columns with tidyr::unnest. All of this with never having to leave the tidyverse world 😏

I'm doing it in two steps to see the intermediate result, with the two columns in a single variable of the dataframe.

library(ggmap)
library(purrr)

rladies <- rladies %>% 
  mutate(longlat = purrr::map(.$location, geocode))

screen_name	location	created_at	followers	age_days	longlat
RLadiesSF	San Francisco	2012-10-15	916	1673	-122.41942, 37.77493
RLadiesNYC	New York	2016-09-01	309	256	-74.00594, 40.71278
RLadiesIstanbul	İstanbul, Türkiye	2016-09-06	436	251	28.97836, 41.00824
RLadiesBCN	Barcelona, Spain	2016-10-11	377	216	2.173404, 41.385064
RLadiesColumbus	Columbus, OH	2016-10-04	179	223	-82.99879, 39.96118
RLadiesBoston	Boston, MA	2016-09-06	259	251	-71.05888, 42.36008

rladies <- rladies %>% 
  unnest()

That's it! Now the dataframe is ready for me to use it for visualizing these Twitter users on a map (considering their sizes and dates of creation), and make some interactive maps and animations!

If you enjoyed this article, check out the next one of the series here or the code in my GitHub repo. You are also welcome to leave your comments and suggestions below or mention me on Twitter. Thank you for reading 😉