In January 2018, I wrote a couple of blog posts outlining some analysis I’d performed on followers of popular Finnish Twitter profiles. A few people asked that I share the tools used to perform that research. Today, I’ll share a tool similar to the one I used to conduct that research, and at the same time, illustrate how to obtain data about a Twitter account’s followers.
This tool uses Tweepy to connect to the Twitter API. In order to enumerate a target account’s followers, I like to start by using Tweepy’s followers_ids() function to get a list of Twitter ids of accounts that are following the target account. This call completes in a single query, and gives us a list of Twitter ids that can be saved for later use (since both screen_name and name an be changed, but the account’s id never changes). Once I’ve obtained a list of Twitter ids, I can use Tweepy’s lookup_users(userids=batch) to obtain Twitter User objects for each Twitter id. As far as I know, this isn’t exactly the documented way of obtaining this data, but it suits my needs. /shrug
Once a full set of Twitter User objects has been obtained, we can perform analysis on it. In the following tool, I chose to look at the account age and friends_count of each account returned, print a summary, and save a summarized form of each account’s details as json, for potential further processing. Here’s the full code:
from tweepy import OAuthHandler from tweepy import API from collections import Counter from datetime import datetime, date, time, timedelta import sys import json import os import io import re import timeHelper functions to load and save intermediate steps
def save_json(variable, filename):
with io.open(filename, “w”, encoding=“utf-8”) as f:
f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))def load_json(filename):
ret = None
if os.path.exists(filename):
try:
with io.open(filename, “r”, encoding=“utf-8”) as f:
ret = json.load(f)
except:
pass
return retdef try_load_or_process(filename, processor_fn, function_arg):
load_fn = None
save_fn = None
if filename.endswith(“json”):
load_fn = load_json
save_fn = save_json
else:
load_fn = load_bin
save_fn = save_bin
if os.path.exists(filename):
print("Loading " + filename)
return load_fn(filename)
else:
ret = processor_fn(function_arg)
print("Saving " + filename)
save_fn(ret, filename)
return retSome helper functions to convert between different time formats and perform date calculations
def twitter_time_to_object(time_string):
twitter_format = “%a %b %d %H:%M:%S %Y”
match_expression = “^(.+)\s(+[0-9][0-9][0-9][0-9])\s([0-9][0-9][0-9][0-9])$”
match = re.search(match_expression, time_string)
if match is not None:
first_bit = match.group(1)
second_bit = match.group(2)
last_bit = match.group(3)
new_string = first_bit + " " + last_bit
date_object = datetime.strptime(new_string, twitter_format)
return date_objectdef time_object_to_unix(time_object):
return int(time_object.strftime("%s"))def twitter_time_to_unix(time_string):
return time_object_to_unix(twitter_time_to_object(time_string))def seconds_since_twitter_time(time_string):
input_time_unix = int(twitter_time_to_unix(time_string))
current_time_unix = int(get_utc_unix_time())
return current_time_unix - input_time_unixdef get_utc_unix_time():
dts = datetime.utcnow()
return time.mktime(dts.timetuple())Get a list of follower ids for the target account
def get_follower_ids(target):
return auth_api.followers_ids(target)Twitter API allows us to batch query 100 accounts at a time
So we’ll create batches of 100 follower ids and gather Twitter User objects for each batch
def get_user_objects(follower_ids):
batch_len = 100
num_batches = len(follower_ids) / 100
batches = (follower_ids[i:i+batch_len] for i in range(0, len(follower_ids), batch_len))
all_data = []
for batch_count, batch in enumerate(batches):
sys.stdout.write("\r")
sys.stdout.flush()
sys.stdout.write("Fetching batch: " + str(batch_count) + “/” + str(num_batches))
sys.stdout.flush()
users_list = auth_api.lookup_users(user_ids=batch)
users_json = (map(lambda t: t._json, users_list))
all_data += users_json
return all_dataCreates one week length ranges and finds items that fit into those range boundaries
def make_ranges(user_data, num_ranges=20):
range_max = 604800 * num_ranges
range_step = range_max/num_rangesWe create ranges and labels first and then iterate these when going through the whole list
of user data, to speed things up
ranges = {} labels = {} for x in range(num_ranges): start_range = x * range_step end_range = x * range_step + range_step label = "%02d" % x + " - " + "%02d" % (x+1) + " weeks" labels[label] = [] ranges[label] = {} ranges[label]["start"] = start_range ranges[label]["end"] = end_range for user in user_data: if "created_at" in user: account_age = seconds_since_twitter_time(user["created_at"]) for label, timestamps in ranges.iteritems(): if account_age > timestamps["start"] and account_age < timestamps["end"]: entry = {} id_str = user["id_str"] entry[id_str] = {} fields = ["screen_name", "name", "created_at", "friends_count", "followers_count", "favourites_count", "statuses_count"] for f in fields: if f in user: entry[id_str][f] = user[f] labels[label].append(entry) return labels
if name == “main”:
account_list = []
if (len(sys.argv) > 1):
account_list = sys.argv[1:]if len(account_list) < 1: print("No parameters supplied. Exiting.") sys.exit(0) consumer_key="" consumer_secret="" access_token="" access_token_secret="" auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) auth_api = API(auth) for target in account_list: print("Processing target: " + target)
Get a list of Twitter ids for followers of target account and save it
filename = target + "_follower_ids.json" follower_ids = try_load_or_process(filename, get_follower_ids, target)
Fetch Twitter User objects from each Twitter id found and save the data
filename = target + "_followers.json" user_objects = try_load_or_process(filename, get_user_objects, follower_ids) total_objects = len(user_objects)
Record a few details about each account that falls between specified age ranges
ranges = make_ranges(user_objects) filename = target + "_ranges.json" save_json(ranges, filename)
Print a few summaries
print print("\t\tFollower age ranges") print("\t\t===================") total = 0 following_counter = Counter() for label, entries in sorted(ranges.iteritems()): print("\t\t" + str(len(entries)) + " accounts were created within " + label) total += len(entries) for entry in entries: for id_str, values in entry.iteritems(): if "friends_count" in values: following_counter[values["friends_count"]] += 1 print("\t\tTotal: " + str(total) + "/" + str(total_objects)) print print("\t\tMost common friends counts") print("\t\t==========================") total = 0 for num, count in following_counter.most_common(20): total += count print("\t\t" + str(count) + " accounts are following " + str(num) + " accounts") print("\t\tTotal: " + str(total) + "/" + str(total_objects)) print print
Let’s run this tool against a few accounts and see what results we get. First up: @realDonaldTrump
As we can see, over 80% of @realDonaldTrump’s last 5000 followers are very new accounts (less than 20 weeks old), with a majority of those being under a week old. Here’s the top friends_count values of those accounts:
No obvious pattern is present in this data.
Next up, an account I looked at in a previous blog post – @niinisto (the president of Finland).
Many of @niinisto’s last 5000 followers are new Twitter accounts. However, not in as large of a proportion as in the @realDonaldTrump case. In both of the above cases, this is to be expected, since both accounts are recommended to new users of Twitter. Let’s look at the friends_count values for the above set.
In some cases, clicking through the creation of a new Twitter account (next, next, next, finish) will create an account that follows 21 Twitter profiles. This can explain the high proportion of accounts in this list with a friends_count value of 21. However, we might expect to see the same (or an even stronger) pattern with the @realDonaldTrump account. And we’re not. I’m not sure why this is the case, but it could be that Twitter has some automation in place to auto-delete programmatically created accounts. If you look at the output of my script you’ll see that between fetching the list of Twitter ids for the last 5000 followers of @realDonaldTrump, and fetching the full Twitter User objects for those ids, 3 accounts “went missing” (and hence the tool only collected data for 4997 accounts.)
Finally, just for good measure, I ran the tool against my own account (@r0zetta).
Here you see a distribution that’s probably common for non-celebrity Twitter accounts. Not many of my followers have new accounts. What’s more, there’s absolutely no pattern in the friends_count values of these accounts:
Of course, there are plenty of other interesting analyses that can be performed on the data collected by this tool. Once the script has been run, all data is saved on disk as json files, so you can process it to your heart’s content without having to run additional queries against Twitter’s servers. As usual, have fun extending this tool to your own needs, and if you’re interested in reading some of my other guides or analyses, here’s full list of those articles.
Article Link: https://labsblog.f-secure.com/2018/02/27/how-to-get-twitter-follower-data-using-python-and-tweepy/