- Notifications
You must be signed in to change notification settings - Fork15
Dump the Twitter stream to JSON and CSV, then apply filters, reject non-English content, do sentiment analysis, and more.
License
cantino/twitter_to_csv
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
A tool for exporting the Twitter stream into a CSV file.
(sudo) gem install twitter_to_csv
Perspectives on Acne: What Twitter Can Teach Health Care Providers Shive M, Bhatt M, Cantino A, Kvedar J, Jethwani K., JAMA Dermatol. 2013;149(5):621-622. doi:10.1001/jamadermatol.2013.248.
Twitter: an opportunity for public health campaignsMackenzie R Wehner, Mary-Margaret Chren, Melissa L Shive, Jack S Resneck, Sherry Pagoto, Andrew B Seidenberg, Eleni Linos, The Lancet 12 July 2014 (Volume 384 Issue 9938 Pages 131-132 DOI: 10.1016/S0140-6736(14)61161-2)
You might start by first running the script for a while to dump the Twitter stream into a JSON file:
twitter_to_csv --api-key <your twitter api key> --api-secret <your twitter api secret> \ --access-token <your twitter access token> --access-token-secret <your twitter access token secret> \ --json out.json --filter your,keywords,of,interest
Then, later, you could export to CSV:
twitter_to_csv --replay-from-file out.json --csv out.csv \ --fields text,created_at,user.name,retweeted_status.id,retweeted_status.favorited,...
Alternatively, you can always stream directly to CSV:
twitter_to_csv --api-key <your twitter api key> --api-secret <your twitter api secret> \ --access-token <your twitter access token> --access-token-secret <your twitter access token secret> \ --filter your,keywords,of,interest --csv out.csv \ --fields text,created_at,user.name,retweeted_status.id,retweeted_status.favorited,...
Twitter requires all API access over oAuth. Follow these instructions to get register and authorize a free Twitter API Application:
- Visithttps://apps.twitter.com/app/new
- Enter something like "my twitter_to_csv" as the name, "Using twitter_to_csv Ruby gem" as the description, and "https://github.com/cantino/twitter_to_csv" as the website.
- Click "manage API keys"
- Copy down the "API key" and "API secret"
- Click "Create my access token" at the bottom of the page, refreshing until your new access token shows up at the bottom of the page. Copy down your "Access token" and "Access token secret".
- You're all set!
You may want to limit to Tweets that appear to be written in English.
twitter_to_csv --api-key <your twitter api key> --api-secret <your twitter api secret> \ --access-token <your twitter access token> --access-token-secret <your twitter access token secret> \ --require-english --fields ...
This filter isn't perfect and will have both false positives and false negatives, but it works fairly well.
You can extract URLs, Hashtags, and User Mentions from the tweet into their own columns by using--url-columns
,--hashtag-columns
, and--user-mention-columns
. For example, you could use--url-columns 3
to get up to 3 extracted URLs in their own columns.
Twitter To CSV can compute an average sentiment score for each tweet. Provide--compute-sentiment
to use this feature. TheAFINN-111 valence database is used to look up the valence of each recognized word, then the average is computed, only considering words that have some known valence associated. That is, "I love cheese" only has one word with valence, "love" with a score of 3, so the average is 3. "I love cheese and like bread", on the other hand, has two words with valence, "love" (3) and "like" (2), and so has an average valence of (3 + 2) / 2, or 2.5. The library will break hyphenated words up and score them as separate words unless the whole thing has a single known valence.
Once you have a recorded Twitter stream, you can rollup retweets in various ways. Here is an example that collapses retweets into theretweet_count
field of the original tweet, only outputs tweets with at least 1 retweet, ignores retweets that happened more than 7 days after the original tweet, and outputs retweet count columns at half an hour, 2 hours, and 2 days after the original tweet:
twitter_to_csv --replay-from-file out.json -c out.csv \ --retweet-mode rollup \ --retweet-threshold 1 \ --retweet-window 7 \ --retweet-counts-at 0.5,2,48 \ --fields retweet_count,text
Note that all of the retweet features require you to--replay-from-file
because they parse the stream backwards. They WILL NOT function correctly when you're reading directly from the stream .
To select a specific window of time in a pre-recorded stream bycreated_at
, pass in--start
and--end
, for example:
twitter_to_csv --replay-from-file out.json \ --start "Mon Mar 07 07:42:22 +0000 2011" \ --end "Mon Mar 08 07:42:22 +0000 2011" \ ...
Sometimes the Twitter API goes down. You can analyze a json output file to see where data gaps (of over 10 minutes, in this case) have occurred:
twitter_to_csv --replay-from-file out.json --analyze-gaps 10
If you're doing research on Twitter, it may make sense to construct composite binary variables predicated on the existence of certain tokens in a tweet. For example, you may wanta variable calledsf that is true if the tweet containssan francisco ORsf ORbay area (but notarea by the bay). You could do that as follows:
twitter_to_csv --replay-from-file out.json \ --csv out.csv \ -w "sf: san francisco OR sf OR bay area" \
If you did want to allowarea by the bay, you might need something like:
-w "sf: san francisco OR sf OR (bay AND area)"
Usetwitter_to_csv --help
to see all available options:
Usage: twitter_to_csv [options]These four fields are required. Please see the README to learn how to get them for your Twitter account. --api-key KEY Twitter API key --api-secret SECRET Twitter API secret --access-token TOKEN Twitter access token --access-token-secret SECRET Twitter access token secretGeneral settings: -c, --csv FILE The CSV file to append to, or - for STDOUT -j, --json FILE The JSON file to append to, or - for STDOUT -f, --filter KEYWORDS Keywords to ask Twitter to filter on -x, --fields FIELDS Fields to include in the CSV --date-fields FIELD_NAMES Break these fields into separate numerical columns for weekday, day, month, your, hour, minute, and second. -e, --require-english [STRATEGY] Attempt to filter out non-English tweets. This will have both false positives and false negatives. The strategy can be either 'uld' to use the UnsupervisedLanguageDetection Ruby gem, 'lang' to use Twitter's guessed 'lang' attribute, or 'both' to only remove tweets that both Twitter and ULD think are non-English. This is most conservative and is the default. -v, --[no-]verbose Run verbosely -r, --replay-from-file FILENAME Replay tweets from a JSON dump file --analyze-gaps MINUTES Look at the stream and display gap information for gaps longer than MINUTES --sample-fields NUMBER_OF_SAMPLES Record NUMBER_OF_SAMPLES tweets and then print out all of the field names seen. Use to find out what can be passed to. --url-columns NUM_COLUMNS Extract up to NUM_COLUMNS urls from the status and include them in the CSV --hash-columns NUM_COLUMNS Extract up to NUM_COLUMNS hashtags (#foo) from the status and include them in the CSV --user-columns NUM_COLUMNS Extract up to NUM_COLUMNS user mentions (@foo) from the status and include them in the CSV -s, --compute-sentiment Compute an average sentiment score for each status using the AFINN-111 sentiment dictionary --compute-word-count Include a word count for each status in the output CSV --normalize-source Return just the domain name from the Tweet source (i.e., tweetdeck, facebook) --remove-quotes This option strips all double quotes from the output to help some CSV parsers. --prefix-ids Prefix any field ending in _id or _id_str with 'id' to force parsing as a string in some programs. -w "NAME:WORD AND WORD AND WORD", --bool-word-field Create a named CSV column that is true when the word expression matches, false otherwise. Word expressions are boolean expressions where neighboring words must occur sequentially and you can use parentheses, AND, and OR to test for occurrence relationships. Examples: keyword_any:tanning booth OR tanning booths OR tanningbooth keyword_both:tanning AND booth keyword_complex:tanning AND (booth OR bed) This option can be used multiple times. --start TIME Ignore tweets with a created_at earlier than TIME --end TIME Ignore tweets with a created_at later than TIMEIf you would like to do special retweet handling, use the following options.For these to function, you must be using --replay-from-file. The replay will be performed in reverse. --retweet-mode MODE Determine how to handle retweets Options are just 'ROLLUP' --retweet-threshold COUNT Only consider statuses with at least COUNT retweets --retweet-window WINDOW Ignore retweets that occur beyond WINDOW days Additionally, statuses where WINDOW days have not yet passed will be ignored. --retweet-counts-at HOURS Output the number of retweets seen at specific times after the original tweet -h, --help Show this message --version Show version
Use--sample-fields 1000
to output the occurrence count of different Twitter fields, like so:
twitter_to_csv --api-key <your twitter api key> --api-secret <your twitter api secret> \ --access-token <your twitter access token> --access-token-secret <your twitter access token secret> \ --sample-fields 1000
Here's a list of fields and their occurrences in a 50,000 tweet dataset:
id 50000id_str 50000created_at 50000text 50000source 50000truncated 50000in_reply_to_status_id 50000in_reply_to_status_id_str 50000in_reply_to_user_id 50000in_reply_to_user_id_str 50000in_reply_to_screen_name 50000user.id 50000user.id_str 50000user.name 50000user.screen_name 50000user.location 50000user.url 50000user.description 50000user.protected 50000user.followers_count 50000user.friends_count 50000user.listed_count 50000user.created_at 50000user.favourites_count 50000user.utc_offset 50000user.time_zone 50000user.geo_enabled 50000user.verified 50000user.statuses_count 50000user.lang 50000user.contributors_enabled 50000user.is_translator 50000user.profile_background_color 50000user.profile_background_image_url 50000user.profile_background_image_url_https 50000user.profile_background_tile 50000user.profile_image_url 50000user.profile_image_url_https 50000user.profile_banner_url 41614user.profile_link_color 50000user.profile_sidebar_border_color 50000user.profile_sidebar_fill_color 50000user.profile_text_color 50000user.profile_use_background_image 50000user.default_profile 50000user.default_profile_image 50000user.following 50000user.follow_request_sent 50000user.notifications 50000geo 48656coordinates 48656place 48664contributors 50000retweet_count 50000favorite_count 50000favorited 50000retweeted 50000filter_level 50000lang 36041entities.hashtags[].text 11024entities.hashtags[].indices[] 11024entities.user_mentions[].screen_name 22368entities.user_mentions[].name 22368entities.user_mentions[].id 22368entities.user_mentions[].id_str 22368entities.user_mentions[].indices[] 22368retweeted_status.created_at 13959retweeted_status.id 13959retweeted_status.id_str 13959retweeted_status.text 13959retweeted_status.source 13959retweeted_status.truncated 13959retweeted_status.in_reply_to_status_id 13959retweeted_status.in_reply_to_status_id_str 13959retweeted_status.in_reply_to_user_id 13959retweeted_status.in_reply_to_user_id_str 13959retweeted_status.in_reply_to_screen_name 13959retweeted_status.user.id 13959retweeted_status.user.id_str 13959retweeted_status.user.name 13959retweeted_status.user.screen_name 13959retweeted_status.user.location 13959retweeted_status.user.url 13959retweeted_status.user.description 13959retweeted_status.user.protected 13959retweeted_status.user.followers_count 13959retweeted_status.user.friends_count 13959retweeted_status.user.listed_count 13959retweeted_status.user.created_at 13959retweeted_status.user.favourites_count 13959retweeted_status.user.utc_offset 13959retweeted_status.user.time_zone 13959retweeted_status.user.geo_enabled 13959retweeted_status.user.verified 13959retweeted_status.user.statuses_count 13959retweeted_status.user.lang 13959retweeted_status.user.contributors_enabled 13959retweeted_status.user.is_translator 13959retweeted_status.user.profile_background_color 13959retweeted_status.user.profile_background_image_url 13959retweeted_status.user.profile_background_image_url_https 13959retweeted_status.user.profile_background_tile 13959retweeted_status.user.profile_image_url 13959retweeted_status.user.profile_image_url_https 13959retweeted_status.user.profile_banner_url 11028retweeted_status.user.profile_link_color 13959retweeted_status.user.profile_sidebar_border_color 13959retweeted_status.user.profile_sidebar_fill_color 13959retweeted_status.user.profile_text_color 13959retweeted_status.user.profile_use_background_image 13959retweeted_status.user.default_profile 13959retweeted_status.user.default_profile_image 13959retweeted_status.user.following 13959retweeted_status.user.follow_request_sent 13959retweeted_status.user.notifications 13959retweeted_status.geo 13728retweeted_status.coordinates 13728retweeted_status.place 13724retweeted_status.contributors 13959retweeted_status.retweet_count 13959retweeted_status.favorite_count 13959retweeted_status.entities.hashtags[].text 2438retweeted_status.entities.hashtags[].indices[] 2438retweeted_status.entities.urls[].url 361retweeted_status.entities.urls[].expanded_url 361retweeted_status.entities.urls[].display_url 361retweeted_status.entities.urls[].indices[] 361retweeted_status.favorited 13959retweeted_status.retweeted 13959retweeted_status.possibly_sensitive 916retweeted_status.lang 13959entities.urls[].url 3662entities.urls[].expanded_url 3662entities.urls[].display_url 3662entities.urls[].indices[] 3662possibly_sensitive 5339entities.media[].id 1736entities.media[].id_str 1736entities.media[].indices[] 1736entities.media[].media_url 1736entities.media[].media_url_https 1736entities.media[].url 1736entities.media[].display_url 1736entities.media[].expanded_url 1736entities.media[].type 1736entities.media[].sizes.thumb.w 1736entities.media[].sizes.thumb.h 1736entities.media[].sizes.thumb.resize 1736entities.media[].sizes.large.w 1736entities.media[].sizes.large.h 1736entities.media[].sizes.large.resize 1736entities.media[].sizes.small.w 1736entities.media[].sizes.small.h 1736entities.media[].sizes.small.resize 1736entities.media[].sizes.medium.w 1736entities.media[].sizes.medium.h 1736entities.media[].sizes.medium.resize 1736geo.type 1344geo.coordinates[] 1344coordinates.type 1344coordinates.coordinates[] 1344place.id 1336place.url 1336place.place_type 1336place.name 1336place.full_name 1336place.country_code 1336place.country 1336place.bounding_box.type 1336place.bounding_box.coordinates[][][] 1336entities.media[].source_status_id 621entities.media[].source_status_id_str 621retweeted_status.entities.user_mentions[].screen_name 1379retweeted_status.entities.user_mentions[].name 1379retweeted_status.entities.user_mentions[].id 1379retweeted_status.entities.user_mentions[].id_str 1379retweeted_status.entities.user_mentions[].indices[] 1379retweeted_status.entities.media[].id 609retweeted_status.entities.media[].id_str 609retweeted_status.entities.media[].indices[] 609retweeted_status.entities.media[].media_url 609retweeted_status.entities.media[].media_url_https 609retweeted_status.entities.media[].url 609retweeted_status.entities.media[].display_url 609retweeted_status.entities.media[].expanded_url 609retweeted_status.entities.media[].type 609retweeted_status.entities.media[].sizes.thumb.w 609retweeted_status.entities.media[].sizes.thumb.h 609retweeted_status.entities.media[].sizes.thumb.resize 609retweeted_status.entities.media[].sizes.medium.w 609retweeted_status.entities.media[].sizes.medium.h 609retweeted_status.entities.media[].sizes.medium.resize 609retweeted_status.entities.media[].sizes.large.w 609retweeted_status.entities.media[].sizes.large.h 609retweeted_status.entities.media[].sizes.large.resize 609retweeted_status.entities.media[].sizes.small.w 609retweeted_status.entities.media[].sizes.small.h 609retweeted_status.entities.media[].sizes.small.resize 609retweeted_status.place.id 235retweeted_status.place.url 235retweeted_status.place.place_type 235retweeted_status.place.name 235retweeted_status.place.full_name 235retweeted_status.place.country_code 235retweeted_status.place.country 235retweeted_status.place.bounding_box.type 235retweeted_status.place.bounding_box.coordinates[][][] 235retweeted_status.geo.type 231retweeted_status.geo.coordinates[] 231retweeted_status.coordinates.type 231retweeted_status.coordinates.coordinates[] 231retweeted_status.entities.media[].source_status_id 42retweeted_status.entities.media[].source_status_id_str 42place.attributes.street_address 2
About
Dump the Twitter stream to JSON and CSV, then apply filters, reject non-English content, do sentiment analysis, and more.