cantino/twitter_to_csvPublic

NotificationsYou must be signed in to change notification settings
Fork15
Star64

Dump the Twitter stream to JSON and CSV, then apply filters, reject non-English content, do sentiment analysis, and more.

License

MIT license

64 stars 15 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
bin		bin
lib		lib
spec		spec
.gitignore		.gitignore
.rspec		.rspec
.ruby-gemset		.ruby-gemset
.ruby-version		.ruby-version
.travis.yml		.travis.yml
Gemfile		Gemfile
LICENSE.txt		LICENSE.txt
README.markdown		README.markdown
Rakefile		Rakefile
twitter_to_csv.gemspec		twitter_to_csv.gemspec

Repository files navigation

Twitter To CSV

A tool for exporting the Twitter stream into a CSV file.

(sudo) gem install twitter_to_csv

Publications

Perspectives on Acne: What Twitter Can Teach Health Care Providers Shive M, Bhatt M, Cantino A, Kvedar J, Jethwani K., JAMA Dermatol. 2013;149(5):621-622. doi:10.1001/jamadermatol.2013.248.

Twitter: an opportunity for public health campaignsMackenzie R Wehner, Mary-Margaret Chren, Melissa L Shive, Jack S Resneck, Sherry Pagoto, Andrew B Seidenberg, Eleni Linos, The Lancet 12 July 2014 (Volume 384 Issue 9938 Pages 131-132 DOI: 10.1016/S0140-6736(14)61161-2)

Usage

Quick Example

You might start by first running the script for a while to dump the Twitter stream into a JSON file:

twitter_to_csv --api-key <your twitter api key> --api-secret <your twitter api secret> \               --access-token <your twitter access token> --access-token-secret <your twitter access token secret> \               --json out.json --filter your,keywords,of,interest

Then, later, you could export to CSV:

twitter_to_csv --replay-from-file out.json --csv out.csv \               --fields text,created_at,user.name,retweeted_status.id,retweeted_status.favorited,...

Alternatively, you can always stream directly to CSV:

twitter_to_csv --api-key <your twitter api key> --api-secret <your twitter api secret> \               --access-token <your twitter access token> --access-token-secret <your twitter access token secret> \               --filter your,keywords,of,interest --csv out.csv \               --fields text,created_at,user.name,retweeted_status.id,retweeted_status.favorited,...

Getting your Twitter API Key and Token

Twitter requires all API access over oAuth. Follow these instructions to get register and authorize a free Twitter API Application:

Visithttps://apps.twitter.com/app/new
Enter something like "my twitter_to_csv" as the name, "Using twitter_to_csv Ruby gem" as the description, and "https://github.com/cantino/twitter_to_csv" as the website.
Click "manage API keys"
Copy down the "API key" and "API secret"
Click "Create my access token" at the bottom of the page, refreshing until your new access token shows up at the bottom of the page. Copy down your "Access token" and "Access token secret".
You're all set!

Requiring English

You may want to limit to Tweets that appear to be written in English.

twitter_to_csv --api-key <your twitter api key> --api-secret <your twitter api secret> \               --access-token <your twitter access token> --access-token-secret <your twitter access token secret> \               --require-english --fields ...

This filter isn't perfect and will have both false positives and false negatives, but it works fairly well.

URLS, Hashtags, and User Mentions

You can extract URLs, Hashtags, and User Mentions from the tweet into their own columns by using--url-columns,--hashtag-columns, and--user-mention-columns. For example, you could use--url-columns 3 to get up to 3 extracted URLs in their own columns.

Sentiment Tagging

Twitter To CSV can compute an average sentiment score for each tweet. Provide--compute-sentiment to use this feature. TheAFINN-111 valence database is used to look up the valence of each recognized word, then the average is computed, only considering words that have some known valence associated. That is, "I love cheese" only has one word with valence, "love" with a score of 3, so the average is 3. "I love cheese and like bread", on the other hand, has two words with valence, "love" (3) and "like" (2), and so has an average valence of (3 + 2) / 2, or 2.5. The library will break hyphenated words up and score them as separate words unless the whole thing has a single known valence.

Handling of Retweets

Once you have a recorded Twitter stream, you can rollup retweets in various ways. Here is an example that collapses retweets into theretweet_count field of the original tweet, only outputs tweets with at least 1 retweet, ignores retweets that happened more than 7 days after the original tweet, and outputs retweet count columns at half an hour, 2 hours, and 2 days after the original tweet:

twitter_to_csv --replay-from-file out.json -c out.csv \               --retweet-mode rollup \               --retweet-threshold 1 \               --retweet-window 7 \               --retweet-counts-at 0.5,2,48 \               --fields retweet_count,text

Note that all of the retweet features require you to--replay-from-file because they parse the stream backwards. They WILL NOT function correctly when you're reading directly from the stream .

Selecting Windows

To select a specific window of time in a pre-recorded stream bycreated_at, pass in--start and--end, for example:

twitter_to_csv --replay-from-file out.json \               --start "Mon Mar 07 07:42:22 +0000 2011" \               --end "Mon Mar 08 07:42:22 +0000 2011" \               ...

Mind the Gap

Sometimes the Twitter API goes down. You can analyze a json output file to see where data gaps (of over 10 minutes, in this case) have occurred:

twitter_to_csv --replay-from-file out.json --analyze-gaps 10

Creating Dynamic Binary Fields

If you're doing research on Twitter, it may make sense to construct composite binary variables predicated on the existence of certain tokens in a tweet. For example, you may wanta variable calledsf that is true if the tweet containssan francisco ORsf ORbay area (but notarea by the bay). You could do that as follows:

twitter_to_csv --replay-from-file out.json \               --csv out.csv \               -w "sf: san francisco OR sf OR bay area" \

If you did want to allowarea by the bay, you might need something like:

               -w "sf: san francisco OR sf OR (bay AND area)"

Other Options

Usetwitter_to_csv --help to see all available options:

Usage: twitter_to_csv [options]These four fields are required.  Please see the README to learn how to get them for your Twitter account.        --api-key KEY                Twitter API key        --api-secret SECRET          Twitter API secret        --access-token TOKEN         Twitter access token        --access-token-secret SECRET Twitter access token secretGeneral settings:    -c, --csv FILE                   The CSV file to append to, or - for STDOUT    -j, --json FILE                  The JSON file to append to, or - for STDOUT    -f, --filter KEYWORDS            Keywords to ask Twitter to filter on    -x, --fields FIELDS              Fields to include in the CSV        --date-fields FIELD_NAMES    Break these fields into separate numerical columns for weekday, day, month, your, hour, minute, and second.    -e, --require-english [STRATEGY] Attempt to filter out non-English tweets. This will have both false positives and false negatives.                                     The strategy can be either 'uld' to use the UnsupervisedLanguageDetection Ruby gem,                                     'lang' to use Twitter's guessed 'lang' attribute, or 'both' to only remove tweets that                                     both Twitter and ULD think are non-English.  This is most conservative and is the default.    -v, --[no-]verbose               Run verbosely    -r, --replay-from-file FILENAME  Replay tweets from a JSON dump file        --analyze-gaps MINUTES       Look at the stream and display gap information for gaps longer than MINUTES        --sample-fields NUMBER_OF_SAMPLES                                     Record NUMBER_OF_SAMPLES tweets and then print out all                                     of the field names seen.  Use to find out what can be passed to.        --url-columns NUM_COLUMNS    Extract up to NUM_COLUMNS urls from the status and include them in the CSV        --hash-columns NUM_COLUMNS   Extract up to NUM_COLUMNS hashtags (#foo) from the status and include them in the CSV        --user-columns NUM_COLUMNS   Extract up to NUM_COLUMNS user mentions (@foo) from the status and include them in the CSV    -s, --compute-sentiment          Compute an average sentiment score for each status using the AFINN-111 sentiment dictionary        --compute-word-count         Include a word count for each status in the output CSV        --normalize-source           Return just the domain name from the Tweet source (i.e., tweetdeck, facebook)        --remove-quotes              This option strips all double quotes from the output to help some CSV parsers.        --prefix-ids                 Prefix any field ending in _id or _id_str with 'id' to force parsing as a string in some programs.    -w "NAME:WORD AND WORD AND WORD",        --bool-word-field            Create a named CSV column that is true when the word expression matches, false otherwise.                                     Word expressions are boolean expressions where neighboring words must occur sequentially                                     and you can use parentheses, AND, and OR to test for occurrence relationships.  Examples:                                       keyword_any:tanning booth OR tanning booths OR tanningbooth                                       keyword_both:tanning AND booth                                       keyword_complex:tanning AND (booth OR bed)                                     This option can be used multiple times.        --start TIME                 Ignore tweets with a created_at earlier than TIME        --end TIME                   Ignore tweets with a created_at later than TIMEIf you would like to do special retweet handling, use the following options.For these to function, you must be using --replay-from-file.  The replay will be performed in reverse.        --retweet-mode MODE          Determine how to handle retweets                                     Options are just 'ROLLUP'        --retweet-threshold COUNT    Only consider statuses with at least COUNT retweets        --retweet-window WINDOW      Ignore retweets that occur beyond WINDOW days                                     Additionally, statuses where WINDOW days have not yet passed will be ignored.        --retweet-counts-at HOURS    Output the number of retweets seen at specific times after the original tweet    -h, --help                       Show this message        --version                    Show version

Field names

Use--sample-fields 1000 to output the occurrence count of different Twitter fields, like so:

twitter_to_csv --api-key <your twitter api key> --api-secret <your twitter api secret> \               --access-token <your twitter access token> --access-token-secret <your twitter access token secret> \               --sample-fields 1000

Here's a list of fields and their occurrences in a 50,000 tweet dataset:

id                                                            50000id_str                                                        50000created_at                                                    50000text                                                          50000source                                                        50000truncated                                                     50000in_reply_to_status_id                                         50000in_reply_to_status_id_str                                     50000in_reply_to_user_id                                           50000in_reply_to_user_id_str                                       50000in_reply_to_screen_name                                       50000user.id                                                       50000user.id_str                                                   50000user.name                                                     50000user.screen_name                                              50000user.location                                                 50000user.url                                                      50000user.description                                              50000user.protected                                                50000user.followers_count                                          50000user.friends_count                                            50000user.listed_count                                             50000user.created_at                                               50000user.favourites_count                                         50000user.utc_offset                                               50000user.time_zone                                                50000user.geo_enabled                                              50000user.verified                                                 50000user.statuses_count                                           50000user.lang                                                     50000user.contributors_enabled                                     50000user.is_translator                                            50000user.profile_background_color                                 50000user.profile_background_image_url                             50000user.profile_background_image_url_https                       50000user.profile_background_tile                                  50000user.profile_image_url                                        50000user.profile_image_url_https                                  50000user.profile_banner_url                                       41614user.profile_link_color                                       50000user.profile_sidebar_border_color                             50000user.profile_sidebar_fill_color                               50000user.profile_text_color                                       50000user.profile_use_background_image                             50000user.default_profile                                          50000user.default_profile_image                                    50000user.following                                                50000user.follow_request_sent                                      50000user.notifications                                            50000geo                                                           48656coordinates                                                   48656place                                                         48664contributors                                                  50000retweet_count                                                 50000favorite_count                                                50000favorited                                                     50000retweeted                                                     50000filter_level                                                  50000lang                                                          36041entities.hashtags[].text                                      11024entities.hashtags[].indices[]                                 11024entities.user_mentions[].screen_name                          22368entities.user_mentions[].name                                 22368entities.user_mentions[].id                                   22368entities.user_mentions[].id_str                               22368entities.user_mentions[].indices[]                            22368retweeted_status.created_at                                   13959retweeted_status.id                                           13959retweeted_status.id_str                                       13959retweeted_status.text                                         13959retweeted_status.source                                       13959retweeted_status.truncated                                    13959retweeted_status.in_reply_to_status_id                        13959retweeted_status.in_reply_to_status_id_str                    13959retweeted_status.in_reply_to_user_id                          13959retweeted_status.in_reply_to_user_id_str                      13959retweeted_status.in_reply_to_screen_name                      13959retweeted_status.user.id                                      13959retweeted_status.user.id_str                                  13959retweeted_status.user.name                                    13959retweeted_status.user.screen_name                             13959retweeted_status.user.location                                13959retweeted_status.user.url                                     13959retweeted_status.user.description                             13959retweeted_status.user.protected                               13959retweeted_status.user.followers_count                         13959retweeted_status.user.friends_count                           13959retweeted_status.user.listed_count                            13959retweeted_status.user.created_at                              13959retweeted_status.user.favourites_count                        13959retweeted_status.user.utc_offset                              13959retweeted_status.user.time_zone                               13959retweeted_status.user.geo_enabled                             13959retweeted_status.user.verified                                13959retweeted_status.user.statuses_count                          13959retweeted_status.user.lang                                    13959retweeted_status.user.contributors_enabled                    13959retweeted_status.user.is_translator                           13959retweeted_status.user.profile_background_color                13959retweeted_status.user.profile_background_image_url            13959retweeted_status.user.profile_background_image_url_https      13959retweeted_status.user.profile_background_tile                 13959retweeted_status.user.profile_image_url                       13959retweeted_status.user.profile_image_url_https                 13959retweeted_status.user.profile_banner_url                      11028retweeted_status.user.profile_link_color                      13959retweeted_status.user.profile_sidebar_border_color            13959retweeted_status.user.profile_sidebar_fill_color              13959retweeted_status.user.profile_text_color                      13959retweeted_status.user.profile_use_background_image            13959retweeted_status.user.default_profile                         13959retweeted_status.user.default_profile_image                   13959retweeted_status.user.following                               13959retweeted_status.user.follow_request_sent                     13959retweeted_status.user.notifications                           13959retweeted_status.geo                                          13728retweeted_status.coordinates                                  13728retweeted_status.place                                        13724retweeted_status.contributors                                 13959retweeted_status.retweet_count                                13959retweeted_status.favorite_count                               13959retweeted_status.entities.hashtags[].text                     2438retweeted_status.entities.hashtags[].indices[]                2438retweeted_status.entities.urls[].url                          361retweeted_status.entities.urls[].expanded_url                 361retweeted_status.entities.urls[].display_url                  361retweeted_status.entities.urls[].indices[]                    361retweeted_status.favorited                                    13959retweeted_status.retweeted                                    13959retweeted_status.possibly_sensitive                           916retweeted_status.lang                                         13959entities.urls[].url                                           3662entities.urls[].expanded_url                                  3662entities.urls[].display_url                                   3662entities.urls[].indices[]                                     3662possibly_sensitive                                            5339entities.media[].id                                           1736entities.media[].id_str                                       1736entities.media[].indices[]                                    1736entities.media[].media_url                                    1736entities.media[].media_url_https                              1736entities.media[].url                                          1736entities.media[].display_url                                  1736entities.media[].expanded_url                                 1736entities.media[].type                                         1736entities.media[].sizes.thumb.w                                1736entities.media[].sizes.thumb.h                                1736entities.media[].sizes.thumb.resize                           1736entities.media[].sizes.large.w                                1736entities.media[].sizes.large.h                                1736entities.media[].sizes.large.resize                           1736entities.media[].sizes.small.w                                1736entities.media[].sizes.small.h                                1736entities.media[].sizes.small.resize                           1736entities.media[].sizes.medium.w                               1736entities.media[].sizes.medium.h                               1736entities.media[].sizes.medium.resize                          1736geo.type                                                      1344geo.coordinates[]                                             1344coordinates.type                                              1344coordinates.coordinates[]                                     1344place.id                                                      1336place.url                                                     1336place.place_type                                              1336place.name                                                    1336place.full_name                                               1336place.country_code                                            1336place.country                                                 1336place.bounding_box.type                                       1336place.bounding_box.coordinates[][][]                          1336entities.media[].source_status_id                             621entities.media[].source_status_id_str                         621retweeted_status.entities.user_mentions[].screen_name         1379retweeted_status.entities.user_mentions[].name                1379retweeted_status.entities.user_mentions[].id                  1379retweeted_status.entities.user_mentions[].id_str              1379retweeted_status.entities.user_mentions[].indices[]           1379retweeted_status.entities.media[].id                          609retweeted_status.entities.media[].id_str                      609retweeted_status.entities.media[].indices[]                   609retweeted_status.entities.media[].media_url                   609retweeted_status.entities.media[].media_url_https             609retweeted_status.entities.media[].url                         609retweeted_status.entities.media[].display_url                 609retweeted_status.entities.media[].expanded_url                609retweeted_status.entities.media[].type                        609retweeted_status.entities.media[].sizes.thumb.w               609retweeted_status.entities.media[].sizes.thumb.h               609retweeted_status.entities.media[].sizes.thumb.resize          609retweeted_status.entities.media[].sizes.medium.w              609retweeted_status.entities.media[].sizes.medium.h              609retweeted_status.entities.media[].sizes.medium.resize         609retweeted_status.entities.media[].sizes.large.w               609retweeted_status.entities.media[].sizes.large.h               609retweeted_status.entities.media[].sizes.large.resize          609retweeted_status.entities.media[].sizes.small.w               609retweeted_status.entities.media[].sizes.small.h               609retweeted_status.entities.media[].sizes.small.resize          609retweeted_status.place.id                                     235retweeted_status.place.url                                    235retweeted_status.place.place_type                             235retweeted_status.place.name                                   235retweeted_status.place.full_name                              235retweeted_status.place.country_code                           235retweeted_status.place.country                                235retweeted_status.place.bounding_box.type                      235retweeted_status.place.bounding_box.coordinates[][][]         235retweeted_status.geo.type                                     231retweeted_status.geo.coordinates[]                            231retweeted_status.coordinates.type                             231retweeted_status.coordinates.coordinates[]                    231retweeted_status.entities.media[].source_status_id            42retweeted_status.entities.media[].source_status_id_str        42place.attributes.street_address                               2

About

Dump the Twitter stream to JSON and CSV, then apply filters, reject non-English content, do sentiment analysis, and more.

Releases

9tags

Packages

No packages published

Languages

Ruby100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Twitter To CSV

Publications

Usage

Quick Example

Getting your Twitter API Key and Token

Requiring English

URLS, Hashtags, and User Mentions

Sentiment Tagging

Handling of Retweets

Selecting Windows

Mind the Gap

Creating Dynamic Binary Fields

Other Options

Field names

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

License

cantino/twitter_to_csv

Folders and files

Latest commit

History

Repository files navigation

Twitter To CSV

Publications

Usage

Quick Example

Getting your Twitter API Key and Token

Requiring English

URLS, Hashtags, and User Mentions

Sentiment Tagging

Handling of Retweets

Selecting Windows

Mind the Gap

Creating Dynamic Binary Fields

Other Options

Field names

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages