CompNet/InfluencePublic

NotificationsYou must be signed in to change notification settings
Fork1
Star8

Twitter and Influence

License

GPL-3.0 license

8 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
postprocessing		postprocessing
preprocessing		preprocessing
processing		processing
.project		.project
README.md		README.md
license.txt		license.txt

Repository files navigation

TwitterInfluence

Characterization of Twitter Profiles, with an application to offline influence detection

TwitterInfluence is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation. For source availability and license information seelicence.txt

Lab site:http://lia.univ-avignon.fr
GitHub repo:https://github.com/CompNet/Influence
Contact: Jean-Valère Cossujean-valere.cossu@alumni.univ-avignon.fr

Description

These scripts are meant to extract certain features from raw Twitter data describing Twitter users (tweets, profile info, as well as external data). Once the features are extracted, various forms of SVMs are trained, and logistic regressions are performed, to classify and rank the users. These operations are conducted on different subgroups of features. The details of the process are given in [CDL'15] and [CLD'16]. The scripts were applied to the classification/ranking of Twitter users in terms ofoffline influence, based on theRepLab 2014 dataset.

Please, cite [CLD'16] if you use our scripts. We would also be very interested to know your context of application and/or modification, so please, let us know.

@Article{Cossu2016,author        ={Cossu, Jean-Valère and Labatut, Vincent and Dugué, Nicolas},title         ={A Review of Features for the Discrimination of {Twitter} Users: Application to the Prediction of Offline Influence},journal       ={Social Network Analysis and Mining},year          ={2016},volume        ={6},pages         ={25},doi           ={10.1007/s13278-016-0329-x},}

Note that the software may evolve depending on our future research work.

Organization

The project is composed of the following folders:

preprocessing: programs related to the preparation of the data.
- retrieval: Java classes used to retrieve additional Twitter and Klout data
  - Main.java: allows to obtain additional features related to a Twitter profile.
  - MainGetTweets.java: allows to obtain tweets posted by users
- cooccurrence: programs aiming at extracting cooccurrence networks and the related features.
  - net-extraction.R: extracts the cooccurrence networks based on a collection of documents, each one corresponding to all the tweets published by a user (User-as-Document approach). The script also computes the vector features based on topological centrality measures, through theigraph library.
  - net-distance.R: processes the distance between each pair of cooccurrence network (each network corresponds to a user).
  - DistanceProcessor.java: same as above, but in Java instead of R, and much faster. The program is multithreaded, by comparisonSerialDistanceProcessor.java performs the same processing in a non-parallel way.
processing: programs implementing the classification and ranking tasks, as well as their evaluation.
- cosine_bot_xxx.pl scripts concern the Bag-of-Tweets based user classification. These scripts compute the probability of each tweet of a given user to be written by an influencer. They then associate to each user a score of being an influencer according 2 methodsxx can be replaced bySum orCount).
postprocessing: programs used to analyze and convert the processing results
- regression: PLS-PM scripts, used to study more thoroughly the relation between certain features and the class.
  - plspm4influence.R : this script aims at finding relations between features categories and verify the efficiency of the proposed influence conceptual model.
- format: scripts related to the format of the input data.
  - format_trec_xxx(.multi).pl: these scripts convert the RepLab data to the TREC format.

Here are the third-party softwares used in this version:

Part of SVMs-based experiments were made usingMulti-Class Support Vector Machine by Thorsten Joachims.
plspm4influence.R relies on theplspm. R package
net-extraction.R relies on theigraph R package.
net-distance.R andnet-extraction.R use theforeach anddoParallel R packages.

Installation

The Java project may be imported into all IDE (Eclipse, IntelliJ or Netbeans) and then built. It can also be built by running:javac MainGetTweets.java andjavac Main.java
All perl scripts work perfectly on Windows (via Cygwin) and Unix systems with Perl 5, version 14, subversion 4, no additional module is required.
To useMulti-Class Support Vector Machine, just download binaries at the following address:http://www.cs.cornell.edu/people/tj/svm_light/svm_multiclass.html
To use the R scripts, just install regularly the appropriate packages, using the commandinstall.packages("xxxxx") for packagexxxx.
Logistic regression, Random Forest, PCA and SVM can also be runned by using Python Script with the Sci-kit Lean Lib

Use and Input

Folder`retrieval`

To get Twitter profile data, runjava Main fileWithToken fileWithAccounts fileToWriteAccounts with:

fileWithToken a file containing Twitter tokens,
fileWithAccounts a file containing the Twitter accounts to crawl,
fileToWriteAccounts the file name where to write the results.

To get tweets from specific Twitter accounts, runjava MainGetTweets fileWithToken fileWithAccounts fileToWriteAccounts with:

fileWithToken a file containing Twitter tokens,
fileWithAccounts a file containing the Twitter accounts to crawl,
'fileToWriteAccounts' the file name where to write the results.

ThefileWithToken file has to be written as follows :

twitterAccountName1346266278-3orvewl5mfCO1xfEEt1gN064uWnjyNyGRDzHO6c1r65PNGNh62dUg28M8eyUJNkXekomzWNyguSXXqW6Q

with one token per line. ThefileWithAccounts file has to contain one user id per line.

Folder`cooccurrence`

If you want to use the cooccurrence-based features, you need first to apply thenet-extraction.R script. But before, you must edit it to set up the variablesin.folder (input folder containing a collection of text documents, each one corresponding to the concatenation of the all the tweets published by a given user),out.folder (output folder, in which all the produced data files will be placed) andlog.file at the beginning. The produced files include the list of terms for the whole collection (terms.txt), and for each user: the list of terms used by the user (localterms.txt), the user term cooccurrence matrix (cooccurrences.txt), the cooccurrence networks with all the nodal centrality measures at the Graphml format (wordnetwork.graphml), the same centrality vectors separately in a text file (local.features.txt) and their average values for the whole network as well as other global network measures such as the densityglobal.features.txt).

The distance between cooccurrence networks can be processed using either thenet-distance.R R script or theSerialDistanceProcessor.java class (faster). For the R script, like before you must set up the variables at the beginning. For the Java class, the same modifications must be performed in themain method.

Folder`processing`

Thecosine_bot_xxx.pl scripts expect two text files (training andtest set) as input, formatted as follows:tweet_id,user_id,domain_id,language, (3 unused fields),tweet_content,reference_tag (for influence), and an unused field. The script will load files in memory and build the model before cleaning the memory as it is running (6 GB RAM would be OK). The script only uses one core/thread and would be more or less fast depending on the CPU maximum frequency.

Thecosine_uad_xxx.pl scripts expect two text files (training andtest set) as input, formatted as follows:user_id,reference_tag (for influence), an unused field,user_document and thenumber of tweets in the domain/language selected (the domains are$Automotive$ and$Banking$, the languages areEnglish andSpanish). The script will load files in memory and build the model before cleaning the memory as it is running. The script only uses one core/thread and would be more or less fast depending on the CPU maximum frequency. To complete the whole process you have to launch the script for each couple domain/language.

Folder`postprocessing`

Theplspm4influence.R script expects as input format a text file, whose first field is theuser id, the second one is thereference tag associated to each user (the field separator is a single tab), the next fields are then thevariable you target for analysis. You can access each variable in the source code with its columun index. The first line (the header) containsfields ID. To useplspm4influence.R just run the source through R and query R on the variable you are interested in to get more details about it. The script produces by itselfinternal andexternal models figures. To select another domain, just change the data file name at the beginning of the script.

Outputs fromcosine_xxx.pl scripts are translated to the TREC-EVAL tool format using the correspondingformat_trec_xxx.pl script. The TREC-EVAL format is the following:domain_id,unused_field,user_id,user_rank,score,system_name.

Data

Raw data are available through the official RepLab page:http://nlp.uned.es/replab2014/ (direct link:http://nlp.uned.es/replab2014/replab2014-dataset.tar.gz).`

RepLab 2014 uses Twitter data in English and Spanish. The balance between both languages depends on the availability of data for each of the profiles included in the dataset.

The training dataset consists of 7,000 Twitter profiles (all with at least 1,000 followers) related to theAutomotive andBanking domains, evaluation is performed separately.Each profile consists of (i) author name; (ii) profile URL and (iii) the last 600 tweets published by the author at crawling time, and have been manually labelled by reputation experts either as “opinion maker” (i.e. authors with reputational influence) or “non-opinion maker”. The objective is to find out which authors have more reputational influence (who the opinion makers are) and which profiles are less influential or have no influence at all.

Since Twitter TOS do not allow redistribution of tweets, only tweets IDs and screen names are provided. RepLab organizers provide details about how to download the tweets.

The system outputs from our scripts for these data are freely available onZenodo.

References

[CLD'16] J.-V. Cossu, N. Dugué & V. Labatut.A Review of Features for the Discrimination of Twitter Users: Application to the Prediction of Offline Influence, Social Network Analysis and Mining, 6(1):25, 2016. DOI:10.1007/s13278-016-0329-x -⟨hal-01203171⟩
[CDL'15] J.-V. Cossu, N. Dugué & V. Labatut.Detecting Real-World Influence Through Twitter, 2nd European Network Intelligence Conference (ENIC), 2015, 228-233. DOI:10.1109/ENIC.2015.20 -⟨hal-01164453⟩

About

Twitter and Influence

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

TwitterInfluence

Description

Organization

Installation

Use and Input

Folder`retrieval`

Folder`cooccurrence`

Folder`processing`

Folder`postprocessing`

Data

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors3

Uh oh!

Languages

Movatterモバイル変換

License

CompNet/Influence

Folders and files

Latest commit

History

Repository files navigation

TwitterInfluence

Description

Organization

Installation

Use and Input

Folderretrieval

Foldercooccurrence

Folderprocessing

Folderpostprocessing

Data

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors3

Uh oh!

Languages

Folder`retrieval`

Folder`cooccurrence`

Folder`processing`

Folder`postprocessing`

Packages