- Notifications
You must be signed in to change notification settings - Fork1
CompNet/Influence
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Characterization of Twitter Profiles, with an application to offline influence detection
- Copyright 2014-15 Jean-Valère Cossu, Nicolas Dugué & Vincent Labatut
TwitterInfluence is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation. For source availability and license information seelicence.txt
- Lab site:http://lia.univ-avignon.fr
- GitHub repo:https://github.com/CompNet/Influence
- Contact: Jean-Valère Cossujean-valere.cossu@alumni.univ-avignon.fr
These scripts are meant to extract certain features from raw Twitter data describing Twitter users (tweets, profile info, as well as external data). Once the features are extracted, various forms of SVMs are trained, and logistic regressions are performed, to classify and rank the users. These operations are conducted on different subgroups of features. The details of the process are given in [CDL'15] and [CLD'16]. The scripts were applied to the classification/ranking of Twitter users in terms ofoffline influence, based on theRepLab 2014 dataset.
Please, cite [CLD'16] if you use our scripts. We would also be very interested to know your context of application and/or modification, so please, let us know.
@Article{Cossu2016,author ={Cossu, Jean-Valère and Labatut, Vincent and Dugué, Nicolas},title ={A Review of Features for the Discrimination of {Twitter} Users: Application to the Prediction of Offline Influence},journal ={Social Network Analysis and Mining},year ={2016},volume ={6},pages ={25},doi ={10.1007/s13278-016-0329-x},}
Note that the software may evolve depending on our future research work.
The project is composed of the following folders:
preprocessing: programs related to the preparation of the data.retrieval: Java classes used to retrieve additional Twitter and Klout dataMain.java: allows to obtain additional features related to a Twitter profile.MainGetTweets.java: allows to obtain tweets posted by users
cooccurrence: programs aiming at extracting cooccurrence networks and the related features.net-extraction.R: extracts the cooccurrence networks based on a collection of documents, each one corresponding to all the tweets published by a user (User-as-Document approach). The script also computes the vector features based on topological centrality measures, through theigraphlibrary.net-distance.R: processes the distance between each pair of cooccurrence network (each network corresponds to a user).DistanceProcessor.java: same as above, but in Java instead of R, and much faster. The program is multithreaded, by comparisonSerialDistanceProcessor.javaperforms the same processing in a non-parallel way.
processing: programs implementing the classification and ranking tasks, as well as their evaluation.cosine_bot_xxx.plscripts concern the Bag-of-Tweets based user classification. These scripts compute the probability of each tweet of a given user to be written by an influencer. They then associate to each user a score of being an influencer according 2 methodsxxcan be replaced bySumorCount).
postprocessing: programs used to analyze and convert the processing resultsregression: PLS-PM scripts, used to study more thoroughly the relation between certain features and the class.plspm4influence.R: this script aims at finding relations between features categories and verify the efficiency of the proposed influence conceptual model.
format: scripts related to the format of the input data.format_trec_xxx(.multi).pl: these scripts convert the RepLab data to the TREC format.
Here are the third-party softwares used in this version:
- Part of SVMs-based experiments were made usingMulti-Class Support Vector Machine by Thorsten Joachims.
plspm4influence.Rrelies on theplspm. R packagenet-extraction.Rrelies on theigraphR package.net-distance.Randnet-extraction.Ruse theforeachanddoParallelR packages.
- The Java project may be imported into all IDE (Eclipse, IntelliJ or Netbeans) and then built. It can also be built by running:
javac MainGetTweets.javaandjavac Main.java - All perl scripts work perfectly on Windows (via Cygwin) and Unix systems with Perl 5, version 14, subversion 4, no additional module is required.
- To useMulti-Class Support Vector Machine, just download binaries at the following address:http://www.cs.cornell.edu/people/tj/svm_light/svm_multiclass.html
- To use the R scripts, just install regularly the appropriate packages, using the command
install.packages("xxxxx")for packagexxxx. - Logistic regression, Random Forest, PCA and SVM can also be runned by using Python Script with the Sci-kit Lean Lib
To get Twitter profile data, runjava Main fileWithToken fileWithAccounts fileToWriteAccounts with:
fileWithTokena file containing Twitter tokens,fileWithAccountsa file containing the Twitter accounts to crawl,fileToWriteAccountsthe file name where to write the results.
To get tweets from specific Twitter accounts, runjava MainGetTweets fileWithToken fileWithAccounts fileToWriteAccounts with:
fileWithTokena file containing Twitter tokens,fileWithAccountsa file containing the Twitter accounts to crawl,- 'fileToWriteAccounts' the file name where to write the results.
ThefileWithToken file has to be written as follows :
twitterAccountName1346266278-3orvewl5mfCO1xfEEt1gN064uWnjyNyGRDzHO6c1r65PNGNh62dUg28M8eyUJNkXekomzWNyguSXXqW6Qwith one token per line. ThefileWithAccounts file has to contain one user id per line.
If you want to use the cooccurrence-based features, you need first to apply thenet-extraction.R script. But before, you must edit it to set up the variablesin.folder (input folder containing a collection of text documents, each one corresponding to the concatenation of the all the tweets published by a given user),out.folder (output folder, in which all the produced data files will be placed) andlog.file at the beginning. The produced files include the list of terms for the whole collection (terms.txt), and for each user: the list of terms used by the user (localterms.txt), the user term cooccurrence matrix (cooccurrences.txt), the cooccurrence networks with all the nodal centrality measures at the Graphml format (wordnetwork.graphml), the same centrality vectors separately in a text file (local.features.txt) and their average values for the whole network as well as other global network measures such as the densityglobal.features.txt).
The distance between cooccurrence networks can be processed using either thenet-distance.R R script or theSerialDistanceProcessor.java class (faster). For the R script, like before you must set up the variables at the beginning. For the Java class, the same modifications must be performed in themain method.
Thecosine_bot_xxx.pl scripts expect two text files (training andtest set) as input, formatted as follows:tweet_id,user_id,domain_id,language, (3 unused fields),tweet_content,reference_tag (for influence), and an unused field. The script will load files in memory and build the model before cleaning the memory as it is running (6 GB RAM would be OK). The script only uses one core/thread and would be more or less fast depending on the CPU maximum frequency.
Thecosine_uad_xxx.pl scripts expect two text files (training andtest set) as input, formatted as follows:user_id,reference_tag (for influence), an unused field,user_document and thenumber of tweets in the domain/language selected (the domains are
Theplspm4influence.R script expects as input format a text file, whose first field is theuser id, the second one is thereference tag associated to each user (the field separator is a single tab), the next fields are then thevariable you target for analysis. You can access each variable in the source code with its columun index. The first line (the header) containsfields ID. To useplspm4influence.R just run the source through R and query R on the variable you are interested in to get more details about it. The script produces by itselfinternal andexternal models figures. To select another domain, just change the data file name at the beginning of the script.
Outputs fromcosine_xxx.pl scripts are translated to the TREC-EVAL tool format using the correspondingformat_trec_xxx.pl script. The TREC-EVAL format is the following:domain_id,unused_field,user_id,user_rank,score,system_name.
Raw data are available through the official RepLab page:http://nlp.uned.es/replab2014/ (direct link:http://nlp.uned.es/replab2014/replab2014-dataset.tar.gz).`
RepLab 2014 uses Twitter data in English and Spanish. The balance between both languages depends on the availability of data for each of the profiles included in the dataset.
The training dataset consists of 7,000 Twitter profiles (all with at least 1,000 followers) related to theAutomotive andBanking domains, evaluation is performed separately.Each profile consists of (i) author name; (ii) profile URL and (iii) the last 600 tweets published by the author at crawling time, and have been manually labelled by reputation experts either as “opinion maker” (i.e. authors with reputational influence) or “non-opinion maker”. The objective is to find out which authors have more reputational influence (who the opinion makers are) and which profiles are less influential or have no influence at all.
Since Twitter TOS do not allow redistribution of tweets, only tweets IDs and screen names are provided. RepLab organizers provide details about how to download the tweets.
The system outputs from our scripts for these data are freely available onZenodo.
- [CLD'16] J.-V. Cossu, N. Dugué & V. Labatut.A Review of Features for the Discrimination of Twitter Users: Application to the Prediction of Offline Influence, Social Network Analysis and Mining, 6(1):25, 2016. DOI:10.1007/s13278-016-0329-x -⟨hal-01203171⟩
- [CDL'15] J.-V. Cossu, N. Dugué & V. Labatut.Detecting Real-World Influence Through Twitter, 2nd European Network Intelligence Conference (ENIC), 2015, 228-233. DOI:10.1109/ENIC.2015.20 -⟨hal-01164453⟩
About
Twitter and Influence
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Contributors3
Uh oh!
There was an error while loading.Please reload this page.