NotificationsYou must be signed in to change notification settings
Fork9
Star49

R-package for text mining with the Corpus Workbench (CWB) as backend

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,257 Commits
.github/workflows		.github/workflows
Docker		Docker
R		R
data-raw		data-raw
demo		demo
docs		docs
inst		inst
man		man
tests		tests
uml		uml
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README-cooccurrences_dotplot-1.png		README-cooccurrences_dotplot-1.png
README.Rmd		README.Rmd
README.md		README.md
TODO.md		TODO.md
_pkgdown.yml		_pkgdown.yml
appveyor.yml		appveyor.yml
cran-comments.md		cran-comments.md

Repository files navigation

Introducing polmineR

Motivation

Purpose: The focus of the package ‘polmineR’ is the interactiveanalysis of corpora using R. Core objectives for the development of thepackage are performance, usability, and a modular design.

Aims: Key aims for developing the package are:

To keep the original text accessible. A seamless integration ofqualitative and quantitative steps in corpus analysis supportsvalidation, based on inspecting the text behind the numbers.
To provide a library with standard tasks. It is an open sourceplatform that will make text mining more productive, avoidingprohibitive costs to reimplement basics, or to run many lines ofcode to perform a basic tasks.
To create a package that makes the creation and analysis ofsubcorpora (‘partitions’) easy. A particular strength of the packageis to support contrastive/comparative research.
To offer performance for users with a standard infrastructure. Thepackage picks up the idea of a three-tier software design. Corpusdata are managed and indexed by using theOpen Corpus Workbench(CWB). The CWB is particularly efficient for storing large corporaand offers a powerful language for querying corpora, the CorpusQuery Processor (CQP).
To support sharing consolidated and documented data, following theideas of reproducible research.

Background: The polmineR-package was specifically developed to makefull use of the XML annotation structure of the corpora created in thePolMine project (see polmine.sowi.uni-due.de). The core PolMine corporaare corpora of plenary protocols. In these corpora, speakers, partiesetc. are structurally annotated. The polmineR-package is meant to helpmaking full use of the rich annotation structure.

Core polmineR functionality

To demonstrate the core functionality of package, we load polmineR.

library(polmineR)

The package includes two small sample corpora (REUTERS andGERMAPARLMINI). Here we want two use somewhat bigger “real life” corpora(Europarl and GermaParl). Thecwbtools package offersan installation mechanism, so we install this package first.

install.packages("cwbtools")

We now install Europarl …

europarl<-"http://corpora.linguistik.uni-erlangen.de/demos/download/Europarl3-CWB-2010-02-28.tar.gz"cwbtools::corpus_install(tarball=europarl)

… and the GermaParl corpus of parliamentary debates.

cwbtools::corpus_install(doi="10.5281/zenodo.3742113")

partition (and partition_bundle)

All methods can be applied to a whole corpus, as well as to partitions(i.e. subcorpora). Use the metadata of a corpus (so-called s-attributes)to define a subcorpus.

ep2005<- partition("EUROPARL-EN",text_year="2006")#> ... get encoding: latin1#> ... get cpos and strucssize(ep2005)#> [1] 3100529

barroso<- partition("EUROPARL-EN",speaker_name="Barroso",regex=TRUE)#> ... get encoding: latin1#> ... get cpos and strucssize(barroso)#> [1] 98142

Partitions can be bundled into partition_bundle objects, and mostmethods can be applied to a whole corpus, a partition, or apartition_bundle object alike. Consult the package vignette to learnmore.

count (using CQP syntax)

Counting occurrences of a feature in a corpus, a partition or in thepartitions of a partition_bundle is a basic operation. By offeringaccess to the query syntax of the Corpus Query Processor (CQP), polmineRpackage exposes a query syntax that goes far beyond regular expressions.See theCQPdocumentationto learn more.

count("EUROPARL-EN","France")#>     query count         freq#> 1: France  5517 0.0001399122count("EUROPARL-EN", c("France","Germany","Britain","Spain","Italy","Denmark","Poland"))#>      query count         freq#> 1:  France  5517 1.399122e-04#> 2: Germany  4196 1.064114e-04#> 3: Britain  1708 4.331523e-05#> 4:   Spain  3378 8.566676e-05#> 5:   Italy  3209 8.138089e-05#> 6: Denmark  1615 4.095673e-05#> 7:  Poland  1820 4.615557e-05count("EUROPARL-EN",'"[pP]opulism"')#>            query count         freq#> 1: "[pP]opulism"   107 2.713542e-06

dispersion (across one or two dimensions)

The dispersion method is there to analyse the dispersion of a query, ora set of queries across one or two dimensions (absolute and relativefrequencies). The CQP syntax can be used.

populism<- dispersion("EUROPARL-EN","populism",s_attribute="text_year",progress=FALSE)pop_regex<- dispersion("EUROPARL-EN",'"[pP]opulism"',s_attribute="text_year",cqp=TRUE,progress=FALSE)

cooccurrences (to analyse collocations)

The cooccurrences method is used to analyse the context of a query(including some statistics).

islam<- cooccurrences("EUROPARL-EN",query='Islam',left=10,right=10)islam<- subset(islam,rank_ll<=100)dotplot(islam)

features (keyword extraction)

Compare partitions to identify features / keywords (using statisticaltests such as chi square).

ep_2002<- partition("EUROPARL-EN",text_year="2002",p_attribute="word")ep_pre_2002<- partition("EUROPARL-EN",text_year=1997:2001,p_attribute="word")features(ep_2002,ep_pre_2002,included=FALSE) %>%  subset(rank_chisquare<=10) %>%  format() %>%knitr::kable(format="markdown")

rank_chisquare	word	count_coi	count_ref	exp_coi	chisquare
1	2002	1694	782	398.96	5011.70
2	Johannesburg	479	21	80.57	2348.97
3	Seville	378	26	65.10	1792.96
4	Barcelona	706	528	198.84	1542.16
5	’s	10694	36727	7641.03	1457.07
6	2003	549	329	141.47	1399.45
7	Copenhagen	575	430	161.94	1256.06
8	terrorism	1221	1917	505.63	1206.67
9	02	233	2	37.87	1198.75
10	candidate	1217	2088	532.54	1048.84

kwic (also known as concordances)

So what happens in the context of a word, or a CQP query? To attainvalid research results, reading will often be necessary. The kwic methodwill help, and uses the conveniences of DataTables, outputted in theViewer pane of RStudio.

kwic("EUROPARL-EN","Islam",meta= c("text_date","speaker_name")) %>%  as.data.frame() %>%.[1:8,] %>%knitr::kable(format="markdown",escape=FALSE)

meta	left	node	right
1996-05-09 Oostlander	, as for example with	Islam	here in Europe , so
1996-05-09 Féret	promotion of the study of	Islam	in Europe ’ , with
1996-05-09 Féret	seem to have forgotten that	Islam	makes no distinction between spiritual
1996-06-05 von Habsburg	, the old arguments against	Islam	are trotted out time and
1996-06-05 von Habsburg	various shades of opinion within	Islam	must not simply be lumped
1996-06-05 von Habsburg	there are various groups within	Islam	and that many of them
1996-07-17 Blot	represented by the growth of	Islam	to the south and east
1996-09-18 Stirbois	rushing into the arms of	Islam	. A fortnight later ,

read (the full text)

Corpus analysis involves moving from text to numbers, and back again.Use the read method, to inspect the full text of a partition (a speechgiven by chancellor Angela Merkel in this case).

merkel<- partition("GERMAPARL",speaker="Angela Merkel",date="2013-09-03")read(merkel)

as.TermDocumentMatrix (for text mining purposes)

Many advanced methods in text mining require term document matrices asinput. Based on the metadata of a corpus, these data structures can beobtained in a fast and flexible manner, for performing topic modelling,machine learning etc.

speakers<- partition_bundle("EUROPARL-EN",s_attribute="speaker_id",progress=FALSE,verbose=FALSE)speakers_count<- count(speakers,p_attribute="word",progress=TRUE)tdm<- as.TermDocumentMatrix(speakers_count,col="count")dim(tdm)

Installation

Windows

The following instructions assume that you have installed R. If not,install it fromCRAN. Aninstallation ofRStudiois highly recommended.

The CRAN release of polmineR can be installed usinginstall.packages(), all dependencies will be installed, too.

install.packages("polmineR")

To install the most recent development version that is hosted in aGitHub repository, use the installation mechanism offered by thedevtools package.

install.packages("devtools")devtools::install_github("PolMine/polmineR",ref="dev")

Check the installation by loading polmineR and activating the corporaincluded in the package.

library(polmineR)corpus()

macOS

Install binary package from CRAN

CRAN offerspolmineR as abinary package both for Intel processors (x86_64 architecture) and thenewer Apple silicon chips (arm64 architecture). If R and RStudio are notyet installed, follow these preparatory steps.

InstallingXQuartz is recommended. Theavailable image works for Intel and Apple chips. Note that XQuartzcapability is configured and used by R only if XQuartz has beeninstalledbefore installing R.
Install R. Note that packages are available for both Intel 64-bitand Apple silicon arm64 chip architectures. Install what applies foryou, see theR for macOSsite.
InstallRStudio.The free version of RStudio Desktop is enough. Starting with version1.4, Apple silicon is supported. When installing RStudio, users witha Apple silicon chips are be asked to installRosetta (say yes).
When starting RStudio the first time, you may be asked to installtheCommand Line DeveloperTools. This is notnecessary for basic polmineR usage, but recommended. Note thatdownloading the Command Line Developer Tools may require a stableinternet connection and still take some time.

Then run this command for installing polmineR:

install.packages("polmineR")

The installation mechanism will determine which binary version yourequire and install all required dependencies.

Install development version / build from source

For installing the development version of polmineR and building thepackage from source, the Command Line Developer Tools need to beinstalled. Install them from a terminal window as follows.

xcode-select --install

If you haven’t done so already, installXQuartz, R and RStudio (see previousinstructions for binary installation).

Thedevtools packageexposes a convenient and commonly used installation mechanism forinstalling a package from GitHub. First install the devtools package,which involves the installation of several dependencies.

install.packages("devtools")# unless devtools is already installed

Then use theinstall_github() function as follows.

devtools::install_github("PolMine/polmineR",ref="dev")

The development version of polmineR may require the installation of adevelopment version of the RcppCWB: polmineR interacts with the CorpusWorkbench (CWB) via RcppCWB, an R package which exposes the C-levelfunctions of the CWB. If you want or need to install a developmentversion of RcppCWB, several system dependencies need to be fulfilled forcompiling the package from source. See theREADME of the RcppCWB GitHubrepository for instructions.

Checking the installation

Check whether everything works by loading polmineR, and see whether yousee the demo corpora included in the package.

library(polmineR)corpus()

Linux (Ubuntu)

If you have not yet installed R on your Ubuntu machine, there is a goodinstruction atubuntuuser. To installbase R, enter in the terminal.

sudo apt-get install r-base r-recommended

Make sure that you have installed the latest version of R. The followingcommands will add the R repository to the package sources and run anupdate. The second line assumes that you are using Ubuntu 16.04.

sudo apt-key adv --recv-keys --keyserver keyserver.ubuntu.com E084DAB9sudo add-apt-repository'deb http://ftp5.gwdg.de/pub/misc/cran/bin/linux/ubuntu xenial/'sudo apt-get updatesudo apt-get upgrade

It is highly recommended to installRStudio,a powerful IDE for R. Output of polmineR methods is generally optimizedto be displayed using RStudio facilities. If you are working on a remoteserver, running RStudio Server may be an interesting option to consider.

The RcppCWB package, the interface used by polmineR to query CWBcorpora, will require the pcre, glib and pkg-config libraries. They canbe installed as follows. In addition libxml2 is installed, a dependencyof the R package xml2 that is used for manipulating html output.

sudo apt-get install libglib2.0-dev libssl-dev libcurl4-openssl-devsudo apt-get install libxml2-devsudo apt-get install libprotobuf-dev protobuf-compiler

The system requirements will now be fulfilled. From R, installdependencies for rcqp/polmineR first, and then rcqp and polmineR.

install.packages("RcppCWB")install.packages("polmineR")

Use devtools to install the development version of polmineR from GitHub.

install.packages("devtools")devtools::install_github("PolMine/polmineR",ref="dev")

You may want to install packaged corpora to run examples in thevignette, and the man packages.

library(polmineR)corpus()

To have access to all package functions and to run all package tests,the installation of further system requirements and packages isrequired. The xlsx dependency requires that rJava is installed andconfigured for R. That is done on the shell:

sudo apt-get install openjdk-8-jresudo R CMD javareconf

To run package tests including (re-)building the manual and vignettes, aworking installation of Latex is required, too. Be aware that this maybe a time-consuming operation.

sudo apt-get install texlive-full texlive-xetex

Now install the remaining packages from within R.

install.packages(pkgs= c("rJava","xlsx","tidytext"))

Quoting polmineR

The polmineR package has been developed to be useful for research. Ifyou publish research results making use of polmineR, the followingcitation is suggested to be included in publications.

Blaette, Andreas (2020). polmineR: Verbs and Nouns for Corpus Analysis.R package version v0.8.5.http://doi.org/10.5281/zenodo.4042093

About

R-package for text mining with the Corpus Workbench (CWB) as backend

Releases20

Nested Boxes Latest

Sep 1, 2022

+ 19 releases

Packages

Contributors6

Languages

HTML86.3%
R12.9%
Other0.8%

Movatterモバイル変換

PolMine/polmineR

Folders and files

Latest commit

History

Repository files navigation

Introducing polmineR

Motivation

Core polmineR functionality

partition (and partition_bundle)

count (using CQP syntax)

dispersion (across one or two dimensions)

cooccurrences (to analyse collocations)

features (keyword extraction)

kwic (also known as concordances)

read (the full text)

as.TermDocumentMatrix (for text mining purposes)

Installation

Windows

macOS

Install binary package from CRAN

Install development version / build from source

Checking the installation

Linux (Ubuntu)

Quoting polmineR

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases20

Packages0

Uh oh!

Uh oh!

Contributors6

Uh oh!

Languages

Packages