- Notifications
You must be signed in to change notification settings - Fork56
A self-hosted search engine for documents.
License
ICIJ/datashare
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Demo |Download |Documentation |User Guide
Datashare is an open-source software developed by the International Consortium of Investigative Journalists (ICIJ). You can use it for free on your computer or install it on your server and analyse your documents with collaborative features.
@ICIJorg publishes video tweets of new features with the hashtag#ICIJDatashare.
This repository is only the backend part of Datashare.
Please find the frontend here :https://github.com/ICIJ/datashare-client.
Datashare is a free open-source desktop application developed by non-profit International Consortium of Investigative Journalists (ICIJ).
Datashare allows investigative journalists to:
- access all their documents in one place locally on their computer while securing them from potential third-party interferences
- search pdfs, images, texts, spreadsheets, slides and any files, simultaneously
- automatically detect and filter by people, organizations and locations
You're welcome to suggest translations on Datashare's Crowdinhttps://crwd.in/datashare. Please contact us if you would like to add a language.
You can download the script at datashare.icij.org.
To access web GUI, go in your documents folder and launchpath/to/datashare.sh
then connect datashare onhttp://localhost:8080
You can use the datashare docker container only for HTTP exposed name finding API.
Just run :
docker run -ti -p 8080:8080 -v /path/to/dist/:/home/datashare/dist icij/datashare:0.10 -m NER
A bit of explanation :
-p 8080:8080
maps the 8080 to 8080, the you could access datashare at localhost:8080 (If you want to access it at localhost:8081, the change to-p 8081:8080
)-m NER
runs datashare without index at all on a stateless mode-v /path/to/dist:/home/datashare/dist
maps the directory where the NLP models will be read (and downloaded if they don't exist)
Then query with curl the server with :
curl -i localhost:8080/api/ner/findNames/CORENLP --data-binary @path/to/a/file.txt
The last path part (CORENLP) is the framework. You can choose it among CORENLP, IXAPIPE, MITIE or OPENNLP.
Implementations
TikaDocument from ICIJ/extract
Apache Tika v1.18 (Apache Licence v2.0)
withTesseract v4.0 alpha
Support
Info: other languages than the ones listed below are not supported. We encourage you to reach out to the maintainers of the original NLP projects to support your preferred language.
Implementations
org.icij.datashare.text.nlp.corenlp.CorenlpPipeline
Stanford CoreNLP v3.8.0,(Conditional Random Fields),Composite GPL v3+
org.icij.datashare.text.nlp.ixapipe.IxapipePipeline
Ixa Pipes Nerc v1.6.1,(Perceptron),Apache Licence v2.0
org.icij.datashare.text.nlp.mitie.MitiePipeline
MIT Information Extraction v0.8,(Structural Support Vector Machines),Boost Software License v1.0
org.icij.datashare.text.nlp.opennlp.OpennlpPipeline
Apache OpenNLP v1.6.0,(Maximum Entropy),Apache Licence v2.0
Natural Language Processing Stages Support
NlpStage |
---|
TOKEN |
SENTENCE |
POS |
NER |
Named Entity Recognition Language Support
NlpStage.NER | ENGLISH | SPANISH | GERMAN | FRENCH | CHINESE |
---|---|---|---|---|---|
NlpPipeline.Type.CORENLP | X | X | X | (w/ EN) | X |
NlpPipeline.Type.OPENNLP | X | X | - | X | - |
NlpPipeline.Type.IXAPIPE | X | X | X | - | - |
NlpPipeline.Type.MITIE | X | X | X | - | - |
Named Entity Categories Support
NamedEntity.Category |
---|
ORGANIZATION |
PERSON |
LOCATION |
Parts-of-Speech Language Support
NlpStage.POS | ENGLISH | SPANISH | GERMAN | FRENCH |
---|---|---|---|---|
NlpPipeline.Type.CORE | X | X | X | X |
NlpPipeline.Type.OPEN | X | X | X | X |
NlpPipeline.Type.IXA | X | X | X | X |
NlpPipeline.Type.MITIE | - | - | - | - |
Implementations
org.icij.datashare.text.indexing.elasticsearch.ElasticsearchIndexer
Elasticsearch v7.9.1,Apache Licence v2.0
RequiresJDK 11,Maven 3 and a runningPostgreSQL database (hostnamepostgres
)with two databasesdatashare
andtest
with write access for usertest
/ passwordtest
. You'll need also a runningelasticsearch instance withelasticsearch
as hostname ; and a redis server namedredis
as well.
mvn validatemvn -pl commons-test -am installmvn -pl datashare-db liquibase:updatemvn test
It is important to keepdatashare
anddatashare-client
up to date by pulling from each repository's master branch.
To ensure that updates are registered,make clean dist
must be run locally from each repository.
If dependencies have been updated ondatashare-client
, runyarn
beforemake clean dist
.
If the database models have changed withindatashare
, run the following commandsbeforemake clean dist
:
sh datashare-db/scr/reset_datashare_db.shmvn -pl commons-test -am installmvn -pl datashare-db liquibase:updatemvn test
Datashare is released under theGNU Affero General Public License
We welcome feedback as well as contributions!
For any bug, question, comment or (pull) request,
please contact us atdatashare@icij.org
About
A self-hosted search engine for documents.