Clearailhc/ACE2005-toolkitPublic

NotificationsYou must be signed in to change notification settings
Fork5
Star37

Focusing on ACE 2005 data preprocessing, we provide doc-level, sentence-level and BIO-style golden data preprocessing, the only thing you need is the ACE05 row data. Hope you enjoy!😎

License

MIT license

37 stars 5 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
filelist		filelist
udpipe		udpipe
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ace_parser.py		ace_parser.py
build_BIO.py		build_BIO.py
extract.py		extract.py
format.py		format.py
requirements.txt		requirements.txt
run.sh		run.sh
transform.py		transform.py
udpipe.py		udpipe.py

Repository files navigation

ACE2005-toolkit

ACE 2005 data preprocess

File structure

ACE2005-toolkit├── ace_2005 (the ACE2005 raw data)│   ├── data│   │   └── ...│   ├── docs│   │   └── ...│   │── dtd│   │   └── ...│   └── index.html├── cache_data (empty before run)│   ├── Arabic/│   ├── Chinese/│   └── English/├── filelist (train/dev/test doc files)│   ├── ace.ar.dev│   ├── ace.ar.test│   ├── ace.ar.train│   ├── ace.en.dev│   ├── ace.en.test│   ├── ace.en.train│   ├── ace.zh.dev│   ├── ace.zh.test│   └── ace.zh.train│   ├── output (final output, empty before run)│   ├── BIO (BIO output)│   │   ├── train/│   │   ├── test/│   │   └── dev/│   └── ...├── udpipe (udpipe files)│   ├── arabic-padt-ud-2.5-191206│   ├── chinese-gsd-ud-2.5-191206│   └── english-ewt-ud-2.5-191206├── ace_parser.py├── extract.py├── format.py├── transform.py├── udpipe.py├── requirements.txt└── run.sh

Preprocess steps

Download the ACE2005 raw data and rename intoace_2005 ;
Install all the requirements bypip install -r requirements.txt;
Start preprocess bybash run.sh en,en can be replaced byzh orar;
Entern to get data divided by filelist, or entery andtrain/dev/test rate(e.g.0.8 0.1 0.1) to get data divided by sentences;
Entery to get transform the data into BIO-type format, the transformed data will be inoutput/BIO/, each train (test or dev) data will be transformed into 4 BIO-style json files(token,entity_BIO,event_trigger_BIO andevent_argument_BIO);
The final output will be in directoryoutput/.

Output format

The output will save separately inoutput/, each file can be loaded byjson.loads(). After loading, the data will be inpython list type, each line will be inpython dict type:

{    "sentence": "Orders went out today to deploy 17,000 U.S. Army soldiers in the Persian Gulf region.",    "tokens": [        "Orders",        "went",        "out",        "today",        "to",        "deploy",        "17,000",        "U.S.",        "Army",        "soldiers",        "in",        "the",        "Persian",        "Gulf",        "region",        "."    ],    "golden-entity-mentions": [        {            "entity-id": "CNN_CF_20030303.1900.02-E4-186",            "entity-type": "GPE:Nation",            "text": "U.S",            "sent_id": "4",            "position": [                7,                7            ],            "head": {                "text": "U.S",                "position": [                    7,                    7                ]            }        },        ...    ],    "golden-event-mentions":         {            "event-id": "CNN_CF_20030303.1900.02-EV1-1",            "event_type": "Movement:Transport",            "arguments": [                {                    "text": "17,000 U.S. Army soldiers",                    "sent_id": "4",                    "position": [                        6,                        9                    ],                    "role": "Artifact",                    "entity-id": "CNN_CF_20030303.1900.02-E25-1"                },                {                    "text": "the Persian Gulf region",                    "sent_id": "4",                    "position": [                        11,                        15                    ],                    "role": "Destination",                    "entity-id": "CNN_CF_20030303.1900.02-E76-191"                }            ],            "text": "Orders went out today to deploy 17,000 U.S. Army soldiers\nin the Persian Gulf region",            "sent_id": "4",            "position": [                0,                15            ],            "trigger": {                "text": "deploy",                "position": [                    5,                    5                ]            }        },        ...    ],    "golden-relation-mentions": [        {            "relation-id": "CNN_CF_20030303.1900.02-R1-1",            "relation-type": "ORG-AFF:Employment",            "text": "17,000 U.S. Army soldiers",            "sent_id": "4",            "position": [                6,                9            ],            "arguments": [                {                    "text": "17,000 U.S. Army soldiers",                    "sent_id": "4",                    "position": [                        6,                        9                    ],                    "role": "Arg-1",                    "entity-id": "CNN_CF_20030303.1900.02-E25-1"                },                {                    "text": "U.S. Army",                    "sent_id": "4",                    "position": [                        7,                        8                    ],                    "role": "Arg-2",                    "entity-id": "CNN_CF_20030303.1900.02-E66-157"                }            ]        },         ...    ]}

You will get all the golden data ofentities, events and relations in output files.

Adjustment

You can change the file names infilelist/, which will directly change the files belong totrain/dev/test, we use a default (529/30/40) division.

Related work

Email us

Any questions can contact us byhaochenli@pku.edu.cn.

About

Focusing on ACE 2005 data preprocessing, we provide doc-level, sentence-level and BIO-style golden data preprocessing, the only thing you need is the ACE05 row data. Hope you enjoy!😎

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ACE2005-toolkit

ACE 2005 data preprocess

File structure

Preprocess steps

Output format

Adjustment

Related work

Email us

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

Clearailhc/ACE2005-toolkit

Folders and files

Latest commit

History

Repository files navigation

ACE2005-toolkit

ACE 2005 data preprocess

File structure

Preprocess steps

Output format

Adjustment

Related work

Email us

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages