Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Focusing on ACE 2005 data preprocessing, we provide doc-level, sentence-level and BIO-style golden data preprocessing, the only thing you need is the ACE05 row data. Hope you enjoy!😎

License

NotificationsYou must be signed in to change notification settings

Clearailhc/ACE2005-toolkit

Repository files navigation

ACE 2005 data preprocess

File structure

ACE2005-toolkit├── ace_2005 (the ACE2005 raw data)│   ├── data│   │   └── ...│   ├── docs│   │   └── ...│   │── dtd│   │   └── ...│   └── index.html├── cache_data (empty before run)│   ├── Arabic/│   ├── Chinese/│   └── English/├── filelist (train/dev/test doc files)│   ├── ace.ar.dev│   ├── ace.ar.test│   ├── ace.ar.train│   ├── ace.en.dev│   ├── ace.en.test│   ├── ace.en.train│   ├── ace.zh.dev│   ├── ace.zh.test│   └── ace.zh.train│   ├── output (final output, empty before run)│   ├── BIO (BIO output)│   │   ├── train/│   │   ├── test/│   │   └── dev/│   └── ...├── udpipe (udpipe files)│   ├── arabic-padt-ud-2.5-191206│   ├── chinese-gsd-ud-2.5-191206│   └── english-ewt-ud-2.5-191206├── ace_parser.py├── extract.py├── format.py├── transform.py├── udpipe.py├── requirements.txt└── run.sh

Preprocess steps

  1. Download the ACE2005 raw data and rename intoace_2005 ;
  2. Install all the requirements bypip install -r requirements.txt;
  3. Start preprocess bybash run.sh en,en can be replaced byzh orar;
  4. Entern to get data divided by filelist, or entery andtrain/dev/test rate(e.g.0.8 0.1 0.1) to get data divided by sentences;
  5. Entery to get transform the data into BIO-type format, the transformed data will be inoutput/BIO/, each train (test or dev) data will be transformed into 4 BIO-style json files(token,entity_BIO,event_trigger_BIO andevent_argument_BIO);
  6. The final output will be in directoryoutput/.

Output format

The output will save separately inoutput/, each file can be loaded byjson.loads(). After loading, the data will be inpython list type, each line will be inpython dict type:

{    "sentence": "Orders went out today to deploy 17,000 U.S. Army soldiers in the Persian Gulf region.",    "tokens": [        "Orders",        "went",        "out",        "today",        "to",        "deploy",        "17,000",        "U.S.",        "Army",        "soldiers",        "in",        "the",        "Persian",        "Gulf",        "region",        "."    ],    "golden-entity-mentions": [        {            "entity-id": "CNN_CF_20030303.1900.02-E4-186",            "entity-type": "GPE:Nation",            "text": "U.S",            "sent_id": "4",            "position": [                7,                7            ],            "head": {                "text": "U.S",                "position": [                    7,                    7                ]            }        },        ...    ],    "golden-event-mentions":         {            "event-id": "CNN_CF_20030303.1900.02-EV1-1",            "event_type": "Movement:Transport",            "arguments": [                {                    "text": "17,000 U.S. Army soldiers",                    "sent_id": "4",                    "position": [                        6,                        9                    ],                    "role": "Artifact",                    "entity-id": "CNN_CF_20030303.1900.02-E25-1"                },                {                    "text": "the Persian Gulf region",                    "sent_id": "4",                    "position": [                        11,                        15                    ],                    "role": "Destination",                    "entity-id": "CNN_CF_20030303.1900.02-E76-191"                }            ],            "text": "Orders went out today to deploy 17,000 U.S. Army soldiers\nin the Persian Gulf region",            "sent_id": "4",            "position": [                0,                15            ],            "trigger": {                "text": "deploy",                "position": [                    5,                    5                ]            }        },        ...    ],    "golden-relation-mentions": [        {            "relation-id": "CNN_CF_20030303.1900.02-R1-1",            "relation-type": "ORG-AFF:Employment",            "text": "17,000 U.S. Army soldiers",            "sent_id": "4",            "position": [                6,                9            ],            "arguments": [                {                    "text": "17,000 U.S. Army soldiers",                    "sent_id": "4",                    "position": [                        6,                        9                    ],                    "role": "Arg-1",                    "entity-id": "CNN_CF_20030303.1900.02-E25-1"                },                {                    "text": "U.S. Army",                    "sent_id": "4",                    "position": [                        7,                        8                    ],                    "role": "Arg-2",                    "entity-id": "CNN_CF_20030303.1900.02-E66-157"                }            ]        },         ...    ]}

You will get all the golden data ofentities, events and relations in output files.

Adjustment

You can change the file names infilelist/, which will directly change the files belong totrain/dev/test, we use a default (529/30/40) division.

Related work

Email us

Any questions can contact us byhaochenli@pku.edu.cn.

About

Focusing on ACE 2005 data preprocessing, we provide doc-level, sentence-level and BIO-style golden data preprocessing, the only thing you need is the ACE05 row data. Hope you enjoy!😎

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp