Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Pipeline component for spaCy (and other spaCy-wrapped parsers such as spacy-stanza and spacy-udpipe) that adds CoNLL-U properties to a Doc and its sentences and tokens. Can also be used as a command-line tool.

License

NotificationsYou must be signed in to change notification settings

BramVanroy/spacy_conll

Repository files navigation

This module allows you to parse text into CoNLL-U format. You can use it as a command line tool, or embed it in yourown scripts by adding it as a custom pipeline component to a spaCy,spacy-stanza, orspacy-udpipe pipeline. Italso provides an easy-to-use function to quickly initialize a parser as well as a ConllParser class with built-infunctionality to parse files or text.

Note that the module simply takes a parser's output and puts it in a formatted string adhering to the linked ConLL-Uformat. The output tags depend on the spaCy model used. If you want Universal Depencies tags as output, I advise youto use this library in combination withspacy-stanza, which is a spaCyinterface usingstanza and its models behind the scenes. Those models use the Universal Dependencies formalism andyield state-of-the-art performance.stanza is a new and improved version ofstanfordnlp. As an alternative to theStanford models, you can use the spaCy wrapper forUDPipe,spacy-udpipe,which is slightly less accurate thanstanza but much faster.

Installation

By default, this package automatically installs onlyspaCy asdependency. BecausespaCy's models are not necessarily trained on UniversalDependencies conventions, their output labels are not UD either. By usingspacy-stanza orspacy-udpipe, we getthe easy-to-use interface of spaCy as a wrapper aroundstanza andUDPipe respectively, including their models thatare trained on UD data.

NOTE:spacy-stanza andspacy-udpipe are not installed automatically as a dependency for this library, becauseit might be too much overhead for those who don't need UD. If you wish to use their functionality, you have to installthem manually or use one of the available options as described below.

If you want to retrieve CoNLL info as apandas DataFrame, this library will automatically export it if it detectsthatpandas is installed. See the Usage section for more.

To install the library, simply use pip.

# only includes spacy by defaultpip install spacy_conll

A number of options are available to make installation of additional dependencies easier:

# include spacy-stanza and spacy-udpipepip install spacy_conll[parsers]# include pandaspip install spacy_conll[pd]# include pandas, spacy-stanza and spacy-udpipepip install spacy_conll[all]# include pandas, spacy-stanza and spacy-udpipe and additional libaries for testing and formattingpip install spacy_conll[dev]

Usage

When the ConllFormatter is added to a spaCy pipeline, it adds CoNLL properties forToken, sentenceSpan andDoc.Note that arbitrary Span's are not included and do not receive these properties.

On all three of these levels, two custom properties are exposed by default,._.conll and its stringrepresentation._.conll_str. However, if you havepandas installed, then._.conll_pd willbe added automatically, too!

  • ._.conll: raw CoNLL format

    • in Token: a dictionary containing all the expected CoNLL fields as keys and the parsed properties as values.
    • in sentence Span: a list of its tokens'._.conll dictionaries (list of dictionaries).
    • in a Doc: a list of its sentences'._.conll lists (list of list of dictionaries).
  • ._.conll_str: string representation of the CoNLL format

    • in Token: tab-separated representation of the contents of the CoNLL fields ending with a newline.
    • in sentence Span: the expected CoNLL format where each row represents a token. WhenConllFormatter(include_headers=True) is used, two header lines are included as well, as per theCoNLL format.
    • in Doc: all its sentences'._.conll_str combined and separated by new lines.
  • ._.conll_pd:pandas representation of the CoNLL format

    • in Token: a Series representation of this token's CoNLL properties.
    • in sentence Span: a DataFrame representation of this sentence, with the CoNLL names as column headers.
    • in Doc: a concatenation of its sentences' DataFrame's, leading to a new a DataFrame whose index is reset.

You can usespacy_conll in your own Python code as a custom pipeline component, or you can use the built-incommand-line script which offers typically needed functionality. See the following section for more.

In Python

This library offers the ConllFormatter class which serves as a custom spaCy pipeline component. It can be instantiatedas follows. It is important that you importspacy_conll before adding the pipe!

importspacynlp=spacy.load("en_core_web_sm")nlp.add_pipe("conll_formatter",last=True)

Because this library supports different spaCy wrappers (spacy,stanza, andudpipe), a convenience function isavailable as well. Withutils.init_parser you can easily instantiate a parser with a single line. You canfind the function's signature below. Have a look at thesource code to read more about all thepossible arguments or try out theexamples.

NOTE:is_tokenized does not work forspacy-udpipe. Usingis_tokenized forspacy-stanza also affects sentencesegmentation, effectivelyonly splitting on new lines. Withspacy,is_tokenized disables sentence splitting completely.

definit_parser(model_or_lang:str,parser:str,*,is_tokenized:bool=False,disable_sbd:bool=False,exclude_spacy_components:Optional[List[str]]=None,parser_opts:Optional[Dict]=None,**kwargs,)

For instance, if you want to load a Dutchstanza model in silent mode with the CoNLL formatter already attached, youcan simply use the following snippet.parser_opts is passed to thestanza pipeline initialisation automatically.Any other keyword arguments (kwargs), on the other hand, are passed to theConllFormatter initialisation.

fromspacy_conllimportinit_parsernlp=init_parser("nl","stanza",parser_opts={"verbose":False})

TheConllFormatter allows you to customize the extension names, and you can also specify conversion maps for theoutput properties.

To illustrate, here is an advanced example, showing the more complex options:

  • ext_names: changes the attribute names to a custom key by using a dictionary.
  • conversion_maps: a two-level dictionary that looks like{field_name: {tag_name: replacement}}. Inother words, you can specify in which field a certain value should be replaced by another. This is especially usefulwhen you are not satisfied with the tagset of a model and wish to change some tags to an alternative0.
  • field_names: allows you to change the default CoNLL-U field names to your own custom names. Similar to theconversion map above, you should use any of the default field names as keys and add your own key as value.Possible keys are : "ID", "FORM", "LEMMA", "UPOS", "XPOS", "FEATS", "HEAD", "DEPREL", "DEPS", "MISC".

The example below

  • shows how to manually add the component;
  • changes the custom attributeconll_pd to pandas (conll_pd only availabe ifpandas is installed);
  • converts anynsubj deprel tag tosubj.
importspacynlp=spacy.load("en_core_web_sm")config= {"ext_names": {"conll_pd":"pandas"},"conversion_maps": {"deprel": {"nsubj":"subj"}}}nlp.add_pipe("conll_formatter",config=config,last=True)doc=nlp("I like cookies.")print(doc._.pandas)

This is the same as:

fromspacy_conllimportinit_parsernlp=init_parser("en_core_web_sm","spacy",ext_names={"conll_pd":"pandas"},conversion_maps={"deprel": {"nsubj":"subj"}})doc=nlp("I like cookies.")print(doc._.pandas)

The snippets above will output a pandas DataFrame by using._.pandas rather than the standard._.conll_pd, and all occurrences ofnsubj in the deprel field are replaced bysubj.

   ID     FORM   LEMMA    UPOS    XPOS                                       FEATS  HEAD DEPREL DEPS           MISC0   1        I       I    PRON     PRP  Case=Nom|Number=Sing|Person=1|PronType=Prs     2   subj    _              _1   2     like    like    VERB     VBP                     Tense=Pres|VerbForm=Fin     0   ROOT    _              _2   3  cookies  cookie    NOUN     NNS                                 Number=Plur     2   dobj    _  SpaceAfter=No3   4        .       .   PUNCT       .                              PunctType=Peri     2  punct    _  SpaceAfter=No

Another initialization example that would replace the column names "UPOS" with "upostag" amd "XPOS" with "xpostag":

importspacynlp=spacy.load("en_core_web_sm")config= {"field_names": {"UPOS":"upostag","XPOS":"xpostag"}}nlp.add_pipe("conll_formatter",config=config,last=True)

Reading CoNLL into a spaCy object

It is possible to read a CoNLL string or text file and parse it as a spaCy object. This can be useful if you have rawCoNLL data that you wish to process in different ways. The process is straightforward.

fromspacy_conllimportinit_parserfromspacy_conll.parserimportConllParsernlp=ConllParser(init_parser("en_core_web_sm","spacy"))doc=nlp.parse_conll_file_as_spacy("path/to/your/conll-sample.txt")'''or straight from raw text:conllstr = """# text = From the AP comes this story :1FromfromADPIN_3case3:case_2thetheDETDTDefinite=Def|PronType=Art3det3:det_3APAPPROPNNNPNumber=Sing4obl4:obl:from_4comescomeVERBVBZMood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin0root0:root_5thisthisDETDTNumber=Sing|PronType=Dem6det6:det_6storystoryNOUNNNNumber=Sing4nsubj4:nsubj_"""doc = nlp.parse_conll_text_as_spacy(conllstr)'''# Multiple CoNLL entries (separated by two newlines) will be included as different sentences in the resulting Docforsentindoc.sents:fortokeninsent:print(token.text,token.dep_,token.pos_)

Command line

Upon installation, a command-line script is added under tha aliasparse-as-conll. You can use it to parse astring or file into CoNLL format given a number of options.

parse-as-conll -husage: parse-as-conll [-h] [-f INPUT_FILE] [-a INPUT_ENCODING] [-b INPUT_STR] [-o OUTPUT_FILE]                  [-c OUTPUT_ENCODING] [-s] [-t] [-d] [-e] [-j N_PROCESS] [-v]                  [--ignore_pipe_errors] [--no_split_on_newline]                  model_or_lang {spacy,stanza,udpipe}Parse an input string or input file to CoNLL-U format using a spaCy-wrapped parser. The outputcan be written to stdout or a file, or both.positional arguments:  model_or_lang         Model or language to use. SpaCy models must be pre-installed, stanza                        and udpipe models will be downloaded automatically  {spacy,stanza,udpipe}                        Which parser to use. Parsers other than'spacy' need to be installed                        separately. For'stanza' you need'spacy-stanza', andfor'udpipe' the'spacy-udpipe' library is required.optional arguments:  -h, --help            show thishelp message andexit  -f INPUT_FILE, --input_file INPUT_FILE                        Path to file with sentences to parse. Has precedence over'input_str'.                        (default: None)  -a INPUT_ENCODING, --input_encoding INPUT_ENCODING                        Encoding of the input file. Default value is system default. (default:                        cp1252)  -b INPUT_STR, --input_str INPUT_STR                        Input string to parse. (default: None)  -o OUTPUT_FILE, --output_file OUTPUT_FILE                        Path to output file. If not specified, the output will be printed on                        standard output. (default: None)  -c OUTPUT_ENCODING, --output_encoding OUTPUT_ENCODING                        Encoding of the output file. Default value is system default. (default:                        cp1252)  -s, --disable_sbd     Whether to disable spaCy automatic sentence boundary detection. In                        practice, disabling means that every line will be parsed as one                        sentence, regardless of its actual content. When'is_tokenized' is                        enabled,'disable_sbd' is enabled automatically (see'is_tokenized').                        Only works when using'spacy' as'parser'. (default: False)  -t, --is_tokenized    Whether your text has already been tokenized (space-seperated). Setting                        this option has as an important consequence that no sentence splitting                        at all will bedone except splitting on new lines. Soif your input is                        a file, and you want to use pretokenised text, make sure that each line                        contains exactly one sentence. (default: False)  -d, --include_headers                        Whether to include headers before the output of every sentence. These                        headers include the sentence text and the sentence ID as per the CoNLL                        format. (default: False)  -e, --no_force_counting                        Whether to disable force counting the'sent_id', starting from 1 and                        increasingfor each sentence. Instead,'sent_id' will depend on how                        spaCy returns the sentences. Must have'include_headers' enabled.                        (default: False)  -j N_PROCESS, --n_process N_PROCESS                        Number of processes to useinnlp.pipe(). -1 will use as many cores as                        available. Might not workfor a'parser' other than'spacy' depending                        on your environment. (default: 1)  -v, --verbose         Whether to always print the output to stdout, regardless of'output_file'. (default: False)  --ignore_pipe_errors  Whether to ignore a priori errors concerning'n_process' By default we                        try to determine whether processing works on your system and stop                        executionif we think it doesn't. If you know what you are doing, you                        can ignore such pre-emptive errors, though, and run the code as-is,                        which will then throw the default Python errors when applicable.                        (default: False)  --no_split_on_newline                        By default, the input file or string is split on newlines for faster                        processing of the split up parts. If you want to disable that behavior,                        you can use this flag. (default: False)

For example, parsing a single line, multi-sentence string:

parse-as-conll en_core_web_sm spacy --input_str"I like cookies. What about you?" --include_headers# sent_id = 1# text = I like cookies.1       I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      2       nsubj   _       _2       like    like    VERB    VBP     Tense=Pres|VerbForm=Fin 0       ROOT    _       _3       cookies cookie  NOUN    NNS     Number=Plur     2       dobj    _       SpaceAfter=No4..       PUNCT.       PunctType=Peri  2       punct   _       _# sent_id = 2# text = What about you?1       What    what    PRON    WP      _       2       dep     _       _2       about   about   ADP     IN      _       0       ROOT    _       _3       you     you     PRON    PRP     Case=Acc|Person=2|PronType=Prs  2       pobj    _       SpaceAfter=No4??       PUNCT.       PunctType=Peri  2       punct   _       SpaceAfter=No

For example, parsing a large input file and writing output to a given output file, using four processes:

parse-as-conll en_core_web_sm spacy --input_file large-input.txt --output_file large-conll-output.txt --include_headers --disable_sbd -j 4

Credits

The first version of this library was inspired by initial work byrgalhamaand has evolved a lot since then.

About

Pipeline component for spaCy (and other spaCy-wrapped parsers such as spacy-stanza and spacy-udpipe) that adds CoNLL-U properties to a Doc and its sentences and tokens. Can also be used as a command-line tool.

Topics

Resources

License

Stars

Watchers

Forks


[8]ページ先頭

©2009-2025 Movatter.jp