Movatterモバイル変換


[0]ホーム

URL:


loading
PapersPapers/2022PapersPapers/2022

Scitepress Logo

The Search is performed on all of the following fields:

Note: Please use complete words only.
  • Publication Title
  • Abstract
  • Publication Keywords
  • DOI
  • Proceeding Title
  • Proceeding Foreword
  • ISBN (Completed)
  • Insticc Ontology
  • Author Affiliation
  • Author Name
  • Editor Name
If you already have a Primoris Account you can use the same username/password here.
Research.Publish.Connect.

The Search is performed on all of the following fields:

Note: Please use complete words only.
  • Publication Title
  • Abstract
  • Publication Keywords
  • DOI
  • Proceeding Title
  • Proceeding Foreword
  • ISBN (Completed)
  • Insticc Ontology
  • Author Affiliation
  • Author Name
  • Editor Name
If you're looking for an exact phrase use quotation marks on text fields.

Paper

Paper Unlock

Authors:Konstantin Todorov andGiovanni Colavizza

Affiliation:Institute for Logic, Language and Computation (ILLC), University of Amsterdam, The Netherlands

Keyword(s):Machine Learning, Language Models, Optical Character Recognition (OCR).

Abstract:Neural language models are the backbone of modern-day natural language processing applications. Their use on textual heritage collections which have undergone Optical Character Recognition (OCR) is therefore also increasing. Nevertheless, our understanding of the impact OCR noise could have on language models is still limited. We perform an assessment of the impact OCR noise has on a variety of language models, using data in Dutch, English, French and German. We find that OCR noise poses a significant obstacle to language modelling, with language models increasingly diverging from their noiseless targets as OCR quality lowers. In the presence of small corpora, simpler models including PPMI and Word2Vec consistently outperform transformer-based models in this respect.

Full Text

Download
Please type the code

CC BY-NC-ND 4.0

Sign In

Guests can use SciTePress Digital Library without having a SciTePress account. However, guests have limited access to downloading full text versions of papers and no access to special options.
Guests can use SciTePress Digital Library without having a SciTePress account. However, guests have limited access to downloading full text versions of papers and no access to special options.
Guest:Register as new SciTePress user now for free.

Sign In

Download limit per month - 500 recent papers or 4000 papers more than 2 years old.
SciTePress user: please login.

PDF ImageMy Papers

PopUp Banner

Unable to see papers previously downloaded, because you haven't logged in as SciTePress Member.

If you are already a member please login.
You are not signed in, therefore limits apply to your IP address 153.126.140.213

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total
Popup Banner

PDF ButtonFull Text

Download
Please type the code

Paper citation in several formats:
Todorov, K. and Colavizza, G. (2022).An Assessment of the Impact of OCR Noise on Language Models. InProceedings of the 14th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART; ISBN 978-989-758-547-0; ISSN 2184-433X, SciTePress, pages 674-683. DOI: 10.5220/0010945100003116

@conference{icaart22,
author={Konstantin Todorov and Giovanni Colavizza},
title={An Assessment of the Impact of OCR Noise on Language Models},
booktitle={Proceedings of the 14th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART},
year={2022},
pages={674-683},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010945100003116},
isbn={978-989-758-547-0},
issn={2184-433X},
}

TY - CONF

JO - Proceedings of the 14th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART
TI - An Assessment of the Impact of OCR Noise on Language Models
SN - 978-989-758-547-0
IS - 2184-433X
AU - Todorov, K.
AU - Colavizza, G.
PY - 2022
SP - 674
EP - 683
DO - 10.5220/0010945100003116
PB - SciTePress

    - Science and Technology Publications, Lda.
    RESOURCES

    Proceedings

    Papers

    Authors

    Ontology

    CONTACTS

    Science and Technology Publications, Lda
    Avenida de S. Francisco Xavier, Lote 7 Cv. C,
    2900-616 Setúbal, Portugal.

    Phone: +351 265 520 185(National fixed network call)
    Fax: +351 265 520 186
    Email:info@scitepress.org

    EXTERNAL LINKS

    PRIMORIS

    INSTICC

    SCITEVENTS

    CROSSREF

    PROCEEDINGS SUBMITTED FOR INDEXATION BY:

    dblp

    Ei Compendex

    SCOPUS

    Semantic Scholar

    Google Scholar

    Microsoft Academic


    [8]
    ページ先頭

    ©2009-2025 Movatter.jp