Developing AI Tools For A Writing Assistant: Automatic Detection of dt-mistakes In Dutch

Wouter  Mercelis

Call for Papers - Ongoing round of submission, notification and publication.

Home | Login or Register | Contact CSC

Home > CSC-OpenAccess Library > Manuscript Information

Full Text Available
(no registration required)

(925.96KB)

-- CSC-OpenAccess Policy

-- Creative Commons Attribution NonCommercial 4.0 International License

>> COMPLETE LIST OF JOURNALS

EXPLORE PUBLICATIONS BY COUNTRIES


	EUROPE

	MIDDLE EAST

	ASIA

	AFRICA
.............................

	United States of America

	United Kingdom

	Canada

	Australia

	Italy

	France

	Brazil

	Germany

	Malaysia

	Turkey

	China

	Taiwan

	Japan

	Saudi Arabia

	Jordan

	Egypt

	United Arab Emirates

	India

	Nigeria

Developing AI Tools For A Writing Assistant: Automatic Detection of dt-mistakes In Dutch

Wouter Mercelis

Pages - 9 - 23 | Revised - 30-04-2021 | Published - 01-06-2021

Published in International Journal of Computational Linguistics (IJCL)

Volume - 12 Issue - 2 | Publication Date - June 2021 Table of Contents

MORE INFORMATION

References | Abstracting & Indexing

KEYWORDS

NLP, Dutch, AI, Spelling Correction, Transfer Learning.

ABSTRACT

This paper describes a lightweight, scalable model that predicts whether a Dutch verb ends in -d, -t or -dt. The confusion of these three endings is a common Dutch spelling mistake. If the predicted ending is different from the ending as written by the author, the system will signal the dt-mistake. This paper explores various data sources to use in this classification task, such as the Europarl Corpus, the Dutch Parallel Corpus and a Dutch Wikipedia corpus. Different architectures are tested for the model training, focused on a transfer learning approach with ULMFiT. The trained model can predict the right ending with 99.4% accuracy, and this result is comparable to the current state-of-the-art performance. Adjustments to the training data and the use of other part-of-speech taggers may further improve this performance. As discussed in this paper, the main advantages of the approach are the short training time and the potential to use the same technique with other disambiguation tasks in Dutch or in other languages.

ABSTRACTING & INDEXING

1	Google Scholar

2	Semantic Scholar

3	refSeek

4	BibSonomy

5	Doc Player

6	J-Gate

7	Scribd

8	SlideShare

REFERENCES

“About - fast.ai,” Internet: https://www.fast.ai/about/, 2020 [Mar. 21, 2021].

“Aquaducten.” Internet: https://www.scholieren.com/verslag/werkstuk-latijn-aquaducten, 2021 [Mar. 21, 2021].

“Circus Maximus.” Internet: https://www.scholieren.com/verslag/werkstuk-geschiedenis-circus-maximus, 2007 [Mar. 21, 2021].

“Cold Skin.” Internet: https://www.scholieren.com/verslag/boekverslag-engels-cold-skin-door-steven-herrick, 2010 [Mar. 21, 2021].

“d / dt / t.” Internet: https://www.vlaanderen.be/taaladvies/d-dt-t, 2021 [Apr. 28, 2021].

“De gevolgen van de ontdekkingsreizen.” Internet: https://www.scholieren.com/verslag/werkstuk-geschiedenis-de-gevolgen-van-de-ontdekkingsreizen, 2003 [Mar. 21, 2021].

“Index of /nlwiki/.” Internet: https://dumps.wikimedia.org/nlwiki/, 2021 [Apr. 28, 2021].

“Internationale politiek België.” Internet: https://www.scholieren.com/verslag/opdracht-geschiedenis-internationale-politiek-belgie, 2004 [Mar. 21, 2021].

“LIIR – Home.” Internet: http://liir.cs.kuleuven.be/software_pages/dt_correction_dataset_preprocessing.php, 2018 [Mar. 21, 2021].

“torch.nn - PyTorch 1.5.0 documentation.” Internet: https://pytorch.org/docs/stable/nn.html [Mar. 21, 2021].

B. van der Burgh. "110k Dutch Book Reviews Dataset for Sentiment Analysis." Internet: https://github.com/benjaminvdb/DBRD, 2019 [Mar. 21, 2021].

C. Leacock, M. Chodorow, M. Gamon, and J. Tetreault. (2014). "Automated Grammatical Error Detection for Language Learners". (2nd ed). [On-line]. Available: https://www.morganclaypool.com/doi/abs/10.2200/S00562ED1V01Y201401HLT025 [Mar. 21, 2021].

G. Alafang Malema, N. Motlogelwa, B. Okgetheng and O. Mogotlhwane. (2016, Aug.). “Setswana Verb Analyzer and Generator.” International Journal of Computational Linguistics. [On-line]. 7(1), pp. 1-11. Available: https://www.cscjournals.org/library/manuscriptinfo.php?mc=IJCL-73 [May 5, 2021].

G. Heyman, I. Vulic, Y. Laevaert, and M.-F. Moens. (2018, Dec.). “Automatic detection and correction of context-dependent dt-mistakes using neural networks.” Comput. Linguist. Neth. J. [On-line]. 8, pp. 49–65. Available: https://clinjournal.org/clinj/article/view/79 [Mar. 21, 2021].

H. Paulussen, L. Macken, W. Vandeweghe, and P. Desmet. (2013). “Dutch Parallel Corpus: A Balanced Parallel Corpus for Dutch-English and Dutch-French.” [On-line]. pp. 185–199. Available: https://link.springer.com/chapter/10.1007/978-3-642-30910-6_11 [Mar. 21, 2021].

H. Schmid. (1997). “Probabilistic Part-of-Speech Tagging Using Decision Trees,” New Methods in Language Processing.[On-line]. pp. 154–164. Available: https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger1.pdf [Mar. 21, 2021].

J. Howard and S. Gugger. (2020, Feb.). “Fastai: A Layered API for Deep Learning.” Information. 11(2). p. 108. Available: https://www.mdpi.com/2078-2489/11/2/108 [May 4, 2021].

J. Howard and S. Ruder. (2018). “Universal Language Model Fine-tuning for Text Classification.” [On-line]. Available: http://arxiv.org/abs/1801.06146 [Mar. 21, 2021].

J. Zhang, Y. Zeng, and B. Starly. (2021, Mar.). “Recurrent neural networks with long term temporal dependencies in machine tool wear diagnosis and prognosis.” SN Appl. Sci. [On-line]. 3(4), p. 442. Available: https://link.springer.com/article/10.1007/s42452-021-04427-5 [Apr. 28, 2021]

J.S. Sumamo, and S. Teferra. (2018, Oct.). “Designing A Rule Based Stemming Algorithm for Kambaata Language Text.” International Journal of Computational Linguistics. [On-line]. 9(2), pp. 41-54. Available: https://www.cscjournals.org/library/manuscriptinfo.php?mc=IJCL-73 [May 5, 2021].

L. Allein, A. Leeuwenberg, and M.-F. Moens. (2020). "Binary and Multitask Classification Model for Dutch Anaphora Resolution: Die/Dat Prediction." ArXiv. [On-line]. Available: https://arxiv.org/abs/2001.02943 [Mar. 21, 2021].

L. Salifou, and H. Â Naroua. (2014, Jun.). “Design of A Spell Corrector For Hausa Language.” International Journal of Computational Linguistics. [On-line]. 5(2), pp. 14-26. Available: https://www.cscjournals.org/library/manuscriptinfo.php?mc=IJCL-56 [May 5, 2021].

M. Honnibal and I. Montani. (2017). “spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.” [On-line]. Available: https://sentometrics-research.com/publication/72/ [Mar. 21, 2021].

N. Verhaert and D. Sandra. (2016). “Homofoondominantie veroorzaakt dt-fouten tijdens het spellen en maakt er ons blind voor tijdens het lezen.” Levende Talen Tijdschr. [On-line]. Available: https://lt-tijdschriften.nl/ojs/index.php/ltt/article/view/1632 [Mar. 21, 2021].

P. Koehn. (2005). “Europarl: A Parallel Corpus for Statistical Machine Translation.” Conference Proceedings: the tenth Machine Translation Summit. [On-line]. pp. 79–86. Available: http://mt-archive.info/MTS-2005-Koehn.pdf [Mar. 21, 2021].

S. Faltl, M. Schimpke, and C. Hackober. "ULMFiT: State-of-the-Art in Text Analysis", Internet: https://humboldt-wi.github.io/blog/research/information_systems_1819/group4_ulmfit/, 2019 [Mar. 21, 2021].

T. Brants and A. Franz. (2006). "Web 1T 5-gram Version 1 - Linguistic Data Consortium." 2006. [On-line]. Available: https://catalog.ldc.upenn.edu/LDC2006T13 [Mar. 21, 2021].

Y. Li, A. Anastasopoulos, and A. W. Black. (2020, Jan.). “Towards Minimal Supervision BERT-based Grammar Error Correction.” ArXiv200103521. [On-line]. Available: http://arxiv.org/abs/2001.03521 [Mar. 21, 2021].

Z. Liu and Y. Liu. (2016). “Exploiting Unlabeled Data for Neural Grammatical Error Detection.” arXiv.org. [On-line]. Available: http://search.proquest.com/docview/2080422559/ [Mar. 21, 2021].

MANUSCRIPT AUTHORS

Mr. Wouter Mercelis

Faculteit Letteren/Onderzoekseenheid Taalkunde/Onderzoeksgroep, Kwantitatieve Lexicologie en Variatielinguïstiek (QLVL), KU Leuven, Leuven, 3000 - Belgium

mercelisw@gmail.com

CREATE AUTHOR ACCOUNT

LAUNCH YOUR SPECIAL ISSUE

View all special issues >>

PUBLICATION VIDEOS