AUEB Library - Digital Repository

PYXIDA Institutional Repository
and Digital Library

Username
Password

Collections :	Ιδρυματικό Αποθετήριο ΟΠΑ / AUEB Institutional Repository Σχολή Επιστημών και Τεχνολογίας της Πληροφορίας / School of Informatics Τμήμα Πληροφορικής / Department of Informatics Μεταπτυχιακές Εργασίες / Postgraduate dissertations

Title :	Error detection in english and greek texts written by foreign learners

Alternative Title :	Εντοπισμός λαθών σε αγγλικά και ελληνικά κείμενα γραμμένα από ξένους μαθητές

Creator :	Στρουμπούλη, Ελευθερία Stroumpouli, Eleftheria

Contributor :	Παυλόπουλος, Ιωάννης (Επιβλέπων καθηγητής) Ανδρουτσόπουλος, Ίων (Εξεταστής) Λουρίδας, Παναγιώτης (Εξεταστής) Athens University of Economics and Business, Department of Informatics (Degree granting institution)

Type :	Text

Extent :	44p.

Language :	en

Identifier :	http://www.pyxida.aueb.gr/index.php?op=view_object&object_id=8125

Abstract :	Η παρούσα Διπλωματική Εργασία στοχεύει στη δημιουργία ενός συστήματος που έχει ως σκοπό τον εντοπισμό αγγλικών προτάσεων με γραμματικά λάθη, γραμμένες από μαθητές της αγγλικής ως ξένης γλώσσας, και τον εντοπισμό γραμματικών, συντακτικών και εννοιολογικών λαθών σε αντίστοιχες ελληνικές προτάσεις. Ο στόχος αυτής της εργασίας χωρίζεται σε δύο υπο-στόχους: 1) ο προσδιορισμός μιας δοθείσας πρότασης εάν είναι σωστή ή λανθασμένη, 2) η κατασκευή ενός ελληνικού κειμένου με τεχνητά λάθη. Για το δεύτερο στόχο, μελετήθηκαν πραγματικά κείμενα γραμμένα από πρόσφυγες και μετανάστες, καθώς και γλωσσικές ασκήσεις που περιείχαν εσκεμμένα λάθη, προκειμένου να αντληθούν τα πιο συχνά λάθη που θα προστεθούν στο νέο κείμενο. Αυτά τα λάθη προστέθηκαν ακολουθώντας έναν αλγόριθμο, με μία συγκεκριμένη πιθανότητα για όλα, με στόχο να μην εφαρμοστούν σε όλες τις περιστάσεις ανεξαιρέτως, έτσι ώστε το αποτέλεσμα να φαίνεται πιο ρεαλιστικό. Για τον πρώτο στόχο, μετά την κατάλληλη προεπεξεργασία των δεδομένων, εφαρμόστηκαν τρεις ταξινομητές και ένα νευρωνικό δίκτυο. Οι ταξινομητές Logistic Regression, Support Vector Machine και Decision Tree πέτυχαν τελευταίας τεχνολογίας (state-of-the-art) αποτελέσματα στα αγγλικά κείμενα, ενώ στις ελληνικές προτάσεις, που είναι εντοπισμένες με λάθη, χρειάζονται περαιτέρω συντονισμό. Σχετικά με το νευρωνικό μοντέλο, το LSTM RNN, πέτυχε χαμηλότερες βαθμολογίες από τους ταξινομητές στα αγγλικά κείμενα και αρκετά καλές βαθμολογίες στα ελληνικά κείμενα. This thesis aims to build a system to tackle the task of detecting sentences with grammatical errors written by learners of English as a foreign language and grammatical, syntactic and semantic errors in corresponding Greek sentences. The goals of this task is to: 1) identify if the given sentence is correct or not, 2) construct a Greek corpus with artificial errors. For the second goal, real texts written by refugees and immigrants were studied as well as language exercises with deliberate mistakes in order to draw the most common mistakes that will be added to the new corpus. These mistakes were added following an algorithm, with a specific probability for all the errors, in order not to be applied in all circumstances without exception, so that the result looks more realistic. For the first goal, after the proper preprocessing of the data, three classifiers and a neural network were implemented. Logistic Regression, Support Vector Machine and Decision Tree classifiers achieved state-of-the-art scores on the English texts, while on the Greek sentence with error detected need further tuning. About the neural model (an LSTM RNN), achieved lower scores than the classifiers on the English texts and fairly good scores on the Greek texts.

Abstract :

Η παρούσα Διπλωματική Εργασία στοχεύει στη δημιουργία ενός συστήματος που έχει ως σκοπό τον εντοπισμό αγγλικών προτάσεων με γραμματικά λάθη, γραμμένες από μαθητές της αγγλικής ως ξένης γλώσσας, και τον εντοπισμό γραμματικών, συντακτικών και εννοιολογικών λαθών σε αντίστοιχες ελληνικές προτάσεις. Ο στόχος αυτής της εργασίας χωρίζεται σε δύο υπο-στόχους: 1) ο προσδιορισμός μιας δοθείσας πρότασης εάν είναι σωστή ή λανθασμένη, 2) η κατασκευή ενός ελληνικού κειμένου με τεχνητά λάθη. Για το δεύτερο στόχο, μελετήθηκαν πραγματικά κείμενα γραμμένα από πρόσφυγες και μετανάστες, καθώς και γλωσσικές ασκήσεις που περιείχαν εσκεμμένα λάθη, προκειμένου να αντληθούν τα πιο συχνά λάθη που θα προστεθούν στο νέο κείμενο. Αυτά τα λάθη προστέθηκαν ακολουθώντας έναν αλγόριθμο, με μία συγκεκριμένη πιθανότητα για όλα, με στόχο να μην εφαρμοστούν σε όλες τις περιστάσεις ανεξαιρέτως, έτσι ώστε το αποτέλεσμα να φαίνεται πιο ρεαλιστικό. Για τον πρώτο στόχο, μετά την κατάλληλη προεπεξεργασία των δεδομένων, εφαρμόστηκαν τρεις ταξινομητές και ένα νευρωνικό δίκτυο. Οι ταξινομητές Logistic Regression, Support Vector Machine και Decision Tree πέτυχαν τελευταίας τεχνολογίας (state-of-the-art) αποτελέσματα στα αγγλικά κείμενα, ενώ στις ελληνικές προτάσεις, που είναι εντοπισμένες με λάθη, χρειάζονται περαιτέρω συντονισμό. Σχετικά με το νευρωνικό μοντέλο, το LSTM RNN, πέτυχε χαμηλότερες βαθμολογίες από τους ταξινομητές στα αγγλικά κείμενα και αρκετά καλές βαθμολογίες στα ελληνικά κείμενα.
This thesis aims to build a system to tackle the task of detecting sentences with grammatical errors written by learners of English as a foreign language and grammatical, syntactic and semantic errors in corresponding Greek sentences. The goals of this task is to: 1) identify if the given sentence is correct or not, 2) construct a Greek corpus with artificial errors. For the second goal, real texts written by refugees and immigrants were studied as well as language exercises with deliberate mistakes in order to draw the most common mistakes that will be added to the new corpus. These mistakes were added following an algorithm, with a specific probability for all the errors, in order not to be applied in all circumstances without exception, so that the result looks more realistic. For the first goal, after the proper preprocessing of the data, three classifiers and a neural network were implemented. Logistic Regression, Support Vector Machine and Decision Tree classifiers achieved state-of-the-art scores on the English texts, while on the Greek sentence with error detected need further tuning. About the neural model (an LSTM RNN), achieved lower scores than the classifiers on the English texts and fairly good scores on the Greek texts.

Subject :	Επεξεργασία φυσικής γλώσσας Δυαδική ταξινόμηση Ανίχνευση σφαλμάτων Νευρωνικά δίκτυα Κείμενα γραμμένα από ξένους μαθητές Natural Language Processing (NLP) Binary classification Error detection Neural Networks (NN) Texts written by foreign learners Neural language models Long Short-Term Memory (LSTM)

Subject :

Επεξεργασία φυσικής γλώσσας
Δυαδική ταξινόμηση
Ανίχνευση σφαλμάτων
Νευρωνικά δίκτυα
Κείμενα γραμμένα από ξένους μαθητές
Natural Language Processing (NLP)
Binary classification
Error detection
Neural Networks (NN)
Texts written by foreign learners
Neural language models Long Short-Term Memory (LSTM)

Date Available :	2020-11-30 17:02:23

Date Issued :	30-10-2020

Date Submitted :	2020-11-30 17:02:23

Access Rights :	Free access

Licence :

File: Stroumpouli_2020.pdf

Type: application/pdf

Login