Francisco Rangela,b, Paolo Rossob, Alexandra L. Uitdenbogerdc, Julian Brooked

aAutoritas Consulting, Spain

bUniversitat Politècnica de València, Spain

cRMIT University, Australia

dThomson Reuters, Canada

Proceedings of the 2nd Workshop on Stylistic Variation, Co-located with NAACL, New Orleans, Louisiana, June 5, Association for Computational Linguistics, pp. 39-43, 2018

In this paper, we approach the task of native language identification in a realistic cross-corpus scenario where a model is trained with available data and has to predict the native language from data of a different corpus. We have
proposed a statistical embedding representation reporting a significant improvement over common single-layer approaches of the state of the art, identifying Chinese, Arabic, and Indonesian in a cross-corpus scenario. The proposed approach was shown to be competitive even when the data is scarce and imbalanced.

Download the full paper