Abstract
Authorship identification is the process of extracting and analysing the writing styles of authors to identify the authorship. From the writing style, the author and his/her different characteristics can be recognised, which is very useful in digital forensics and cyber investigations. In the literature, authorship identification tasks were addressed on both long and short documents and performed on different languages, such as English, Arabic, Chinese, and Greek. This survey has reviewed the authorship identification tasks for the Arabic language to contribute to this area of research by exploring Arabic language performance and challenges. A total of 27 prominent Arabic studies of each authorship identification domain were reviewed considering the used data, selected features, utilised methods, and results. After a review of the various studies, it was concluded that the results of authorship identification tasks vary based on mostly the selected features and used dataset. Furthermore, the effective features differ from one dataset to another based on the various types of the Arabic language. However, all authorship identification tasks involving the Arabic language face considerable challenges with data pre-processing due to the challenging Arabic concatenative morphology.
Original language | English |
---|---|
Journal | ACM Transactions on Asian and Low-Resource Language Information Processing |
DOIs | |
Publication status | E-pub ahead of print - 22 Sept 2022 |
Keywords
- General Computer Science