Arabic Computational Resources Tools
Producing resources for machine learning and NLP development is the main focus of the Arabic Computational Resources Group. Aside from the datasets that they produce, they have also developed these prototypes:
- Sharjah Arabic Romanization Tools
These are a group of tools that convert Arabic script into Roman script using several Romanization systems:
- The International Phonetic Association symbols are the most important Romanization system. They are used to render vowelized Arabic in a form that represents how words are spoken. The transcription that our IPA tool uses is broad and phonological; no co-articulation, assimilation, or morphophonemic alteration is represented in our output, but the tool has facility for syllabification. For accurate Romanization, please ensure that the Arabic text input is fully vowelized. If it is not, the IPA-transcribed output will not be a faithful representation of how Arabic is pronounced.
Try it out:
https://romanization.sharjah.ac.ae/
بِسمِ اللاهِ الرَّحمانِ الرَّحِيمِ
bismi ʔallaːhi ʔalɾaħmaːni ʔal-ɾaħiːmi
bis.mi ʔal.laː.hi ʔal.ɾaħ.maː.ni ʔal-.ɾa.ħiː.mi.
- ALA-LC Romanization System. This is the fully diacritical Library of Congress system that is used to Romanize Arabic bibliographic information; ALA-LC was first published in 1991. Its Romanization is close to that of the German Oriental Society’s system which is used internationally in scientific publications by Arabists. It also requires full vowelization of Arabic texts. The Library of Congress tool can vowelize a text automatically before it Romanizes it. ALA-LC was first published in 1991.
Try it out:
https://romanization.sharjah.ac.ae/CongressPageView/
بِسمِ اللاهِ الرَّحمانِ الرَّحِيمِ
bismi al-lāhi al-rraḥmāni al-rraḥīmi
bis.mi. al-.l.āhi. al-r.raḥ.mā.ni. al-r.ra.ḥī.mi.
- Brill’s Encyclopedia of Islam 3 Romanization System. This transcription system is based on a modified version of The German Oriental Society’s 1935 transliteration system that eventually became the German Institute for Standardization DIN 31635 (1982). This has become the standard system for Romanizing Arabic in academic publications, especially by Arabists. It is a system that is mainly diacritical but it has some diphrams.
Try it out:
https://romanization.sharjah.ac.ae/BrillPageView/
بِسمِ اللاهِ الرَّحمانِ الرَّحِيمِ
bismi al-lāhi al-rraḥmāni al-rraḥīmi
bis.mi. al-.l.āhi. al-r.raḥ.mā.ni. al-r.ra.ḥī.mi.
- Arabica Romanization System. This transcription system is also based on The German Oriental Society’s 1935 transliteration system and that eventually became the German Institute for Standardization DIN 31635 (1982). This has become the standard system for Romanizing Arabic in academic publications, especially by Arabists. It is a system that is fully diacritical.
Try it out:
https://romanization.sharjah.ac.ae/ArabicaPageView/
بِسمِ اللاهِ الرَّحمانِ الرَّحِيمِ
bismi al-lāhi al-rraḥmāni al-rraḥīmi
bis.mi. al-.l.āhi. al-r.raḥ.mā.ni. al-r.ra.ḥī.mi.
- Buckwalter Romanization System. This transcription system is an ASCII-based system that represents Arabic orthography one-to-one, allowing users to type or convert text exactly as it appears in Arabic without adding additional morphological information. For instance, ya’ is represented as y regardless of whether it is pronounced as I or y, the alef mamdouda and maqsoura are represented differently, A and Y respectively. It only represents diacritics if they are present in the input. This system was produced in Ken Beesley’s ALPNET Arabic Project in 1988; it was designed by Tim Buckwalter and Derek Foxley.
Try it out:
https://romanization.sharjah.ac.ae/BuckwalterPageView/
بِسمِ اللاهِ الرَّحمانِ الرَّحِيمِ
bismi AllAhi AlraHmAni AlraHiymi
bis.mi. AllAhi. AlraH.mAni. Alra.Hiy.mi.
- Sharjah Arabic Sentence Aligner
This is a tool that allows the user to match a source text with its translation at phrase, clause, sentence, or paragraph levels. It is used for the compilation of parallel corpora that are used by translators and natural language processing specialists for training neural networks to perform such tasks as machine translation, bilingual lexicography, cross-linguistic analysis, etc.
Try it out:
https://alignment.sharjah.ac.ae/
Here is an aligned segment of texts:
- Al-Murshid Arabic-English Dictionary.
This is a dictionary whose nucleus was authored by Prof. Abdul Fattah Abu-Ssaydeh and published in print form in 2013. Al-Murshid is a corpus-informed, collocation-aware, contemporary Arabic-English bilingual dictionary. It offers entries with (1) part of speech (POS) categorization; (2) domain of use; (3) definitions; (4) synonyms; (5) multi-word sequences that the entry enters into, either as a multiword term or as examples of use; and (6) English glosses. Furthermore, Al-Murshid is linked with Peter Mark Roget’s Thesaurus of English Words and Phrases, so every entry will also have its ontological designation. It is also linked to a translation corpus, a parallel corpus, so all entries are given example sentences that illustrate their use in both English and Arabic.
Try it out:
https://dictionary.sharjah.ac.ae/discsearch_page/
Here is an aligned segment of texts: