Open Vocabulary Arabic Diacritics Restoration

You are here

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

Open Vocabulary Arabic Diacritics Restoration

Yasser Hifny

Diacritics restoration is a necessary component in order to develop Arabic text to speech systems. When diacritics are present, the phonetic transcription algorithm can be implemented based on a few rules. Restoring Arabic diacritics based on language model scoring is the dominant approach. A fixed vocabulary is usually used to build the language model used for scoring. Since Arabic is a morphologically rich language, the number of the Out-of-vocabulary (OOV) words is large and the diacritization algorithm fails to restore diacritics for these words. In this letter, we present a novel approach to support open vocabulary diacritics restoration based on the Byte Pair Encoding (BPE) method. The BPE method segments the words into variable length sub-word units and allows open vocabulary from fixed sub-word units dictionary. On the Tashkeela diacritization task, this open vocabulary approach outperforms the word and character based methods commonly used in the literature.

SPS on Twitter

  • We are happy to welcome Prof. Jiebo Luo as the new Editor-in-Chief of IEEE Transactions on Multimedia beginning in…
  • wants your talents! Our tenure-track position in engineering applications of information and data science a…
  • If you’re missing out on , don’t worry - we’ll be tweeting all week long. Follow along with us to see the action!

SPS Videos

Signal Processing in Home Assistants


Multimedia Forensics

Careers in Signal Processing             


Under the Radar