Oddbean

You probably want to train an AI model (text-to-speech, LLM, ...) and now want to tokenize Arabic letters. I did some tokenization, but it was in English and German, so I'm not sure if I can help here. The general idea is to get any non-numeric representation as a numeric representation. So, instead of having letters, smileys, or any kind of data representation, you want to find a number that represents the same meaning as the original representation. For the Arabic language, you might be able to use ASCII, UTF, and so on as used in the blog. You might also be able to include an Arabic font. Another approach that I think would be a good alternative is to use an existing Arabic tokenizer. Have a look at Hugging Face. Coming up with your own tokenizer could be a good idea if you know what you're doing. However, note that the quality of your LLM highly depends on your tokenizer. Thus, I would suggest going with an already existing Arabic tokenizer. Chances are that it's better than what a non-expert in this field can come up with. I'm not sure if this has helped you, but for the moment, I cannot invest more time in other things. All the best, my friend.