The HICMA dataset is made of 5031 images, spread accross five styles: Kufic, Thuluth, Naskh, Diwani and Muhaquaq, and contains also a CSV file with all corresponding labels. The dataset can be downloaded through the following download button.

For more information regarding the dataset and its benchmarking tools, check out the paper accepted in Proceedings of ArabicNLP 2023, colocated with EMNLP 2023, found here.

The HICMA Dataset Benchmarking Tool is a powerful utility designed to assess the performance of Optical Character Recognition (OCR) models on the HICMA Dataset. It can be also accessed from github here.

  • Calligraphy image
  • Calligraphy image
  • Calligraphy image
  • Calligraphy image
  • Calligraphy image

License

HICMA dataset and its benchmark are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Citation

BibTex

Copied!
Anis Ismail, Zena Kamel, and Reem Mahmoud. 2023. HICMA: The Handwriting Identification for Calligraphy and Manuscripts in Arabic Dataset. In Proceedings of ArabicNLP 2023, pages 24–32, Singapore (Hybrid). Association for Computational Linguistics.
Copied!
        @inproceedings{ismail-etal-2023-hicma,
          title = "{HICMA}: The Handwriting Identification for Calligraphy and Manuscripts in {A}rabic Dataset",
          author = "Ismail, Anis  and
            Kamel, Zena  and
            Mahmoud, Reem",
          editor = "Sawaf, Hassan  and
            El-Beltagy, Samhaa  and
            Zaghouani, Wajdi  and
            Magdy, Walid  and
            Abdelali, Ahmed  and
            Tomeh, Nadi  and
            Abu Farha, Ibrahim  and
            Habash, Nizar  and
            Khalifa, Salam  and
            Keleg, Amr  and
            Haddad, Hatem  and
            Zitouni, Imed  and
            Mrini, Khalil  and
            Almatham, Rawan",
          booktitle = "Proceedings of ArabicNLP 2023",
          month = dec,
          year = "2023",
          address = "Singapore (Hybrid)",
          publisher = "Association for Computational Linguistics",
          url = "https://aclanthology.org/2023.arabicnlp-1.3",
          pages = "24--32",
          abstract = "Arabic is one of the most globally spoken languages with more than 313 million speakers worldwide. Arabic handwriting is known for its cursive nature and the variety of writing styles used. Despite the increase in effort to digitize artistic and historical elements, no public dataset was released to deal with Arabic text recognition for realistic manuscripts and calligraphic text. We present the Handwriting Identification of Manuscripts and Calligraphy in Arabic (HICMA) dataset as the first publicly available dataset with real-world and diverse samples of Arabic handwritten text in manuscripts and calligraphy. With more than 5,000 images across five different styles, the HICMA dataset includes image-text pairs and style labels for all images. We further present a comparison of the current state-of-the-art optical character recognition models in Arabic and benchmark their performance on the HICMA dataset, which serves as a baseline for future works. Both the HICMA dataset and its benchmarking tool are made available to the public under the CC BY-NC 4.0 license in the hope that the presented work opens the door to further enhancements of complex Arabic text recognition.",
      }