This is a dataset of sentences in Croatian language for anyone interested in Natural Language Processing. The dataset is Spark’s dataframe in a snappy compressed ORC format. Most of the sentences are from the Croatian Wikipedia dump and OpenSubtitles project. There are still entries that do not belong here, whether in a form of misspelled or grammatically incorrect sentences, logs, different languages, or weird tags. However, those should be minimal. The goal of this project was to create a reusable dataset with standardised use of Croatian language for various purposes.

Size

  • 14.7 million entries
  • 840 mb uncompressed size
  • 460 mb snappy compressed ORC

alt text

GitHub repo