Croatian Language Dataset

This is a dataset of sentences in Croatian language for anyone interested in Natural Language Processing. The dataset is Spark’s dataframe in a snappy compressed ORC format. Most of the sentences are from the Croatian Wikipedia dump and OpenSubtitles project. There are still entries that do not belong here, whether in a form of misspelled or grammatically incorrect sentences, logs, different languages, or weird tags. However, those should be minimal. The goal of this project was to create a reusable dataset with standardised use of Croatian language for various purposes.

Size

14.7 million entries
840 mb uncompressed size
460 mb snappy compressed ORC

alt text

GitHub repo

Croatian Language Dataset

Categories

Tags