Genomics ETL

This is a demo data engineering project built on Azure cloud. ETL pipeline implemented in Azure Data Factory is ingesting and transforming Illumina Platinum Genomes dataset. Terraform is used for provisioning infrastructure. Deployment pipelines are implemented in GitHub actions in accordance with official Microsoft guidelines, with working CICD process between DEV and PROD environments. DEV ADF is connected to git repository, and ARM templates are propagated to PROD environment through GitHub action.

Databricks visualization: alt text

Deploy Databricks on Azure with Terraform

This is a minimal example for deploying Databricks service on Azure. The smallest number of nodes in the cluster will be 1, and maximum 5. Node type will be the smallest one, and Spark version the latest one with long term support. You will be able to log in automatically with your SSO user. Auto-termination is set to 20 minutes.

Databricks logo

Amazon review dataset translations for sentiment analysis

This repository contains translations of a 150 000 randomly selected entries from Amazon dataset originally created by Julian McAuley and Jianmo Ni, containing over 20GB of data. The goal is to create smaller datasets for sentiment analysis on languages other than English, for which there are many publicly available datasets already. Translation is performed with Microsoft Azure’s Translator cloud service.

AWS Spanish Translation

Java OCP exam code snippets

These are my preparation exercises for the OCP exam. Each example is self-contained, executable code snippet. Usually, my examples are based on meaningful illustration and clear functionality, as in this case of BiPredicate lambda function:

        // Map
        Map<Integer, String> henryFondaMovies = new HashMap<>();
        henryFondaMovies.put(1962, "The Longest Day");
        henryFondaMovies.put(1940, "The Grapes of Wrath");
        henryFondaMovies.put(1964, "Fail-Safe");
        henryFondaMovies.put(1957, "12 Angry Men");
        henryFondaMovies.put(1937, "You Only Live Once");
        henryFondaMovies.put(1938, "The Mad Miss Manton");

        BiPredicate<Integer, String> yearFilter = ((year, title) -> year > 1950);
        BiPredicate<Integer, String> titleFilterNumeric = ((year, title) -> title.matches(".*\\d+.*"));
        BiPredicate<Integer, String> titleFilterThe = ((year, title) -> title.contains("The"));

        // this will print only movies that are released after the 1950
        for (Map.Entry<Integer, String> entry : henryFondaMovies.entrySet())
            if (yearFilter.test(entry.getKey(), entry.getValue()))
                System.out.println("after 1950: " + entry.getValue());

        // this will print only movies that contain number in the title
        for (Map.Entry<Integer, String> entry : henryFondaMovies.entrySet())
            if (titleFilterNumeric.test(entry.getKey(), entry.getValue()))
                System.out.println("number in title: " + entry.getValue());

        // this will print only movies that contain "The" in the title
        for (Map.Entry<Integer, String> entry : henryFondaMovies.entrySet())
            if (titleFilterThe.test(entry.getKey(), entry.getValue()))
                System.out.println("the in title: " + entry.getValue());


Computer Science courses

List of Computer Science university courses freely available on YouTube

CS50 2020 at Harvard
CS50 2020 at Harvard

MIT 6.006 Introduction to Algorithms 2020
MIT 6.006 Introduction to Algorithms 2020

Questions and answers portal on Spanish - nuestras-preguntas.net

www.nuestras-preguntas.net was my demo project for questions and answers platform on Spanish language with full text search capability. Application was completely configurable, so switch to another language could be easily accomplished by providing translation for existing fields. It was running in production for nine months, with two million question and answer pairs. CICD pipeline was built in Jenkins, and application was running in Docker container on Digital Ocean hosting service. Whole system had four containers: SpringBoot application, ElasticSearch database, LetsEncrypt certbot, and Nginx proxy. They were orchestrated with Docker compose. VM instance had 8GB of RAM, and ElasticSearch was by far the largest consumer of resources. My architectural decisions were mainly guided by cost reduction, and comparable cloud solutions would be orders of magnitude more expensive.

Applications unique feature was anonymous usage, meaning registration with email was not necessary to submit a question or answer, or vote. I developed a custom encryption protocol mostly based on AES algorithm so traffic between client and server could not be reverse-engineered and automated. To defend against abuse, there were also throttling services and honey-pot traps against bots. Anti-scraping mechanisms included rotating HTML schema (without an effect on user experience or functionality), in combination with detection of unusually high amount of requests that triggered responses with poisoned content.

alt text

Java 8 code examples for the OCA exam

After doing a couple of mockup-exams, I realised that it is necessary to be familiar with a code that does compile, just as much as with code that does NOT compile. Following examples include both forms of code. Incorrect statements are commented out, and can be easily transformed back to code that throws compile-time errors.

The examples are not ordered in any way. They are just grouped by similarity (eg. arrays, flowcontrol, lambdas, exceptions, etc.)

One code example is not necessarily designed to explain only one thing; it can illustrate several syntactic or programming principles of Java language.

One example is not enough to explain or understand a principle: therefore, code contains seemingly redundant lines of code that nonetheless help in acquiring the pattern behind the particular case.

Top 500 companies in Croatian IT - Superset

First image is a dashboard composed of all other figures. Second figure is a bar chart of companies grouped by city and activity. There is a clear domination of capital city (Zagreb) where are most of the companies registered. However, there is also a clear domination of 6201 code, which denotes ‘Computer programming’. Interestingly, other cities have companies only in this domain, and nothing in counseling(6202), wireless communication (6120), data analysis (6311), equipment management (6203), or even uncategorized activity (6209). Third image represents profit allocation with a pie chart. Roughly two thirds annual profits are from computer programming. Fourth image is a word cloud that counts employees in each city and presents them in adequate ratio. Clearly, Zagreb as a capital city with the most companies has the largest word size. Fifth is a chord diagram that connects company location with activity. Again, presence of Zagreb and programming is easily notceable. Sixth image is a bubble chart. Each data point represents one company. X axis is mapped to a value of ‘Capital and reserves’. Y axis is mapped to the annual revenue. Scales are logarithmically adjusted. Bubble size is linked to the number of employees in the company, while bubble color is linked to the activity type. Presence of red coloer indicated ‘Computer programming’ category. Big purple circle in the middle is Croatian Telecom (Hrvatski Telekom). Big bluish circle in the bottom right corner is VipNet. Since SuperSet reads these numbers as strings, ratios are not correct, and the same figure should be done properly with another tool.

alt text