In this very practical article, we will build two of the most interesting technologies in the Big Data ecosystem: Apache kudu and Apache Impala. We will build both technologies from source code, package them and deploy them with the minimum configurations to make them functional (we will avoid performance related configuration on this occasion) and additionally we will use Apache Impala making it as independent as possible from HDFS.
It is known that Apache Kudu and Apache Impala release their versions in “Source Release” form, therefore they do not make the binaries of these technologies available. This forces users to build such projects from source code. Generally the projects of the Big Data ecosystem and the Apache Software Foundation are based on Java code, which makes them easier to build and in particular to distribute (in most cases only a JVM is needed on the machine to be deployed).
In the case of Apache Kudu and Apache Impala this is not the case, they are two projects whose code base is C++, which makes their construction and in particular their distribution in the target systems somewhat more complicated.
This is a series of articles about coding Apache NiFi in order to help new comers (like me) to understand this awesome project and development community. In this series we are going to study the Apache NiFi ecosystem from a coding perspective. The idea is learning step by step because of modify a huge code set like Apache NiFi is a complex endeavor.
First Step: The Build System
Probably one of the first steps in order to understand the project, it’s to analyze the build structure used in the project. The build structure is based on Apache Maven. This first article is focused in this building tool.
In this article is shown how to setup Jetbrains IntelliJ IDEA for debug/develop Apache MiNiFi tool.
Jetbrains Intellij IDEA is a powerful IDE but sometimes is a little bit complicated of setting up, in particular when the upstream project doesn’t have clear instructions about the development using this kind of IDE’s.
This work attempts to create a framework for making good architectural decisions when faced with data challenges. A systematic way of approaching Big Data complex projects.
The document follows an architectural blueprint in order to classify the correct components in the Big Data architecture landscape. We can think in this blueprint as a way of standardize the names and roles of the different technologies used in an end-to-end Big Data project.
When Big Data projects reach a certain point, the should be agile and adaptable systems that can be easily modified, that requires to have a fair understanding of the software stack as a whole. This work try to help in the decision of which components to use thinking in his own areas of focus.
Many of the ideas, classifications and descriptions are based on other authors. I’ve paraphrased many of the sentences or entire paragraphs, because of I have not found a better way to express the ideas. At the end of this document I’ve posted the references per author.