This is a series of articles about coding Apache NiFi in order to help new comers (like me) to understand this awesome project and development community. In this series we are going to study the Apache NiFi ecosystem from a coding perspective. The idea is learning step by step because of modify a huge code set like Apache NiFi is a complex endeavor.
First Step: The Build System
Probably one of the first steps in order to understand the project, it’s to analyze the build structure used in the project. The build structure is based on Apache Maven. This first article is focused in this building tool.
Apache Maven Introduction
Apache Maven is one of the most popular and spread tools in the scope of build automation software for Java. Other tools in this field are Apache Ant (with Apache Ivy) and Gradle.
Probably Apache Maven is the facto standard for building Java applications. Apache Maven has established foundations used by other build tools:
- Maven Standard Directory Layout
- Artifact Naming
- Artifact Repository Infrastructure
We can install the latest version from maven.apache.org without interfering the Maven installed on the system using the
$ alternatives --display mvn mvn - status is auto. link currently points to /usr/share/maven/bin/mvn /usr/share/maven/bin/mvn - priority 0 slave mvnDebug: /usr/share/maven/bin/mvnDebug slave mvn1: /usr/share/maven/bin/mvn.1.gz slave mvnDebug1: /usr/share/maven/bin/mvn.1.gz Current `best' version is /usr/share/maven/bin/mvn. $ which mvn /usr/bin/mvn $ sudo alternatives --install /usr/bin/mvn mvn /opt/apache-maven-3.6.3/bin/mvn 2 $ sudo alternatives --config mvn There are 2 programs which provide 'mvn'. Selection Command ----------------------------------------------- 1 /usr/share/maven/bin/mvn *+ 2 /opt/apache-maven-3.6.3/bin/mvn Enter to keep the current selection[+], or type selection number:
Here are the key concepts maven provides:
- Set of build standards for all projects.
- Standard life cycles for building, testing, packaging, deploying, publishing.
- Default Tasks execution.
- Common declarative Project Object Model (POM) as building block.
- Dependencies management.
- Artifact repository infrastructures.
- Modular Design (Plug-ins) and re-usability design.
According to the official Apache Maven documentation, the philosophy behind Apache Maven is :
Maven was born of the very practical desire to make several projects at Apache work in the same way. So that developers could freely move between these projects, knowing clearly how they all worked by understanding how one of them worked. If a developer spent time understanding how one project built it was intended that they would not have to go through this process again when they moved on to the next project. The same idea extends to testing, generating documentation, generating metrics and reports, testing and deploying.
Central Place: The POM File
A Maven Project is guided by a control file named “Project Object Model” (POM), or the POM file. This file is named pom.xml, and it’s a XML representation of this “Project Object Model”. In fact, in the Maven world, a project need not contain any code at all, merely a pom.xml file.
So the pom.xml file is the core of a project’s configuration in Maven. It is a single configuration file that contains the majority of information required to build a project in just the way you want. The POM is huge and can be daunting in its complexity.
The pom.xml file contains information of project and configuration information for the maven to build the project such as dependencies, build directory, source directory, test source directory, plugin, goals etc.
Maven reads the pom.xml file, then executes the goal. Before maven 2, it was named as project.xml file. But, since maven 2 (also in maven 3), it is renamed as pom.xml.
This figure is a listing of the elements directly under the POM’s project element. Notice that
modelVersion contains 4.0.0. That is currently the only supported POM version, and is always required.
Note: The POM 4.0.0 XSD and descriptor reference documentation
Maven Coordinates are used to identify artifacts. Maven coordinates identify uniquely a project, a dependency, or a plugin defined in POM. Each entity is uniquely identified by the combination of a group identifier, an artifact identifier, and the version.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>domain.organization.my-item</groupId> <artifactId>my-project</artifactId> <version>1.0</version> </project>
The POM defined above is the bare minimum that Maven allows.
groupId:artifactId:version are all required fields (although, groupId and version do not need to be explicitly defined if they are inherited from a parent – more on inheritance later). The three fields act much like an address and timestamp in one. This marks a specific place in a repository, acting like a coordinate system for Maven projects
groupId – Typically unique to an organization. Often the organization’s reverse domain is used.
But not always. Can be just ‘junit’.
artifactId – typically the project name. A descriptor for the artifact
version – refers to a specific version of the project.
The version numbering: In Maven the most of the projects follows the Semantic Versioning Schema https://semver.org/. Typically:
- Major Version
- Minor Version
- Incremental Version (or patch level)
However you can follow other pattners, for example adding:
- Build number (from CI tools), or
- Qualifiers (such as Beta, Alpha, and so on).
Special interest has in Maven the “qualifier” named SNAPSHOT. For example – 1.2.3-SNAPSHOT. The SNAPSHOT suffix is an important qualifier to Maven behavior:
- Tells Maven this is a development version
- Not stable, and Maven should check for newer versions
- Maven will first check locally, then check remote repositories (By default, Maven will check remote repositories once per day).
Now that we have our unique identifier
groupId:artifactId:version, there is one more standard label the project’s packaging.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> ... <packaging>war</packaging> ... </project>
The current core packaging values are:
Maven Repositories are a locations where project artifacts are stored. We can classify the following types of repositories:
- Local Repository: Repository on local file system, typically $HOME /.m2/ folder.
- Central Repository: Public repository hosted by Maven community, by default at https://repo.maven.apache.org/maven2/
- Remote Repository: Other locations which can be public or private (JBoos community, Adroid community, private hosted in companies, and so on).
The Central Repository is one of the most powerful features of the Maven community. We have an user friendly view at: https://mvnrepository.com/
Maven Effective POM
All maven projects inherit a default configuration from a named Super POM. This Super POM provides a common standards for directories and configurations, so we can write a simple pom.xml for our project without rewrite over and over. That said the Effective POM is the result of this ecuation:
Effective POM = Super POM + Simplest POM (our project POM)
So the Effective POM is the POM complete with all inherited properties. You can check your effective pom by below command.
Another useful command is we can actually redirect output of this command to a text file.
mvn help:effective-pom -Doutput=effective-pom-result.txt
A dependency is an artifact which your Maven project depends upon, typically a JAR or other POM. It’s important to mark the concept of:
- Transitive Dependencies: This is when a dependency depends on other dependencies, so we have a lot of levels of dependencies.
- Cyclic Dependencies: This kind of dependencies are considered a problem of the transitive dependencies, which are not supported (A depends on B, and B depends on A).
An important concept to understand in this point is the concept of Dependency Mediation: Determines what version to use when multiple version of the same dependency are encountered. The rule is “Nearest Definition”, example:
“nearest definition” means that the version used will be the closest one to your project in the tree of dependencies. For example, if dependencies for A, B, and C are defined as A -> B -> C -> D 2.0 and A -> E -> D 1.0, then D 1.0 will be used when building A because the path from A to D through E is shorter. You could explicitly add a dependency to D 2.0 in A to force the use of D 2.0.
Maven Standard Directory Layout
According with the official Apache Maven documentation, the Maven projects: The next figure documents the directory layout expected by Maven and the directory layout created by Maven. Please try to conform to this structure as much as possible; however, if you can’t these settings can be overridden via the project descriptor.
Maven Build Lifecycles
Maven is based around the central concept of a Build Lifecycle. This Maven lifecycle concept covers many of the steps that are expected to occur in a project’s development lifetime.
- A build lifecycle is a pre-defined group of build steps, or stages, called PHASES.
- Each phase can be bound to one or more PLUGIN GOALS (everything in maven are plugins). So a Maven plugin is a container supplier of GOALS. Code implemented in goals is, in fact, the one is doing the real work. Each of a plugin’s goals can be bounded to any of the lifecycle phases. So a Phase can be bound to one or many Plugin Goals, and one Plugin Goal can be used in one or many phases of different lifecycles.
- These plugin goals are called in a sequence. When invoking
mvn <phase>Maven passes all phases (every time) and executes all goals (supplied by plugins) that have been bound to any of the phases prior and up to (and including) the given phase. If there is a phase with no goal bound to it nothing is done. But the phase is passed nevertheless.
There are three built-in build lifecycles:
defaultlifecycle handles your project build and deployment.
cleanlifecycle handles project cleaning, removing all artifacts from working directory.
sitelifecycle handles the creation of your project’s site documentation.
Each of these build lifecycles is defined by a different list of build phases, wherein a build phase represents a stage in the lifecycle.
The Default Build Lifecycle
The default lifecycle comprises of the following phases (aka. build phases), for a complete list of the lifecycle phases, refer to the Lifecycle Reference.
validate– validate the project is correct and all necessary information is available..
compile– compile the source code of the project
test– test the compiled source code using a suitable unit testing framework. These tests should not require the code be packaged or deployed
package– take the compiled code and package it in its distributable format, such as a JAR.
verify– run any checks on results of integration tests to ensure quality criteria are met
install– install the package into the local repository, for use as a dependency in other projects locally
deploy– done in the build environment, copies the final package to the remote repository for sharing with other developers and projects.
These lifecycle phases (plus the other lifecycle phases not shown here) are executed sequentially to complete the default lifecycle. Given the lifecycle phases above, this means that when the default lifecycle is used, Maven will first validate the project, then will try to compile the sources, run those against the tests, package the binaries (e.g. jar), run integration tests against that package, verify the integration tests, install the verified package to the local repository, then deploy the installed package to a remote repository.
Note: A build phase is made up of plugins. However, even though a build phase is responsible for a specific step in the build lifecycle, the manner in which it carries out those responsibilities may vary. And this is done by declaring the PLUGIN GOALS bound to those build phases. A plugin goal represents a specific task (finer than a build phase) which contributes to the building and managing of a project. It may be bound to zero or more build phases. A goal not bound to any build phase could be executed outside of the build lifecycle by direct invocation. The order of execution depends on the order in which the goal(s) and the build phase(s) are invoked.
For example, consider the command below. The clean and package arguments are build phases, while the dependency:copy-dependencies is a goal (of a plugin). Example:
mvn clean dependency:copy-dependencies package
If this were to be executed, the
clean phase will be executed first (meaning it will run all preceding phases of the clean lifecycle, plus the clean phase itself), and then the
dependency:copy-dependencies goal, before finally executing the
package phase (and all its preceding build phases of the default lifecycle).
Moreover, if a goal is bound to one or more build phases, that goal will be called in all those phases.
Furthermore, a build phase can also have zero or more goals bound to it. If a build phase has no goals bound to it, that build phase will not execute. But if it has one or more goals bound to it, it will execute all those goals.
Maven Build Profiles
A Build profile is a set of configuration values, which can be used to set or override default values of Maven build. Using a build profile, you can customize build for different environments such as Production v/s Development environments.
Profiles are specified in pom.xml file using its activeProfiles/profiles elements and are triggered in variety of ways. Profiles modify the POM at build time, and are used to give parameters different target environments, as such, profiles can easily lead to differing build results from different members of your team.
Maven Build Profiles are a way of modify the behavior of our Build.
The definition of Build Profiles is based in three level of scopes.
- Per Project: Defined in the POM itself
- Per User: Defined in the file
- Per System Wide (Global): Defined in the global Maven-settings
After we create one or more profiles we can start using them, or in other words, activating them. We can use the following goal to see which profiles are active in our default build:
The activation of profile can be done by different ways:
- Active by Default using configuration
<profile> <id>my-profile</id> <activation> <activeByDefault>true</activeByDefault> </activation> </profile>
- Based on a Property: We can activate profiles on the command-line. However, sometimes it’s more convenient if they’re activated automatically. For instance, we can base it on a -D system property
<profile> <id>active-on-property-environment</id> <activation> <property> <name>environment</name> </property> </activation> </profile>
mvn package -Denvironment
- Based on JDK version
<profile> <id>active-on-jdk-11</id> <activation> <jdk>11</jdk> </activation> </profile>
- Based on the Operating System
<profile> <id>active-on-windows-10</id> <activation> <os> <name>windows 10</name> <family>Windows</family> <arch>amd64</arch> <version>10.0</version> </os> </activation> </profile>
- Based on a File: For example we activate the execution of profile only if the my-custom-file.html is not yet present
<activation> <file> <missing>target/my-custom-file.html</missing> </file> </activation>
- The more standard way: Using the parameter -P
mvn package -P profile1, profile2
Maven Archetypes are project templates ready for bootstrapping a Maven Project. Apache Maven provides a variety of standard archetypes to serve as starters for common Java projects. Maven Archetypes are also available from a variety of 3rd parties.
For example let’s create a simple Apache Maven project based on the Java quick start template (archetype):
$ mvn archetype:generate -DarchetypeGroupId=org.apache.maven.archetypes - DarchetypeArtifactId=maven-archetype-quickstart [INFO] Scanning for projects... [INFO] [INFO] ------------------< org.apache.maven:standalone-pom >------------------- [INFO] Building Maven Stub Project (No POM) 1 [INFO] --------------------------------[ pom ]--------------------------------- [INFO] [INFO] >>> maven-archetype-plugin:3.1.2:generate (default-cli) > generate-sources @ standalone-pom >>> [INFO] [INFO] <<< maven-archetype-plugin:3.1.2:generate (default-cli) < generate-sources @ standalone-pom <<< [INFO] [INFO] [INFO] --- maven-archetype-plugin:3.1.2:generate (default-cli) @ standalone-pom --- [INFO] Generating project in Interactive mode [INFO] Archetype [org.apache.maven.archetypes:maven-archetype-quickstart:1.4] found in catalog remote Define value for property 'groupId': MyGroupID Define value for property 'artifactId': MyArtifactID Define value for property 'version' 1.0-SNAPSHOT: : Define value for property 'package' MyGroupID: : MyPackage Confirm properties configuration: groupId: MyGroupID artifactId: MyArtifactID version: 1.0-SNAPSHOT package: MyPackage Y: : Y [INFO] ---------------------------------------------------------------------------- [INFO] Using following parameters for creating project from Archetype: maven-archetype-quickstart:1.4 [INFO] ---------------------------------------------------------------------------- [INFO] Parameter: groupId, Value: MyGroupID [INFO] Parameter: artifactId, Value: MyArtifactID [INFO] Parameter: version, Value: 1.0-SNAPSHOT [INFO] Parameter: package, Value: MyPackage [INFO] Parameter: packageInPathFormat, Value: MyPackage [INFO] Parameter: package, Value: MyPackage [INFO] Parameter: version, Value: 1.0-SNAPSHOT [INFO] Parameter: groupId, Value: MyGroupID [INFO] Parameter: artifactId, Value: MyArtifactID [INFO] Project created from Archetype in dir: /home/user/MyArtifactID [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 25.395 s [INFO] Finished at: 2020-04-27T08:47:06+02:00 [INFO] ------------------------------------------------------------------------ $ tree MyArtifactID/ MyArtifactID/ ├── pom.xml └── src ├── main │ └── java │ └── MyPackage │ └── App.java └── test └── java └── MyPackage └── AppTest.java
The Maven Wrapper is an excellent choice for projects that need a specific version of Maven (or for users that don’t want to install Maven at all). Instead of installing many versions of it in the operating system, we can just use the project-specific wrapper script. It’s the same concept as Gradle Wrapper.
First, we need to go in the main folder of the project and run this command:
$ mvn -N io.takari:maven:wrapper
We can also specify the version of Maven
$ mvn -N io.takari:maven:wrapper -Dmaven=3.5.2
The option -N means –non-recursive so that the wrapper will only be applied to the main project of the current directory, not in any submodules.
After executing the goal, we’ll have more files and directories in the project:
mvnw: it’s an executable Unix shell script used in place of a fully installed Maven.
mvnw.cmd: it’s the Batch version of the above script
./mvn: the hidden folder that holds the Maven Wrapper Java library and its properties file
Note: Recently this plugin is going to be added to Apache Maven as ASF project: http://incubator.apache.org/ip-clearance/maven-wrapper.html
A multi-module project is built from an aggregator POM that manages a group of submodules. Usually the aggregator is located in the project’s root directory and must have packaging of type pom.
It’s important to point out that each module is effectively a Maven project, It’s just happens to inherit from its parent module, so the submodules are regular Maven projects, and they can be built separately or through the aggregator POM.
The mechanism in Maven that handles multi-module projects is referred to as The Refactor. This part of the Maven core does the following:
- The Reactor is the what builds each module of the Maven project, collects all the available modules to build.
- The Reactor will then run selected build lifecycle against each module.
- The Reactor will determine the build order of the modules.
- By default The Reactor will build modules sequentially. Optionally can use threads to build modules in parallel.
Some factors determining the build order used by The Reactor:
- A project dependency on another module in the build.
- A plugin declaration where the plugin is another module in the build.
- A plugin dependency on another module in the build.
- A build extension declaration on another module in the build.
- The order declared in the
<modules>element (if no other rule applies).
Overview of Maven in Apache NiFi
With the above Apache Maven introduction, hopefully with have enough resources to analyze Apache NiFi structure by means of Apache Maven capabilities.
NiFi Sub modules
We can go to Maven Refactor in order to inspect the Maven sub modules defined in Apache NiFi. For example we can use the following command for getting the Refactor output:
mvn help:evaluate -Dexpression=project.modules
Maven is pointing out we have 472 modules, so we have 472 project into folders (and nested folders). The most of the modules with the tag
[pom] are probably projects with nested modules too. So we can use the following command for inspect the sub modules of that module.
mvn -f nifi-toolkit help:evaluate -Dexpression=project.modules | grep '\[INFO\].*.*\[\(...\|maven-archetype\)]'
This example shows the sub module
nifi-toolkit has 10 child sub modules.
This way of inspecting the NiFi Maven sub modules, by means of Maven CLI is not the best approach for this case, because of the huge amount of modules. Probably the best way is using an IDE, for example the Maven Plugin in IntelliJ IDEA makes this task easier.
Let’s focus on the child sub modules of the root Maven pom.
<modules> <module>nifi-commons</module> <module>nifi-api</module> <module>nifi-framework-api</module> <module>nifi-bootstrap</module> <module>nifi-mock</module> <module>nifi-nar-bundles</module> <module>nifi-assembly</module> <module>nifi-docs</module> <module>nifi-maven-archetypes</module> <module>nifi-external</module> <module>nifi-toolkit</module> <module>nifi-docker</module> <module>nifi-system-tests</module> </modules>
Sub module nifi-commons
Recently the NiFi code structure was reworked, discussion here:
This module is the result of the above discussion. It’s contains shared code / implementations, such as security provider implementations for authentication and authorization. It’s the home for shared, top-level, reusable libraries of components across the other collection of projects.
According with Kevin Doran comitter:
A collection of libraries for NiFi projects. Currently, just security-related libraries, but in the future this could hold other shared code as well.
Sub module nifi-api
Contains any Java APIs that need to be agreed upon by multiple top-level projects, such as nifi-framework-api or nifi-bootstrap.
Sub module nifi-framework-api
The set of libraries used in the Core Framework development. The modules
nifi-framework-api are designed for code re-usability in different components of the NiFi project.
Sub module nifi-bootstrap
This module is the entry point (the main method) for starting Apache NiFi system. This is the first point where a developer can begin to study the code.
Sub module nifi-mock
This module is known as NiFi’s Mock Framework. This
nifi-mock module that can be used in conjunction with JUnit to provide extensive testing of components. The Mock Framework is mostly aimed at testing Processors, as these are by far the most commonly developed extension point. However, the framework does provide the ability to test Controller Services as well.
Sub module nifi-nar-bundles
This is one of the core modules of NiFi, this module is the container of every NiFi Processor and Service within the platform. This module has the source code of every NAR file build in the platform. NAR files stands for NiFi Archive, is the way Apache NiFi is packaging the artifacts. The reason for creating a custom packaging for a NiFi processor is to provide a bit of Java ClassLoader isolation. The processors and controller services all come from different companies and contributors, utilizing differing versions of libraries (such as apache-commons, etc…). A NAR file provides isolation from the potential issue of NoClassDefFoundError exceptions being thrown for having the wrong version of a dependency already loaded in the ClassLoader from a different processor. A NAR file essentially mirrors the Java Web application ARchive (WAR) or Java Application ARchive (JAR).
Sub module nifi-assembly
This module is based on The Assembly Plugin for Maven, that enables NiFi developers to combine project output into a single distributable archive that also contains dependencies, modules, site documentation, and other files.
Sub module nifi-docs
This module is dedicated to build the Apache NiFi documentation. This module is based on AsciiDoc tool http://asciidoc.org/ and the The Asciidoctor Maven Plugin https://asciidoctor.org/docs/asciidoctor-maven-plugin from the AsciiDoctor Project https://asciidoctor.org, the maintainer of the official definition of the AsciiDoc syntax.
Sub module nifi-maven-archetypes
In this module NiFi provides to Maven Archetypes for easily creating the
Processor bundle and
Controller Service bundle project structure.
Example of usage of
Maven Processor Archetype:
mvn archetype:generate -DarchetypeGroupId=org.apache.nifi -DarchetypeArtifactId=nifi-processor-bundle-archetype -DarchetypeVersion=1.0.0 -DnifiVersion=1.0.0
Example of usage of
Maven Controller Service Archetype:
mvn archetype:generate -DarchetypeGroupId=org.apache.nifi -DarchetypeArtifactId=nifi-service-bundle-archetype -DarchetypeVersion=1.0.0 -DnifiVersion=1.0.0
Sub module nifi-external
The nifi-external module is a location where components can be developed by the NiFi team that are not intended to be used directly by NiFi but are to be used within other frameworks in order to integrate with NiFi.
The sub modules within this module are:
Sub module nifi-toolkit
This module is the home for
The NiFi Toolkit that contains command line utilities for administrators to support NiFi maintenance in standalone and clustered environments.
For example the NiFi CLI:
Or NiFi Notify:
Sub module nifi-docker
Module with utilities for running Apache NiFi stand-alone in Docker, Docker Compose definition, Dockerfile, scripts for integration tests and other actions related with containers.
Note: This modules is based on
Spotify Docker File Maven Plugin: https://github.com/spotify/dockerfile-maven
Note: This plugin looks is not evolving any more according with the note in GitHub project.
Sub module nifi-system-tests
This module is dedicated to perform integration tests that start an entire NiFi application and create/update/delete components and verify behavior via the REST API.
NiFi Build Profiles
We can get the list of Build Profiles defined in Apache NiFi with the following command:
mvn help:all-profiles | grep -v INFO | less
Fortunately Apache NiFi developers are commenting the most of the profiles in the pom files, for example some of the comments in the top-level pom file about the profiles:
Performs execution of Integration Tests using the Maven FailSafe Plugin. The view of integration tests in this context are those tests interfacing with external sources and services requiring additional resources or credentials that cannot be explicitly provided. Also appropriate for tests which depend on inter-thread and/or network or having timing considerations which could make the tests brittle on various environments.
Checks style and licensing requirements. This is a good idea to run for contributions and for the release process. While it would be nice to run always these plugins can considerably slow the build and have proven to create unstable builds in our multi-module project and when building using multiple threads. The stability issues seen with Checkstyle in multi-module builds include false-positives and false negatives.
This profile will disable DocLint which performs strict JavaDoc processing which was introduced in JDK 8. These are technically errors in the JavaDoc which we need to eventually address. However, if a release is performed using JDK 8 or newer, the JavaDoc generation would fail. By activating this profile when running on JDK 8 or newer we can ensure the JavaDocs continue to generate successfully.
This profile, activating when compiling on Java versions above 1.8, provides configuration changes to allow NiFi to be compiled on those JDKs.
We can get a list of repositories used in Apache NiFi with this tricky command:
mvn dependency:list-repositories | grep url | sort | uniq
In this article we have been able of getting a lot information about Apache NiFi using the Apache Maven command for inspecting the project. Obviously we have to go deeper in order to be able of hacking the code: This is only the beginning of an exciting journey!