Introducction

In this very practical article, we will build two of the most interesting technologies in the Big Data ecosystem: Apache kudu and Apache Impala. We will build both technologies from source code, package them and deploy them with the minimum configurations to make them functional (we will avoid performance related configuration on this occasion) and additionally we will use Apache Impala making it as independent as possible from HDFS.

It is known that Apache Kudu and Apache Impala release their versions in “Source Release” form, therefore they do not make the binaries of these technologies available. This forces users to build such projects from source code. Generally the projects of the Big Data ecosystem and the Apache Software Foundation are based on Java code, which makes them easier to build and in particular to distribute (in most cases only a JVM is needed on the machine to be deployed).

In the case of Apache Kudu and Apache Impala this is not the case, they are two projects whose code base is C++, which makes their construction and in particular their distribution in the target systems somewhat more complicated.

Built Versions

To build the two technologies we will use the “main” branch of the two projects, the result is therefore the “versions to come”, they are development versions, in particular:

  1. Apache Kudu – 1.17.0-SNAPSHOT
  2. Apache Impala – 4.1.0-SNAPSHOT

For building the source code will use CentOS 7.x as development platform, thinking we are going to deploy the technologies into CentOS 7.x based hosts.

Building Kudu – 1.17.0-SNAPSHOT

sudo yum -y install autoconf automake curl cyrus-sasl-devel cyrus-sasl-gssapi \ cyrus-sasl-plain flex gcc gcc-c++ gdb git java-1.8.0-openjdk-devel \ krb5-server krb5-workstation libtool make openssl-devel patch pkgconfig \ redhat-lsb-core rsync unzip vim-common which centos-release-scl-rh \ devtoolset-8 tree vim

git clone https://github.com/apache/kudu kudu.git

cd kudu.git

build-support/enable_devtoolset.sh thirdparty/build-if-necessary.sh

mkdir -p build/release

cd build/release 

../../build-support/enable_devtoolset.sh \
 ../../thirdparty/installed/common/bin/cmake \
 -DCMAKE_BUILD_TYPE=release \
 ../.. 

make -j4 

sudo mkdir /opt/kudu && sudo chown -R ${USER}: /opt/kudu 

make DESTDIR=/opt/kudu install

In the case of Apache Kudu, thanks to the “make install” rule, it is easy to organize the artifacts and their dependencies for subsequent packaging and deployment. Let’s take a look at the generated artifacts.

/opt/kudu/
    └── usr
        └── local
            ├── share (lots of files)
            ├── include (losts of files)
            ├── bin
            │   └── kudu (the CLI)
            ├── lib64
            │   ├── libkudu_client.so -> libkudu_client.so.0
            │   ├── libkudu_client.so.0 -> libkudu_client.so.0.1.0
            │   └── libkudu_client.so.0.1.0
            └── sbin
                ├── kudu-master
                └── kudu-tserver

# We have to copy manually the document root front end:
cp -r kudu.git/www/ /opt/kudu/usr/local/

# We have to copy HMS Kudu plugin manually:
mkdir /opt/usr/local/lib
cp kudu.git/build/release/bin/hms-plugin.jar /opt/kudu/usr/local/lib

Packaging 

rm -fr /opt/kudu/usr/local/share
rm -fr /opt/kudu/usr/local/include

tar cvzf kudu-1.17.0-SNAPSHOT.tar.gz /opt/kudu (514 MB compressed)

Installation

All Machines

scp kudu-1.17.0-SNAPSHOT.tar.gz <all-nodes>:

sudo yum install cyrus-sasl-gssapi cyrus-sasl-plain cyrus-sasl-devel krb5-server krb5-workstation openssl lzo-devel tzdata

sudo mkdir -p /var/lib/kudu/wal
sudo mkdir -p /var/lib/kudu/data
sudo mkdir -p /var/log/kudu

sudo adduser -r kudu
sudo chown -R kudu: /var/lib/kudu
sudo chown -R kudu: /var/log/kudu

sudo tar xvzf kudu-1.17.0-SNAPSHOT.tar.gz -C /
sudo mkdir /opt/kudu/conf
sudo chown -R kudu: /opt/kudu/

Single Kudu Master Machine

cat << EOF | sudo tee /opt/kudu/conf/master.gflagfile
--webserver_doc_root=/opt/kudu/usr/local/www
--log_dir=/var/log/kudu
--fs_wal_dir=/var/lib/kudu/wal
--fs_data_dirs=/var/lib/kudu/data
--rpc_encryption=optional
--rpc_authentication=optional
--rpc_negotiation_timeout_ms=5000
EOF

sudo chown kudu: /opt/kudu/conf/master.gflagfile

cat << EOF | sudo tee /etc/systemd/system/kudu-master.service
[Unit]                                                                                                                                            
Description=Apache Kudu Master Server
Documentation=http://kudu.apache.org 

[Service]
Environment=KUDU_HOME=/var/lib/kudu
ExecStart=/opt/kudu/usr/local/sbin/kudu-master --flagfile=/opt/kudu/conf/master.gflagfile
TimeoutStopSec=5
Restart=on-failure
User=kudu

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl start kudu-master
sudo systemctl enable kudu-master

Kudu Tablet Server Machines

cat << EOF | sudo tee /opt/kudu/conf/tserver.gflagfile
--tserver_master_addrs=kudu-master1.node.keedio.cloud:7051
--webserver_doc_root=/opt/kudu/usr/local/www
--log_dir=/var/log/kudu
--fs_wal_dir=/var/lib/kudu/wal
--fs_data_dirs=/var/lib/kudu/data 
--rpc_encryption=optional
--rpc_authentication=optional
--rpc_negotiation_timeout_ms=5000
EOF

sudo chown kudu: /opt/kudu/conf/tserver.gflagfile

cat << EOF | sudo tee /etc/systemd/system/kudu-tserver.service 
[Unit]                                                                                                                                            
Description=Apache Kudu Tablet Server
Documentation=http://kudu.apache.org 

[Service]
Environment=KUDU_HOME=/var/lib/kudu
ExecStart=/opt/kudu/usr/local/sbin/kudu-tserver --flagfile=/opt/kudu/conf/tserver.gflagfile
TimeoutStopSec=5
Restart=on-failure
User=kudu

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl start kudu-tserver
sudo systemctl enable kudu-tserver

Building Impala – 4.1.0-SNAPSHOT

git clone https://github.com/apache/impala.git

cd impala

export IMPALA_HOME=$PWD
./bin/bootstrap_system.sh

source ./bin/impala-config.sh
./buildall.sh -noclean -notests

Packaging

mkdir -p /opt/impala/lib64
mkdir /opt/impala/sbin
mkdir /opt/impala/lib
mkdir /opt/impala/bin

cp -r impala/www/ /opt/impala/

cp \
  impala/toolchain/toolchain-packages-gcc7.5.0/gcc-7.5.0/lib64/libgcc_s.so.1 \
  impala/toolchain/toolchain-packages-gcc7.5.0/gcc-7.5.0/lib64/libstdc++.so.6.0.24 \
  /opt/impala/lib64

cp impala/toolchain/toolchain-packages-gcc7.5.0/kudu-67ba3cae45/debug/lib64/libkudu_client.so.0.1.0 /opt/impala/lib64/

cp -P impala/be/build/debug/service/* /opt/impala/sbin/

cp impala/fe/target/impala-frontend-4.1.0-SNAPSHOT.jar /opt/impala/lib

cp impala/fe/target/dependency/*.jar /opt/impala/lib

cp -r impala/shell/build/impala-shell-4.1.0-SNAPSHOT /opt/impala/bin/shell

cp -r impala/toolchain/cdp_components-23144489/apache-hive-3.1.3000.7.2.15.0-88-bin/ /opt/hive

cp -r impala/toolchain/cdp_components-23144489/hadoop-3.1.1.7.2.15.0-88/ /opt/hadoop

tar cvzf impala-4.1.0-SNAPSHOT.tar.gz \
    -P /opt/impala/ \
    -P /opt/hive/ \
    -P /opt/hadoop/ # (1861 MB compressed)

Installation

All Machines

scp impala-4.1.0-SNAPSHOT.tar.gz <all-nodes>:

sudo yum install -y java-1.8.0-openjdk-devel
sudo adduser -r impala
sudo mkdir /var/log/impala
sudo chown -R impala: /var/log/impala
sudo mkdir /opt/impala/conf
sudo chown -R impala: /opt/impala/conf

sudo tar xzf impala-4.1.0-SNAPSHOT.tar.gz -P -C /
sudo chown -R impala: /opt/impala/

sudo tee /etc/ld.so.conf.d/impala.conf <<EOF
/opt/impala/lib64
EOF

HMS In Master Machine

We install Hive Metastore on the Kudu and Impala Master (obviously it can be installed independently).

sudo yum install -y postgresql postgresql-server postgresql-contrib postgresql-jdbc

sudo postgresql-setup initdb

sudo sed -i "s/\#listen_addresses =.*/listen_addresses = \'*\'/g" /var/lib/pgsql/data/postgresql.conf

echo 'host  all  all 0.0.0.0/0 md5' | sudo tee -a /var/lib/pgsql/data/pg_hba.conf

sudo systemctl start postgresql
sudo systemctl enable postgresql

sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'postgres';"
sudo -u postgres psql -c "CREATE DATABASE metastore;"

psql postgresql://postgres:postgres@kudu-master1.node.keedio.cloud -c '\l' | grep metastore

sudo adduser -r hive
sudo chown -R hive: /opt/hive/
sudo ln -s /usr/share/java/postgresql-jdbc.jar /opt/hive/lib/
sudo cp /opt/kudu/usr/local/lib/hms-plugin.jar /opt/hive/lib
sudo chown hive: /opt/hive/lib/hms-plugin.jar

sh impala-config.sh

/opt/hive/bin/schematool -initSchema -dbType postgres


sudo systemctl start metastore
sudo systemctl start impala-statestore
sudo systemctl start impala-catalog
sudo systemctl start impala-admission

sudo systemctl enable metastore
sudo systemctl enable impala-statestore
sudo systemctl enable impala-catalog
sudo systemctl enable impala-admission

Multiple Impala Servers

The Impala Servers are only in the Kudu Tablet Servers nodes (the workers).

sudo adduser -r hive
sudo chown -R hive: /opt/hive/
sudo ln -s /usr/share/java/postgresql-jdbc.jar /opt/hive/lib/
sudo cp /opt/kudu/usr/local/lib/hms-plugin.jar /opt/hive/lib
sudo chown hive: /opt/hive/lib/hms-plugin.jar

sh impala-config.sh

sudo systemctl start impala
sudo systemctl enable impala

impala-config.sh helper script

cat << EOF | sudo tee /etc/systemd/system/impala.service
[Unit]
Description=Apache Impala Daemon
Documentation=http://impala.apache.org

[Service]
EnvironmentFile=/opt/impala/conf/impala.env
ExecStart=/opt/impala/sbin/impalad --flagfile=/opt/impala/conf/impala.gflagfile
TimeoutStopSec=5
Restart=on-failure
User=impala

[Install]
WantedBy=multi-user.target
EOF

cat << EOF | sudo tee /etc/systemd/system/impala-catalog.service
[Unit]
Description=Apache Impala Catalog Daemon
Documentation=http://impala.apache.org

[Service]
EnvironmentFile=/opt/impala/conf/impala.env
ExecStart=/opt/impala/sbin/catalogd --flagfile=/opt/impala/conf/catalog.gflagfile
TimeoutStopSec=5
Restart=on-failure
User=impala

[Install]
WantedBy=multi-user.target
EOF

cat << EOF | sudo tee /etc/systemd/system/impala-statestore.service
[Unit]
Description=Apache Impala StateStore Daemon
Documentation=http://impala.apache.org

[Service]
EnvironmentFile=/opt/impala/conf/impala.env
ExecStart=/opt/impala/sbin/statestored --flagfile=/opt/impala/conf/statestore.gflagfile
TimeoutStopSec=5
Restart=on-failure
User=impala

[Install]
WantedBy=multi-user.target
EOF

cat << EOF | sudo tee /etc/systemd/system/impala-admission.service
[Unit]
Description=Apache Impala Admission Control Daemon
Documentation=http://impala.apache.org

[Service]
EnvironmentFile=/opt/impala/conf/impala.env
ExecStart=/opt/impala/sbin/admissiond --flagfile=/opt/impala/conf/admission.gflagfile
TimeoutStopSec=5
Restart=on-failure
User=impala

[Install]
WantedBy=multi-user.target
EOF

cat << EOF | sudo tee /opt/impala/conf/impala.env
IMPALA_HOME=/opt/impala
JAVA_HOME=/usr/lib/jvm/java/
CLASSPATH=/opt/impala/lib/*:/opt/hive/lib/*
HADOOP_HOME=/opt/hadoop
HIVE_HOME=/opt/hive
HIVE_CONF=/opt/hive/conf
EOF

sudo chown impala: /opt/impala/conf/impala.env

cat << EOF | sudo tee /opt/impala/conf/impala.gflagfile
--abort_on_config_error=false
--log_dir=/var/log/impala
--state_store_host=kudu-master1.node.keedio.cloud
--catalog_service_host=kudu-master1.node.keedio.cloud
--admission_service_host=kudu-master1.node.keedio.cloud
--kudu_master_hosts=kudu-master1.node.keedio.cloud
--enable_legacy_avx_support=true
EOF

sudo chown impala: /opt/impala/conf/impala.gflagfile

cat << EOF | sudo tee /opt/impala/conf/catalog.gflagfile
--kudu_master_hosts=kudu-master1.node.keedio.cloud
--log_dir=/var/log/impala
--enable_legacy_avx_support=true
EOF

sudo chown impala: /opt/impala/conf/catalog.gflagfile

cat << EOF | sudo tee /opt/impala/conf/statestore.gflagfile
--kudu_master_hosts=kudu-master1.node.keedio.cloud
--log_dir=/var/log/impala
--enable_legacy_avx_support=true
EOF

sudo chown impala: /opt/impala/conf/statestore.gflagfile

cat << EOF | sudo tee /opt/impala/conf/admission.gflagfile
--kudu_master_hosts=kudu-master1.node.keedio.cloud
--log_dir=/var/log/impala
--enable_legacy_avx_support=true
EOF

sudo chown impala: /opt/impala/conf/admission.gflagfile

cat << EOF | sudo tee /etc/systemd/system/metastore.service 
[Unit]
Description=Apache Hive Metastore - HSM
Documentation=http://hive.apache.org

[Service]
Type=simple
Environment=JAVA_HOME=/usr/lib/jvm/java/
Environment=HADOOP_HOME=/opt/hadoop
ExecStart=/opt/hive/bin/hive --service metastore --hiveconf hive.root.logger=DEBUG,console
TimeoutStopSec=5
Restart=on-failure
User=hive

[Install]
WantedBy=multi-user.target
EOF

cat << EOF | sudo tee /opt/hive/conf/hive-site.xml
<configuration>
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://kudu-master1.node.keedio.cloud:9083</value>
    <description>Thrift URI for the remote metastore. 
      Used by metastore client to connect to remote metastore.
    </description>
  </property>

  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/var/lib/hive/warehouse</value>
    <description>location of default database for the warehouse</description>
  </property>

  <property>
    <name>metastore.task.threads.always</name>
    <value>org.apache.hadoop.hive.metastore.events.EventCleanerTask</value>
  </property>

  <property>
    <name>metastore.expression.proxy</name>
    <value>org.apache.hadoop.hive.metastore.DefaultPartitionExpressionProxy</value>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.postgresql.Driver</value>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:postgresql://kudu-master1.node.keedio.cloud:5432/metastore</value>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>postgres</value>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>postgres</value>
  </property> 

  <property>
    <name>hive.metastore.transactional.event.listeners</name>
    <description>
      KEY connection with KUDU from (artifact from Kudu release): hms-plugin.jar
    </description>
    <value>      
      org.apache.hive.hcatalog.listener.DbNotificationListener,
      org.apache.kudu.hive.metastore.KuduMetastorePlugin
    </value> 
  </property> 

  <property>
    <name>hive.metastore.disallow.incompatible.col.type.changes</name>
    <value>false</value>
  </property>

  <property>
      <!-- Required for automatic metadata sync. -->
      <name>hive.metastore.dml.events</name>
      <value>true</value>
  </property>

  <property>
     <name>hive.metastore.event.db.notification.api.auth</name>
     <value>false</value>
     <description>
       Should metastore do authorization against database notification 
       related APIs such as get_next_notification.
       If set to true, then only the superusers in proxy settings have the permission
     </description>
  </property>
</configuration>
EOF

cat << EOF | sudo tee /opt/hadoop/etc/hadoop/core-site.xml
<configuration>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>/cluster/nn</value>
  </property>

  <property>
    <name>dfs.datanode.data.dir</name>
    <value>/cluster/1/dn,/cluster/2/dn</value>
  </property>

  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
</configuration>
EOF

cat << EOF | sudo tee /opt/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
 <property>
    <name>fs.default.name</name>
    <value>file://var/lib/hive</value>
    <description>fake hdfs</description>
 </property> 
</configuration>
EOF

Enabling Kudu Sync with Metastore

Kudu has an optional feature which allows it to integrate its own catalog with the Hive Metastore (HMS). The HMS is the de-facto standard catalog and metadata provider in the Hadoop ecosystem. When the HMS integration is enabled, Kudu tables can be discovered and used by external HMS-aware tools (in our case the Impala Server).

Kudu has tight integration with Apache Impala, allowing you to use Impala to insert, query, update, and delete data from Kudu tablets using Impala’s SQL syntax, as an alternative to using the Kudu APIs to build a custom Kudu application. In addition, you can use JDBC or ODBC to connect existing or new applications written in any language, framework, or business intelligence tool to your Kudu data, using Impala as the broker.

cat /opt/kudu/conf/master.gflagfile
--webserver_doc_root=/opt/kudu/usr/local/www
--log_dir=/var/log/kudu
--fs_wal_dir=/var/lib/kudu/wal
--fs_data_dirs=/var/lib/kudu/data
--rpc_encryption=optional
--rpc_authentication=optional
--rpc_negotiation_timeout_ms=5000
--hive_metastore_uris=thrift://kudu-master1.node.keedio.cloud:9083

sudo systemctl restart kudu-master

Final Architecture