Table of Contents
Spark Streaming 개발환경 구축하기
Spark Streaming 개발에 필요한 최소 설정으로 개발환경을 구축합니다.
하나의 서버에 모든 요소(Kafka + Spark + Hadoop) 을 설치합니다.
성능이나 보안이슈 등은 고려하지 않습니다.
JDK 버전 문제로 계정은 2개를 생성합니다.
JDK 설치
sudo apt-get update
sudo apt-get install openjdk-8-jdk
sudo apt-get install openjdk-11-jdk
# java 8 을 선택한다.
sudo update-alternatives --config java
Kafka 설치
sudo adduser kafka
sudo su - kafka
wget https://dlcdn.apache.org/kafka/2.8.1/kafka_2.13-2.8.1.tgz
tar xvfz kafka_2.13-2.8.1.tgz
mkdir kafka_2.13-2.8.1/logs
mkdir kafka_2.13-2.8.1/data
vi .bashrc
......
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
......
Spark, Hadoop 설치
sudo adduser spark
sudo su - spark
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz
tar xvfz hadoop-3.2.3.tar.gz
wget https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
tar xvfz spark-3.2.1-bin-hadoop3.2.tgz
vi ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/home/spark/hadoop-3.2.3/
export SPARK_HOME=/home/spark/spark-3.2.1-bin-hadoop3.2/
export PATH=$JAVA_HOME/bin:$PATH:$HADOOP_HOME/bin/:$HADOOP_HOME/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native
exit
sudo su - spark
mkdir -p /home/spark/hadoop/hadoopdata/hdfs/namenode
mkdir -p /home/spark/hadoop/hadoopdata/hdfs/datanode
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
sudo su - hduser
vi $HADOOP_HOME/etc/hadoop/core-site.xml
......
## <configuration> </configuration> 사이에 입력
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
......
vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
......
## <configuration> </configuration> 사이에 입력
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>localhost:9868</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/spark/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/spark/hadoop/hadoopdata/hdfs/datanode</value>
</property>
......
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
......
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
......
hdfs namenode -format
start-dfs.sh
start-yarn.sh
hadoop fs -mkdir -p /tmp
hadoop fs -chmod 777 /tmp
jps
9664 ResourceManager
9824 NodeManager
10130 Jps
9060 NameNode
9466 SecondaryNameNode
9247 DataNode
실행 명령어
sudo su - kafka
kafka_2.13-2.8.1/bin/zookeeper-server-start.sh kafka_2.13-2.8.1/config/zookeeper.properties &
kafka_2.13-2.8.1/bin/kafka-server-start.sh kafka_2.13-2.8.1/config/server.properties &
exit
sudo su - spark
start-dfs.sh
start-yarn.sh
종료 명령어
sudo su - spark
stop-yarn.sh
stop-dfs.sh
exit
sudo su - kafka
kafka_2.13-2.8.1/bin/kafka-server-stop.sh
kafka_2.13-2.8.1/bin/zookeeper-server-stop.sh
Kafka 추가설정
vi kafka_2.13-2.8.1/config/zookeeper.properties
......
dataDir=/home/kafka/kafka_2.13-2.8.1/data
......
Kafka 원격 접속 허용, 토픽자동생성 비활성화, 토픽삭제 활성화
AWS EC2 인스턴스인 경우,
보안그룹에 인스턴스 공인아이피에서의 접속을 허용해 주어야 한다.
AWS 보안그룹 소스를 이용한 접속허용은,
지정된 보안 그룹과 연결된 리소스의
프라이빗 IP 주소를 기반으로 하는 트래픽이 허용됩니다.
sudo su - kafka
vi kafka_2.13-2.8.1/config/server.properties
......
advertised.listeners=PLAINTEXT://54.180.XXX.XXX:9092
allow.auto.create.topics=false
delete.topic.enable=true
log.dirs=/home/kafka/kafka_2.13-2.8.1/logs
......
Kafka Topic 생성
kafka_2.13-2.8.1/bin/kafka-topics.sh --create \
--zookeeper localhost:2181 \
--replication-factor 3 \
--partitions 20 \
--topic test
kafka_2.13-2.8.1/bin/kafka-topics.sh --describe \
--zookeeper localhost:2181 \
--topic test
kafka_2.13-2.8.1/bin/kafka-topics.sh --list \
--zookeeper localhost:2181
kafka_2.13-2.8.1/bin/kafka-topics.sh --alter \
--zookeeper localhost:2181 \
--topic test \
--partitions 40
kafka_2.13-2.8.1/bin/kafka-topics.sh --delete \
--zookeeper localhost:2181 \
--topic test
Spark 추가설정
Log Level
sudo su - spark
cp spark-3.2.1-bin-hadoop3.2/conf/log4j.properties.template spark-3.2.1-bin-hadoop3.2/conf/log4j.properties
vi spark-3.2.1-bin-hadoop3.2/conf/log4j.properties
......
log4j.rootCategory=WARN, console
......
Hadoop 추가설정
spark 계정에 sudo 권한 부여(비밀번호 없이 실행)
# 에디터 vim 선택
sudo update-alternatives --config editor
sudo visudo
......
spark ALL=(ALL) NOPASSWD:ALL
......
Hive 설치하기
우선 MariaDB 설치 및 계정생성 을 해준다.
sudo su - spark
wget https://dlcdn.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
tar xvfz apache-hive-3.1.2-bin.tar.gz
vi ~/.bashrc
......
export HIVE_HOME=/home/spark/apache-hive-3.1.2-bin/
export PATH=$HIVE_HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
exit
sudo su - spark
cp $HIVE_HOME/conf/hive-env.sh.template $HIVE_HOME/conf/hive-env.sh
vi $HIVE_HOME/conf/hive-env.sh
......
HADOOP_HOME=$HADOOP_HOME
......
cp $HIVE_HOME/conf/hive-default.xml.template $HIVE_HOME/conf/hive-site.xml
vi $HIVE_HOME/conf/hive-site.xml
......
## 아래를 추가해주세요
<configuration>
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive-${user.name}</value>
<description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>/tmp/${user.name}</value>
<description>Local scratch space for Hive jobs</description>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/tmp/${user.name}_resources</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
<property>
<name>hive.scratch.dir.permission</name>
<value>733</value>
<description>The permission for the user specific scratch directories that get created.</description>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
<description>Whether to include the current database in the Hive prompt.</description>
</property>
</configuration>
hadoop fs -mkdir -p /user/hive/warehouse
# hadoop fs -chown -R hive /user/hive
# bug fix
rm $HIVE_HOME/lib/guava-19.0.jar
cp $HADOOP_HOME/share/hadoop/hdfs/lib/guava-27.0-jre.jar $HIVE_HOME/lib/
hive
hive> exit;
스키마생성(derby)
SessionHiveMetaStoreClient 관련 오류가 발생하면,
아래 명령을 다시 실행해준다.
# rm derby.log
# rm -rf metastore_db/
schematool -initSchema -dbType derby
hive
......
hive (default)> show tables;
OK
Time taken: 0.857 seconds
......
hive (default)> CREATE TABLE T1 (ID STRING);
hive (default)> INSERT INTO T1(ID) VALUES('aaaaa');
hive (default)> SELECT * FROM T1;
MariaDB 설정하기
MySQL 용 JDBC 를 다운받는다.(MariaDB XXX)
wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-java-5.1.49.zip
unzip mysql-connector-java-5.1.49.zip
cp mysql-connector-java-5.1.49/mysql-connector-java-5.1.49.jar $HIVE_HOME/lib/
vi $HIVE_HOME/conf/hive-site.xml
......
## 아래를 추가해주세요
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/db_test?createDatabaseIfNotExist=true</value>
<description>JDBC connection string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>testuser</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>test1234</value>
<description>password to use against metastore database</description>
</property>
</configuration>
schematool -initSchema -dbType mysql
hive
......
hive (default)> show tables;
OK
Time taken: 0.665 seconds
hive (default)> CREATE TABLE T1 (ID STRING);
hive (default)> INSERT INTO T1(ID) VALUES('aaaaa');
hive (default)> SELECT * FROM T1;
hive (default)> DROP TABLE T1;