Spark Streaming 개발환경 구축하기

By | 2022년 4월 14일
Table of Contents

Spark Streaming 개발환경 구축하기

Spark Streaming 개발에 필요한 최소 설정으로 개발환경을 구축합니다.
하나의 서버에 모든 요소(Kafka + Spark + Hadoop) 을 설치합니다.
성능이나 보안이슈 등은 고려하지 않습니다.

JDK 버전 문제로 계정은 2개를 생성합니다.

JDK 설치

sudo apt-get update
sudo apt-get install openjdk-8-jdk
sudo apt-get install openjdk-11-jdk

# java 8 을 선택한다.
sudo update-alternatives --config java

Kafka 설치

sudo adduser kafka
sudo su - kafka
wget https://dlcdn.apache.org/kafka/2.8.1/kafka_2.13-2.8.1.tgz
tar xvfz kafka_2.13-2.8.1.tgz
mkdir kafka_2.13-2.8.1/logs
mkdir kafka_2.13-2.8.1/data
vi .bashrc
......
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
......

Spark, Hadoop 설치

sudo adduser spark
sudo su - spark
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz
tar xvfz hadoop-3.2.3.tar.gz

wget https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
tar xvfz spark-3.2.1-bin-hadoop3.2.tgz
vi ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/home/spark/hadoop-3.2.3/
export SPARK_HOME=/home/spark/spark-3.2.1-bin-hadoop3.2/
export PATH=$JAVA_HOME/bin:$PATH:$HADOOP_HOME/bin/:$HADOOP_HOME/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native
exit
sudo su - spark

mkdir -p /home/spark/hadoop/hadoopdata/hdfs/namenode
mkdir -p /home/spark/hadoop/hadoopdata/hdfs/datanode
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
sudo su - hduser

vi $HADOOP_HOME/etc/hadoop/core-site.xml
......
## <configuration> </configuration> 사이에 입력
<property>
   <name>fs.default.name</name>
   <value>hdfs://localhost:9000</value>
</property>
......
vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
......
## <configuration> </configuration> 사이에 입력
<property>
   <name>dfs.replication</name>
   <value>2</value>
</property>
<property>
   <name>dfs.namenode.secondary.http-address</name>
   <value>localhost:9868</value>
</property>
<property>
   <name>dfs.namenode.name.dir</name>
   <value>file:/home/spark/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
   <name>dfs.datanode.data.dir</name>
   <value>file:/home/spark/hadoop/hadoopdata/hdfs/datanode</value>
</property>
......
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
......
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
......
hdfs namenode -format
start-dfs.sh
start-yarn.sh

hadoop fs -mkdir -p /tmp
hadoop fs -chmod 777 /tmp
jps
9664 ResourceManager
9824 NodeManager
10130 Jps
9060 NameNode
9466 SecondaryNameNode
9247 DataNode

실행 명령어

sudo su - kafka
kafka_2.13-2.8.1/bin/zookeeper-server-start.sh kafka_2.13-2.8.1/config/zookeeper.properties &
kafka_2.13-2.8.1/bin/kafka-server-start.sh kafka_2.13-2.8.1/config/server.properties &
exit

sudo su - spark
start-dfs.sh
start-yarn.sh

종료 명령어

sudo su - spark
stop-yarn.sh
stop-dfs.sh
exit

sudo su - kafka
kafka_2.13-2.8.1/bin/kafka-server-stop.sh
kafka_2.13-2.8.1/bin/zookeeper-server-stop.sh

Kafka 추가설정

vi kafka_2.13-2.8.1/config/zookeeper.properties
......
dataDir=/home/kafka/kafka_2.13-2.8.1/data
......

Kafka 원격 접속 허용, 토픽자동생성 비활성화, 토픽삭제 활성화

AWS EC2 인스턴스인 경우,
보안그룹에 인스턴스 공인아이피에서의 접속을 허용해 주어야 한다.

AWS 보안그룹 소스를 이용한 접속허용은,
지정된 보안 그룹과 연결된 리소스의
프라이빗 IP 주소를 기반으로 하는 트래픽이 허용됩니다.

sudo su - kafka
vi kafka_2.13-2.8.1/config/server.properties
......
advertised.listeners=PLAINTEXT://54.180.XXX.XXX:9092
allow.auto.create.topics=false
delete.topic.enable=true
log.dirs=/home/kafka/kafka_2.13-2.8.1/logs
......

Kafka Topic 생성

kafka_2.13-2.8.1/bin/kafka-topics.sh --create \
    --zookeeper localhost:2181 \
    --replication-factor 3 \
    --partitions 20 \
    --topic test

kafka_2.13-2.8.1/bin/kafka-topics.sh --describe \
    --zookeeper localhost:2181 \
    --topic test

kafka_2.13-2.8.1/bin/kafka-topics.sh --list \
    --zookeeper localhost:2181

kafka_2.13-2.8.1/bin/kafka-topics.sh --alter \
    --zookeeper localhost:2181 \
    --topic test \
    --partitions 40

kafka_2.13-2.8.1/bin/kafka-topics.sh --delete \
    --zookeeper localhost:2181 \
    --topic test

Spark 추가설정

Log Level

sudo su - spark

cp spark-3.2.1-bin-hadoop3.2/conf/log4j.properties.template spark-3.2.1-bin-hadoop3.2/conf/log4j.properties
vi spark-3.2.1-bin-hadoop3.2/conf/log4j.properties
......
log4j.rootCategory=WARN, console
......

Hadoop 추가설정

spark 계정에 sudo 권한 부여(비밀번호 없이 실행)

# 에디터 vim 선택
sudo update-alternatives --config editor

sudo visudo
......
spark ALL=(ALL) NOPASSWD:ALL
......

Hive 설치하기

우선 MariaDB 설치계정생성 을 해준다.

sudo su - spark

wget https://dlcdn.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
tar xvfz apache-hive-3.1.2-bin.tar.gz
vi ~/.bashrc
......
export HIVE_HOME=/home/spark/apache-hive-3.1.2-bin/
export PATH=$HIVE_HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

exit
sudo su - spark
cp $HIVE_HOME/conf/hive-env.sh.template $HIVE_HOME/conf/hive-env.sh

vi $HIVE_HOME/conf/hive-env.sh
......
HADOOP_HOME=$HADOOP_HOME
......
cp $HIVE_HOME/conf/hive-default.xml.template $HIVE_HOME/conf/hive-site.xml

vi $HIVE_HOME/conf/hive-site.xml
......
## 아래를 추가해주세요
<configuration>

  <property>
    <name>hive.exec.scratchdir</name>
    <value>/tmp/hive-${user.name}</value>
    <description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description>
  </property>
  <property>
    <name>hive.exec.local.scratchdir</name>
    <value>/tmp/${user.name}</value>
    <description>Local scratch space for Hive jobs</description>
  </property>
  <property>
    <name>hive.downloaded.resources.dir</name>
    <value>/tmp/${user.name}_resources</value>
    <description>Temporary local directory for added resources in the remote file system.</description>
  </property>
  <property>
    <name>hive.scratch.dir.permission</name>
    <value>733</value>
    <description>The permission for the user specific scratch directories that get created.</description>
  </property>
  <property>
    <name>hive.cli.print.current.db</name>
    <value>true</value>
    <description>Whether to include the current database in the Hive prompt.</description>
  </property>

</configuration>
hadoop fs -mkdir -p /user/hive/warehouse
# hadoop fs -chown -R hive /user/hive
# bug fix
rm $HIVE_HOME/lib/guava-19.0.jar
cp $HADOOP_HOME/share/hadoop/hdfs/lib/guava-27.0-jre.jar $HIVE_HOME/lib/

hive

hive> exit;

스키마생성(derby)

SessionHiveMetaStoreClient 관련 오류가 발생하면,
아래 명령을 다시 실행해준다.

# rm derby.log
# rm -rf metastore_db/
schematool -initSchema -dbType derby
hive
......
hive (default)> show tables;
OK
Time taken: 0.857 seconds
......
hive (default)> CREATE TABLE T1 (ID STRING);
hive (default)> INSERT INTO T1(ID) VALUES('aaaaa');
hive (default)> SELECT * FROM T1;

MariaDB 설정하기

MySQL 용 JDBC 를 다운받는다.(MariaDB XXX)

wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-java-5.1.49.zip
unzip mysql-connector-java-5.1.49.zip
cp mysql-connector-java-5.1.49/mysql-connector-java-5.1.49.jar $HIVE_HOME/lib/
vi $HIVE_HOME/conf/hive-site.xml
......
## 아래를 추가해주세요
<configuration>

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://localhost:3306/db_test?createDatabaseIfNotExist=true</value>
  <description>JDBC connection string for a JDBC metastore</description>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>testuser</value>
  <description>username to use against metastore database</description>
</property>
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>test1234</value>
  <description>password to use against metastore database</description>
</property>

</configuration>
schematool -initSchema -dbType mysql

hive
......
hive (default)> show tables;
OK
Time taken: 0.665 seconds

hive (default)> CREATE TABLE T1 (ID STRING);
hive (default)> INSERT INTO T1(ID) VALUES('aaaaa');
hive (default)> SELECT * FROM T1;
hive (default)> DROP TABLE T1;

답글 남기기