Spark3
Apache Spark 是一个开源的大数据处理框架,旨在快速处理大规模数据集。它提供了分布式计算能力,支持批处理和流处理。Spark 提供了丰富的API,支持多种编程语言(如Java、Scala、Python、R),并且能在不同的集群管理器(如Hadoop YARN、Kubernetes)上运行。Spark 通过内存计算和高度优化的执行引擎,显著提高了数据处理速度,广泛应用于数据分析、机器学习和图计算等领域。
基础配置
下载软件包
wget https://dlcdn.apache.org/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz解压软件包
tar -zxvf spark-3.5.4-bin-hadoop3.tgz -C /usr/local/software/
ln -s /usr/local/software/spark-3.5.4-bin-hadoop3 /usr/local/software/spark2
配置环境变量
cat >> ~/.bash_profile <<"EOF"
## SPARK_HOME
export SPARK_HOME=/usr/local/software/spark
export PATH=$PATH:$SPARK_HOME/bin
EOF
source ~/.bash_profile2
3
4
5
6
查看版本
spark-shell --versionSpark Standalone(高可用集群)
Spark提供了一个独立的集群管理器,允许用户在不依赖于其他资源管理器的情况下部署Spark应用程序。用户可以通过启动和配置独立的Master和Worker节点来实现集群。
这种模式仅用于开发环境,生产环境使用Spark on YARN的方式
文档使用以下3台服务器,具体服务分配见描述的进程
| IP地址 | 主机名 | 描述 |
|---|---|---|
| 192.168.1.131 | bigdata01 | Master Worker HistoryServer |
| 192.168.1.132 | bigdata02 | Master Worker HistoryServer |
| 192.168.1.133 | bigdata03 | Master Worker HistoryServer |
集群配置
在bigdata01节点配置文件,然后分发到其他节点
配置spark-env.sh
cat >> $SPARK_HOME/conf/spark-env.sh <<"EOF"
export JAVA_HOME=/usr/local/software/jdk8
export HADOOP_HOME=/usr/local/software/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native
export SPARK_MASTER_HOST=bigdata01
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_WORKER_PORT=7078
export SPARK_WORKER_WEBUI_PORT=8081
export SPARK_DAEMON_MEMORY=4g
EOF2
3
4
5
6
7
8
9
10
11
12
配置worker
cat > $SPARK_HOME/conf/workers <<EOF
bigdata01
bigdata02
bigdata03
EOF2
3
4
5
配置spark-defaults.conf
cat >> $SPARK_HOME/conf/spark-defaults.conf <<EOF
## Spark Config
spark.eventLog.enabled true
spark.eventLog.dir hdfs://atengcluster/tmp/logs/spark
spark.eventLog.rolling.enabled true
spark.eventLog.rolling.maxFileSize 128m
spark.history.ui.port 18080
spark.history.retainedApplications 50
spark.history.fs.logDirectory hdfs://atengcluster/tmp/logs/spark
spark.driver.cores 1
spark.driver.memory 2g
spark.driver.memoryOverhead 2g
spark.executor.instances 3
spark.executor.cores 1
spark.executor.memory 4g
spark.executor.memoryOverhead 4g
spark.task.maxFailures 8
spark.sql.shuffle.partitions 8
spark.default.paralleism 8
# HA
spark.deploy.recoveryMode=ZOOKEEPER
spark.deploy.zookeeper.url=bigdata01:2181,bigdata02:2181,bigdata03:2181
spark.deploy.zookeeper.dir=/spark
EOF2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
分发配置文件
scp $SPARK_HOME/conf/{spark-env.sh,workers,spark-defaults.conf} bigdata02:$SPARK_HOME/conf/
scp $SPARK_HOME/conf/{spark-env.sh,workers,spark-defaults.conf} bigdata03:$SPARK_HOME/conf/2
修改其他节点的master主机
将SPARK_MASTER_HOST改为本机的地址
[admin@bigdata02 ~]$ vi $SPARK_HOME/conf/spark-env.sh
export SPARK_MASTER_HOST=bigdata02
[admin@bigdata02 ~]$ vi $SPARK_HOME/conf/spark-env.sh
export SPARK_MASTER_HOST=bigdata032
3
4
创建日志目录
hadoop fs -mkdir /tmp/logs/spark启动集群
启动服务
bigdata01: Master Worker
bigdata02: Master Worker
bigdata03: Master Worker
Master Web: http://bigdata01:8080
[admin@bigdata01 ~]$ $SPARK_HOME/sbin/start-all.sh
[admin@bigdata02 ~]$ $SPARK_HOME/sbin/start-master.sh
[admin@bigdata03 ~]$ $SPARK_HOME/sbin/start-master.sh2
3
启动historyserver服务
bigdata01: HistoryServer
bigdata02: HistoryServer
bigdata03: HistoryServer
HistoryServer Web: http://bigdata01:18080
[admin@bigdata01 ~]$ $SPARK_HOME/sbin/start-history-server.sh
[admin@bigdata02 ~]$ $SPARK_HOME/sbin/start-history-server.sh
[admin@bigdata03 ~]$ $SPARK_HOME/sbin/start-history-server.sh2
3
关闭服务
[admin@bigdata01 ~]$ $SPARK_HOME/sbin/stop-history-server.sh
[admin@bigdata02 ~]$ $SPARK_HOME/sbin/stop-history-server.sh
[admin@bigdata03 ~]$ $SPARK_HOME/sbin/stop-history-server.sh
[admin@bigdata01 ~]$ $SPARK_HOME/sbin/stop-all.sh
[admin@bigdata02 ~]$ $SPARK_HOME/sbin/stop-master.sh
[admin@bigdata03 ~]$ $SPARK_HOME/sbin/stop-master.sh2
3
4
5
6
设置服务自启
Spark Master
bigdata01、bigdata02、bigdata03设置Master
编辑配置文件
sudo tee /etc/systemd/system/spark-master.service <<"EOF"
[Unit]
Description=Spark Master
Documentation=https://spark.apache.org
After=network.target
[Service]
Type=forking
Environment="SPARK_HOME=/usr/local/software/spark"
ExecStart=/usr/local/software/spark/sbin/spark-daemon.sh start org.apache.spark.deploy.master.Master 1
ExecStop=/usr/local/software/spark/sbin/spark-daemon.sh stop org.apache.spark.deploy.master.Master 1
Restart=on-failure
RestartSec=30
TimeoutStartSec=120
TimeoutStopSec=180
StartLimitIntervalSec=600
StartLimitBurst=3
KillMode=control-group
KillSignal=SIGTERM
SuccessExitStatus=143
User=admin
Group=ateng
[Install]
WantedBy=multi-user.target
EOF2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
启动服务
sudo systemctl daemon-reload
sudo systemctl enable spark-master.service
sudo systemctl start spark-master.service
sudo systemctl status spark-master.service2
3
4
Spark Worker
bigdata01、bigdata02、bigdata03设置Worker
编辑配置文件
sudo tee /etc/systemd/system/spark-worker.service <<"EOF"
[Unit]
Description=Spark Worker
Documentation=https://spark.apache.org
After=network.target
[Service]
Type=forking
Environment="SPARK_HOME=/usr/local/software/spark"
ExecStart=/usr/local/software/spark/sbin/spark-daemon.sh start org.apache.spark.deploy.worker.Worker 1 spark://bigdata01:7077,bigdata02:7077,bigdata03:7077
ExecStop=/usr/local/software/spark/sbin/spark-daemon.sh stop org.apache.spark.deploy.worker.Worker 1
Restart=on-failure
RestartSec=30
TimeoutStartSec=120
TimeoutStopSec=180
StartLimitIntervalSec=600
StartLimitBurst=3
KillMode=control-group
KillSignal=SIGTERM
SuccessExitStatus=143
User=admin
Group=ateng
[Install]
WantedBy=multi-user.target
EOF2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
启动服务
sudo systemctl daemon-reload
sudo systemctl enable spark-worker.service
sudo systemctl start spark-worker.service
sudo systemctl status spark-worker.service2
3
4
Spark HistoryServer
bigdata01、bigdata02、bigdata03设置HistoryServer
编辑配置文件
sudo tee /etc/systemd/system/spark-history-server.service <<"EOF"
[Unit]
Description=Spark HistoryServer
Documentation=https://spark.apache.org
After=network.target
[Service]
Type=forking
Environment="SPARK_HOME=/usr/local/software/spark"
ExecStart=/usr/local/software/spark/sbin/spark-daemon.sh start org.apache.spark.deploy.history.HistoryServer 1
ExecStop=/usr/local/software/spark/sbin/spark-daemon.sh stop org.apache.spark.deploy.history.HistoryServer 1
Restart=on-failure
RestartSec=30
TimeoutStartSec=120
TimeoutStopSec=180
StartLimitIntervalSec=600
StartLimitBurst=3
KillMode=control-group
KillSignal=SIGTERM
SuccessExitStatus=143
User=admin
Group=ateng
[Install]
WantedBy=multi-user.target
EOF2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
启动服务
sudo systemctl daemon-reload
sudo systemctl enable spark-history-server.service
sudo systemctl start spark-history-server.service
sudo systemctl status spark-history-server.service2
3
4
使用服务
spark-submit
提交任务到Spark Standalone
spark-submit \
--master spark://bigdata01:7077,bigdata02:7077,bigdata03:7077 \
--deploy-mode cluster \
--total-executor-cores 2 \
--class org.apache.spark.examples.SparkPi \
$SPARK_HOME/examples/jars/spark-examples_2.12-3.5.4.jar 10002
3
4
5
6
spark-sql
使用SparkSQL连接Spark Standalone
没有配置Spark on Hive持久存储,使用内存模式进入默认会在当前目录下生成 spark-warehouse 目录
spark-sql \
--conf spark.sql.catalogImplementation=in-memory \
--conf spark.sql.legacy.createHiveTableByDefault=false \
--master spark://bigdata01:7077,bigdata02:7077,bigdata03:7077 \
--total-executor-cores 22
3
4
5
创建数据库
CREATE TABLE my_table_spark (
id INT,
name STRING
);2
3
4
插入数据
INSERT INTO my_table_spark VALUES
(1, 'John'),
(2, 'Jane'),
(3, 'Bob'),
(4, 'Alice');2
3
4
5
查询数据
SELECT * FROM my_table_spark;
SELECT count(*) FROM my_table_spark;2
清理目录
rm -rf spark-warehouse/