最近准备把一些批量的应用转换到Spark Streaming上, 先部署到standalone模式上 , 将碰到的一些问题记录下来供接下来参考.
具体使用Spark 2.2.1, 脚本使用Python3(使用pyenv的虚拟环境,按照https://ask.hellobi.com/blog/seng/3084 部署)
1. OutOfMemoryError
examples/src/main/python/pi.py 10000报错
java.lang.OutOfMemoryError: Java heap space
参考http://www.cnblogs.com/wrencai/p/4231934.html,不指定给driver分配内存时,默认分配的是512M。
需要配置环境spark-env.sh
# SPARK_EXECUTOR_MEMORY < SPARK_DRIVER_MEMORY< yarn集群中每个nodemanager内存大小
export SPARK_EXECUTOR_MEMORY=8g
export SPARK_DRIVER_MEMORY=16g
2.master和slave python版本不一致
Exception: Python in worker has different version 2.7 than that in driver 3.6,
PySpark cannot run with different minor versions.
Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
解决方法,修改~/.bashrc 增加Python环境的激活
pyenv deactivate
pyenv activate env365
目前根据sumbbit指定具体python版本还没有搞定,virtualenv可以参考hortonworks文档
https://community.hortonworks.com/articles/104949/using-virtualenv-with-pyspark-1.html
3.stand-alone模式python不能使用cluster模式
错误信息:
Error: Cluster deploy mode is currently not supported for python applications on standalone clusters.
$SPARKBASE/spark-submit --deploy-mode cluster
参考:
http://spark.apache.org/docs/2.2.1/submitting-applications.html