diff --git a/run-Spark-on-SeaweedFS.md b/run-Spark-on-SeaweedFS.md new file mode 100644 index 0000000..4598409 --- /dev/null +++ b/run-Spark-on-SeaweedFS.md @@ -0,0 +1,33 @@ +# Installation for Spark +Follow instructions on spark doc: +* https://spark.apache.org/docs/latest/configuration.html#inheriting-hadoop-cluster-configuration +* https://spark.apache.org/docs/latest/configuration.html#custom-hadoophive-configuration + +## installation inheriting from Hadoop cluster configuration + +Inheriting from Hadoop cluster configuration should be the easiest way. + +To make these files visible to Spark, set HADOOP_CONF_DIR in $SPARK_HOME/conf/spark-env.sh to a location containing the configuration file `core-site.xml`, usually `/etc/hadoop/conf` + +## installation not inheriting from Hadoop cluster configuration + +Copy the seaweedfs-hadoop2-client-x.x.x.jar to all executor machines. + +Add the following to spark/conf/spark-defaults.conf on every node running Spark +``` +spark.driver.extraClassPath /path/to/seaweedfs-hadoop2-client-x.x.x.jar +spark.executor.extraClassPath /path/to/seaweedfs-hadoop2-client-x.x.x.jar +``` + +And modify the configuration at runntime: + +``` +./bin/spark-submit \ + --name "My app" \ + --master local[4] \ + --conf spark.eventLog.enabled=false \ + --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \ + --conf spark.hadoop.fs.seaweedfs.impl=seaweed.hdfs.SeaweedFileSystem \ + --conf spark.hadoop.fs.defaultFS=seaweedfs://localhost:8888 \ + myApp.jar +```