Created HDFS via S3 connector (markdown)

2024-01-19 02:48:24 +00:00 · 2021-09-29 14:20:20 -07:00 · 2021-09-29 14:20:20 -07:00 · d74df68773
parent 9d038d2ca2
commit d74df68773
1 changed files with 46 additions and 0 deletions
--- a/HDFS-via-S3-connector.md
+++ b/HDFS-via-S3-connector.md
@ -0,0 +1,46 @@
+Current recommended way for Hadoop to access SeaweedFS is via [SeaweedFS Hadoop Compatible File System](Hadoop-Compatible-File-System), which is the most efficient way with the client directly accessing filer for metadata and accessing volume servers for file content.
+
+However, the downside is that you need to add a SeaweedFS jar to classpath, and change some Hadoop settings.
+
+# HDFS Access SeaweedFS via S3 connector
+
+The S3a connector is already included in hadoop distributions. You can use it directly.
+
+Use 
+
+```
+<properties>
+    <maven.compiler.source>8</maven.compiler.source>
+    <maven.compiler.target>8</maven.compiler.target>
+    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
+    <scala.version>2.12.11</scala.version>
+    <spark.version>3.1.2</spark.version>
+    <hadoop.version>3.3.1</hadoop.version>
+    <spark.pom.scope>compile</spark.pom.scope>
+</properties>
+
+```
+And add this in your code:
+```
+SparkSession spark = SparkSession.builder()
+    .master("local[*]")
+    .config("spark.eventLog.enabled", "false")
+    .config("spark.driver.memory", "1g")
+    .config("spark.executor.memory", "1g")
+    .appName("SparkDemoFromS3")
+    .getOrCreate();
+spark.sparkContext().hadoopConfiguration().set("fs.s3a.access.key", "admin");
+spark.sparkContext().hadoopConfiguration().set("fs.s3a.secret.key", "xx");
+spark.sparkContext().hadoopConfiguration().set("fs.s3a.endpoint", "ip:8333");
+spark.sparkContext().hadoopConfiguration().set("com.amazonaws.services.s3a.enableV4", "true");
+spark.sparkContext().hadoopConfiguration().set("fs.s3a.path.style.access", "true");
+spark.sparkContext().hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false");
+spark.sparkContext().hadoopConfiguration().set("fs.s3a.multiobjectdelete.enable", "false");
+spark.sparkContext().hadoopConfiguration().set("fs.s3a.directory.marker.retention", "keep");
+spark.sparkContext().hadoopConfiguration().set("fs.s3a.change.detection.version.required", "false");
+spark.sparkContext().hadoopConfiguration().set("fs.s3a.change.detection.mode", "warn");
+RDD<String> rdd = spark.sparkContext().textFile("s3a://bk002/test1.txt", 1);
+System.out.println(rdd.count());
+rdd.saveAsTextFile("s3a://bk002/testcc/t2");
+
+```