Created HDFS via S3 connector (markdown)

2024-01-19 02:48:24 +00:00 · 2021-09-29 14:20:20 -07:00 · 2021-09-29 14:20:20 -07:00 · d74df68773
parent 9d038d2ca2
commit d74df68773
1 changed files with 46 additions and 0 deletions
--- a/HDFS-via-S3-connector.md
+++ b/HDFS-via-S3-connector.md
@ -0,0 +1,46 @@
 Current recommended way for Hadoop to access SeaweedFS is via [SeaweedFS Hadoop Compatible File System](Hadoop-Compatible-File-System), which is the most efficient way with the client directly accessing filer for metadata and accessing volume servers for file content.
 However, the downside is that you need to add a SeaweedFS jar to classpath, and change some Hadoop settings.
 # HDFS Access SeaweedFS via S3 connector
 The S3a connector is already included in hadoop distributions. You can use it directly.
 Use 
 ```
 <properties>
    <maven.compiler.source>8</maven.compiler.source>
    <maven.compiler.target>8</maven.compiler.target>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <scala.version>2.12.11</scala.version>
    <spark.version>3.1.2</spark.version>
    <hadoop.version>3.3.1</hadoop.version>
    <spark.pom.scope>compile</spark.pom.scope>
 </properties>
 ```
 And add this in your code:
 ```
 SparkSession spark = SparkSession.builder()
    .master("local[*]")
    .config("spark.eventLog.enabled", "false")
    .config("spark.driver.memory", "1g")
    .config("spark.executor.memory", "1g")
    .appName("SparkDemoFromS3")
    .getOrCreate();
 spark.sparkContext().hadoopConfiguration().set("fs.s3a.access.key", "admin");
 spark.sparkContext().hadoopConfiguration().set("fs.s3a.secret.key", "xx");
 spark.sparkContext().hadoopConfiguration().set("fs.s3a.endpoint", "ip:8333");
 spark.sparkContext().hadoopConfiguration().set("com.amazonaws.services.s3a.enableV4", "true");
 spark.sparkContext().hadoopConfiguration().set("fs.s3a.path.style.access", "true");
 spark.sparkContext().hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false");
 spark.sparkContext().hadoopConfiguration().set("fs.s3a.multiobjectdelete.enable", "false");
 spark.sparkContext().hadoopConfiguration().set("fs.s3a.directory.marker.retention", "keep");
 spark.sparkContext().hadoopConfiguration().set("fs.s3a.change.detection.version.required", "false");
 spark.sparkContext().hadoopConfiguration().set("fs.s3a.change.detection.mode", "warn");
 RDD<String> rdd = spark.sparkContext().textFile("s3a://bk002/test1.txt", 1);
 System.out.println(rdd.count());
 rdd.saveAsTextFile("s3a://bk002/testcc/t2");
 ```