From 8035c7ca05d4cc06c42bd1098f501edb4a497b3b Mon Sep 17 00:00:00 2001 From: Chris Lu Date: Sat, 21 Nov 2020 13:55:34 -0800 Subject: [PATCH] Updated Hadoop Benchmark (markdown) --- Hadoop-Benchmark.md | 73 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 72 insertions(+), 1 deletion(-) diff --git a/Hadoop-Benchmark.md b/Hadoop-Benchmark.md index 18277f4..e29a01d 100644 --- a/Hadoop-Benchmark.md +++ b/Hadoop-Benchmark.md @@ -71,4 +71,75 @@ bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.10.0 20/07/17 15:59:38 INFO fs.TestDFSIO: IO rate std deviation: 13.33 20/07/17 15:59:38 INFO fs.TestDFSIO: Test exec time sec: 22.76 20/07/17 15:59:38 INFO fs.TestDFSIO: -``` \ No newline at end of file +``` + +# Independent Benchmarks + +## Sugon (中科曙光) +To test the performance of HDFS and SeaweedFS, we did a comparison by running 4 common spark operators, such as `count`, `group by`, `join` and `write`, for `group by` and `join` , there is a `count` followed to act. + +The basic configuration information of cluster is as follows: + +- HDFS: + + Node number: 25 + + Total disks: 36disk * 25node = 900disk + + Disk capacity: 3.7T SATA + + Total disk capacity: 3.19PB + + Replication: 5 + +- SeaweedFS: + + + Node number: 6(3+3 rack) + + Disk capacity: 3.7T SATA + + Cluster max volume: 21500 + + Total disk capacity: 799TB + + Replication policy: 010 + +Here are the details and results of our test. At the beginning of the test, we put our data to both HDFS and SeaweedFS. The amount of the data is 100 million records, and stored in 200 parquet files. The size of each parquet file is about 89 MB. We ran spark on yarn with 20 executors. In spark, we got two DataFrames by reading parquet from HDFS and HCFS separately, then executed `count`, `group by` and `join` by 100 times , and `write` by 10 times, on each DataFrame. + +As for `count`, SeaweedFS's advantage is obvious. The average time of the DataFrame from HDFS is 4.05 seconds, while SeaweedFS is only 0.659. Following is the result: + +| Summary | HDFS | SeaweedFS | +| ------- | ----- | ----- | +| Count | 100 | 100 | +| Mean | 4.050 | 0.659 | +| Stddev | 0.264 | 0.941 | +| Min | 3.678 | 0.392 | +| Max | 5.692 | 9.688 | + + + +As for `write`, we wrote the DataFrame from HDFS to SeaweedFS, and wrote the DataFrame from HCFS to SeaweedFS. Following is the result: + +| Summary | HDFS | SeaweedFS | +| ------- | ------- | ------- | +| Count | 10 | 10 | +| Mean | 234.279 | 232.078 | +| Stddev | 26.823 | 12.652 | +| Min | 216.931 | 214.349 | +| Max | 307.330 | 252.375 | + + + +As for `group by`, following is the result: + +| Summary | HDFS | SeaweedFS | +| ------- | ------ | ------ | +| Count | 100 | 100 | +| Mean | 14.121 | 12.515 | +| Stddev | 1.972 | 1.255 | +| Min | 12.879 | 11.322 | +| Max | 32.296 | 22.573 | + + + +As for `join`, every DataFrame join with itself on one column. Following is the result: + +| Summary | HDFS | SeaweedFS | +| ------- | ------ | ------ | +| Count | 100 | 100 | +| Mean | 25.684 | 23.897 | +| Stddev | 0.934 | 1.381 | +| Min | 24.006 | 22.275 | +| Max | 30.991 | 30.279 | +