Updated Independent Benchmarks (markdown)

Chris Lu 2020-11-18 10:48:03 -08:00
parent c624ce79cd
commit 51b424d209

@ -20,7 +20,7 @@ The basic configuration information of cluster is as follows:
+ Total disk capacity: 799TB
+ Replication policy: 010
Here are the details and results of our test. At the beginning of the test, we put our data to both HDFS and HCFS. The amount of the data is 100 million records, and stroed in 200 parquet files. The size of each parquet file is about 89 MB. We ran spark on yarn with 20 executors. In spark, we got two DataFrames by reading parquet from HDFS and HCFS separately, then executed `count`, `group by` and `join` by 100 times , and `write` by 10 times, on each DataFrame.
Here are the details and results of our test. At the beginning of the test, we put our data to both HDFS and HCFS. The amount of the data is 100 million records, and stored in 200 parquet files. The size of each parquet file is about 89 MB. We ran spark on yarn with 20 executors. In spark, we got two DataFrames by reading parquet from HDFS and HCFS separately, then executed `count`, `group by` and `join` by 100 times , and `write` by 10 times, on each DataFrame.
As for `count`, HCFS's advantage is obvious. The average time of the DataFrame from HDFS is 4.05 seconds, while HCFS is only 0.659. Following is the result: