From 51b424d209aaf0f7a8997e1dd61d9dc29dec21fd Mon Sep 17 00:00:00 2001 From: Chris Lu Date: Wed, 18 Nov 2020 10:48:03 -0800 Subject: [PATCH] Updated Independent Benchmarks (markdown) --- Independent-Benchmarks.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Independent-Benchmarks.md b/Independent-Benchmarks.md index 898c662..3b8059b 100644 --- a/Independent-Benchmarks.md +++ b/Independent-Benchmarks.md @@ -20,7 +20,7 @@ The basic configuration information of cluster is as follows: + Total disk capacity: 799TB + Replication policy: 010 -Here are the details and results of our test. At the beginning of the test, we put our data to both HDFS and HCFS. The amount of the data is 100 million records, and stroed in 200 parquet files. The size of each parquet file is about 89 MB. We ran spark on yarn with 20 executors. In spark, we got two DataFrames by reading parquet from HDFS and HCFS separately, then executed `count`, `group by` and `join` by 100 times , and `write` by 10 times, on each DataFrame. +Here are the details and results of our test. At the beginning of the test, we put our data to both HDFS and HCFS. The amount of the data is 100 million records, and stored in 200 parquet files. The size of each parquet file is about 89 MB. We ran spark on yarn with 20 executors. In spark, we got two DataFrames by reading parquet from HDFS and HCFS separately, then executed `count`, `group by` and `join` by 100 times , and `write` by 10 times, on each DataFrame. As for `count`, HCFS's advantage is obvious. The average time of the DataFrame from HDFS is 4.05 seconds, while HCFS is only 0.659. Following is the result: