Created Cloud Cache Benefits (markdown)

Chris Lu 2021-08-16 01:14:40 -07:00
parent dd766e91c6
commit 235fb7c73b

91
Cloud-Cache-Benefits.md Normal file

@ -0,0 +1,91 @@
# Context
Nowadays, the trend is to go to cloud storage, since "everybody is doing it".
## Cloud is not for everyone
But after really using cloud storage, many users will find:
* The cloud cost is too high. On [[AWS S3|https://aws.amazon.com/s3/pricing/]], the storage cost is relatively cheap (but not really) around $0.023 per GB per month. But there are other costs which can add up quickly:
* API cost for PUT, POST, LIST requests is $0.005 per 1000 requests
* Transfer out cost is $0.09 per GB.
* The network latency is high.
* The response latency is not consistent.
* Any code changes may increase your total cost.
* It limits engineers' creativity and development speed in order to watch for cost.
## SeaweedFS can be a good choice
SeaweedFS can be good because:
* Freedom to read your own data! Any times that you want!
* Freedom to develop new features with a fixed budget.
* Faster high-capacity storage hardware is also getting cheaper.
* Local access latency.
* Avoid noisy neighbor problem.
* Cross data center replication gives high data redundancy and availability.
However, how to make SeaweedFS work with data already on cloud?
# Design
![SeaweedFS Remote Storage](https://raw.githubusercontent.com/chrislusf/seaweedfs/master/note/SeaweedFS_RemoteMount.png)
# Benefits
* Cached Locally
* Fast metadata operations.
* Fast read and write at local network latency and throughput.
* Fast and cheaper hardware.
* Avoid noisy neighbors.
* Minimum cost. Download data once.
* Scalable Capacity
* Just pre-cache everything. No more delay on first uncached read.
* No need to try hard to find best caching strategy for different data access patterns.
* Easy To Manage
* Warm up cache for by folder, file name pattern, file size, file age, etc.
* Uncache by folder, file name pattern, file size, file age, etc.
* Optionally write data back to cloud storage.
* Flexible
* Can write data back to work with existing cloud ecosystems.
* Can transparently switch to different cloud storage vendors.
* Can detach from the cloud storage if decided to move off cloud.
# Possible Use Cases
* Machine learning
* Problem
* Training jobs need to repeatedly visit a large set of files.
* The randomized access pattern is hard for caching.
* With SeaweedFS Cloud Cache
* Users can explicitly ask SeaweedFS Cloud Cache to cache one whole folder.
* Increase training speed and reduce API cost and network cost.
* Users can access data with FUSE mounted folders.
* Data Hoarding
* Problem
* With cloud capacity and storage tiering, saving data files there may be a good idea.
* Recently uploaded files very likely need to be accessed again.
* With SeaweedFS Cloud Cache
* Users can explicitly ask SeaweedFS Cloud Cache to uncache by file age.
* Users can also choose to never uncache, basically treating cloud copy as a backup.
* Big Data
* Problem
* Run MapReduce, Spark, and Flink jobs on mounted folders for faster computation.
* With SeaweedFS Cloud Cache
* Avoiding slow cloud storage metadata access.
* Large amount of data access will not increase cost.
* Write back data to work with cloud ecosystems.
* Cloud Storage Vendor Agnostic
* Problem
* Different datasets may need to be on different vendors, based on access pattern, latency, cost, etc.
* Transparently switch to from one vendor to another.
* Move Off Cloud
* Problem
* Cloud storage is costly!
* With SeaweedFS Cloud Cache
* Help to transition between on-cloud to off-cloud.
* When you are happy with it, just stop the write back process (and cancel the monthly payment to the cloud vendor!).
* Support multiple access methods.
* Problem
* You may need to access cloud data by HDFS, or HTTP, or S3 API, or WebDav, or FUSE Mount.
* With SeaweedFS Cloud Cache
* Multiple ways to access remote storage.