diff --git a/Cloud-Cache-Benefits.md b/Cloud-Cache-Benefits.md new file mode 100644 index 0000000..07ebfa5 --- /dev/null +++ b/Cloud-Cache-Benefits.md @@ -0,0 +1,91 @@ +# Context +Nowadays, the trend is to go to cloud storage, since "everybody is doing it". + +## Cloud is not for everyone +But after really using cloud storage, many users will find: + +* The cloud cost is too high. On [[AWS S3|https://aws.amazon.com/s3/pricing/]], the storage cost is relatively cheap (but not really) around $0.023 per GB per month. But there are other costs which can add up quickly: + * API cost for PUT, POST, LIST requests is $0.005 per 1000 requests + * Transfer out cost is $0.09 per GB. +* The network latency is high. +* The response latency is not consistent. +* Any code changes may increase your total cost. +* It limits engineers' creativity and development speed in order to watch for cost. + +## SeaweedFS can be a good choice + +SeaweedFS can be good because: + +* Freedom to read your own data! Any times that you want! +* Freedom to develop new features with a fixed budget. +* Faster high-capacity storage hardware is also getting cheaper. +* Local access latency. +* Avoid noisy neighbor problem. +* Cross data center replication gives high data redundancy and availability. + +However, how to make SeaweedFS work with data already on cloud? + +# Design + +![SeaweedFS Remote Storage](https://raw.githubusercontent.com/chrislusf/seaweedfs/master/note/SeaweedFS_RemoteMount.png) + +# Benefits + +* Cached Locally + * Fast metadata operations. + * Fast read and write at local network latency and throughput. + * Fast and cheaper hardware. + * Avoid noisy neighbors. + * Minimum cost. Download data once. +* Scalable Capacity + * Just pre-cache everything. No more delay on first uncached read. + * No need to try hard to find best caching strategy for different data access patterns. +* Easy To Manage + * Warm up cache for by folder, file name pattern, file size, file age, etc. + * Uncache by folder, file name pattern, file size, file age, etc. + * Optionally write data back to cloud storage. +* Flexible + * Can write data back to work with existing cloud ecosystems. + * Can transparently switch to different cloud storage vendors. + * Can detach from the cloud storage if decided to move off cloud. + +# Possible Use Cases + +* Machine learning + * Problem + * Training jobs need to repeatedly visit a large set of files. + * The randomized access pattern is hard for caching. + * With SeaweedFS Cloud Cache + * Users can explicitly ask SeaweedFS Cloud Cache to cache one whole folder. + * Increase training speed and reduce API cost and network cost. + * Users can access data with FUSE mounted folders. +* Data Hoarding + * Problem + * With cloud capacity and storage tiering, saving data files there may be a good idea. + * Recently uploaded files very likely need to be accessed again. + * With SeaweedFS Cloud Cache + * Users can explicitly ask SeaweedFS Cloud Cache to uncache by file age. + * Users can also choose to never uncache, basically treating cloud copy as a backup. +* Big Data + * Problem + * Run MapReduce, Spark, and Flink jobs on mounted folders for faster computation. + * With SeaweedFS Cloud Cache + * Avoiding slow cloud storage metadata access. + * Large amount of data access will not increase cost. + * Write back data to work with cloud ecosystems. +* Cloud Storage Vendor Agnostic + * Problem + * Different datasets may need to be on different vendors, based on access pattern, latency, cost, etc. + * Transparently switch to from one vendor to another. +* Move Off Cloud + * Problem + * Cloud storage is costly! + * With SeaweedFS Cloud Cache + * Help to transition between on-cloud to off-cloud. + * When you are happy with it, just stop the write back process (and cancel the monthly payment to the cloud vendor!). +* Support multiple access methods. + * Problem + * You may need to access cloud data by HDFS, or HTTP, or S3 API, or WebDav, or FUSE Mount. + * With SeaweedFS Cloud Cache + * Multiple ways to access remote storage. +