Created Remote Storage Architecture (markdown)

Chris Lu 2021-08-10 00:33:25 -07:00
parent d9918c4a1c
commit fdd8b0d86e

@ -0,0 +1,62 @@
Nowadays, the trend is to go to cloud storage, since "everybody is doing it".
## Cloud is not for everyone
But after really using cloud storage, many users will find:
* The cloud cost is too high. On [[AWS S3|https://aws.amazon.com/s3/pricing/]], the storage cost is relatively cheap (but not really) around $0.023 per GB per month. But there are other costs which can add up quickly:
* API cost for PUT, POST, LIST requests is $0.005 per 1000 requests
* Transfer out cost is $0.09 per GB.
* The network latency is high.
* The response latency is not consistent.
* Any code changes may increase your total cost.
* It limits engineers' creativity and development speed in order to watch for cost.
## SeaweedFS can be a good choice
SeaweedFS can be good because:
* Freedom to develop new features with a fixed budget.
* Faster high-capacity storage hardware is also getting cheaper.
* Local access latency.
* Avoid noisy neighbor problem.
* Cross data center replication gives high data redundancy and availability.
However, how to make SeaweedFS work with data already on cloud?
# SeaweedFS Remote Storage Cache
With this feature, SeaweedFS can cache data that is on cloud. It can cache metadata and file content. Given SeaweedFS unlimited scalability, the cache size is actually unlimited. Any local changes can be write back to the cloud asynchronously.
```
[HDFS|Mount|HTTP|S3|WebDAV] <== Filer(metadata cache) <== Volume Servers (data cache) <== Cloud
[HDFS|Mount|HTTP|S3|WebDAV] ==> Filer(metadata cache) ==> Volume Servers (data cache) ==> `weed filer.remote.sync` ==> Cloud
```
## Mount Remote Storage
The remote storage, e.g., AWS S3, can be [[configured|Configure Remote Storage]] and [[mounted|Mount Remote Storage]] directly to an empty folder in SeaweedFS.
On mount, all the metadata will be pulled down and cached to the local filer store.
The metadata will be used for all metadata operations, such as listing, directory traversal, read file size, compare file modification time, etc, which will be free and fast as usual, without any API calls to the cloud.
## Cache/Uncache File Content
A file has metadata and its content.
By default, the file content is cached to local volume servers on the first read.
Sometimes you may want to fetch all file content for a set of files. But trying to warm up the cache by open and read all files is not fun.
Here you can run command `remote.cache -dir=xxx` in `weed shell`, which will cache all files under the specified directory.
Correspondingly, you can also run `remote.uncache -dir=xxx` in `weed shell`, to avoid local duplicated storage.
## Write Back Cache
The cache is write back by the `weed filer.remote.sync` process.
If not starting `weed filer.remote.sync`, the cache will be just read only. Changes are not forbidden, but any data changes will not be propagated back to the cloud.
The asynchronous write back will not slow down any local operations.