Updated Remote Storage Architecture (markdown)

Chris Lu 2021-08-14 08:02:48 -07:00
parent 1889bb0072
commit c6ac297d4d

@ -44,7 +44,7 @@ The remote storage, e.g., AWS S3, can be [[configured|Configure Remote Storage]]
```
# in "weed shell"
> remote.configure -name=cloud1 -type=s3 -access_key=xyz -secret_key=yyy
> remote.mount -dir=xxx -remote=cloud1/bucket
> remote.mount -dir=/path/to/xxx -remote=cloud1/bucket
```
On mount, all the metadata will be pulled down and cached to the local filer store.
@ -58,21 +58,20 @@ If the cloud data has any changes, just run `remote.mount -dir=xxx -remote=cloud
By default, the file content is [[cached|Cache Remote Storage]] to local volume servers on the first read.
Sometimes you may want to fetch all file content for a set of files. But trying to warm up the cache by open and read all files is not fun.
Here you can run command `remote.cache -dir=/path/to/xxx/cacheNeeded` in `weed shell`, which will cache all files under the specified directory.
Here you can run command `remote.cache -dir=xxx` in `weed shell`, which will cache all files under the specified directory.
Correspondingly, you can also run `remote.uncache -dir=xxx` in `weed shell`, to purge local cache.
Purge local cache, you can run `remote.uncache -dir=/path/to/xxx/cacheNeeded` in `weed shell`.
## Write Back Cache
The cache is write back by the `weed filer.remote.sync` process. This asynchronous write back will not slow down any local operations.
Local changes are write back by the `weed filer.remote.sync` process, which is asynchronous and will not slow down any local operations.
If not starting `weed filer.remote.sync`, the data changes will not be propagated back to the cloud.
# Possible Use Cases
* Machine learning training jobs need to repeatedly visit a large set of files. Increase training speed and reduce API cost and network cost.
* Saving data files. With cloud capacity and storage tiering, saving data files there may be a good idea. The cache can save the programming effort.
* Saving data files. With cloud capacity and storage tiering, saving data files there may be a good idea. This feature can save the programming effort.
* Run Spark/Flink jobs on mounted folders for faster computation.
* Multiple access methods, HDFS/HTTP/S3/WebDav/Mount, to access remote storage. No need to use one specific way to access remote storage.
* If you plan to move off cloud, you can start with SeaweedFS Remote Storage Cache. When you are happy with it, just stop the write back process (and cancel the monthly payment to the cloud vendor!).