Updated Super Large Directories (markdown)

Chris Lu 2020-12-22 02:53:32 -08:00
parent fe0237e570
commit ce08c9ef61

@ -2,17 +2,7 @@
If one super large directory has way too many files or sub folders, the file listing itself can be a challenge.
For example, for Cassandra filer store, all the children in a directory are physically stored in one Cassandra data node. This is fine for most cases. However, if there are billions of child entries for one directory, the data node would not be able to query or even store the child list.
This is actually a common case when user name, id, or UUID are used as sub folder or file names. Usually a separate index is built to translate names to file id, and use file id to access data directory, giving up all the convenience from a file system.
We need a way to spread the data to all data nodes, without sacrificing too much.
In a sense, we want SeaweedFS to be as efficient and scalable as a hybrid of distributed key value store, while still using the familiar file system operations.
# How it works?
This is currently implemented in Cassandra and Redis. In Cassandra, each entry has this schema:
For example, for Cassandra filer store, each entry has this schema:
```
CREATE TABLE filemeta (
directory varchar,
@ -21,9 +11,17 @@ This is currently implemented in Cassandra and Redis. In Cassandra, each entry h
PRIMARY KEY (directory, name)
) WITH CLUSTERING ORDER BY (name ASC);
```
The directory is the clustering key. So the entries with the same directory is clustered to the same data node.
The directory is the partitioning key. So the entries with the same directory is partitioned to the same data node. This is fine for most cases. However, if there are billions of child entries for one directory, the data node would not perform well.
This is actually a common case when user name, id, or UUID are used as child entries. Usually a separate index is built to translate names to file id, and use file id to access data directory, giving up all the convenience from a file system.
We need a way to spread the data to all data nodes, without sacrificing too much. In a sense, we want SeaweedFS to be as efficient and scalable as a hybrid of distributed key value store, while still using the familiar file system operations.
# How it works?
This is currently implemented in Cassandra and Redis.
For super large directories, the directory is hashed and combined together with name as `<directory hash, name>`, which is used as the partitioning key. This ensures the directory children are evenly distributed to all data nodes.
For super large directories, data is partitioned by `<directory hash, name>`. This ensures the directory children are evenly distributed to all data nodes.
## The Downside