From 6d8fa37e4cb0e0fec009532a1767df3384724148 Mon Sep 17 00:00:00 2001 From: Chris Lu Date: Mon, 28 Dec 2020 17:27:40 -0800 Subject: [PATCH] Updated Super Large Directories (markdown) --- Super-Large-Directories.md | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/Super-Large-Directories.md b/Super-Large-Directories.md index b7a2775..2f2dc1e 100644 --- a/Super-Large-Directories.md +++ b/Super-Large-Directories.md @@ -19,9 +19,28 @@ We need a way to spread the data to all data nodes, without sacrificing too much # How it works? -This is currently implemented in Cassandra and Redis. - -For super large directories, data is partitioned by ``. This ensures the directory children are evenly distributed to all data nodes. +This is currently implemented in Cassandra and Redis. Super large directories sacrifices the directory listing functionality, to keep the directory scalable. As the directory entry names usually are user ids or UUIDs, the list are already stored in some other storage. Listing all child entries can be achieved by other approaches. + +Only direct children of the super large directory can not be listed. For the deeper level directories, listing still works. For example, if `/home/users/` is configured as a super large directory, listing `/home/users/` would not work, but listing `/home/users/user1` and `/home/users/user1/books` still work. + +``` + /home/users/user1/books/book1.txt + /home/users/user1/books/book2.txt +``` + +## Cassandra Implementation +In Cassandra, for normal directories, data has primary key of ``, where the `directory hash` is the partitioning key. This data layout enables directory listing via range query with the directory hash as the prefix. + +However, this means all the child entries are physically located in one Cassandra node. When the directory has billions of child entries, that Cassandra node will be overloaded. + +So for large directories configured in Cassandra, SeaweedFS use the `` as the partitioning key. So all child entries in that directory are evenly spread out to all Cassandra data nodes. + +## Redis Implementation +In Redis, for normal directories, the list of child entries are stored in one key~value entry as ``. + +However, when the number of child entries becomes larger, it would be slower and slower to read and write to this key~value entry. + +So for large directories configured in Redis, SeaweedFS skips this operation to remember the child entries. ## The Downside