From ce08c9ef61b4a0c79579f1bd25deb125952017eb Mon Sep 17 00:00:00 2001 From: Chris Lu Date: Tue, 22 Dec 2020 02:53:32 -0800 Subject: [PATCH] Updated Super Large Directories (markdown) --- Super-Large-Directories.md | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/Super-Large-Directories.md b/Super-Large-Directories.md index d902578..97bfe7f 100644 --- a/Super-Large-Directories.md +++ b/Super-Large-Directories.md @@ -2,17 +2,7 @@ If one super large directory has way too many files or sub folders, the file listing itself can be a challenge. -For example, for Cassandra filer store, all the children in a directory are physically stored in one Cassandra data node. This is fine for most cases. However, if there are billions of child entries for one directory, the data node would not be able to query or even store the child list. - -This is actually a common case when user name, id, or UUID are used as sub folder or file names. Usually a separate index is built to translate names to file id, and use file id to access data directory, giving up all the convenience from a file system. - -We need a way to spread the data to all data nodes, without sacrificing too much. - -In a sense, we want SeaweedFS to be as efficient and scalable as a hybrid of distributed key value store, while still using the familiar file system operations. - -# How it works? - -This is currently implemented in Cassandra and Redis. In Cassandra, each entry has this schema: +For example, for Cassandra filer store, each entry has this schema: ``` CREATE TABLE filemeta ( directory varchar, @@ -21,9 +11,17 @@ This is currently implemented in Cassandra and Redis. In Cassandra, each entry h PRIMARY KEY (directory, name) ) WITH CLUSTERING ORDER BY (name ASC); ``` -The directory is the clustering key. So the entries with the same directory is clustered to the same data node. +The directory is the partitioning key. So the entries with the same directory is partitioned to the same data node. This is fine for most cases. However, if there are billions of child entries for one directory, the data node would not perform well. + +This is actually a common case when user name, id, or UUID are used as child entries. Usually a separate index is built to translate names to file id, and use file id to access data directory, giving up all the convenience from a file system. + +We need a way to spread the data to all data nodes, without sacrificing too much. In a sense, we want SeaweedFS to be as efficient and scalable as a hybrid of distributed key value store, while still using the familiar file system operations. + +# How it works? + +This is currently implemented in Cassandra and Redis. -For super large directories, the directory is hashed and combined together with name as ``, which is used as the partitioning key. This ensures the directory children are evenly distributed to all data nodes. +For super large directories, data is partitioned by ``. This ensures the directory children are evenly distributed to all data nodes. ## The Downside