Updated Super Large Directories (markdown)

Chris Lu 2020-12-28 17:27:40 -08:00
parent a955e4cf54
commit 6d8fa37e4c

@ -19,9 +19,28 @@ We need a way to spread the data to all data nodes, without sacrificing too much
# How it works?
This is currently implemented in Cassandra and Redis.
This is currently implemented in Cassandra and Redis. Super large directories sacrifices the directory listing functionality, to keep the directory scalable. As the directory entry names usually are user ids or UUIDs, the list are already stored in some other storage. Listing all child entries can be achieved by other approaches.
For super large directories, data is partitioned by `<directory hash, name>`. This ensures the directory children are evenly distributed to all data nodes.
Only direct children of the super large directory can not be listed. For the deeper level directories, listing still works. For example, if `/home/users/` is configured as a super large directory, listing `/home/users/` would not work, but listing `/home/users/user1` and `/home/users/user1/books` still work.
```
/home/users/user1/books/book1.txt
/home/users/user1/books/book2.txt
```
## Cassandra Implementation
In Cassandra, for normal directories, data has primary key of `<directory hash, name>`, where the `directory hash` is the partitioning key. This data layout enables directory listing via range query with the directory hash as the prefix.
However, this means all the child entries are physically located in one Cassandra node. When the directory has billions of child entries, that Cassandra node will be overloaded.
So for large directories configured in Cassandra, SeaweedFS use the `<full_path>` as the partitioning key. So all child entries in that directory are evenly spread out to all Cassandra data nodes.
## Redis Implementation
In Redis, for normal directories, the list of child entries are stored in one key~value entry as `<path, sorted_set_of_child_entry_names>`.
However, when the number of child entries becomes larger, it would be slower and slower to read and write to this key~value entry.
So for large directories configured in Redis, SeaweedFS skips this operation to remember the child entries.
## The Downside