Updated Super Large Directories (markdown)

Chris Lu 2020-12-29 13:27:43 -08:00
parent 03d115d783
commit 5a06b3214b

@ -1,4 +1,10 @@
# Why this is needed?
# Why need a super large directory?
This is actually a common case. For example, entity ids, such as user name, id, IP address, URLs, or UUID can be used as sub directory names. And under the sub directory, more unstructured data can be colocated together, such as user photos, URL text and images, logs, etc.
If using a separate lookup to translate the entity id to file id, and use file id to access data, this would give up all the convenience from a file system.
# Why super large directory is challenging?
If one super large directory has way too many files or sub folders, the file listing itself can be a challenge.
@ -13,9 +19,7 @@ For example, for Cassandra filer store, each entry has this schema:
```
The directory is the partitioning key. So the entries with the same directory is partitioned to the same data node. This is fine for most cases. However, if there are billions of direct child entries under one directory, the data node would not perform well.
This is actually a common case when user name, id, or UUID are used as child entries. Usually a separate index is built to translate names to file id, and use file id to access data directory, giving up all the convenience from a file system.
We need a way to spread the data to all data nodes, without sacrificing too much. In a sense, we want SeaweedFS to be as efficient and scalable as a hybrid of distributed key value store, while still using the familiar file system operations.
We need a way to spread the data to all data nodes, without sacrificing too much. In a sense, we want SeaweedFS to be as efficient and scalable as a distributed key value store, while still using the familiar file system operations.
# How it works?