Why need a super large directory?
This is actually a common case. For example, entity ids, such as user name, id, IP address, URLs, or UUID can be used as sub directory names. The number of entity ids could be very large. And under the sub directory, more unstructured data can be colocated together, such as user avatar, uploaded files, access logs, URL text, images, audio, video, etc.
You can manually translate the entity id to file id with a separate lookup, and use file id to access data. This is exactly what SeaweedFS does internally. This manual approach not only re-invents the wheel, but also would give up all the convenience from a file system, such as deeper directories.
Assuming you are bootstrapping a startup with potentially millions of users, but currently only a few test accounts. You need to spend your time to really meet user requirements. You would not spend your time to design data structures and schemas for different cases to store customer data. Instead of optimizing early on, you can start with a folder for each account, and continue. SeaweedFS can make this simple approach future-proof.
Why super large directory is challenging?
If one super large directory has way too many files or sub folders, the file listing itself can be a challenge.
For example, for Cassandra filer store, each entry has this schema:
CREATE TABLE filemeta (
directory varchar,
name varchar,
meta blob,
PRIMARY KEY (directory, name)
) WITH CLUSTERING ORDER BY (name ASC);
The directory is the partitioning key. So the entries with the same directory is partitioned to the same data node. This is fine for most cases. However, if there are billions of direct child entries under one directory, the data node would not perform well.
We need a way to spread the data to all data nodes, without sacrificing too much. In a sense, we want SeaweedFS to be as efficient and scalable as a distributed key value store, while still using the familiar file system operations.
How it works?
This is currently implemented in Cassandra and Redis. Super large directories sacrifices the directory listing functionality, to keep the directory scalable. As the directory entry names usually are user ids or UUIDs, the list are already stored in some other storage. Listing all child entries can be achieved by other approaches.
Only direct children of the super large directory can not be listed. For the deeper level directories, listing still works. For example, if /home/users/
is configured as a super large directory, listing /home/users/
would not work, but listing /home/users/user1
and /home/users/user1/books
still work.
/home/users/user1/books/book1.txt
/home/users/user1/books/book2.txt
Cassandra Implementation
In Cassandra, for normal directories, data has primary key of <directory hash, name>
, where the directory hash
is the partitioning key. This data layout enables directory listing via range query with the directory hash as the prefix.
However, this means all the child entries are physically located in one Cassandra node. When the directory has billions of child entries, that Cassandra node will be overloaded.
So for large directories configured in Cassandra, SeaweedFS use the <full_path>
as the partitioning key. So all child entries in that directory are evenly spread out to all Cassandra data nodes.
Redis Implementation
In Redis, for normal directories, the list of child entries are stored in one key~value entry as <path, sorted_set_of_child_entry_names>
.
However, when the number of child entries becomes larger, it would be slower and slower to read and write to this key~value entry.
So for large directories configured in Redis, SeaweedFS skips this operation, so the list of child entries are not stored.
The Downside
The consequences are:
- The directory listing for this folder is not supported.
- The filer meta dada import and export for this folder is not supported. You can still do it for specific child folders though.
- Once this is configured, it can not be changed back easily. You will need to write code to iterate all sub entries for that.
How to configure it?
In filer.toml
for Cassandra/Redis, there is an option superLargeDirectories
. For example, if you will have a lot of user data under /home/users
[cassandra]
...
superLargeDirectories = [
"/home/users",
]
This is assuming the /home/user
is an empty folder.
As you can see, it supports multiple super large directories. However, never change or remove the entries in superLargeDirectories
or the data will be lost!
Note that with a path specific filer store, the superLargeDirectories
path is relative to the path specific store root. For example, if you wanted to make an entire S3 bucket have its own filer store and be a super large directory, you need to configure it like this:
location = "/buckets/mybucket"
superLargeDirectories = ["/"]
Introduction
API
Configuration
- Replication
- Store file with a Time To Live
- Failover Master Server
- Erasure coding for warm storage
- Server Startup Setup
- Environment Variables
Filer
- Filer Setup
- Directories and Files
- Data Structure for Large Files
- Filer Data Encryption
- Filer Commands and Operations
- Filer JWT Use
Filer Stores
- Filer Cassandra Setup
- Filer Redis Setup
- Super Large Directories
- Path-Specific Filer Store
- Choosing a Filer Store
- Customize Filer Store
Advanced Filer Configurations
- Migrate to Filer Store
- Add New Filer Store
- Filer Store Replication
- Filer Active Active cross cluster continuous synchronization
- Filer as a Key-Large-Value Store
- Path Specific Configuration
- Filer Change Data Capture
FUSE Mount
WebDAV
Cloud Drive
- Cloud Drive Benefits
- Cloud Drive Architecture
- Configure Remote Storage
- Mount Remote Storage
- Cache Remote Storage
- Cloud Drive Quick Setup
- Gateway to Remote Object Storage
AWS S3 API
- Amazon S3 API
- AWS CLI with SeaweedFS
- s3cmd with SeaweedFS
- rclone with SeaweedFS
- restic with SeaweedFS
- nodejs with Seaweed S3
- S3 API Benchmark
- S3 API FAQ
- S3 Bucket Quota
- S3 API Audit log
- S3 Nginx Proxy
AWS IAM
Machine Learning
HDFS
- Hadoop Compatible File System
- run Spark on SeaweedFS
- run HBase on SeaweedFS
- run Presto on SeaweedFS
- Hadoop Benchmark
- HDFS via S3 connector
Replication and Backup
- Async Replication to another Filer [Deprecated]
- Async Backup
- Async Filer Metadata Backup
- Async Replication to Cloud [Deprecated]
- Kubernetes Backups and Recovery with K8up
Messaging
Use Cases
Operations
Advanced
- Large File Handling
- Optimization
- Volume Management
- Tiered Storage
- Cloud Tier
- Cloud Monitoring
- Load Command Line Options from a file
- SRV Service Discovery