NIBFS The Non-Indexed Blob File System A High-Efficiency, Capability-Based, Storage System for Archival Websites

作者:Jeremy T. Hilliker 刊名: 上传者:王传琦

【摘要】Image archival is a popular and high profile Web 2.0 service. We have examined the problem domain of internet archival websites; particularly image hosting sites; to discover the usage and characteristics unique to their problem domain. We have used this knowledge to determine how these services’ needs differ from those offered by traditional posix filesystems; and then have constructed a tailored filesystem to better meet their needs. Particularly; our system reduces disk seeks caused by unnecessary meta-data and the domain’s long tailed access distribution. Our filesystem offers a 40 to 55% improvement in throughput over standard filesystems; which is significant since these services are I/O bound. Increasing I/O efficiency for these services allows them to serve more content with fewer resources; and to scale better. Introduction Web 2.0 websites are built to allow users to collaborate with each other; share information; and to create content [5]. One application of Web 2.0 is online photo albums and image hosting. These websites allow users to post images to their accounts to be shared with their peers or their audience. This feature has bene adopted by many social networking websites which allow users to post images to be viewed by members of their social circle. Some website place more or less of an emphasis on social networking or image hosting. Imageshack is strictly an image hosting site with no social component. Flickr is primarily an image archival site with a small social component (tagging; comments; friends; and groups). 4chan is an image message board with equal emphasis on image sharing and social interaction. Facebook is a social networking site with an image sharing component. Other sites such as Photobucket; Picasa; Blogger; Google Video; and Youtube provide differing mixes of social networking and content sharing and archival. Facebook is currently the largest social networking site on the internet; and the 5 most popular th website overall [6]. Though the primary function of Facebook does not appear to be image archival; it is nevertheless the largest image archival site on the internet with over 6.5 billion images (with 4 or 5 sizes each) occupying 540 terra-bytes of storage [2]. 475;000 images are served per second; (including 200;000 profile images per second) and 100 million images are uploaded per week. Facebook currently serves 99.8% of its profile requests and 92% of its photo requests through content distribution networks (CDNs) such as Akamai and Limelight [2]. This results in approximately 452;600 images served through CDNs per second; and 22;400 images served per second by their own servers. The cost of using CDNs to this extent is prohibitive; but Facebook has had to use them due to their inability to scale their own servers and services fast enough. They would like to reduce their reliance on CDNs to lower costs [2]. Facebook has 10;000 servers [3] and has had to borrow $100 million dollars to purchase more [4]. Facebook’s scale of image hosting (as well as the other image hosts) makes them an interesting research case to discover if novel approaches to archival storage in their usage model can offer significant improvement (in either space or throughput) over existing approaches. Problem Facebook began with the naive approach of storing their image files in a traditional filesystem served over NFS by clusters of NetApp servers. These servers became heavily I/O bound; with as may as 15 disk reads required to serve each request [2]. The high number of reads was due to the meta-data (data about data) used by traditional file systems. Every time a file is opened; the file’s name and path must be resolved to an i-node which acts as a bookmark for the file. Each directory (a component of the path) has an i-node; and that i-node maps to the directory’s contents on disk. When the final component of the path is resolved; the system gets an i-node which points to the file to be read. Only then can the filesystem begin to read the contents of the file. Each of these i-nodes is a piece of data that describes how to find the file on disk. In a deep directory structure; a large number of directory entries and their corresponding i-nodes are required to be read before the system can reach the requested file. Facebook’s first optimization was to develop a system to cache the mapping of filenames to their final i-node; and to modify the Linux kernel to allow files to be opened directly with a reference to that i-node. This reduced the number of required disks reads to approximately 3 reads per file [2]. They could perform this optimization because Facebook does not allow files to be moved; renamed or deleted within their storage system; so names would always resolve to the same inode. File meta-data can be cached in other ways; but most of it is unneeded for Facebook’s application; so caching it wastes cache space which could be used for actual file data. Facebook; and most other image archival sites; are driven by databases which contain nearly all of the application’s required meta-data. Having the filesystem duplicate this meta-data; or having it store un-needed meta-data; wastes resources. Realizing that optimizations can be made by eliminating un-needed meta-data and by dropping support for unneeded filesystem functions can lead to a more efficient storage system for archival websites. The remainder of this paper will explore archival website’s usage models (particularly image archives); and will build an efficient storage system for those models. Finally; the performance and characteristics of that system will be compared with traditional filesystems and with Facebook’s solution to the same problem. Analysis of Problem Domain Image archival (and image portions of social networking) sites have some unique characteristics which differentiate them from regular filesystem usage models. Access Pattern On social sites such as Facebook; image access patterns follow a known distribution. Profile pictures (photo’s identifying a user; shown in user listings; and as the main photo on a user’s profile) are accessed the most frequently; followed by the most recently uploaded photos; followed by a very long tail [2]. This access pattern presumably translates to other archival sites. Identifying images are accessed most frequently as they are embedded in the most number of pages; followed by the most recent content as users distribute that content and view what is new; followed by the long tail of the website’s archived content. Even on image boards such as 4chan; we can imagine a similar scenario. Images which start a thread of conversation will be the most accessed (as they are shown on the forum summary pages); the most recent images (which are on the first page of the forum) will be accessed next most often; followed by the long tail of old images in old conversation threads. Filesystem Usage The largest difference between archival storage systems and regular file systems is how the files are used and what is done to them. These systems are primarily driven by read operations where a file is submitted once and retrieved many times. This behavior makes an image archive like a write-once-read-many (WORM) storage system. When the files are added; they are submitted as complete blocks (not streams) and they are not subsequently modified or appended. The systems are intended to be archives; so file deletion is rare. Related files are often submitted in batch (through a web-form with multiple file upload boxes); and related files are often retrieved close together (photos from the same album). Filesystem Access The files in the systems are always accessed by the user through a uniform resource locator (URL). These URLs have differing structures between systems; but they share some common characteristics. Example URLs are available at [1]. Analysis of these URLs reveal that they are most often not user-friendly. The URLs often contain some kind of volume identifier; an object identifier (OID); some kind of hash (presumably for replication and load-balancing); and a size indicator. The path is almost always completely meaningless to the user; and the filename component is sometimes meaningful; but often not. The following table presents a summary of the characteristics of archival site URLs. F ac eb o k F li ck r P h to bu ck et P ic as a