SSRC publication: Organizing, Indexing, and Searching Large-Scale File Systems

Organizing, Indexing, and Searching Large-Scale File Systems

Published as Storage Systems Research Center Technical Report UCSC-SSRC-09-09.

Abstract

The world is moving towards a digital infrastructure. This move is driving the demand for data storage and has already resulted in file systems that contain petabytes of data and billions of files. In the near future file systems will be storing exabytes of data and trillions of files. This data growth has introduced the key question of how do we effectively find and manage data in this growing sea of information. Unfortunately, file organization and retrieval methods have not kept pace with data volumes. Large-scale file systems continue to rely on hierarchical namespaces that make finding and managing files difficult. As a result, there has been an increasing demand for search-based file access. A number of commercial file search solutions have become popular on desktop and small-scale enterprise systems. However, providing effective search and indexing at the scale of billions of files is not a simple task. Current solutions rely on general-purpose index designs, such as relational databases, to provide search. General-purpose indexes can be ill-suited for file system search and can limit performance and scalability. Additionally, current search solutions are designed as applications that are separate from the file system. Providing search through a separate application requires file attributes and modifications to be replicated into separate index structures, which presents consistency and efficiency problems at large-scales. This thesis addresses these problems through novel approaches to organizing, indexing, and searching files in large-scale file systems. We conduct an analysis of large-scale file system properties using workload and snapshot traces to better understand the kinds of data being stored and how it is used. This analysis represents the first major workload study since 2001 and the first major study of enterprise file system contents and workloads in over a decade. Our analysis shows a number of important workload properties have changed since previous studies (e.g., read to write byte ratios have decreased to 2:1 from 4:1 or higher in past studies) and examines properties that are relevant to file organization and search. Other important observations include highly skewed workload distributions and clustering of metadata attribute values in the namespace. We hypothesize that file search performance and scalability can be improved with file system specific index solutions. We present the design of new file metadata and file content indexing approaches that exploit key file system properties from our study. These designs introduce novel file system optimized index partitioning, query execution, and versioning techniques. We show that search performance can be improved up to 1-4 orders of magnitude compared to traditional approaches. Additionally, we hypothesize that directly integrating search into the file system can address the consistency and efficiency problems with separate search applications. We present new metadata and semantic file system designs that introduce novel disk layout, indexing, and updating methods to enable effective search without degrading normal file system performance. We then discuss on going challenges and how this work may be extended in the future.

Publication date:
December 2009

Authors:
Andrew Leung

Projects:
Scalable File System Indexing
HECURA: Scalable Data Management
Ultra-Large Scale Storage

Available media

Full paper text: PDF

Bibtex entry

@techreport{leung-ssrctr09-09,
  author       = {Andrew Leung},
  title        = {Organizing, Indexing, and Searching Large-Scale File Systems},
  institution  = {University of California, Santa Cruz},
  number       = {UCSC-SSRC-09-09},
  month        = dec,
  year         = {2009},
}

Last modified 28 May 2019