As Linux clusters have matured as platforms for low cost, high-performance parallel computing, software packages to provide many key services have emerged, especially in areas such as message passing and networking. One area devoid of support, however, has been parallel file systems, which are critical for high performance I/O on such clusters. We have developed a parallel file system for Linux clusters, called the Parallel Virtual File System (PVFS). PVFS is intended both as a high-performance parallel file system that anyone can download and use and as a tool for pursuing further research in parallel I/O and parallel file systems for Linux clusters.
In this paper, we describe the design and implementation of PVFS and present performance results on the Chiba City cluster at Argonne. It provides performance results for a workload of concurrent reads and writes for various numbers of computer nodes, I/O nodes, and I/O request sizes. It also presents performance results for MPI-IO on PVFS, both for a concurrent read/write workload and for the BTIO benchmark. We compare the I/O performance when using a Myrinet network versus a Fast- Ethernet network for I/O-related communication in PVFS. It is obtained read and write bandwidths as high as 700 Mbytes/sec with Myrinet and 225 Mbytes/sec with fast Ethernet.
Related work in parallel and distributed file systems can be divided roughly into three groups:
Commercial parallel file systems
Distributed file systems
Research parallel file systems.
The first group comprises commercial parallel file systems such as PFS for the Intel Paragon, PIOFS. And GPFS for the IBM SP, HFS for the HP Exemplar, and XFS for the SGI Origin2000. These file systems provide high performance and functionality desired for I/O-intensive applications but is available only on the specific platforms on which the vendor has implemented them. (SGI, however, has recently released XFS for Linux. SGI is also developing a version of XFS for clusters, called CXFS, but, to our knowledge, CXFS is not yet available for Linux clusters.)
The second group comprises distributed file systems such as NFS, FS/Coda, Intermezzo, XFS and GFS. These file systems are designed to provide distributed access to files from multiple client machines, and their consistency semantics and caching behavior are designed accordingly for such access. The types of workloads resulting from large parallel scientific applications usually do not mesh well with file systems designed for distributed access; particularly, distributed file systems are not designed for high-bandwidth concurrent writes that parallel applications typically require.
The third group includes, A number of research projects existing in the areas of parallel I/O and parallel file systems, such as PIOUS, PPFS, and Galley. PIOUS focuses on viewing I/O from the viewpoint of transactions, PPFS research focuses on adaptive caching and prefetching, and Galley looks at disk-access optimization and alternative file organizations. These file systems may be freely available but are mostly research prototypes, not intended for everyday use by others.
PVFS Design and Implementation
As a parallel file system, the primary goal of PVFS is to provide high-speed access to file data for parallel applications. In addition, PVFS provides a cluster-wide consistent name space, enables user-controlled striping of data across disks on different I/O nodes, and allows existing binaries to operate on PVFS files without the need for recompiling. Like many other file systems, PVFS is designed as a client-server system with multiple servers, called I/O daemons. I/O daemons typically run on separate nodes in the cluster, called I/O nodes, which have disks attached to them. Each PVFS file is striped across the disks on the I/O nodes.
Application processes interact with PVFS via a client library. PVFS also has a manager daemon that handles only metadata operations such as permission checking for file creation, open, close, and remove operations. The manager does not participate in read/write operations; the client library and the I/O daemons handle all file I/Os without the intervention of the manager. The clients, I/O daemons, and the manager need not be run on different machines. Running them on different machines may result in higher performance, however. PVFS is primarily a user-level implementation; no kernel modifications or modules are necessary to install or operate the file system.
PVFS Manager and Metadata: -
A single manager daemon is responsible for the storage of and access to all the metadata in the PVFS file system. Metadata, in the context of a file system, refers to information describing the characteristics of a file, such as permissions, the owner and group, and, more important, the physical distribution of the file data. In the case of a parallel file system, the distribution information must include both file locations on disk and disk locations in the cluster. Unlike a traditional file system, where metadata and file data are all stored on the raw blocks of a single device, parallel file systems must distribute this data among many physical devices. In PVFS, for simplicity, we chose to store both file data and metadata in files on existing local file systems rather than directly on raw devices.
PVFS files are striped across a set of I/O nodes in order to facilitate parallel access. The specifics of a given file distribution are described with three metadata parameters: base I/O node number, number of I/O nodes, and stripe size. These parameters, together with an ordering of the I/O nodes for the file system, allow the file distribution to be completely specified. An example of some of the metadata fields for a file /pvfs/foo is given in Table 1. The pcount field specifies that the data is spread across three I/O nodes, base specifies that the first (or base) I/O node is node 2, and ssize specifies that the stripe size—the unit by which the file is divided among the I/O nodes—is 64 Kbytes. The user can set these parameters when the file is created, or PVFS will use a default set of values