The Data Capacitor II and DC-WAN2 high-speed file systems at Indiana University

Note:
To request project space on the Data Capacitor II, submit the Data Capacitor II project allocation request form.

On this page:


System overview

The High Performance File Systems (HPFS) unit of UITS Research Technologies operates two separate high-speed file systems for temporary storage of research data. Both use the open source Lustre parallel distributed file system running on a version of the Linux operating system:

  • Data Capacitor II: Data Capacitor II (DC2) is a large-capacity, high-throughput, high-bandwidth Lustre-based file system serving all IU campuses. It is mounted on the UITS research computing systems.
  • Data Capacitor Wide Area Network 2: The Data Capacitor Wide Area Network 2 (DC-WAN2) is a large, high-speed data storage facility serving all IU campuses and several research centers throughout the nation. The DC-WAN2 file system lets researchers access remote data as if they were stored locally and share large amounts of data with researchers at multiple remote sites.

Usage policies

  • Scratch directories: DC2 scratch directories are created automatically for all users with accounts on IU's research computing systems. If you have an account on an IU research computing system, your DC2 scratch directory is mounted at (replace username with your IU username):
      /N/dc2/scratch/username
    

    DC2 scratch space is not allocated, and its total capacity fluctuates based on project space requirements.

    DC2 scratch space is not intended for permanent storage, and data are not backed up. Files in scratch space may be purged if they have not been accessed for more than 60 days. To archive scratch space data, move files to the Scholarly Data Archive (SDA); see Access the SDA at IU.

  • Project space: Project directories on the DC2 and DC-WAN2 file systems are reserved for research projects with atypical requirements that cannot be met by other systems.

    The UITS HPFS team's allocation committee evaluates requests for project space on a case-by-case basis.

    Projects receive a default quota of 10 TB; project owners can request quota increases if additional space is needed.

    Project space is not intended for permanent storage, and data are not backed up. Files in project space may be purged if they have not been accessed for more than 180 days. To archive project space data, move files to the Scholarly Data Archive (SDA).

    File system space not allocated to projects will be available as scratch space and will vary depending on file system usage.

    To request project space, submit the Data Capacitor II project allocation request form.

Note:
The DC2 and DC-WAN2 file systems are not designed for storing a large number of small files. If you need to store a large number of small files, use a compression utility (e.g., tar or gzip) to bundle them into a small number of large files. Failure to do so can negatively impact performance of these file systems and strain their file-count (inode) capacities.

System information

Maintenance

The scheduled monthly maintenance window for IU's high-performance computing systems is the second Sunday of each month, 7am-7pm.

Data Capacitor II (DC2)

System configuration
Machine type Data storage
Operating system CentOS Linux v6.8, Kernel 2.6.32
Processor cores 4-6
CPUs 2 per node
Nodes 26
RAM 48-96 GB DDR-2
Network 56-Gb FDR InfiniBand
Storage Connected via 56-Gb FDR InfiniBand to DataDirect Network SFA12K storage controllers
Storage information
File systems Lustre 2.5
Total disk space 5 PB
Total scratch space Varies based on system usage
Aggregate I/O 40 GBps
Availability scope All IU campuses
Quotas 10 TB default; more upon request

Data Capacitor Wide Area Network 2 (DC-WAN2)

System configuration
Machine type Data storage
Operating system CentOS Linux v6.8, Kernel 2.6.32
Processor cores 10
CPUs 2 per node
Nodes 6
RAM 256 GB DDR-4
Network 10/40-Gbs Ethernet
Storage Connected via 40-Gbs QDR Infiniband
Storage information
File system Lustre 2.8.0
Total disk space 651 TB
Aggregate I/O 40 Gbps to storage servers; 10 Gbps to metadata servers (1 active / 1 backup)
Availability scope All IU campuses and other US sites

System access

  • Data Capacitor II: The DC2 file system is mounted on Big Red II, Karst, and Carbonate as /N/dc2/, and behaves like any other disk device on those machines. If you have an account on Big Red II, Karst, or Carbonate, you can access your DC2 scratch directory at /N/dc2/scratch/username (replace username with your IU username).
  • DC-WAN2: The DC-WAN2 file system provides project space for IU researchers, which can be mounted on Big Red II, Karst, and Carbonate, and other remote systems. If you have project space on the DC-WAN2 file system, you can access it at /N/dcwan/projects/project_name (replace project_name with the personal or group username associated with your DC-WAN2 project space).

List files

You should use ls -l only when necessary, or only on directories containing small amounts of data. Running ls -l in a DC2 or DC-WAN2 directory to list its contents and associated metadata (e.g., ownership, permissions, and file size information for each file) can cause performance issues for you and other users, particularly if the directory contains a large amount of data.

Due to its parallel architecture, Lustre performs file and metadata operations separately. When you run ls -l in a DC2 or DC-WAN2 directory, the system contacts Lustre's Metadata Server (MDS) to get your data's location, ownership, and permissions information. However, to retrieve file size information, the system must contact multiple Object Storage Servers (OSSs), which in turn must contact multiple Object Storage Targets (OSTs) that store the data objects that make up your files. When the load on one or more OSS nodes is high, your ls -l command may hang; other users on the file system may experience latency issues, as well.

Furthermore, some IU systems have ls (without any options) aliased to ls --color=tty, which enables the use of colors to distinguish file types. With the alias, running ls initiates a full lookup to determine the color associated with each file, which (as with ls -l) requires communication with the OSSs and the OSTs. Without the alias, running ls contacts the MDS only (i.e., it does not initiate a full lookup involving the OSSs and OSTs). To avoid potential performance issues, you can override the ls --color=tty alias, preventing ls from initiating a full lookup. To do so, add the following line to your shell profile:

 unalias ls

Using ls to list information about individual files creates a lot less overhead on the file system:

  • To check for the existence of a file (e.g., my_file), use:
     ls my_file
  • To see all details for a specific file (e.g., my_file), use:
     ls -l my_file

Sort files by age

Data Capacitor II and DC-WAN2 are intended for temporary storage of research computing data. Files in scratch directories may be purged if they have not been accessed for more than 60 days. Files in project directories may be purged if they have not been accessed for more than 180 days.

To determine which files located in or below the present working directory are the oldest (and at risk of being purged), you can list them by age (oldest to newest) using the find command; for example:

 find . -type f -exec ls -1hltr '{}' +;

In the command above:

  • The dot (.) directs find to search the present working directory and its subdirectories.
  • The -type f test limits the find search to regular files.
  • The -exec ls -1hltr "{}" +; action makes find run the ls command on its search results and treat any subsequent arguments as options to that command until it encounters the semicolon (;) argument.
  • The + directive builds a file list from the find search results, appending each file name to the {} string.
  • The ls command parses the file list and (given the options provided) displays the results one file per line (-1), in long format (-l), with human-readable file sizes (-h), sorted by modification time (-t), and listed in reverse order (-r).

To perform the same operation on a directory that's not the present working directory, use the same command and options, but replace the dot (.) with the full path to the directory in question; for example:

  • For a directory in your scratch space (replace <username> with your Network ID username and some_other_dir with the directory you want to sort):
     find /N/dc2/scratch/<username>/some_other_dir -type f -exec ls -1hltr "{}" +;
  • For a directory in your project space, (replace <project_name> with the your project's name and some_other_dir with the directory you want to sort):
     find /N/dc2/projects/<project_name>/some_other_dir -type f -exec ls -1hltr "{}" +;

Transfer files

The Data Capacitor II and DC-WAN2 file systems are parallel high-performance file systems. Files are not "transferred" to these file system; instead the DC2 or DC-WAN2 file systems are mounted on computational resources, making them accessible from those resource as directory paths (e.g., /N/dc2/scratch/username). To read or write a file on the DC2 or DC-WAN2 file system, use the same standard Linux commands used for reading and writing files stored on your computational system's local directories.

Specify DC2 or DC-WAN2 as a requirement for your batch job

For instructions, see For a batch job on an IU research computing system, how do I specify the required parallel file system?

Work with protected health information

Your responsibilities

Important:
Storing data containing protected health information (PHI) regulated by the Health Insurance Portability and Accountability Act of 1996 (HIPAA) on the DC-WAN2 file system is not permitted.

If you use the Data Capacitor II file system to store data containing PHI:

  • You and/or the project's principal investigator (PI) are responsible for ensuring the privacy and security of that data, and complying with applicable federal and state laws/regulations and institutional policies. IU's policies regarding HIPAA compliance require the appropriate Institutional Review Board (IRB) approvals and a data management plan.
  • You and/or the project's PI are responsible for implementing HIPAA-required administrative, physical, and technical safeguards to any person, process, application, or service used to collect, process, manage, analyze, or store PHI.
Note:
Although PHI is one type of institutional data classified as Critical at IU, other types of institutional data classified as Critical are not permitted on Research Technologies systems. For help determining which institutional data elements classified as Critical are considered PHI, see About protected health information (PHI) data elements in the classifications of institutional data .

Official classification levels for institutional data at IU are defined by the university's data management rules and policies. If you have questions about the classifications of institutional data, use the Data Sharing and Handling (DSH) tool or contact the appropriate Data Steward.

To determine the most sensitive classification of institutional data you can store on any given UITS service, see the Choosing an appropriate storage solution section of About dedicated file storage services and IT services with storage components appropriate for sensitive institutional data, including research data containing protected health information.

For more, see Your legal responsibilities for protecting data containing protected health information (PHI) when using UITS Research Technologies systems and services.

Technical safeguards

You should employ the following technical safeguards when working with PHI:

  • Set directory permissions: The permissions for a directory containing PHI should be set to grant read, write, and execute access to the owner (you) only. No access at all should be granted to group members and other users.

    To change the permission of an existing file or directory, use the chmod command. For example, to restrict all read and write access to the owner of phi_file, on the command line, enter:

     chmod 700 phi_file

    The above command will set the Unix permissions to look like this:

     -rwx------ 1 <username> uits 40 Sep 13 15:12 phi_file

    Alternatively, to configure your user environment so that every new file and directory gets the same permission level (accessible only by the owner), add the following line to your shell profile:

     umask 077
  • Encrypt data at rest: While data containing PHI are at rest (i.e., when you are not working with them), they should be encrypted; see Recommended tools for encrypting data containing HIPAA-regulated PHI.

Reference

For more about the Lustre file system, see the Lustre wiki.

Acknowledge grant support

The Indiana University cyberinfrastructure, managed by the Research Technologies division of UITS, is supported by funding from several grants, each of which requires you to acknowledge its support in all presentations and published works stemming from research it has helped to fund. Conscientious acknowledgment of support from past grants also enhances the chances of IU's research community securing funding from grants in the future. For the acknowledgment statement(s) required for scholarly printed works, web pages, talks, online publications, and other presentations that make use of this and/or other grant-funded systems at IU, see If I use IU's research cyberinfrastructure, what sources of funding do I need to acknowledge in my published work?

Support

For technical support or general information about the DC2 and DC-WAN2 file systems, contact the UITS High Performance File Systems group.

For after-hours support, call Data Center Operations (812-855-9910), and ask to have High Performance File Systems contacted.

To receive maintenance and downtime information, subscribe to the hpfs-maintenance-l@indiana.edu mailing list; see Subscribe to an IU List mailing list.

Back to top

This is document avvh in the Knowledge Base.
Last modified on 2018-11-07 12:29:26.

Contact us

For help or to comment, email the UITS Support Center.