Intro- digital imagesContentsSelection- introConversion-introQuality Control-definitionMetadata-definitionTechnical-storage needsPresentationDigital PreservationManagementContinuing Education
Technical-storage typesTechnical-delivery

6C. Technical Infrastructure:
FILE MANAGEMENT

Key Concepts

introduction
keeping track
image databases
storage
storage types
storage needs

 


 

 

DETERMINING STORAGE NEEDS

Formula to compute storage needs
Basic storage capacity requirements can be estimated by simple calculation:

Total storage needed = # of image files x average file size x 1.25

Example:  A Collection of 3,000 text images, each averaging 75KB, would require about 225MB of storage. However, many other factors can increase storage needs. OCR text for the same pages might run 3KB per page, or about 1/25th the space required for the corresponding image file. The number and size of derivative files, as well as whether they're permanently stored or created on the fly could add further to storage requirements. In addition, all storage technologies involve a certain amount of wasted space. The precise amount depends on factors such as the storage technology used, total capacity, partition size, and average file size. Some experimentation may be necessary to determine the approximate percentage of wasted space, but it needs to be taken into account in estimating storage needs. The formula above factors in a generous overage to cover such concerns.

Cost of storage can be approximated as follows:

Total storage cost formula

Total storage cost = total storage needed x cost per unit of storage

This will provide a rough estimate, since it includes only basic drive and media costs. Other costs related to storage could include racks and enclosures, backup power supply, cables, cards, storage management software, etc. . Expect unit storage costs for large systems that include redundancy, high reliability and very high performance to be substantially higher than for routine desktop storage. Check with your systems staff for a more complete picture.

Reality Check

A collection of 10,000 4 x 5-inch transparencies is scanned at 400 dpi, 24-bit color, and then losslessly compressed at a 1.3:1 ratio. Calculate the cost of hard disk storage (at $2/GB) needed for this collection.

US dollars    


Choosing a particular technology can be confusing. For example, consider magnetic disk, where there are many options-ATA (also called EIDE and UDMA), SCSI (wide/narrow, Ultra II/III/160/320, LVD, etc.), Firewire (IEEE-1394), USB, Fibre Channel, etc. The number of choices is growing, with higher performance versions of most of these technologies in the works.

For small collections, both during image capture and delivery, desktop ATA, USB and Firewire storage may be all that's required. The current implementation of ATA (now being called parallel ATA to distinguish it from its successor) has topped out at 133 MB/sec transfer rate and will gradually be replaced by serial ATA, which starts at 150 MB/sec and goes up from there. USB 2.0 and Firewire (IEEE 1394a) both run at about 50 MB/sec, though IEEE 1394b is expected to double that performance.

SCSI is an older storage technology that has, through a continuing series of upgrades, maintained a performance lead over most other technologies. SCSI used to be the choice for high-performance (and high-cost) desktop storage, but while still available, it is less and less common in desktop systems. However, SCSI is still very popular for high-performance networked disk arrays. It is also one of the most important technologies for NAS and SAN installations.

NAS (networked attached storage) can provide large quantities (terabytes) of hard disk storage in a storage appliance that attaches to existing, traditional network servers. NAS is fairly simple to set up and maintain and is usually quite reliable. NAS does suffer from some limitations on expandability and can become difficult to manage in large numbers. NAS is usually based on SCSI drives, though some use ATA.

SAN (storage area network) is primarily for very large installations that require maximum performance and flexibility. SANs allow better integration and sharing of backup facilities, and help to keep traffic between storage devices (e.g. for backup) off of Ethernet networks. However, SANs can be quite complex to establish and often require outside help in order to install the required infrastructure and avoid interoperability problems. SANs operate over a Fibre Channel infrastructure (not Ethernet), using either SCSI or Fibre Channel drives.

The various removable media technologies (both disk and tape) can be considered mostly secondary storage technologies. That is, they are well suited for backup, off-site storage, and storage of material that doesn't need to be immediately accessible. Also, if scanning is outsourced, many vendors return image files on some form of removable media. Despite its low density, CD-R is now a low-cost and widely accepted standard. However, at 650 MB capacity, it may not be suitable for large collections and/or very large files. DVD-R, at up to 9.4 GB/disk for double-sided media is a possible alternative and some manufacturers are predicting 100 year life expectancy. However, if experience with CD-Rs is any indication, media quality can vary substantially amongst manufacturers, and even from batch to batch. It is also unclear how long any of the DVD formats will remain usable, since higher-capacity, next-generation DVD formats are already on the horizon, and backward compatibility questions are unanswered. Committing to new removable media formats for archival storage remains something of a risky business, and all media should be regarded as temporary .

Computer Considerations
The main consideration will be the level of support provided for the chosen peripheral bus (i.e., SCSI, Firewire) and the computer's ability to keep up with its peripherals. Peripheral bus speeds now routinely exceed those of the computer's internal bus, meaning that some bottlenecks are unavoidable, but attempts should be made to minimize these, otherwise the performance advantage of high-speed storage is lost. Advanced storage architectures such as RAID or Fibre Channel are mostly supported on multi-user platforms such as Windows NT/2000 or Unix/Linux. SCSI is an option on many systems, but won't necessarily come with the base configuration. Make sure that the operating system and system BIOS support the size disk array you need and that there is sufficient space for needed expansion cards.

© 2000-2003 Cornell University Library/Research Department

 
Technical - storage typesTechnical - delivery
Contents


View this page in Spanish
View this page in French