By Bill Tarket, Technical Lead

FDI’s commitment to our customers is to provide transparent, meaningful access to their content – all their content, whether it was created 30 years ago or needs to be accessible 30 years from now.

To that end, we provide innovative tooling and solutions to enable customers to take content from legacy mainframe-based applications with IMS/IDMS/VSAM-based storage structures and access it using web-based applications.

As we’ve built our practice, we’ve identified several common issues our customers face:

  • Old database systems that are not relational (IMS, IDMS, VSAM)
  • Custom applications written in COBOL with numerous copybooks (record definitions)
  • Mainframe systems using the EBCDIC (Extended Binary Coded Decimal Interchange Code) character set; typically, but not exclusively hosted on IBM operating systems

A Case in Point: Archiving Two COBOL Systems

In the case of one recent project, the client was decommissioning two in-house, custom-built invoicing and materials management applications that had been in use for more than 30 years. These COBOL applications were deployed on an IBM mainframe and used two different database systems.

One of the applications used the IBM IMS database. IMS is a hierarchical database developed 50 years ago by IBM. The other was developed on a Computer Associates CA-IDMS database that was initially released in 1973. This database uses a CODASYL style storage architecture.

Our challenge was to take these disparate systems and archive the data into a format that our customer could confidently expect to access for the next 30 years.

In addition to the extended access required, this client (like most clients) wanted their data to be stored in an industry-neutral format, accessible by Open Source tooling and in a solution that supports multiple data models, multiple character set encodings, a flexible security model, and extremely low cost of ownership.

Why COBOL Data Presents Archiving Challenges

When dealing with mainframe applications on database platforms that are 30-plus years old, we encountered several major challenges.

These included:

  • Domain expertise
  • Hierarchal data
  • Variable length headers
  • Variable length footers
  • Record separators
  • Record database keys

On both applications that we were decommissioning, original programmers and maintenance programmers were still around to help out with the COBOL logic and copybooks (record layouts and structures for the different record types). Had there not been these programmers and domain experts around, this would have been a much more difficult project to complete.

Don’t Underestimate the Value of Your Internal Experts

We see our customers losing domain knowledge about legacy systems all the time as employees retire or leave the organization. Our guidance is to move forward aggressively with this type of archiving project.

For some customers, there is a sense that doing nothing is the lowest risk approach. Our experience is that the do-nothing strategy is actually quite risky; lose the domain expertise and understanding of the data model, suffer a major hardware/backup failure, and you’ve caused yourself a host of problems that can be very expensive to remediate.

IMS Database Challenges: Queries, Variable Length Headers, and Record Separators

Because of the hierarchal nature of the storage of data in IMS, all the associated records in the hierarchy are grouped together based on the location of the parent record. What this means for archival is that the records usually (unless the database designer had the foresight to know about relational databases) have no foreign key relationship information in the data records themselves to link them together. Without this information included, it makes the data more difficult to query.

The dump utility used for the IMS database made figuring out the variable length headers and footers of the extract files quite difficult. This variable length changes for each of the database segments that are extracted and is dependent on the utility being used and the number of record types included in the segment.

To complicate matters further, there were the occasional record separators. These separators were variable in length (if they existed in the file, they were usually either 1 or 2 characters), but at least it was consistent throughout the extract files while processing.

IDMS Dump Record Challenges

For the IDMS database, there were similar issues with the file structures of the dump records, where each of the “pages” dumped from the database contained numbers of records on the page, starting database record keys, ending record keys, the space available, how many records are on the page, and the records themselves. Each record had a key that indicated what page and the line number stored in 4 bytes where the first 3 bytes indicated the page and the last byte indicated the page.

You can see from these examples that these older database technologies tend to express a logical model in the COBOL code and associated copybooks and the physical model (records on a database page, record interdependency pointers based on physical location of a record, file storage structures and application data expressed on the same database page, etc.) in the database.

All of this made for some difficult, yet very interesting challenges! How can we provide a repeatable process to transform this heterogenous, disparate set of data models into a coherent, schema-less, NoSQL-based storage design that would be easily accessible by our web-based solution?

The Solution: Amping Up ETL

We overcame all of these challenges by adding the following capabilities to the FDI ETL Acceleration Suite:

  1. Generating foreign relationship keys
    Tackling this issue by programmatically generating keys into a record wrapper metadata helped us maintain the integrity of the original IMS data in the wrapped data. The application data still followed the copybook layout and allowed for easy linking for future queries of the related records.
  2. Made the COBOL translator highly configurable
    We accomplished the other items in my issues list by making the translator configurable to allow for many customer and data specific issues.

    1. Variable lengths per file – configuration for files
    2. Hierarchy structure – occasionally clients will only want to archive certain records of the structure while dropping less important values. We made this part configurable as well to save archive space and cost.
    3. Separator lengths – a variable configuration
    4. Configured to allow leading zeros (sometimes number values need to have the leading zeros to be matched on for searches)

This solution met the needs of the client. Moreover, with our highly configurable solution for converting IMS and IDMA data, we can quickly run other customers’ data through the converter to determine what types of data issues may be encountered during the archiving process.

While this set of projects has had some interesting data modeling challenges, we’ve been successful helping our customers maintain access to their data while future-proofing their data model. It’s a real treat to watch our customers access 30-year-old data through our web solution. Customers are accelerating data center closure and cloud-based migrations using our solution and it’s great to help them achieve these strategic initiatives, all while reducing the cost of ownership of maintaining legacy data.

Bonus Insight

As with many software development projects, there are little mistakes that can cause big time and money delays. Here’s an example: we store all our data encoded in UTF-8, which is a multi-byte encoding that is fantastic for storing a wide variety of data character set encodings. Typical US based IBM systems encode characters using EBCDIC (code page 37) which is a single byte character encoding. In addition, COBOL applications frequently have data fields that are not characters – typically these field express dates, integers, floats, etc. OK, so far, so good! The small problem (with big consequences!) we encountered is that customers will dump the data using a backup utility, then FTP the content into our development environment. Note that the dump file has got lots of data structures encoded in it; dump program headers and footers, multiple data type in the application data, record pointers, etc.

Always have the files transferred from the mainframe to network drive using BINARY mode in the file transfer protocol used.

If this does not happen, inappropriate character conversion will happen during the file transfer process. Easy to correct once identified, but a potentially major project delay when transferring terabytes of data through a conversion pipeline!


Bill Tarket is a Technical Lead for Flatirons Digital Innovations. As a consultant he’s helped multiple businesses archive and decommission many old applications in numerous fields like Healthcare, Financial, and Manufacturing. When he’s not leading these efforts to save this old data from extinction, you can find him on the golf course, playing with his dogs, or attending family sporting events throughout the year.