Solving the challenge of seismic data management—from compression to open-sourcing

The challenges orbiting around seismic data are probably very peculiar and unique to our industry. The sizes of such data have grown almost exponentially over the past decades. Recent acquisition surveys are now dealing with petabit of data daily. This is quite unique. It is thus not a surprise that data formats have evolved tremendously over the years and have been adapted to fit specific workflows or software solutions, adding to the data management complexity. Disk-space is cheap, it is true; but moving data around is not. Seismic data (pre- or post-stack) and their associated attributes have become so huge that it still represents a major cost-item in any given company’s IT infrastructure.

There are two (complementary) strategies for solving the challenge of seismic data management:

  • Size reduction

  • Format standardization

File-size reduction—compression

Compute power available for the geophysicists and geologists analyzing the acquired seismic has also advanced tremendously. However, bandwidth from the disk to the end-user has not progressed at the same pace, bound by the physical space constraints of the wiring between CPUs and disks (on-prem or in the cloud), making the read/write, i.e. input/output (I/O), of seismic data more and more of a bottleneck.

One way to get around this I/O bottleneck is to reduce the volume of data we have to transfer (Vermeer et al, 1996), recognizing that it may be faster to use some CPU time to compress the data, and in that way save some time on the communication—due to the more compact representation of the seismic. In practice, seismic compression has been a niche tool, only used when the alternative is to transfer data over very slow, or very expensively, for example, through satellite communication at a remote field location.

This lack of available compute power has led users of popular seismic interpretation workstations to accept computationally cheap, low quality, compression methods, in order to achieve better storage capacity and I/O speeds for their data. One of the most popular methods has been to store their post-stack seismic in 8-bit format. This, in practice, gave them a 4:1 compression ratio (compared to the segy standard 32-bit float format).

This method does have some serious quality issues:

  • 8-bit data has a maximum dynamic range of 6 x 8 = 48 dB, which is well below the potential dynamic range of the signal from modern acquisition systems, leading to loss of useful signal.

  • Seismic amplitudes tend to be strongest in the shallow, and much weaker deep down in the underground (unless artificial amplitude gain processes have been applied), the deep weaker reflectors tend to be severely clipped.

With the advent of modern workstation and cloud compute infrastructure it may be that the time is ripe to re-visit the concept of advanced interactive seismic compression.

All compression methods (loss-less or lossy) are based on the exploitation of redundancies, whether these redundancies manifest themselves as symmetries, repetitions, or other similar correlations, in 1-, 2-, 3-, or even n-dimensional space.

For example, if a 2D image is composed mainly of many repetitions of one complex 2D shape (say a tree shape, or a house shape), then one efficient way to compress the image will be to store the complex shape only once, and then just store all discrete locations (together with a scale parameter) in the compressed image where the shape is located. By analysis of the input image, using statistical methods like autocorrelation, one can identify the shapes in the image which are occurring most often, and hence represent most of the information more compactly.

The same concept applies to seismic compression, where the basic shape is usually assumed to be some form of wavelet kernel, whether in 1D, 2D, 3D or more dimensions. Because we are essentially iteratively replacing details in the image with reconstructed versions of the common wavelet shape, until we are satisfied with estimated reconstruction quality, there will inviably be some loss of residual detail—hence the term “lossy” compression. The level of acceptable loss of residual data is generally set by some sort of “compressed signal quality” (“sq”) parameter, which usually is represented by a single value expressed in the decibel (dB) unit.

Fig 1. Input seismic section in the top left corner. Upper right panel shows introduced compression noise (plotted with the same color scale as the input image) for sq = 6 dB (149:1 compression ratio). Lower left image shows compression noise for sq = 18 dB (20:1 compression), and lower right image for sq = 30 dB (8:1 compression). For sq = 36 dB there is no visible compression noise at all (at a 6:1 ratio). (Figure based on data supplied by Geoscience Australia).

Compression ratios are in the end dependent on the information content embedded in the seismic, and information content will generally vary a lot throughout a 3D seismic cube. In the shallow part we will have high bandwidth, and hence high vertical and lateral resolution. Deep down in the seismic volume much of the high frequencies will have been attenuated, and we will hence have much poorer vertical and lateral resolution. This means that it is reasonable to think that one should be able to achieve much higher compression ratios for seismic data deep in the cube than in the shallow part. We will hence partition our 3D seismic volumes into sub-cubes (3D bricks) which will be compressed separately. This approach is not new, and has been used successfully in the past, e.g. in (Hale, 1998) and in (Vermeer, 2000).

The choice of compression brick size is a delicate balancing act. If we choose a small brick size, we will have high similarity (i.e. low entropy) between the samples in the sub-cube, and hence a potential for high compression ratios. However, the total number of bits to compress will be fairly low, and we will hence have a lower upper bound on the maximum compression ratio we can achieve. If the brick size is, say, 4x4x4 samples with 32-bit float precision, then the total size of the of data is 4x4x4x32 = 2048 bits, and the theoretical maximal compression ratio will hence be 2048:1 (which happens only in the extreme case when the whole brick is compressed down to one single bit). It is also a question of compute cost. Small bricks have fewer “degrees of freedom”, meaning that there are fewer compression opportunities (due e.g. to fewer, and shorter, repetitive combinations to scan for). It is hence computationally cheaper to compress a small brick than a large brick, although the latter have a larger potential for size reduction.

With small brick sizes, we will also have more boundaries between sub-cubes than for larger brick sizes, but the similarity of the reconstructed data across the boundaries will be reasonably high. This is a real problem with many common bricked compression methods (for example the popular jpeg method), as they introduce visible discontinuities at the brick boundaries. On the other hand, if we choose a larger brick size, we will have more variability in each sub-cube. But we have the potential for larger maximal compression ratios, and we will have fewer brick boundaries to deal with. The end-user workflow will often dictate the optimal compression strategy (e.g. structural interpretation heavy on large-scale visualization vs. quantitative interpretation with high amplitude precision needs).

The optimal brick size for the compression of seismic samples may not be the same size as the optimal brick size for random access of sub-cubes in a large 3D volume. When we are executing geophysical algorithms on the 3D seismic data (for example semi-automatic horizon auto-tracking, interactive RGB spectral decomposition or 3D filtering) there are varying likelihoods for the access of chunks of neighboring samples, depending on the geophysical process at hand.

It is also a matter of hardware. We need geophysically-practical brick sizes which are also optimal for the available hardware, e.g. memory cache sizes, data packet sizes in the IO protocol layer and sector sizes on the storage medium we use.

We have since decided to decouple the brick size for compression and the brick size for random access in the OpenZGY format. The random-access macro-brick size is by default 64x64x64 samples. This is not hard-coded, so users can choose other sizes. But, many years of experience with the development of interactive geoscience applications have led us to think that this is a good practical size for a variety of different geoscience workflows. The compression brick size on the other hand is only 4x4x4 samples. This is the micro-brick size and means that a full 3D seismic volume is by default partitioned into a set of 64x64x64 sample sub-cubes (the macro-bricks), which in turn is partitioned into a set of 4x4x4 sample bricks (the micro-bricks). The micro-bricks are then compressed individually and independently of each other, possibly in parallel.

The compression algorithm we settled for is the well-known ZFP algorithm, which was developed by a team at the Lawrence Livermore National Laboratory. You can read more about the ZFP algorithm, and you can even find the laboratory team’s BSD-licensed open-source code in their ZFP GitHub repo.

Mathematically, the ZFP algorithm is based on a clever generalization of the discrete cosine transform (DCT), and the top priority for the ZFP researchers were compute efficiency, i.e. to maximize the “bit rate”. The code is hence extremely fast, it can achieve very high compression ratios (100:1 compression is theoretically achievable) and offers strong parametric controls on compression signal quality.

Format Standardization—OpenZGY

With various seismic acquisition techniques (e.g. first vibroseis in the 1950s) applied to different environments (e.g. first offshore seismic survey recorded in the 1930s), and the evolution of hardware and software to visualize and interpret them over the past five decades, it is no surprise that the seismic format heritage we have across the industry is quite rich.

Many initiatives have tried to standardize formats (SEG-Y and its various revisions are a good example). Such projects have always aimed at making the data portable and readable by various applications. Rarely has the focus been on data-access performance (e.g. for visualization or processing). This issue is mainly linked to how every actor in our industry has dealt with intellectual property. Performance being considered a key differentiating criterion; access patterns and associated data structures have mostly been part of the “secret” recipe of a given software solution.

 

Fig 2. Charisma seismic interpretation software in 1983.

The OSDU™ data platform

Such mindsets have evolved in a profound way in recent years. Open source was initially a scary concept, which made the corporate lawyers squeak! Now it has become a new way of working and is how business and various initiatives are ramping up their capabilities in our industry. An outstanding example of an open-source platform is the Open Subsurface Data Universe (OSDU)™. The code for the OSDU data platform has been contributed by other organizations—the upcoming R3 (third release) is the first commercially available version, and the first to be built on the open source code of the DELFI data ecosystem, which was contributed by Schlumberger.

With OSDU data platform R3 around the corner (it’s planned for late 2020), we will see the contribution of many actors making their way into the open-sourced code-base, thus becoming accessible to everybody having a device connected to the internet. This will enrich what the founding members (in particular Shell) have already pushed out to the world. A whole data ecosystem including ingestion tools, enrichment solutions, data structures, is up for grabs. 

In this context an entire solution for how to ingest, store, compress, read and write seismic data is open-sourced. The famous ZGY format of the Petrel E&P software platform becomes OpenZGY and is now ready for dissection.

It’s source code and the associated APIs are contributed without restrictions to the OSDU initiative and will soon be visible inside a dedicated repository in Gitlab. With it, all software vendors, service companies, and operators will be able to read and write ZGY files. Everyone will be able to leverage the exabits (10^18) of ZGY files that have been generated of the past two decades, compressed or not, and read them directly from within their applications, connecting tightly at the data-layer with one of the most commonly used subsurface characterization softwares: the Petrel platform. 
They will also have the capacity to create their own OpenZGY files and optimize their I/O-size strategy by leveraging the ZFP compression libraries that are accompanying the code-base.

 

Fig 3. Petrel 2010 - seismic scalability focus

Final Words

We strongly believe that this initiative is going to make a phenomenal impact from various perspectives:

  • Efficiency—as data doesn’t need to be duplicated and can be “sized” in a fit-for-purpose manner

  • Openness—as every application will be able to access, in an equal manner, data from a single source without any hidden functionality that would disadvantage some parties

  • Integrity—as portability needs for the data are minimized and associated errors avoided

  • Sustainability— as energy consuming infrastructures can be scaled optimally

This commitment coupled with the OSDU initiative has the potential to be genuinely transformative. It is not only an open-sourcing exercise, but also a transfer of ownership that will give the industry the capacity to unite around existing and proven technology and drive the way it wants to move forward.

Leveraging OpenZGY structured within the seismic data-management services of the OSDU data platform also offers to stream high throughput cloud native datasets directly into your app. Tackling the portability of such data types in a highly-performing manner from an open-standard perspective.

Authors information:

A couple of guys that have always loved playing with seismic and tools around it, one is personally a nutritional-facts nerd, the other only cares about good wine, I will let you guess who’s who.

Victor Aarre is Geophysics Advisor for the Schlumberger Norway Technology Center (SNTC), where they develop the Petrel E&P software platform. Victor holds a Cand. Scient. degree in Scientific Computing from the University of Bergen (1993).  He has worked in various Research & Engineering positions in Schlumberger since 1995, mainly within the seismic discipline. His current personal research focus is on seismic signal processing, fault/fracture mapping, automatic Earth Model Building and Machine Learning/AI technologies. He is member of SEG and EAGE, and holds 18 granted or pending US patents and 20+ peer-reviewed publications

Jimmy is responsible for fostering, promoting and funneling Digital Innovation Projects in Software Integrated Solutions (SIS), Schlumberger’s software division. Making sure, emerging technologies, combining cloud-based solutions, artificial intelligence, edge compute and various other interesting pieces of tech. are being evaluated and integrated into the wider Exploration to Production workflow with maximum return on investment. He joined the oil and gas industry in 2008 after graduating from IFP and the University of Strasbourg with degrees in Petroleum Geosciences and Geophysics respectively. He then held various positions in Schlumberger’s software product-line, from product ownership for the development of various parts of the sub-surface characterization workflow, to service delivery management for the Scandinavian business-unit.

Aarre-Klinger

Victor Aarre, Jimmy Klinger

Seismic Insights Miners

 

Useful Links:

Petrel Geophysics

Openness in DELFI

OSDU