Delta Encoding ASCII Decimal Strings to Achieve Better Compression Ratio

Introduction

Compression algorithms benefit from delta encoding, which makes the data more compressible. Delta encoding of correlated ASCII decimal strings improves the overall compression ratio, thereby providing a considerable benefit to compression algorithms.

ASCII Decimal String

An ASCII decimal string consists of a set of printable characters, which range from codes 0×30 to 0×39 that represent the digits from 0 to 9.

Delta Encoding ASCII Decimal Strings

Textual and semi-textual data can contain ASCII decimal strings that have the same digit lengths and are correlated. For example, the 8-digit decimal strings are in an incremental order as shown in Figure 1.

12345678 12345679 12345680 12345681

Figure 1: Correlated Decimal Strings

This understanding is useful for delta encoding decimal strings. The first decimal string has the base value, and each subsequent decimal string has the value difference from the previous value. The encoding is performed in a digit-by-digit manner rather than as a whole integer. The encoding can produce results similar to the one shown in Figure 2.

12345678 00000001 00000011 00000001

Figure 2: Delta-encoded Decimal Strings

In real-world data, there may be other data between the decimal strings, such as strings as shown in Figure 3.

ID 12345678 ID 12345679 ID 12345680 ID 12345681

Figure 3: Correlated Decimal Strings with Strings In Between

The delta coder recognizes the decimal strings regardless of location. The coder ignores the other strings during the encoding process. The encoding can result in a display similar to the one shown in Figure 4.

ID 12345678 ID 00000001 ID 00000011 ID 00000001

Figure 4: Delta-encoded Decimal Strings with Strings In Between

In real-world data, there may be other data between the decimal strings, such as decimal strings with different lengths as shown in Figure 5.

ID99 12345678 ID100 12345679 ID101 12345680 ID102 12345681

Figure 5: Correlated Decimal Strings with Decimal Strings In Between

The delta coder performs multiple passes on the data and encodes decimal strings with different lengths separately. In Figure 6, the coder has encoded decimal strings with 8 digits. Similarly, decimal strings with different lengths are encoded independently.

ID99 12345678 ID100 00000001 ID101 00000011 ID102 00000001

Figure 6: Delta-encoded Decimal Strings with Strings and Decimal Strings In Between

Delta encoding is beneficial for compression algorithms because it increases the occurrence of the same characters. In this example, it is the zero character. For this reason, many compression algorithms can compress data more efficiently. However, if the decimal strings are not correlated, the delta encoding process may not improve the compression ratio, or it could even make it worse.

Experimental Analysis

In this experiment, the efficiency of delta encoding is tested using a real-world PDF (Portable Document Format) file, which is a semi-textual format and represents the varying degrees of correlated ASCII decimal strings.

Ten delta encoding methods are applied to the sample. Delta encoding procedures range from encoding 1-digit decimal strings to 10-digit decimal strings. One combined delta encoding method is applied to the sample. It combines the five delta encoding methods that are the strongest individually. Seven compression methods are applied to the original and delta-encoded files. Each method is set to the maximum compression ratio.

Results

Combined delta encoding contributes to varying degrees of compression improvements ranging from 3.5 percent to 18.8 percent as shown in Table 1. PPMd, BZip2, and RAR achieve 18.8 percent, 15.8 percent, and 14.7 percent improvements respectively. ZPAQ achieves the least, but still the 3.5 percent result is noteworthy. Without delta encoding, ZPAQ produces the smallest file, followed by LZMA and Brotli. With combined delta encoding, ZPAQ produces the smallest file, but Brotli achieves a better result than LZMA. Delta encoding with 2, 4, 5, 6 and 10 digits improves the compression ratio. However, delta encoding with 1 and 9 digits deteriorates the compression ratio.

Table 1: PDF Sample Results
Delta encoding method ZIP:Deflate BZip2 7z:PPMd 7z:LZMA 7z:Brotli RAR ZPAQ
None (original sample) 6,798,779 6,769,657 7,013,053 6,040,012 6,151,065 6,750,897 5,692,034
1-digit 6,804,274 6,788,146 7,042,361 6,057,956 6,167,812 6,773,170 5,704,985
2-digit 6,753,716 6,721,548 6,977,163 6,034,762 6,134,957 6,722,578 5,680,818
3-digit 6,807,662 6,765,354 7,029,890 6,076,672 6,170,911 6,808,052 5,718,458
4-digit 6,747,452 6,734,268 6,942,803 6,032,576 6,134,263 6,708,992 5,685,910
5-digit 6,369,822 6,252,086 6,397,184 5,863,166 5,884,288 6,283,388 5,631,217
6-digit 6,649,723 6,609,305 6,804,199 5,975,727 6,055,567 6,596,946 5,668,890
7-digit 6,798,901 6,769,197 7,012,649 6,039,961 6,150,558 6,751,004 5,692,130
8-digit 6,798,790 6,769,490 7,012,362 6,041,421 6,150,680 6,750,935 5,692,051
9-digit 6,798,795 6,769,657 7,013,069 6,040,028 6,151,081 6,750,913 5,692,042
10-digit 6,591,545 6,456,421 6,579,591 5,943,496 5,987,127 6,452,858 5,598,654
Combined of 2, 4, 5, 6, 10 digits 5,920,912 (12.9%) 5,699,267 (15.8%) 5,696,300 (18.8%) 5,668,295 (6.2%) 5,580,474 (9.3%) 5,760,317 (14.7%) 5,494,643 (3.5%)

The size of the original PDF file is 20,642,355 bytes. The file size in yellow indicates that the compression ratio is better after individual delta encoding.

Reliability

The delta encoding presented in this analysis is lossless and reversible. The delta encoder used in this experiment can undo the delta encoding to restore the original data.

Conclusion

The experiment provides conclusive evidence for the benefits of delta encoding correlated ASCII decimal strings, which increases the occurrence of the same decimal characters and improves the overall compression ratio.

Article Link: https://suszter.com/reversingonwindows/delta-encoding-ascii-decimal-strings-to-achieve-better-compression-ratio