LLMs in Action: Text Compression and Decompression Techniques

Ahmed
9 min readMay 11, 2024

--

Table of Contents:

1- Introduction

2- Methodology

3- Testing and Results

4- Discussion and Analysis

5- Conclusion

1- Introduction

Modern technology has birthed monumental achievements in the form of large language models (LLMs). These models undergo rigorous training on vast datasets of textual information, meticulously absorbing the intricacies of word relationships and contextual nuances within expansive documents. This understanding provides LLMs to generate coherent and contextually relevant text, making them indispensable tools in natural language processing.

The LLMs capabilities have spurred discussions and raised concerns, particularly regarding their potential interaction with copyrighted material. Recent debates have focused on the possibility of LLMs inadvertently reproducing copyrighted text due to the extensive training data they assimilate. This issue prompts a deeper exploration into the inner workings of LLMs and their relationship with the data they are trained on. The core question that arises is whether it is feasible to extract training text from LLMs. This question is not entirely new or devoid of prior discussion, yet it remains a pivotal point of interest.

Exploring further, one might ponder the extent to which LLMs can autonomously generate text that diverges from their training data — specifically, whether they can replicate text they have never encountered directly. This line of inquiry leads to considerations about the underlying mechanisms of LLMs and their potential for extrapolation beyond their training parameters. An investigation into diverse avenues for enhancing the efficiency of small-scale Large Language Models (LLMs) encompasses strategies such as pre-training methodologies and the implementation of varied compression algorithms.

Through a comparative analysis as shown in figure 1 of these pathways towards efficiency, the study endeavors to discern the inherent trade-offs associated with compressing LLMs. While compression offers advantages such as reduced computational overhead and minimized storage requirements, it can concurrently influence the model’s performance metrics, including its accuracy and dependability. Thus, this scholarly investigation seeks to unveil the nuanced effects of compression on LLM trustworthiness, providing valuable insights into achieving a balance between efficiency gains and maintaining optimal model performance across diverse applications and contexts.

Figure 1: An exploration of different strategies for optimizing smaller LLMs, such as pre-training and varying compression algorithms, reveals the impact of compression on a range of trustworthiness metrics.

In this context, the focus shifts to the possibility of using LLMs for text compression — a novel application stemming from understanding and manipulating the intricate web of linguistic relationships these models encapsulate. The exploration of text compression using LLMs not only sheds light on their versatility but also opens doors to innovative approaches in data processing and analysis.Figure 2 below explains that encoding the sequence ‘AIXI’ through arithmetic coding with a probabilistic model P results in the binary code ‘0101001’. The compression process involves assigning intervals to symbols based on the probabilities given by P, gradually reducing pauses to generate compressed bits representing the original message. During decoding, arithmetic coding uses incoming compressed bits to initialize intervals and matches intervals with symbols iteratively to reconstruct the original message, utilizing the probabilities from P.

Figure 2: Encoding ‘AIXI’ with arithmetic coding using a model P yields the binary ‘0101001’. Compression assigns intervals to symbols based on P’s probabilities, gradually generating compressed bits. Decoding matches intervals with symbols to rebuild the message, using P’s probabilities. (Marktechpost, 2023)

This article explain these complexities in practical way, through illustrating the methodology of extracting text from LLMs, exploring the nuances of text compression using these models, and contemplating the broader implications and potential applications of such endeavors. Through this exploration, we aim to deepen our understanding of LLMs’ capabilities, their interaction with textual data, and the possibilities they present for transformative advancements in natural language understanding and data processing.

2- Methodology

The methodology employed in this study revolves around utilizing llama.cpp’s Python bindings to develop a solution for text compression using large language models (LLMs). The approach is structured around several key functions that facilitate the extraction and manipulation of text within LLMs.

2.1. load_document(filename): This function is designed to read a text file and tokenize it using the model’s tokenizer. If the text surpasses the model’s context window, it is segmented into smaller parts that fit within this window, preventing token overflow and ensuring compatibility with the model’s processing capabilities.

2.2. generate_text(prompt, max_tokens=1): The generation of text occurs in incremental steps using a specified prompt. The function generates a specified number of tokens at a time with a temperature setting of 0.0 and a static seed. This process effectively continues the text from where the input text ceased, maintaining coherence and context within the generated text.

2.3. compress_text(source_text): The compression function aims to compress the input text by iteratively generating segments of text using the LLM. Each generated segment is compared with the corresponding section of the source text. If the generated text aligns with the source text, it is incorporated into the compressed string; otherwise, the character directly from the source document is added to the compressed string. To track the generated text, the function records the number of tokens generated and places that number between delimiters for easy reconstruction during decompression.

Code 1: “Compress text by generating and comparing segments to the source text.”

generated_text = ""
compressed_string = ""
gen_count = 0
i = 0
# let's loop until we have generated the entire source text
while generated_text != source_text:
# get a new token
part = generate_text(generated_text)
# if our generated text aligns with the source text then tally it
if source_text.startswith(str(generated_text + part)) and len(part) > 0:
gen_count += 1
generated_text += part
i = len(generated_text)
if debug:
print(BLUE + part + RESET, end="", flush=True)
# if not, then grab a letter from the source document
# hopefully we'll be back on track during the next loop
else:
i += 1
if gen_count > 0:
compressed_string += f"{re.escape(DELIMITER)}{gen_count}{re.escape(DELIMITER)}"
gen_count = 0
generated_text += source_text[i - 1]
compressed_string += source_text[i - 1]
if debug:
print(source_text[i - 1], end="", flush=True)
Figure 3: The script is being processed by the model, where the blue text corresponds to text generated by the LLM, while the white text originates from the source text.

2.4. decompress_text(compressed_text): In contrast, the decompression function reverses the compression process. It splits the compressed text using the delimiter and reconstructs the original text by generating missing parts or directly appending text from the compressed string.

Code 2: The code decompresses text by splitting it into sections, generating text based on generation counts, and appending it to the decompressed string.

decompressed_text = ""
# split the parts into sections, text and generation counts
parts = re.split(rf'({re.escape(DELIMITER)}\d+{re.escape(DELIMITER)})', compressed_text)

for part in parts:
# if we're looking at a generation count, then generate text
if re.match(rf'{re.escape(DELIMITER)}\d+{re.escape(DELIMITER)}', part):
number = int(part[1:-1])
for count in range(number):
part = generate_text(decompressed_text)
if debug:
print(GREEN + part + RESET, end="", flush=True)
decompressed_text = decompressed_text + part
else:
# just add the text to the decompressed string
decompressed_text += part
if debug:
print(part, end="", flush=True)
Figure 4: The compressed draft of the post outlines warnings about its quality, explains the methodology, details the testing phase with results, and describes the model’s processing of the script with color-coded text differentiation.

Testing involved applying these functions to two texts, including the first chapter of “Alice’s Adventures in Wonderland” The results demonstrated significant reductions in text size through compression, with successful decompression validating the efficacy of the compression algorithm. The compressed draft of the post warns about its unfinished state and discusses training large language models (LLMs) on extensive datasets to understand word relationships and contexts within documents, also touching on text extraction and reproduction, and demonstrating compression and model processing with color-coded text differentiation.

3- Testing and Results

The testing phase of this study involved the application of the developed methodology to two distinct texts. The primary objective was to assess the efficacy of the text extraction, compression, and decompression processes using large language models (LLMs). The first text selected for testing was the initial chapter of “Alice’s Adventures in Wonderland.” This choice was made due to the assumption that this text would likely be part of the LLM’s training data, ensuring a baseline for comparison and validation of the methodology.

The application of the methodology to the selected text revealed notable outcomes. The compression process yielded significant reductions in text size, demonstrating the model’s ability to compress textual data effectively. The compressed text, while substantially smaller in size, retained the essential content and structure of the original text, indicating the strength of the compression algorithm. Following successful compression, the decompression process was initiated to restore the compressed text to its original form. The decompression algorithm effectively reconstructed the text, demonstrating the reversibility and reliability of the compression-decompression cycle. Quantitative analysis of the results showcased the extent of text reduction achieved through compression.

Figure 5: This is a compressed version of a post, discussing various aspects of language models, their training, and text generation processes, along with insights on compression and its implications on text quality. (LaurieWired, May 2024)

Comparing the size of the original text to the compressed text revealed substantial reductions, highlighting the efficiency of the compression algorithm in reducing data size while preserving essential content and meaning. These results provide empirical evidence of the feasibility and effectiveness of utilizing LLMs for text compression purposes. The successful application of the methodology to real-world texts underscores the practicality and scalability of the approach, paving the way for further exploration and utilization of LLMs in data compression and manipulation tasks.

4- Discussion and Analysis

In this section, we highlight the multifaceted aspects of utilizing LLMs for text compression, ranging from technical scalability and performance to ethical and societal considerations. Further research and exploration in these areas can unlock new possibilities and applications for LLMs in data processing and optimization tasks. The findings from the testing phase prompt a deeper exploration into the implications and potential applications of utilizing large language models (LLMs) for text compression and manipulation tasks.

4.1. Feasibility and Practicality: The successful application of the developed methodology highlights the feasibility and practicality of using LLMs for text compression. The efficient reduction in text size while preserving essential content underscores the potential of LLMs in data optimization and storage efficiency.

4.2. Scalability and Performance: An important aspect to consider is the scalability and performance of the compression algorithm across different text sizes and complexities. Further research and testing can provide insights into the algorithm’s scalability and its ability to handle larger and more diverse textual datasets.

4.3. Data Identification and Copyright Concerns: One intriguing aspect raised by this study is the potential for using LLM-based compression methods to identify data used in model training. This capability could address copyright concerns and aid in data attribution and validation within the context of LLM-generated content.

4.4. Model Variability and Generalization: Exploring how different LLM architectures and models impact compression performance is another avenue for future investigation. Comparing results across multiple LLMs can provide insights into model variability and generalization capabilities in text compression tasks.

This study focused on text compression, the principles and methodologies developed can potentially be extended to other data types, such as images or audio. Investigating the adaptability of LLMs in compressing diverse data formats opens doors to broader applications in data optimization and storage.

5- Conclusion

The exploration of text compression using large language models (LLMs) has unveiled promising avenues for data processing and manipulation. The successful application of the developed methodology to real-world texts, exemplified by the compression and subsequent decompression of the first chapter of “Alice’s Adventures in Wonderland,” underscores the practicality and effectiveness of using LLMs for text compression tasks. The findings from this study contribute significantly to our understanding of LLM capabilities in data optimization and storage efficiency. The efficient reduction in text size while preserving essential content showcases the potential of LLMs in addressing data volume challenges and enhancing data processing workflows. Moving forward, several key areas warrant further exploration and research. These include:

  • Investigating the scalability of the compression algorithm across diverse text sizes and complexities is mandatory for its practical application in actual projects. Optimizing the algorithm’s performance and resource utilization can further enhance its efficiency and effectiveness.
  • Comparing the performance of the compression algorithm across different LLM architectures and models can provide valuable insights into model variability and generalization capabilities. Understanding how various LLMs handle text compression tasks can inform model selection and optimization strategies.
  • Exploring the adaptability of LLMs in compressing diverse data formats beyond text, such as images or audio, opens up new possibilities for data optimization and storage efficiency across multiple domains.

The successful application of LLMs for text compression represents a significant step towards harnessing the capabilities of advanced natural language processing technologies for data manipulation and optimization. Continued research and exploration in this domain hold promise for innovative advancements and practical applications in data processing workflows. While this proof of concept demonstrates promising results in text compression using LLMs, further exploration is warranted to address scalability, model variability, and potential applications beyond text data. This work opens avenues for research at the intersection of LLMs, data compression, and data identification.

Reference:

https://o565.com/llm-text-compression/

https://mattmahoney.net/dc/

--

--