Select Page

The tech world is buzzing with excitement over the meteoric rise of ChatGPT and other Large Language Models (LLMs), making them a focal point of the yearā€™s most significant tech stories. These models have astounded us with their remarkable capabilities, but as they dominate headlines, concerns about privacy have emerged as a pressing issue.Ā Wired and other specialized press outletsĀ have amplified these concerns, citing notable cases like Italyā€™s ChatGPT ban for privacy reasons andĀ Amazonā€™s cautionary advice to employeesĀ afraid of corporate secrets leakages.

To fully capitalize on the immense potential of LLM technologies, corporations must regain control by developing their own chatbot. Similarly, SaaS providers, as they integrate conversational AI into their offerings, need to rebuild end-usersā€™ trust through a strong commitment to security and confidentiality. This is especially crucial in Cloud environments where queries to these models may involve highly sensitive information, such as internal documentation, confidential secrets, and private correspondence.

Cosmian offers a solution to tackle those issues. In this blog post, we will leverage Cosmian open-sourced technologies to run your own HuggingFace open-source models confidentially on the Cloud. All computations are conducted within secure enclaves, ensuring data never appears in clear text. Only the end-user can decipher the answers, providing unparalleled privacy and control.

Running LLMs on Intel SGX šŸ—œļø

One of the main challenges in serving a confidential LLM in the Cloud is the need to keep data encrypted, including during computing, a phase vulnerable to memory dumps. We will tackle this issue using Intel SGX. Therefore, we needed to run models with billions of parameters using a CPU protected with the SGX features – yet to come to GPUs.

Hopefully, we found a really cool open-source project namedĀ GGML!
This repository provides multiple tools to efficiently run LLMs on a CPU, including:

  • High quantization methods like 5 and 4-bits integer.
  • Massively parallel computing on CPU through AVX.

We did some testing with different quantization parameters:

Models Quantization Model size Context size Eval time per token Perplexity
Falcon-7B float16 14G 14G 121 ms 10.3582
Falcon-7B 8 bits 8G 8G 82 ms 10.3675
Falcon-7B 5 bits 5G 5G 74 ms 10.9167
Falcon-7B 4 bits 4G 4G 68 ms 12.4256

We significantly reduce the disk space and memory required during computations by quantizing. This will impact the model quality, which is measured using the perplexityĀ computed onĀ wikitext-2-raw (lower is better). We found that 5-bit quantization was a good compromise between size, speed, and quality.

We then proceed to bench the inference time for popular models with 5-bit quantization inside a SGX enclave. All benchmarks were made on anĀ Intel(R) Xeon(R) Gold 6312U CPU @ 2.40GHzĀ using 48 threads.

Models Model size Context size Eval time per token
MPT-7B 4G 4.5G 85 ms
Falcon-7B 5G 5G 102 ms
Pythia-12B 7.5G 14G 115 ms
GPT-NeoX-20B 13G 24G 172 ms
MPT-30B 19G 20G 238 ms
Falcon-40B 27G 27G 352 ms

The results were promising: without further optimization, the inference time running on encrypted hardware was about 35% slower.

Ā Consequently, we developed an MSE app to serve a language model in inference.

Creating the Microservice Encryption (MSE) application šŸ‘·ā€ā™‚ļø

Cosmian Microservice Encryption (MSE) allows deploying of confidential web applications written in Python easily. The code is securely deployed within secured enclaves powered by Intel SGX, protecting any data and metadata against the underlying cloud provider owning the hardware infrastructure.
The full code of the application created for this blog post is open-sourced and available at https://github.com/Cosmian/mse-example-gpt.

MSE apps are built like regular Flask API using Python. We used theĀ ctransformersĀ library to loadĀ GGMLĀ models with Python.

from ctransformers import AutoModelForCausalLM
from flask import Flask, Response, jsonify, request

app = Flask(__name__)

# The model is stored in the current working directory (./mse_src)
# More information: https://docs.staging.cosmian.com/microservice_encryption_home/develop/#the-paths
CWD_PATH = Path(os.getenv("MODULE_PATH")).resolve()

llm: AutoModelForCausalLM

@app.before_first_request
def init():
    """
    Function to initialize the model before handling any requests.
    Here the model is loaded from disk but it could be downloaded from a secure source.
    """
    global llm
    model_path = str(CWD_PATH / "ggml-model-q4_0.bin")
    try:
        llm = AutoModelForCausalLM.from_pretrained(model_path, model_type="gpt-neox")
    except ValueError as e:
        print(f"Model initialization error: {e}")

Then, we can create theĀ /generateĀ endpoint to generate text from a user query.

@app.route("/generate", methods=["POST"])
def generate():
    """Route for generating a response based on a query."""
    query = request.json.get("query")
    if not query:
        return Response(status=HTTPStatus.BAD_REQUEST)

    # Generate a response using the model
    res = llm(query, seed=123, threads=3, max_new_tokens=MAX_RESPONSE_SIZE)

    return jsonify({"response": res})

 

Finally, we can make a more interactive experience by streaming the response throughĀ server-sent eventsĀ and injecting back the conversation history.

@app.route("/generate")
def chat():
    """
    Route for generating a stream response based on a prompt
    containing a query and chat history.
    """
    b64_prompt = request.args.get("prompt")
    if not b64_prompt:
        return Response(status=HTTPStatus.BAD_REQUEST)

    # Truncate context to leave space for answer
    prompt = b64decode(b64_prompt).decode("utf-8")
    max_context_size = llm.context_length - MAX_RESPONSE_SIZE
    context_tokens = llm.tokenize(prompt)[-max_context_size:]

    def stream_response():
        msg_id = 0
        # Stream model tokens as they are being generated
        for token in llm.generate(context_tokens, seed=123, threads=3):
            msg_id += 1
            msg_str = dumps(llm.detokenize(token))

            yield f"id: {msg_id}\nevent: data\ndata: {msg_str}\n\n"

            if msg_id == MAX_RESPONSE_SIZE:
                break

        # End stream
        yield f"id: {msg_id + 1}\nevent: end\ndata: {{}}\n\n"

    # Create SSE response
    return Response(stream_response(), mimetype="text/event-stream")

Deploying your application šŸš€

Setup

We recommend cloning the example repositoryĀ to follow the instructions smoothly.
Ensure you have git-lfsĀ installed to download the model (EleutherAI/pythia-1b) from the repo.
To use custom models, please readĀ this. You might need to upgrade the hardware configuration to run models with more than 1B parameters.

Now that you have all the necessary files, you are ready to deploy your code on MSE!
Create an account onĀ https://console.cosmian.com, and download the mse-cli:

# install
$ pip install mse-cli
# login
$ mse login

Local testing

Before deploying, you can test the application locally (you need to have docker installed):

mse-example-gpt$ mse test
Starting the docker: ghcr.io/cosmian/mse-flask:20230228091325...
...
[2023-06-28 14:02:01 +0000] [15] [INFO] Running on http://0.0.0.0:5000 (CTRL + C to quit)

The app is running, now a quick test:

$ curl -X POST http://localhost:5000/generate \
     -H 'Content-Type: application/json' \
     -d '{"query":"User data protection is important for AI applications since"}'

{
    "response": " it protects users' privacy, security and personal information. This includes storing and protecting the data associated with an application so that no unauthorized use can be made of this data. In particular, this type of protection allows for user authentication based on biometric data. The authentication of a user's identity based on their unique fingerprints or"
}

Deployment on MSE

mse-example-gpt$ mse deploy
...
Deploying your app 'demo-mse-gpt' with 4096M memory and 3.00 CPU cores...
...
šŸ’” You can now test your application: 

     curl https://$APP_DOMAIN_NAME/health --cacert $CERT_PATH

Congrats šŸŽŠ youā€™ve just deployed your first app on MSE!

Keep theĀ urlĀ andĀ certificate pathĀ to perform requests to the MSE app.

šŸ’” You should be able to see your app on https://console.cosmian.com/apps, click on it to view more information about your app, including the url.

As before, you can query your application usingĀ curl:

$ curl https://$APP_DOMAIN_NAME/generate --cacert $CERT_PATH
     -H 'Content-Type: application/json' \
     -d '{"query":"User data protection is important for AI applications since"}'

{
    "response": " it protects users' privacy, security and personal information. This includes storing and protecting the data associated with an application so that no unauthorized use can be made of this data. In particular, this type of protection allows for user authentication based on biometric data. The authentication of a user's identity based on their unique fingerprints or"
}

However, using curl is not very practical so we developed simpleĀ Python clientsĀ to interact with the app.

A confidential chat with your AI assistant šŸ˜Ž

We provide a command-lineĀ chat clientĀ that you can use to ask questions to your application!

mse-example-gpt/clients_example$ python chat.py https://$APP_DOMAIN_NAME

User> What is computer science?
Assistant> 

The definition of the term computer science includes both academic disciplines such as engineering and computer science. It is an interdisciplinary field which attempts to integrate mathematics, information theory, computational algorithms, artificial intelligence, logic programming, distributed computing, computer architecture, cryptography, communication networks, information processing, artificial vision and control systems

User> Summarize it in a sentence
Assistant> .

How many different things can you name? (3, 6)

As we see, theĀ Pythia model was not fine-tuned for chat usage, so the result is not very good.

We deployed a confidential chat app with theĀ Falcon-7b model fine-tuned by OpenAssistant, you can try it with the URLĀ https://98d154989818fd3a.cosmian.app:

$ python chat.py --prompt https://98d154989818fd3a.cosmian.app

User> What is computer science?
Assistant> Computer science is the scientific study of computer systems. Computer scientists work on the design, development and application of computer hardware, software and networks. Computer scientists analyze computational problems, including efficiency, reliability, correctness and security, to find and develop algorithms and techniques that address them. Computer scientists also work with software development, including software architecture, software engineering and software testing. Computer scientists design and implement databases, operating systems and networks, while also studying the theory of computation and computational complexity.

User> Summarize it in a sentence
Assistant> Computer science is the scientific study of computer systems, focusing on the design, development and application of hardware, software and networks, as well as the analysis of computational problems and the design of software, systems and networks.

Conclusion

Thank you for reading šŸ™‚
We hope that this ā€œhow-toā€ helped you embrace game-changing LLMs technologies.

If youā€™re interested in building your own ā€œChatGPTā€ with confidence while earning the trust of your board, investors, and customers, we invite you to book a 30-minute demo with our experts. And embark on a journey of innovation, trust, and privacy with Cosmian end-to-end encryption.

 

Cosmian makes no tracking for advertising and does not collect any personal data. Cookies are used for statistical or operational purposes, as well as for analysis, allowing for continuous improvement of the website. Cosmian uses the Matomo Analytics tool, an audience measurement solution that uses cookies with a configuration that complies with the data protection legislation and the recommendations of the CNIL (Commission Nationale de l'Informatique et des LibertĆ©s).Ā This configuration allows to anonymise visitor's data and to limit the storage period of this data to a maximum of 13 months. With this configuration, the prior consent to the deposit of Matomo Analytics cookies is not required. However, you can still choose not to allow these cookies by clickingĀ below or at any time by consulting our Privacy Policy.

You may choose to prevent this website from aggregating and analyzing the actions you take here. Doing so will protect your privacy, but will also prevent the owner from learning from your actions and creating a better experience for you and other users.