The tech world is buzzing with excitement over the meteoric rise of ChatGPT and other Large Language Models (LLMs), making them a focal point of the year’s most significant tech stories. These models have astounded us with their remarkable capabilities, but as they dominate headlines, concerns about privacy have emerged as a pressing issue. Wired and other specialized press outlets have amplified these concerns, citing notable cases like Italy’s ChatGPT ban for privacy reasons and Amazon’s cautionary advice to employees afraid of corporate secrets leakages.
To fully capitalize on the immense potential of LLM technologies, corporations must regain control by developing their own chatbot. Similarly, SaaS providers, as they integrate conversational AI into their offerings, need to rebuild end-users’ trust through a strong commitment to security and confidentiality. This is especially crucial in Cloud environments where queries to these models may involve highly sensitive information, such as internal documentation, confidential secrets, and private correspondence.
Cosmian offers a solution to tackle those issues. In this blog post, we will leverage Cosmian open-sourced technologies to run your own HuggingFace open-source models confidentially on the Cloud. All computations are conducted within secure enclaves, ensuring data never appears in clear text. Only the end-user can decipher the answers, providing unparalleled privacy and control.
Running LLMs on Intel SGX 🗜️
One of the main challenges in serving a confidential LLM in the Cloud is the need to keep data encrypted, including during computing, a phase vulnerable to memory dumps. We will tackle this issue using Intel SGX. Therefore, we needed to run models with billions of parameters using a CPU protected with the SGX features – yet to come to GPUs.
Hopefully, we found a really cool open-source project named GGML!
This repository provides multiple tools to efficiently run LLMs on a CPU, including:
- High quantization methods like 5 and 4-bits integer.
- Massively parallel computing on CPU through AVX.
We did some testing with different quantization parameters:
Models | Quantization | Model size | Context size | Eval time per token | Perplexity |
---|---|---|---|---|---|
Falcon-7B | float16 | 14G | 14G | 121 ms | 10.3582 |
Falcon-7B | 8 bits | 8G | 8G | 82 ms | 10.3675 |
Falcon-7B | 5 bits | 5G | 5G | 74 ms | 10.9167 |
Falcon-7B | 4 bits | 4G | 4G | 68 ms | 12.4256 |
We significantly reduce the disk space and memory required during computations by quantizing. This will impact the model quality, which is measured using the perplexity computed on wikitext-2-raw (lower is better). We found that 5-bit quantization was a good compromise between size, speed, and quality.
We then proceed to bench the inference time for popular models with 5-bit quantization inside a SGX enclave. All benchmarks were made on an Intel(R) Xeon(R) Gold 6312U CPU @ 2.40GHz
using 48 threads.
Models | Model size | Context size | Eval time per token |
---|---|---|---|
MPT-7B | 4G | 4.5G | 85 ms |
Falcon-7B | 5G | 5G | 102 ms |
Pythia-12B | 7.5G | 14G | 115 ms |
GPT-NeoX-20B | 13G | 24G | 172 ms |
MPT-30B | 19G | 20G | 238 ms |
Falcon-40B | 27G | 27G | 352 ms |
The results were promising: without further optimization, the inference time running on encrypted hardware was about 35% slower.
Consequently, we developed an MSE app to serve a language model in inference.
Creating the Microservice Encryption (MSE) application 👷♂️
Cosmian Microservice Encryption (MSE) allows deploying of confidential web applications written in Python easily. The code is securely deployed within secured enclaves powered by Intel SGX, protecting any data and metadata against the underlying cloud provider owning the hardware infrastructure.
The full code of the application created for this blog post is open-sourced and available at https://github.com/Cosmian/mse-example-gpt.
MSE apps are built like regular Flask API using Python. We used the ctransformers library to load GGML
models with Python.
from ctransformers import AutoModelForCausalLM
from flask import Flask, Response, jsonify, request
app = Flask(__name__)
# The model is stored in the current working directory (./mse_src)
# More information: https://docs.staging.cosmian.com/microservice_encryption_home/develop/#the-paths
CWD_PATH = Path(os.getenv("MODULE_PATH")).resolve()
llm: AutoModelForCausalLM
@app.before_first_request
def init():
"""
Function to initialize the model before handling any requests.
Here the model is loaded from disk but it could be downloaded from a secure source.
"""
global llm
model_path = str(CWD_PATH / "ggml-model-q4_0.bin")
try:
llm = AutoModelForCausalLM.from_pretrained(model_path, model_type="gpt-neox")
except ValueError as e:
print(f"Model initialization error: {e}")
Then, we can create the /generate
endpoint to generate text from a user query.
@app.route("/generate", methods=["POST"])
def generate():
"""Route for generating a response based on a query."""
query = request.json.get("query")
if not query:
return Response(status=HTTPStatus.BAD_REQUEST)
# Generate a response using the model
res = llm(query, seed=123, threads=3, max_new_tokens=MAX_RESPONSE_SIZE)
return jsonify({"response": res})
Finally, we can make a more interactive experience by streaming the response through server-sent events
and injecting back the conversation history.
@app.route("/generate")
def chat():
"""
Route for generating a stream response based on a prompt
containing a query and chat history.
"""
b64_prompt = request.args.get("prompt")
if not b64_prompt:
return Response(status=HTTPStatus.BAD_REQUEST)
# Truncate context to leave space for answer
prompt = b64decode(b64_prompt).decode("utf-8")
max_context_size = llm.context_length - MAX_RESPONSE_SIZE
context_tokens = llm.tokenize(prompt)[-max_context_size:]
def stream_response():
msg_id = 0
# Stream model tokens as they are being generated
for token in llm.generate(context_tokens, seed=123, threads=3):
msg_id += 1
msg_str = dumps(llm.detokenize(token))
yield f"id: {msg_id}\nevent: data\ndata: {msg_str}\n\n"
if msg_id == MAX_RESPONSE_SIZE:
break
# End stream
yield f"id: {msg_id + 1}\nevent: end\ndata: {{}}\n\n"
# Create SSE response
return Response(stream_response(), mimetype="text/event-stream")
Deploying your application 🚀
Setup
We recommend cloning the example repository to follow the instructions smoothly.
Ensure you have git-lfs installed to download the model (EleutherAI/pythia-1b) from the repo.
To use custom models, please read this. You might need to upgrade the hardware configuration to run models with more than 1B parameters.
Now that you have all the necessary files, you are ready to deploy your code on MSE!
Create an account on https://console.cosmian.com, and download the mse-cli:
# install
$ pip install mse-cli
# login
$ mse login
Local testing
Before deploying, you can test the application locally (you need to have docker installed):
mse-example-gpt$ mse test
Starting the docker: ghcr.io/cosmian/mse-flask:20230228091325...
...
[2023-06-28 14:02:01 +0000] [15] [INFO] Running on http://0.0.0.0:5000 (CTRL + C to quit)
The app is running, now a quick test:
$ curl -X POST http://localhost:5000/generate \
-H 'Content-Type: application/json' \
-d '{"query":"User data protection is important for AI applications since"}'
{
"response": " it protects users' privacy, security and personal information. This includes storing and protecting the data associated with an application so that no unauthorized use can be made of this data. In particular, this type of protection allows for user authentication based on biometric data. The authentication of a user's identity based on their unique fingerprints or"
}
Deployment on MSE
mse-example-gpt$ mse deploy
...
Deploying your app 'demo-mse-gpt' with 4096M memory and 3.00 CPU cores...
...
💡 You can now test your application:
curl https://$APP_DOMAIN_NAME/health --cacert $CERT_PATH
Congrats 🎊 you’ve just deployed your first app on MSE!
Keep the url
and certificate path
to perform requests to the MSE app.
💡 You should be able to see your app on https://console.cosmian.com/apps, click on it to view more information about your app, including the
url
.
As before, you can query your application using curl
:
$ curl https://$APP_DOMAIN_NAME/generate --cacert $CERT_PATH
-H 'Content-Type: application/json' \
-d '{"query":"User data protection is important for AI applications since"}'
{
"response": " it protects users' privacy, security and personal information. This includes storing and protecting the data associated with an application so that no unauthorized use can be made of this data. In particular, this type of protection allows for user authentication based on biometric data. The authentication of a user's identity based on their unique fingerprints or"
}
However, using curl
is not very practical so we developed simple Python clients to interact with the app.
A confidential chat with your AI assistant 😎
We provide a command-line chat client that you can use to ask questions to your application!
mse-example-gpt/clients_example$ python chat.py https://$APP_DOMAIN_NAME
User> What is computer science?
Assistant>
The definition of the term computer science includes both academic disciplines such as engineering and computer science. It is an interdisciplinary field which attempts to integrate mathematics, information theory, computational algorithms, artificial intelligence, logic programming, distributed computing, computer architecture, cryptography, communication networks, information processing, artificial vision and control systems
User> Summarize it in a sentence
Assistant> .
How many different things can you name? (3, 6)
As we see, the Pythia
model was not fine-tuned for chat usage, so the result is not very good.
We deployed a confidential chat app with the Falcon-7b model fine-tuned by OpenAssistant, you can try it with the URL https://98d154989818fd3a.cosmian.app
:
$ python chat.py --prompt https://98d154989818fd3a.cosmian.app
User> What is computer science?
Assistant> Computer science is the scientific study of computer systems. Computer scientists work on the design, development and application of computer hardware, software and networks. Computer scientists analyze computational problems, including efficiency, reliability, correctness and security, to find and develop algorithms and techniques that address them. Computer scientists also work with software development, including software architecture, software engineering and software testing. Computer scientists design and implement databases, operating systems and networks, while also studying the theory of computation and computational complexity.
User> Summarize it in a sentence
Assistant> Computer science is the scientific study of computer systems, focusing on the design, development and application of hardware, software and networks, as well as the analysis of computational problems and the design of software, systems and networks.
Conclusion
Thank you for reading 🙂
We hope that this “how-to” helped you embrace game-changing LLMs technologies.
If you’re interested in building your own “ChatGPT” with confidence while earning the trust of your board, investors, and customers, we invite you to book a 30-minute demo with our experts. And embark on a journey of innovation, trust, and privacy with Cosmian end-to-end encryption.