Translate PDF Text to New Language with ChatGPT

Translating PDF documents into different languages allows businesses to communicate with their customers and partners in their preferred language. For example, a business can translate its customer support documentation into different languages so that its customers can easily find the information they need.

PDF text translation is especially important in countries where businesses are required to provide certain documents, such as contracts and invoices, in the local language. Translating PDF documents into different languages allows businesses to comply with these legal requirements.

Businesses can also use translated PDF documents to reach new markets and increase sales. For example, a business can translate its marketing materials into different languages to target new customers in other countries.

Let's step through an example of translating extracted text from a PDF file using pdfRest and OpenAI's ChatGPT.

Environment

For convenience, we will set up an environment running Jupyter. One way to do that is to create a Python environment and activate it:

python -m venv .venv
. ./.venv/bin/activate

Then install Jupyter.

python -m pip install jupyter

You'll also need to install the other Python packages required by these sample notebooks. Those are in a file called requirements.txt, available in our GitHub repository.

python -m pip install -r requirements.txt

Run Jupyter, opening this notebook.

jupyter notebook extract-and-translate.ipynb

API keys

You'll need to sign up for API keys in order to use this example:

Create a file called .env in the same directory as this notebook, and places the keys into it, like this:

OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
PDFREST_API_KEY=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

First, we will start by importing some Python modules that we need, and acquiring API keys.

import os
from pathlib import Path

import openai
import requests
from dotenv import load_dotenv
from IPython.display import display_markdown
from requests_toolbelt import MultipartEncoder

load_dotenv()

openai.api_key = os.getenv("OPENAI_API_KEY")
pdfrest_api_key = os.getenv("PDFREST_API_KEY")

REQUEST_TIMEOUT = 30

Extracting the text

Below, we'll define a function that extracts the text from a PDF document represented by a path on disk. It will get the full text by page, returning the JSON data from the endpoint.

def extract_text(document: Path) -> dict:
    """Extract text on a page-by-page basis from a document, and
    return the extracted text"""

    extract_endpoint_url = "https://api.pdfrest.com/extracted-text"

    # Define the file to upload, and request full text on a per-page basis
    request_data = [
        ("file", (document.name, document.open(mode="rb"), "application/pdf")),
        ("full_text", "by_page"),
    ]

    mp_encoder_upload = MultipartEncoder(fields=request_data)

    # Let's set the headers that the upload endpoint expects.
    # Since MultipartEncoder is used, the 'Content-Type' header gets set to
    # 'multipart/form-data' via the content_type attribute below.
    headers = {
        "Accept": "application/json",
        "Content-Type": mp_encoder_upload.content_type,
        "Api-Key": pdfrest_api_key,
    }

    print("Sending POST request to extract text endpoint...")
    response = requests.post(
        extract_endpoint_url,
        data=mp_encoder_upload,
        headers=headers,
        timeout=REQUEST_TIMEOUT,
    )

    # Print the response status code and raise an exception if the request fails
    print("Response status code: " + str(response.status_code))
    response.raise_for_status()

    return response.json()

TranslationChatbot

Let's define a chatbot whose main purpose is translation. This is a Python class, which makes a persistent object that can be used for a continuing conversation.

We start with a system instruction. The system instruction indicates to OpenAI what the purpose of the conversation is, what role it should take, and any additional instructions.

When translating, we also prepend the material to be translated with an instruction to translate to English.

Each interaction is recorded in self.messages, which contains content and a role:

system means that the content is a system instruction. System instructions are usually present at the start of a conversation, but are typically not presented to the user, for instance, in ChatGPT.
user means that the content is part of the conversation that was uttered by the user.
assistant means that the content is a reply from the AI.

This class makes it easy to have a conversation with GPT-4. We call translate_text() to supply text to be translated, and chat() if we want to continue the conversation.

class TranslationChatbot:
    """A chatbot that specializes in translation, but can have a continuing conversation."""

    SYSTEM_INSTRUCTION = """
    You are a helpful translator. Given an input text, translate
    it to the requested language. If there are any ambiguities,
    or things that couldn't be translated, please
    mention them after the translation.

    The output can use Markdown for formatting.
    """

    TRANSLATION_INSTRUCTION = """
    Please translate the following to English:

    """

    def __init__(self):
        self.messages = [
            {"content": self.SYSTEM_INSTRUCTION, "role": "system"},
        ]

    def get_openai_response(self, new_message):
        """Request chat completion from OpenAI, and update the messages with the reply. Returns the response from OpenAI."""
        self.messages.append(new_message)
        response = openai.ChatCompletion.create(
            model="gpt-4",
            temperature=0,
            messages=self.messages,
        )
        message = response["choices"][0]["message"]
        self.messages.append(message)
        return response

    def translate_text(self, text: str) -> str:
        """Translate text, and return OpenAI's reply."""

        response = self.get_openai_response(
            {"content": f"{self.TRANSLATION_INSTRUCTION}{text}", "role": "user"}
        )
        message = response["choices"][0]["message"]
        return message["content"]

    def converse(self, text: str) -> str:
        """Add a message to the conversation, and return OpenAI's reply."""
        response = self.get_openai_response({"content": text, "role": "user"})
        message = response["choices"][0]["message"]
        return message["content"]

    def chat(self, text: str) -> str:
        """A simple method for chatting. OpenAI returns results formatted with Markdown,
        and may contain text styling and lists."""
        display_markdown(self.converse(text), raw=True)

Extract the text

Here, we simply call extract_text() with the path to the input document. In this case, the PDF file contains Article 1 of the Universal Declaration of Human Rights in Greek.

After that, we get the text of the first page. As you can see from the code, the fullText dictionary contains an array pages which contains each page. The code gets the first page, indexed by 0, and retrieves the text from it.

extracted_text = extract_text(Path("pdf/UDHR_Article_1_Greek.pdf"))
page_1_text = extracted_text["fullText"]["pages"][0]["text"]

Sending POST request to extract text endpoint...
Response status code: 200

Using the TranslationChatbot

Create a TranslationChatbot. Use it to translate the text, and ask it to translate the text of the page.

The chatbot retains the history of the conversation, so that we can make further inquiries about the text that was translated.

Since this code is running in the context of a Jupyter notebook, we use display_markdown() to print output with style attached. GPT-4 also provides Markdown formatted content, so if the response has any lists or tables in it, they will render nicely.

chatbot = TranslationChatbot()
display_markdown(f"**Text before translation:** {page_1_text}", raw=True)
translated_text = chatbot.translate_text(page_1_text)
display_markdown(f"**Text after translation:** {translated_text}", raw=True)

Text before translation: ΑΡΘΡΟ 1 ' Ολοι οι άνθρωποι γεννιούνται ελεύθεροι και ίσοι στην αξιοπρέπεια και τα δικαιώματα. Είναι προικισμένοι με λογική και συνείδηση, και οφείλουν να συμπεριφέρονται μεταξύ τους με πνεύμα αδελφοσύνης.

Text after translation: ARTICLE 1: All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

Conclusion

You may wish to consider translating your PDF documents into different languages to reach new markets, improve customer service, and comply with legal requirements. pdfRest Extract Text API Tool pairs perfectly with OpenAI's ChatGPT API to translate PDF document text to new languages. Give the above example a try, and let us know if there's anything we can do to help.

Extract Text