Skip to content
Using python to ingest pdfs

Add PDFs or DOCX files to your dataset in Python

To add a PDF or DOCX file, you'll first need to read the file and convert its content to text. You can use the pdfplumber library for PDF files and the python-docx library for DOCX files.

First, install the required libraries:

pip install pdfplumber python-docx

Then, you can use the following code to read a PDF or DOCX file and add its content to your dataset:

import pdfplumber
from docx import Document
from embedbase_client.client import EmbedbaseClient
 
def read_pdf(file_path):
    with pdfplumber.open(file_path) as pdf:
        content = ""
        for page in pdf.pages:
            content += page.extract_text()
    return content
 
def read_docx(file_path):
    doc = Document(file_path)
    content = ""
    for paragraph in doc.paragraphs:
        content += paragraph.text + "\n"
    return content
 
embedbase = EmbedbaseClient('https://api.embedbase.xyz', '<grab me here https://app.embedbase.xyz/>')
 
pdf_file_path = "path/to/your/pdf_file.pdf"
docx_file_path = "path/to/your/docx_file.docx"
 
pdf_content = read_pdf(pdf_file_path)
docx_content = read_docx(docx_file_path)
 
dataset_id = 'document-content'
data = embedbase.dataset(dataset_id).chunk_and_batch_add([{'data': pdf_content}, {'data': docx_content}])
print(data)

Replace path/to/your/pdf_file.pdf and path/to/your/docx_file.docx with the actual file paths of your PDF and DOCX files.