I am working on a project where I need to convert a PDF containing a large table (thousands of rows) into a JSON Array of object. The PDF has a table with headers that should be used as keys in the JSON object, and the respective cell values should be the values. Each row of the table should be represented as an object in the JSON.
I have tried using libraries like pdf-parse and pdfjs-dist from npm, but they didn't meet my expectations for extracting the data correctly.
What is the best approach to extract the table data from the PDF in the format I need? Should I handle this processing on the frontend in React and then send the resulting JSON to the backend (which is built with Python), or should I send the PDF to the backend and handle the conversion there?
I am working on a project where I need to convert a PDF containing a large table (thousands of rows) into a JSON Array of object. The PDF has a table with headers that should be used as keys in the JSON object, and the respective cell values should be the values. Each row of the table should be represented as an object in the JSON.
I have tried using libraries like pdf-parse and pdfjs-dist from npm, but they didn't meet my expectations for extracting the data correctly.
What is the best approach to extract the table data from the PDF in the format I need? Should I handle this processing on the frontend in React and then send the resulting JSON to the backend (which is built with Python), or should I send the PDF to the backend and handle the conversion there?
PDF is not a structured language, but instead a display-oriented format. In fact, it is even better described as a rendering engine programming language.
To render the three words "The lazy fox", the PDF-generating software can choose to instruct either:
The lazy <move to bottom right> 36 <come back to the page> fox
)The nice fox <move back to the start position of "nice"> <draw a white rectangle over the word "nice"> lazy
Thus the ability to extract contents in a structured way from your PDF can vary greatly, depending on what produced the PDF.
Your first mission is to ensure you only have 1 stable source of PDF.
Do not expect to create a general-use "any PDF containing tables-to-JSON".
OK, let's say that you're OK with it, you just have to get the juice of that specific PDF, and once done, you'll trash the whole project never to work on it anymore (no way to "Manu, the engine you gave us in 2025 doesn't work anymore on the 2027 version of the PDF, can you repair it please?").
Your best bet then will be to try tools starting from the simplests.
First try PDF-to-text extractors (like pdf-parse
; but please give an excerpt of its output!),
but don't count on them to output a pretty table;
instead try to find a pattern in the output:
if your output looks like:
col1
col2
col3
col1
col2
col3
pagenumber
col1
col2
col3
then you're good to go with some loops, parsing, detection and steering.
Be warned that you may have some manual iterations to do,
for example if the table's data is hardly distinguishable from the page numbers or headers or footers,
or if the table contains multi-line cells:
col1
col2
second line of col2 that you could mistake for a col3
col3
Then this would be a cycle of "parse PDF to a .txt -> regex to JSON -> verify consistence -> if fail then edit the .txt -> regex to JSON -> verify -> […]".
This would be the most efficient solution,
depending on the kind of guts of your PDF of course.
Level 2 would be to parse the PDF instructions (pdfjs-dist
may be good at it) to detect the "pen moves" between text tokens, and then mentally place it on a map, knowing that buckets at the same ordinate with subsequent abscissas are adjacent words, or cells.
But I'm not sure it's worth the effort, and then you could go to…
In case you need a fully automated workflow that level 1 can't provide (from your specific PDF),
then you could use pdfjs-dist
to render the PDF, pushing the image to table-aware OCR software that would output something more suitable to the "regex to JSON" last step of Level 1.