We have been tasked with taking the following folder full of PDF files and extracting each file’s table data and combining the data into a single Excel table.
Loading the Data Into Power Query
We begin this adventure by starting Excel and creating a new, blank workbook.
Click Data (tab) -> Get & Transform (group) -> Get Data -> From File -> From Folder.
Navigate to the PDF folder’s location and click Open.
From here, we see the metadata for each file in the selected folder. Click the Transform Data button to load the metadata into Power Query.
Power Query displays the metadata in the Power Query Editor window.
Featured Course
Master Excel Power Query – Beginner to Pro
Extracting the PDF File Contents
We don’t want the file metadata; we want the contents of each file.
To extract each of the PDF file contents, click the Expand Binaries button in the top-right of the Content column.
In the Combine Binaries dialog box, we see a listing of all tables and pages in the first PDF.
As each file has but one page with a single table, selecting either the “Table001” or “Page001” options will likely yield the same result.
If you have multiple tables on a single page, and you want all tables, selecting “Page001” may prove more effective than selecting each table separately.
Selecting tables is usually preferred as it avoids unwanted PDF content, like paragraph-style text, and more properly identifies table characteristics, like headers.
Click OK to begin the combining process.
Processing the PDF Files
Power Query will automatically create a series of queries, parameters, custom functions, and transformation steps to iterate through each PDF in the folder. This iterative process will extract and combine the data from each PDF table into a single table.
Keeping Track of the Data’s Place of Origin
Before we load the results into Excel, we want to ensure that we can properly identify the region from which each transaction originated.
The file’s name contains the region name at the end of the filename.
We will use a transformation tool to extract the text immediately following the last underscore character and before the period just before the file’s extension.
Click Transform (tab) -> Text Column (group) -> Extract -> Text Between Delimiters.
In the Text Between Delimiters dialog box, select an underscore as the Start Delimiter and a period as the End Delimiter.
Because there are multiple underscores in the filenames, we need to search for the underscore character starting from the end of the filename. Set the Scan for the Start Delimiter option to “From the end of the input”.
Click OK and rename the column to “Store”.
Click Home (tab) -> Close (group) -> Close & Load to send the results to an Excel table on a new sheet.
We see that we have combined 351 rows of fragmented data into a single table.
Featured Course
Power Excel Bundle
Adding PDF Data to the Folder
If we were to receive periodic updates of new information in the form of additional PDF files, we only perform two steps to bring the data into the output table.
- Add the new PDF to the folder of existing PDF files.
- Refresh the output table in Excel.
Updating the Query With Additional Transformations
If we need to update the query to perform additional transformations, like combining the First Name and Last Name fields into a single Full Name field, we merely edit the query and add the step.
- Open the query in the Power Query Editor.
- Select the FirstName and LastName columns (in that order).
- Click Transform (tab) -> Text Column (group) -> Merge Columns.
- In the Merge Columns dialog box, select “Space” as the Separator and name the new column “Full Name”.
- Click OK to apply the transformation.
- Click Close & Load to send the updated results back to Excel.
Dealing with Non-PDF Files
Suppose someone unintentionally (we’re assuming no one is trying to sabotage us) places a Microsoft Word document in our folder of PDF files.
When we refresh the data, we will be met with an unwanted response.
If we reopen the query in the Power Query Editor, selecting the query and clicking the Refresh button, the SOURCE step shows that a new file has been detected in the folder. Unfortunately, it is not a PDF file.
We need to ensure only PDF files are selected from this folder.
To do this, ensure the SOURCE step is selected and filter the Extension column to only select “.pdf” files.
Clicking Close & Load to send the results back to Excel shows that all is well with the world.
Non-PDF Files Can “Slip Through the Cracks”
Even though we applied a filter to eliminate the non-PDF files from the folder examination, the process only worked because we got a bit lucky.
Remember how the Microsoft Word document appeared at the bottom of the metadata list?
If we add a Microsoft Excel file…
Refreshing the query shows the newly added filed at the SOURCE step.
The new Excel file appears at the top of the list because of the alphabetical sorting by filename.
Of course, we added a step to filter out all non-PDF files.
Unbeknownst to us, another query that was created for us named “Sample File” is selecting the first file in the metadata list of files for examination and processing.
When the query was created, the first file in the list was a proper PDF file.
Now the metadata list shows the Excel file as the first file.
The “Sample File” query performed its folder scan, independent of the original query’s folder scan.
This causes the “Transform Sample File” query to fail because it is attempting to perform PDF transformations on an Excel file.
Correcting the Error in the Sample File
To correct this issue, we need to apply the same filter step to the “Sample File” as we did to the main query.
Select the SOURCE step of the “Sample File” query and perform the same filter operation as earlier.
Close & Load the updated queries back into Excel.
All is well with the world, once again.
Leila Gharani
I'm a 6x Microsoft MVP with over 15 years of experience implementing and professionals on Management Information Systems of different sizes and nature.
My background is Masters in Economics, Economist, Consultant, Oracle HFM Accounting Systems Expert, SAP BW Project Manager. My passion is teaching, experimenting and sharing. I am also addicted to learning and enjoy taking online courses on a variety of topics.