Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using PdfParser without Composer #117

Closed
apmuthu opened this issue Sep 26, 2016 · 33 comments · Fixed by #388
Closed

Using PdfParser without Composer #117

apmuthu opened this issue Sep 26, 2016 · 33 comments · Fixed by #388

Comments

@apmuthu
Copy link

apmuthu commented Sep 26, 2016

Alternative Autoloader built in

Since v0.18.2 you don't need to do the following steps to use PDFParser without Composer. Please check https://github.com/smalot/pdfparser#install for further information on our alternative autoloader.


❗ Outdated

Last checked in 2020

Updated file: vendor-autoload.zip - See #117 (comment)

The ../vendor/autoload.php gets generated when we use composer and we include it in our scripts for PdfParser access. If we wish to freeze our install and manage it without using Composer, this said file can be created to have the following:

<?php
/**
 * this file acts as vendor/autoload.php
 */

/*
Using PDFParser without Composer
Folder structure
================
webroot
  pdfdemos
    INV001.pdf # test PDF file to extract text from for demo
    test.php # our operational demo file
  vendor
    autoload.php
    smalot
      pdfparser # unpack from git master https://github.com/smalot/pdfparser/archive/master.zip release is 0.9.25 dated 2015-09-15
        docs # optional
        samples # optional
        src
          Smalot
            PdfParser
*/

$prerequisites = array();

/**
 * TODO: ADAPT THIS PATH TO pdfparser
 */ 
$pdfparser = '/host/path/to/pdfparser';

$prerequisites['pdfparser'] = array (
    $pdfparser.'/Config.php',
    $pdfparser.'/Parser.php',
    $pdfparser.'/Document.php',
    $pdfparser.'/Header.php',
    $pdfparser.'/PDFObject.php',
    $pdfparser.'/Element.php',
    $pdfparser.'/Encoding.php',
    $pdfparser.'/Font.php',
    $pdfparser.'/Page.php',
    $pdfparser.'/Pages.php',
    $pdfparser.'/Element/ElementArray.php',
    $pdfparser.'/Element/ElementBoolean.php',
    $pdfparser.'/Element/ElementString.php',
    $pdfparser.'/Element/ElementDate.php',
    $pdfparser.'/Element/ElementHexa.php',
    $pdfparser.'/Element/ElementMissing.php',
    $pdfparser.'/Element/ElementName.php',
    $pdfparser.'/Element/ElementNull.php',
    $pdfparser.'/Element/ElementNumeric.php',
    $pdfparser.'/Element/ElementStruct.php',
    $pdfparser.'/Element/ElementXRef.php',
    $pdfparser.'/Encoding/StandardEncoding.php',
    $pdfparser.'/Encoding/ISOLatin1Encoding.php',
    $pdfparser.'/Encoding/ISOLatin9Encoding.php',
    $pdfparser.'/Encoding/MacRomanEncoding.php',
    $pdfparser.'/Encoding/WinAnsiEncoding.php',
    $pdfparser.'/Font/FontCIDFontType0.php',
    $pdfparser.'/Font/FontCIDFontType2.php',
    $pdfparser.'/Font/FontTrueType.php',
    $pdfparser.'/Font/FontType0.php',
    $pdfparser.'/Font/FontType1.php',
    $pdfparser.'/RawData/FilterHelper.php',
    $pdfparser.'/RawData/RawDataParser.php',
    $pdfparser.'/XObject/Form.php',
    $pdfparser.'/XObject/Image.php'
);

foreach($prerequisites as $project => $includes) {
    foreach($includes as $mapping => $file) {
      require_once $file;
    }
}

/*
// Information for comparison with composer
use Datamatrix;
use PDF417;
use QRcode;
use TCPDF;
use TCPDF2DBarcode;
use TCPDFBarcode;
use TCPDF_COLORS;
use TCPDF_FILTERS;
use TCPDF_FONTS;
use TCPDF_FONT_DATA;
use TCPDF_IMAGES;
use TCPDF_IMPORT;
use TCPDF_PARSER;
use TCPDF_STATIC;
*/

We can now create a test.php in the deployment folder (pdfdemos here) with:

<?php
include "../vendor/autoload.php";

$directory = getcwd();
$file = 'INV001.pdf';
$fullfile = $directory . '/' . $file;
$content = '';
$out = '';
$parser = new \Smalot\PdfParser\Parser();

$document = $parser->parseFile($fullfile);
$pages    = $document->getPages();
$page     = $pages[0];
$content  = $page->getText();
$out      = $content;
echo '<pre>' . $out . '</pre>';

EDIT 1 by k00ni: added updated PHP code from @ndmax. Also removed tecnickcom/tcpdf (not needed anymore) and added code highlighting.

@rajeshgozoom
Copy link

Wow Man, You are great, that really awesome and worked for me...

Thanks a TON!!!

@kaustavdey
Copy link

Sir its simply great. Can you also help to install Tesseract OCR for php without composer?

@apmuthu
Copy link
Author

apmuthu commented Mar 7, 2018

I have not used this wrapper: Tesseract OCR for PHP. If you succeed in deploying it, do let us know what issues crop up.

I have only used the compiled version of Tesseract OCR.

@yousaf50
Copy link

yousaf50 commented Aug 5, 2018

WOW sweet work ..love from pakistan.

@federicovilla
Copy link

federicovilla commented Oct 26, 2018

Great hint apmuthu...this solve my issue since I cannot run composer on my remote server.
I'm trying to use your modified installation and autoload into a Codeigniter site...so I installed all files and autoload into the third_party folder.
I have created a library with the follpwing code
class Pdfparser { function __construct() { require_once '/third_party/vendor/autoload.php'; } }

after that I've added this method into the controller:
function file_pdf_parser($file){ $parser = new \Smalot\PdfParser\Parser(); $pdf = $parser->parseFile($file); $text = $pdf->getText(); return $text; }

but when I try to execute the method I get the following error:
[26-Oct-2018 08:18:58 UTC] PHP Fatal error: Class 'Smalot\PdfParser\Parser' not found in /home/gavsit/public_html/application/controllers/adm/Files.php on line 324

Any hint to solve this? Thanks a lot

@apmuthu
Copy link
Author

apmuthu commented Oct 31, 2018

Make the path to the autoload relative like (or completely absolute with full linux path):

class Pdfparser {
    function __construct() {
        require_once ('third_party/vendor/autoload.php');
    }
 }

@doganoo
Copy link
Contributor

doganoo commented Oct 31, 2018

What exactly you mean by „freeze“? Composer doesn’t update any files until you run „composer update“ and you don’t change your composer.json and/or composer.lock files.

Therefore, just run “composer update” on your instance and enjoy?!

@apmuthu
Copy link
Author

apmuthu commented Nov 7, 2018

For those who cannot use composer for whatever reason (offline, stability, unfamiliarity, etc), this thread lists an alternative.

@ndmax
Copy link

ndmax commented Mar 26, 2019

I really do wish this was baked into pdfparser (and other projects that assume Composer). I don't use Composer, and don't want to add yet another dependency/package manager for one project submodule. Composer isn't installed in my production environment.

I don't mean to suggest ejecting Composer (it's a common standard and super easy if you're already using it), but simply including an installation procedure that doesn't require Composer. An example would be the DOMPDF project.

In any case, thanks @apmuthu for this post.

@ndmax
Copy link

ndmax commented Mar 27, 2019

Also: Object.php is now PDFObject.php

@apmuthu
Copy link
Author

apmuthu commented Mar 27, 2019

@ndmax: Thanks for the info and glad for the +1 composer-less implementation initiative.

Yes, you're right, the renaming of the file will now require, in the code snippet above, the replacement of the line:

include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/Object.php";

with

include_once  $vendorDir . "/smalot/pdfparser/src/Smalot/PdfParser/PDFObject.php";

Stands corrected in the opening post above.

@WillRun4Cake
Copy link

WillRun4Cake commented May 4, 2020

Apmuthu, thank you. This thread was very helpful.
Below I have attached an example for doing this without Composer. This should be extracted to the /var/www/html directory or public_html directory.

Three additional things to mention:

  1. I just cloned the master branch on May 3, 2020 and the Object.php has still not been renamed to PDFObject.php
  2. Make sure you have the php-mbstring extension installed, e.g.
    sudo apt-get install php-mbstring
    yum install php-mbstring
  3. You may also need to enable in php.ini:
    extension=mbstring

Examples:

example.zip
example.tar.gz

@apmuthu
Copy link
Author

apmuthu commented May 4, 2020

Thanks Joseph. Stands corrected.

@hemaantjd
Copy link

Thank you WillRun4Cake... It's working great... https://www.hemantjadhav.com/

@j0k3r
Copy link
Collaborator

j0k3r commented Jul 27, 2020

Duplicates #279

@j0k3r j0k3r closed this as completed Jul 27, 2020
@ndmax
Copy link

ndmax commented Jul 28, 2020

For those who have been manually implementing w/o Composer, the following two new includes are now required:
/RawData/FilterHelper.php
/RawData/RawDataParser.php

@k00ni
Copy link
Collaborator

k00ni commented Jul 29, 2020

Because this issue seems interesting for some people, we could pin it. Furthermore, information like the last of @ndmax should be merged into the initial post so that new people find them quickly without the need to scroll through the whole issue.

What do you think?

CC @j0k3r

@j0k3r j0k3r pinned this issue Jul 29, 2020
@j0k3r
Copy link
Collaborator

j0k3r commented Jul 29, 2020

@k00ni I agree. I've pinned the issue. Can you update the initial post?

@k00ni
Copy link
Collaborator

k00ni commented Jul 29, 2020

@ndmax, @apmuthu and others: I would like to keep the initial post up-to-date. Can you tell me, what needs to be changed?

As far as I saw, these files need to be added:

/RawData/FilterHelper.php
/RawData/RawDataParser.php

@k00ni k00ni reopened this Jul 29, 2020
@ndmax
Copy link

ndmax commented Jul 29, 2020

Thanks @k00ni and @j0k3r. Relative to the project directory, I think a current full list of includes would look something like this:

/Parser.php
/Document.php
/Header.php
/PDFObject.php
/Element.php
/Encoding.php
/Font.php
/Page.php
/Pages.php
/Element/ElementArray.php
/Element/ElementBoolean.php
/Element/ElementString.php
/Element/ElementDate.php
/Element/ElementHexa.php
/Element/ElementMissing.php
/Element/ElementName.php
/Element/ElementNull.php
/Element/ElementNumeric.php
/Element/ElementStruct.php
/Element/ElementXRef.php
/Encoding/StandardEncoding.php
/Encoding/ISOLatin1Encoding.php
/Encoding/ISOLatin9Encoding.php
/Encoding/MacRomanEncoding.php
/Encoding/WinAnsiEncoding.php
/Font/FontCIDFontType0.php
/Font/FontCIDFontType2.php
/Font/FontTrueType.php
/Font/FontType0.php
/Font/FontType1.php
/RawData/FilterHelper.php
/RawData/RawDataParser.php
/XObject/Form.php
/XObject/Image.php

@k00ni
Copy link
Collaborator

k00ni commented Aug 13, 2020

Sorry for the late response @ndmax. I hope its not too much of an ask: can you put your file list (as includes) in the code from @apmuthu so I can copy it directly?

@apmuthu
Copy link
Author

apmuthu commented Aug 13, 2020

vendor-autoload.zip
The order of the files given by @ndmax is correct and the ../vendor/autoload.php file stands attached herewith.

@ndmax
Copy link

ndmax commented Aug 16, 2020

Sorry I missed this @k00ni and thanks @apmuthu for posting up a drop-in file.

For what it's worth, I implement pdfparser as part of a larger class responsible for all PDF-related functionality. Here's how the pdfparser component get's wired up (trimmed up a bit for simplicity):

<?php

  $prerequisites = array();

  $pdfparser = '/host/path/to/pdfparser';

  $prerequisites['pdfparser'] = array (
    $pdfparser.'/Parser.php',
    $pdfparser.'/Document.php',
    $pdfparser.'/Header.php',
    $pdfparser.'/PDFObject.php',
    $pdfparser.'/Element.php',
    $pdfparser.'/Encoding.php',
    $pdfparser.'/Font.php',
    $pdfparser.'/Page.php',
    $pdfparser.'/Pages.php',
    $pdfparser.'/Element/ElementArray.php',
    $pdfparser.'/Element/ElementBoolean.php',
    $pdfparser.'/Element/ElementString.php',
    $pdfparser.'/Element/ElementDate.php',
    $pdfparser.'/Element/ElementHexa.php',
    $pdfparser.'/Element/ElementMissing.php',
    $pdfparser.'/Element/ElementName.php',
    $pdfparser.'/Element/ElementNull.php',
    $pdfparser.'/Element/ElementNumeric.php',
    $pdfparser.'/Element/ElementStruct.php',
    $pdfparser.'/Element/ElementXRef.php',
    $pdfparser.'/Encoding/StandardEncoding.php',
    $pdfparser.'/Encoding/ISOLatin1Encoding.php',
    $pdfparser.'/Encoding/ISOLatin9Encoding.php',
    $pdfparser.'/Encoding/MacRomanEncoding.php',
    $pdfparser.'/Encoding/WinAnsiEncoding.php',
    $pdfparser.'/Font/FontCIDFontType0.php',
    $pdfparser.'/Font/FontCIDFontType2.php',
    $pdfparser.'/Font/FontTrueType.php',
    $pdfparser.'/Font/FontType0.php',
    $pdfparser.'/Font/FontType1.php',
    $pdfparser.'/RawData/FilterHelper.php',
    $pdfparser.'/RawData/RawDataParser.php',
    $pdfparser.'/XObject/Form.php',
    $pdfparser.'/XObject/Image.php'
  );

  foreach($prerequisites as $project => $includes) {
    foreach($includes as $mapping => $file) {
      require_once $file;
    }
  }

?>

@k00ni
Copy link
Collaborator

k00ni commented Aug 17, 2020

OK, i adapted @apmuthu's initial post with code from @ndmax. Hope I didn't miss something. AFAIK these classes in the use section are not needed (and not available) anymore (like Datamatrix), because I removed tecnickcom/tcpdf a while ago.

@WillRun4Cake
Copy link

I have updated my example for using PdfParser when Composer is not an option.

It now aligns with the v0.17.1 release and includes 3 new files:
RawData/FilterHelper.php
RawData/RawDataParser.php
Encoding/PostScriptGlyphs.php

You can download, extract and copy this to your webroot directory to start using PdfParser immediately.
Follow the example code in text_extractor_example.php to invoke PdfParser.
Only the classes/ directory is required. The following files are for example and may be deleted:
readme.md
test.pdf (replace with your own pdf)
text_extractor_example.php (example code)
index.php (use your existing index.php)

This should be placed in the /var/www/html directory or public_html webroot directory, as applicable.

Two additional things to mention:
1. Make sure you have the php-mbstring extension installed, e.g.
sudo apt-get install php-mbstring
OR
yum install php-mbstring
2. You may also need to enable in php.ini:
extension=mbstring

PdfParser Team, I noticed that Object.php was renamed to PDFObject.php. Thanks for making this fix.
example.zip
example.tar.gz

@WillRun4Cake
Copy link

WillRun4Cake commented Nov 1, 2020 via email

@ndmax
Copy link

ndmax commented Jan 16, 2021

Hello Everyone,

If you're still implementing this project manually and you're suddenly getting 500s with Fatal error: Uncaught Error: Class 'Smalot\PdfParser\Config' not found there's a new include:

'/Config.php'

@apmuthu @k00ni it might be worth adding $pdfparser.'/Config.php', to the code snippet for manual include.

@k00ni
Copy link
Collaborator

k00ni commented Jan 18, 2021

@ndmax thank you for the hint. I updated the initial post accordingly.


In my opinion Composer should be the way to go, but I acknowledge that there is also the need to use other ways. Therefore my proposal:

  • create a new file in root folder, named autoload.php or something like that
  • move includes from initial post into that file
  • close this issue with a notice about that file

Advantages:

  • Its easier for developers to just require this file and have everything they need (no need to parse this issue again and again when there are upgrades)
  • We can extend this file by using pull requests, which makes it easier to keep it up to date than using this issue (and copy paste).

What do you think?

CC @j0k3r @apmuthu

@j0k3r
Copy link
Collaborator

j0k3r commented Jan 18, 2021

That might be a good idea. We should warm people that the mbstring extension will then be required because the Symfony polyfill won't be installed (required in composer.json).
Following adding that new autoload.php file, we should also add a custom test to ensure it is working as expected. So the test will fail if we add a new file and forgot to update the autoload.php file.

@ndmax
Copy link

ndmax commented Jan 18, 2021

I think it's a good approach @k00ni. It eliminates the cat-and-mouse for the few who incorporate manually, without disrupting the majority who install with Composer. Comment instructions at the top of autoload.php and a brief reference to install options in README.md would point everyone down the right path.

@apmuthu
Copy link
Author

apmuthu commented Jan 19, 2021

The said autoload.php should not be overwritten by any unintended composer update / unzip.

A file called autoload_standalone.php that gets included if present may be the answer.

In order not to break dependencies of certain missing libraries / files, the optional set of includes can be commented out and enabled as needed by the end user.

Notes in it as applicable to certain commits onwards would be useful to use the appropriate version.

@k00ni
Copy link
Collaborator

k00ni commented Feb 9, 2021

There is a new pull request which may help in the future. Once it passes there should be no more manual work required to use PDFParser without Composer.

Feedback is appreciated at #388

@DionardoMarques
Copy link

Thanks a lot, it worked perfectly! Regards from Brazil.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.