Usage

You can configure the OCR processing via Hub's workflow engine. Therefore configure a new flow via SettingsFlowAdd new flow (if you don't see OCR file here the app isn't installed properly or you forgot to activate it).

Useful triggers

Trigger OCR if file was created or updated

If you want a newly uploaded file to be processed via OCR or if you want to process a file which was updated, use the When-conditions File created or File updated or both.

A typical setup for processing incoming PDF-files and adding a text-layer to them might look like this:

PDF setup

⚠️ Please ensure to use the File MIME typeisPDF documents operator, otherwise you might not be able to save the workflow like discussed here.

Trigger OCR on tag assigning

If you have existing files which you want to process after they have been created, or if you want to filter manually which files are processed, you can use the Tag assigned event to trigger the OCR process if a user adds a specific tag to a file. Such a setup might look like this:

Tag assigned setup

After that you should be able to add a file to the OCR processing queue by assigning the configured tag to a file:

Tag assign frontend 1

Tag assign frontend 2

Settings

Per workflow settings

Anyone who can create new workflows (admin or regular user) can configure settings for the OCR processing for a specific workflow. These settings are only applied to the specific workflow and do not affect other workflows.

Per workflow settings

Currently the following settings are available per workflow:

Name Description
OCR language The languages to be used for OCR processing. The languages can be choosen from a dropdown list. For PDF files this setting corresponds to the -l parameter of ocrmypdf. Please note that you'll have to install the appropriate languages like described in the ocrmypdf documentation.
Assign tags after OCR These tags will be assigned to the file after it has been successfully processed.
Remove tags after OCR These tags will be removed from the file after it has been successfully processed. If the file does not have the tag, it will just be skipped.
OCR mode Controls the way files are processed, which already have OCR content. For PDF files this setting corresponds to the --skip-text, --redo-ocr and --force-ocr parameters of ocrmypdf. See official docsfor additional information.
Skip text: skip pages completely that already contain text. Such a page will not be touched and just be copied to the final output.
Redo OCR: perform a detailed text analysis to split up pages into areas with and without text.
Force OCR: all pages will be rasterized to images and OCR will be performed on every page.
Keep original file version If the switch is set, the original file (before applying OCR) will be kept. This is done by giving the file version the label Before OC. This version will be excluded from the automatic expiration process (see herefor details)
Keep original file modification date Restore the modification date of the original file. The original modification date will be applied to the newly created file version. This is useful if you need to preserve the file modification date, for example to be able to sort files accordingly.
Send success notification Usually the workflow would only send a notification to the user if the OCR process failed. If this option is activated, the user will also be notified if a document has been processed successfully via OCR.
Remove background* If the switch is set, the OCR processor will try to remove the background of the document before processing and instead set a white background. For PDF files this setting corresponds to the --remove-background parameter of ocrmypdf.
⚠️ Please note that this flag will currently only work with ocrmypdf versions prior to 13. It might be added in future versions again. See here for details. ⚠️
Custom ocrMyPdf CLI arguments If you want to pass custom arguments to the ocrmypdf CLI, you can do so here. Please note that the arguments will be passed as they are to the CLI, so make sure to use the correct syntax. Check the official docs for more information.

* For ocrmypdf the parameter --remove-background is incompatible with --redo-ocr.

Global settings

As a Hub administrator you're able to configure global settings which apply to all configured OCR-workflows on the current system. Go to SettingsFlow and scroll down to Workflow OCR:

Global settings

Currently the following settings can be applied globally:

Name Description
Processor cores Defines the number of processor cores to use for OCR processing. When the input is a PDF file, this corresponds to the ocrmypdf CPU limit. This setting can be especially useful if you have a small backend system which has only limited power.

Testing your configuration

To test if your file gets processed properly you can do the following steps:

  1. Upload a new file which meets the criteria you've recently defined in the workflow creation.
  2. Go to your servers console and change into the Hub installation directory (e.g. cd /var/www/html/nextcloud).
  3. Execute the cronjob file manually e.g. by typing sudo -u www-data php cron.php (this is the command you usually setup to be executed by linux crontab).
  4. If everything went fine you should see that there was a new version of your file created. If you uploaded a PDF file you should now be able to select text in it if it contained at least one image with scanned text.

File versions

Get feedback via Notifications

The Workflow OCR app supports sending notifications to the user in case anything went wrong during the asynchronous OCR processing. To enable this feature, you have to install and enable the Notifications app in your Hub instance.

Notifications

How it works

General

General diagramm

PDF

For processing PDF files, the external command line tool OCRmyPDF is used. The tool is always invoked with the --skip-text parameter so that it will skip pages which already contain text. Please note that with that parameter set, it's currently not possible to analize pages with mixed content (see #113 for furhter information).

Images

For processing single images (currently jpg and png are supported), ocrmypdf converts the image to a PDF. The converted PDF file will then be OCR processed and saved as a new file with the original filename and the extension .pdf(for example myImage.jpg will be saved to myImage.jpg.pdf). The original image fill will remain untouched.

Troubleshooting

Generic troubleshooting guide

Since this app does its main work asynchronously, controlled by the NC cron, the troubleshooting gets slightly more complicated. That's why we suggest to follow this guide if you're facing any issues:

  1. Create your OCR workflow with triggers and conditions to your taste
  2. Temporarily decrease the servers loglevel to 0
  3. Try to trigger the workflow according to the conditions you've set (for example by uploading a new PDF file or setting a new tag)
  4. Check your Database table oc_jobs. This should contain a new job for the OCR processing like this: | OCA\WorkflowOcr\BackgroundJobs\ProcessFileJob | {"filePath":"some.pdf","settings":"{\"languages\":[\"eng\"]}"}. If that's not the case, you can stop here. You're facing a condition issue. The nextcloud.log file content might help you to find out why your workflow was not added to the queue
  5. If you can see a new job for the OCR process, run the cron.php once manually (for example by running sudo -u www-data php -f /var/www/nextcloud/cron.php)
  6. Inspect your nextcloud.log file (e.g. by using the logreader). You should be able to see various outputs, pointing you to the right direction (for example you should be able to see the output of the ocrmypdf process)

The Hub Workflowengine

This app is build on top of the Hub Workflowengine which makes it quite flexible and customizable. But this comes with the tradeoff that some missbehaviours might be related to the app itself and some others have their origin in the Workflowengine. As a rule of thumb, everything related to the lefhandside triggers and conditions secions comes from the NC Workflowengine, while the settings on the righthandside are OCR app specific:

NC Workflowengine

Please keep that in mind when troubleshooting issues. Of course, feel free to open new issues here, but we might need to redirect you to the official NC Server project.

You can check issues related to the Workflowengine by trying to reproduce the same behaviour with different workflow-based apps. If they behave in the same way in terms of triggers and conditions, the issue is most likely related to the NC Workflowengine itself and cannot be fixed here.

Development

Dev setup

Tools and packages you need for development:

  • make
  • node and npm
  • composer (Will be automatically installed when running make build)
  • Properly setup php-environment
  • Webserver (like Apache)
  • XDebug and a XDebug-connector for your IDE (for example https://marketplace.visualstudio.com/items?itemName=felixfbecker.php-debug) if you want to debug PHP code
  • PHP IDE (we recommend VSCode)

You can then build and install the app by cloning this repository into the Hub apps folder and running make build.

cd /var/www/<Hub_INSTALL>/apps
git clone https://github.com/R0Wi/workflow_ocr.git workflow_ocr
cd workflow_ocr
make build
 

Limitations

 

  • Currently only pdf documents (application/pdf) and single images (image/jpeg and image/png) can be used as input. Other mimetypes are currently ignored but might be added in the future.

  • All input file types currently produce a single pdf output file. Currently there is no other output file format supported.

  • Pdf metadata (like author, comments, ...) might not be available in the converted output pdf document. This is limited by the capabilities of ocrmypdf (see ocrmypdf/OCRmyPDF#327).

  • Currently files are only processed based on workflow-events so there is no batch-mechanism for applying OCR to already existing files. This is a feature which might be added in the future. For applying OCR to a single file, which already exist, one could use the "tag assigned" workflow trigger.

  • If you encounter any problems with the OCR processing, you can always restore the original file via Hub's version history.

    File versions

    If you want to clean the files history for all files and only preserve the newest file version, you can use
    sudo -u www-data php occ versions:cleanup

 

Ha estat útil la resposta? 0 Els usuaris han Trobat Això Útil (0 Vots)