Transcript
Feature List Recognition Server 4.0 Release Multilingual - Part Nr. 1135/6 Build 4.0.3.1167
Recognition Server 4.0 R2 Feature List Product:
Recognition Server 4.0 Release 2
14.11.2014
Part & Build:
Part: 1135‐6 Build 4.0.3.1167
Confidentiality:
Partners and End Customers
Release Date:
Table of contents INTRODUCTION............................................................................................................................................................................ 4 About the document ............................................................................................................................................................... 4 About the product ................................................................................................................................................................... 4 Release 2 - Key features and enhancements ....................................................................................................................... 4 Release 1 Multilingual - Key features and enhancements .................................................................................................. 4 Release 1 (English and Russian User Interface) - Key features and enhancements ....................................................... 5 Installing the new version ...................................................................................................................................................... 5 Licensing ................................................................................................................................................................................. 5 1.
NEW FEATURES AND IMPROVEMENTS ........................................................................................................................... 5 1.1.
Server features .......................................................................................................................................................... 5
1.1.1.
Internal database ................................................................................................................ 5
1.1.2.
Separate workflow queues ................................................................................................. 5
1.1.3.
Server exceptions folder ..................................................................................................... 5
1.1.4.
Easy recovery without data loss ......................................................................................... 6
1.1.5.
Support of Failover cluster .................................................................................................. 6
© ABBYY. All rights reserved.
1.2.
Workflow features ..................................................................................................................................................... 6
1.2.1.
New workflow type: Document Library ............................................................................... 6
1.2.2.
Periodical crawling of document libraries ........................................................................... 7
1.2.3.
Processing of documents in SharePoint libraries ............................................................... 7
1.2.4.
Using IFilter for processing PDF File in MS SharePoint..................................................... 9
1.2.5.
Processing files by mask .................................................................................................... 9
1.2.6.
Flexible detection of blank pages ..................................................................................... 10
1.2.8.
Overwriting files in the output folder ................................................................................. 11
1.2.9.
SSL for POP3 email servers ............................................................................................. 11
1.2.10.
New processing parameters ............................................................................................. 11
1.2.10.1.
KeepPages property ................................................................................................. 11
1.2.10.2.
Despeckle option ...................................................................................................... 11
1.2.10.3.
Extending the font set ............................................................................................... 11
1.2.10.4.
Enabling recognition of text inside pictures .............................................................. 12
1.2.10.5.
Limiting the number of recognized pages in a file .................................................... 12
1.2.10.6.
New barcode type – USPS-4CB (Intelligent Mail Barcode)...................................... 12
1.2.10.7.
Disabled image compression of lossy JBIG type ..................................................... 13
1.2.11.
Export to ePub3 format ..................................................................................................... 13
1.2.12.
Settings of units measurement for export to ALTO XML .................................................. 13
1.2.13.
Export to specific column types in SharePoint ................................................................. 14
1.3.
PDF processing features........................................................................................................................................ 14
1.3.1.
Improved MRC compression ............................................................................................ 14
1.3.2.
PDF/A standards and PDF versions................................................................................. 15
1.3.3.
Export to PDF/A-3 format ................................................................................................. 15
1.3.4.
Tagged PDF enabled by default ....................................................................................... 16
1.3.5.
Possibility to skip processing of PDF with text layer ........................................................ 16
1.3.6.
Injection of a text layer in existing PDF files ..................................................................... 17
1.3.7.
Using PDF text layer for recognition results improvement ............................................... 17
1.3.8.
Using PDF text layer for generationg quality output files of different formats .................. 17
1.3.9.
Fast WEB View mode for PDF files .................................................................................. 18
1.4.
Technological advances ........................................................................................................................................ 18
1.4.1.
Special mode for processing of plans and drawings ........................................................ 18
1.4.2.
Speed increase for Arabic OCR ....................................................................................... 19
1.5.
Administration features.......................................................................................................................................... 19
1.5.1.
Updated UI of the Administration Console ....................................................................... 19
1.5.2.
User management via Active Directory groups ................................................................ 20
1.5.3.
Logging of operators activities .......................................................................................... 20
1.5.4.
Improved logging .............................................................................................................. 20
© ABBYY. All rights reserved. Page 2 of 30
1.5.5.
Notification for the administrator includes server and workflow names in the message text 21
1.5.6.
In advance notification about license expiry ..................................................................... 21
1.5.7.
Soft stop of workflow processing ...................................................................................... 22
1.5.8.
Job cancellation without the loss of files........................................................................... 22
1.6.
Operator stations .................................................................................................................................................... 23
1.6.1.
Scanning Station: Sending registration parameters values to index fields ...................... 23
1.6.2.
Selection of documents for verification and indexing ....................................................... 24
1.6.3.
Saving of interim verification results ................................................................................. 25
1.6.4.
Timeout of inactivity .......................................................................................................... 25
1.7.
Scripting .................................................................................................................................................................. 26
1.7.1. Access to subsequent pages from the document assembly script................................................................... 26 1.7.2. Detecting the workflow name by script ................................................................................................................ 27 1.8.
Compatibility features and limitations .................................................................................................................. 27
1.8.1.
Discontinued support of Windows XP and Windows Server 2003 ................................... 27
1.8.2.
Compatibility with FineReader Engine 11......................................................................... 27
1.9.
Changes in API and XML result ............................................................................................................................. 27
1.9.1.
Page tracking in XML result .............................................................................................. 27
1.9.2.
Installation of COM-based API and Web API on 64-bit systems...................................... 27
1.9.3.
Changes in COM-based API and Web API ...................................................................... 27
2. UI and Documentation localization ............................................................................................................................. 29
© ABBYY. All rights reserved. Page 3 of 30
Introduction About the document This document describes new features that are implemented in ABBYY Recognition Server 4.
About the product ABBYY Recognition Server 4 is intended to provide new technology including improved recognition of texts in Arabic language, better integration with SharePoint, new PDF processing features and other improvements. Main server features such as stability, performance, and auto‐recovery were revised and improved. This version also includes functionality for processing read‐only folders, advanced logging, some UI changes and bug fixes.
Release 2 ‐ Key features and enhancements Part #: 1135/6, build # 4.0.3.1167, OCR Technologies build # 13.0.15.131, release date: 14/11/2014 New features and changes in Release 2 are marked with the blue color here and in the document below. The major features: Improved MRC compression method Using IFilter for processing PDF files in MS SharePoint Processing the SharePoint document libraries: o Crawling of the complete SharePoint site (including multiple libraries and folders) o Periodical re‐crawling settings Export to specific column types in SharePoint (support of Date, Number, and selected other formats) Export to PDF/A‐3 Other improvements: Improved e‐mail notifications: o In advance notifications about license expiry o Information on server name in the message text Sending registration parameters values from Scanning Station to index fields Soft stop of the workflow processing Support of failover cluster Using PDF text layer for generating output files Blank page detection parameters New barcode type ‐ USPS‐4CB (Intelligent Mail Barcode) New export format: ePub3 Settings of units measurement for export to ALTO XML Disabled image compression of lossy JBIG2 type Tagged PDF enabled by default Possibility to combine values from several areas into a one index field Access to subsequent pages from the document assembly script Detecting the workflow name by script
Release 1 Multilingual ‐ Key features and enhancements Part #: 1135/5, build # 4.0.2.952, OCR Technologies build number 13.0.13.21, release date: 14/08/2014
Translation of UI and help into the following languages: o o o o o
French German Italian Spanish Chinese
© ABBYY. All rights reserved. Page 4 of 30
o o o o
Portuguese (Brazil) Czech Hungarian Polish
Release 1 (English and Russian User Interface) ‐ Key features and enhancements Part #: 1135/4, build # 4.0.2.943, OCR Technologies build number 13.0.13.15, release date: May 27, 2014
Improved fault tolerance and logging Processing documents in “read‐only” mode Processing of documents in SharePoint libraries Enhanced work with PDF files Better support for construction drawings Faster recognition of Arabic texts User management via Active Directory
Installing the new version Recognition Server 4 can be installed on the same computer where Recognition Server 3.5 or previous versions were installed. Configuration of a previous version of ABBYY Recognition Server can be imported in ABBYY Recognition Server 4. For further information, please see System Administrator’s Guide, “Upgrade from the previous versions of ABBYY Recognition Server.” Note. Please be aware that some changes have been made to the XML Result file scheme and the corresponding API object. This may lead to modifications in your custom code written for integration of ABBYY Recognition Server with third‐party systems. Please find details below in this document or in the XMLResult description article in the help file.
Licensing Recognition Server 4 requires licenses generated specifically for this version of the product. It cannot work with a license generated for Recognition Server 3.5 or earlier.
1. New features and improvements 1.1.
Server features
1.1.1. Internal database The current system state is now stored in an SQLite database. This database is installed together with Recognition Server and is invisible for users.
1.1.2. Separate workflow queues Each workflow has a separate queue which allows it to work independently of other workflows and prevents it from stopping if another workflow has a lot of jobs waiting in queue. The default maximum number of jobs in the queue of each workflow is 50. This number can be changed in the Configuration.xml file: MaxJobsCount="50". (In previous versions of Recognition Server, this parameter controlled the length of the combined server queue.)
1.1.3. Server exceptions folder A new folder with server exceptions is created in %programdata%\ABBYY Recognition Server 4.0\RS4WF\Exceptions. This folder contains files that failed due to a server error (for example, in the case of the database or configuration file corruption), when the server cannot publish files to the workflow’s exceptions folder.
© ABBYY. All rights reserved. Page 5 of 30
1.1.4. Easy recovery without data loss The recovery after a server failure is now smoother and does not require manual copying of files between folders. The processing of jobs resumes automatically, or, if the server cannot resume it, the files are published in the server’s Exceptions folder. GUIDs are not used in file names anymore, so it is always possible to find a file by its name. During the processing of jobs in Recognition Server, files are stored in folder %programdata%\ABBYY Recognition Server 4.0\RS4WF\Images\
. File names coincide with source file names with the only difference: job ID is added at the beginning of the filename.
1.1.5. Support of Failover cluster Work on failover cluster is supported. The Recognition Server instances can be installed on separate nodes of a Failover cluster. All settings of the Recognition Server can be stored in the shared folder that is available for the cluster. Please note: this feature has not been tested. The testing can be done upon the request. (The instruction with details of installing the Recognition Server on Failover cluster will be available later on.) Implemented in: Release 2.
1.2.
Workflow features
1.2.1. New workflow type: Document Library Recognition Server 4 features a new workflow type called “Document Library” designed to process folders in “read‐only” mode. In this type of workflow, the input files are not moved away from the input folder.
Now users can indicate a root folder where the documents are stored as an input folder, and the files will be processed without affecting the contents and structure of the root folder. The root folder structure will be mirrored in the output directory. The workflow of the Document Library type will be stopped after all files in the indicated library have been processed. If a new file is added to the library, it is necessary to start the workflow again. Only newly added files © ABBYY. All rights reserved. Page 6 of 30
will be processed then. (It is also possible to define a schedule for the workflow to process newly added filed every day, week, etc.) If it is necessary to re‐process all files again with new workflow settings, the command Restart should be used (it is located under the arrow near the Start button). A document library might be quite large and require significant time to be processed. A progress bar indicating the current processing state helps to estimate the remaining processing time.
1.2.2. Periodical crawling of document libraries To ensure fast processing of upcoming files a crawling frequency can be set up for the workflow type Document Library. For periodical processing the option Crawl for new files in library every: should be enabled. The period of the library crawling can be selected from the drop‐down list (from 10 minutes to 11 hours) or typed manually. E.g. “2 hours”, “12 hours”, etc. After the periodical crawling is enabled and the workflow is started, the system will monitor the document library and count down the time until the next crawl. If the Crawl for new files in library every: option is not enabled, the library is crawled only once. The start time of crawling depends on the Workflow Activity settings (General tab). Settings of periodical crawling of document libraries can be also specified in the configuration file (Configuration.xml). Parameter EnablePeriodicCrawling stands for enabling/disabling the periodical crawling, the possible values are True and False (the default is False). Parameter CrawlingInterval sets the crawling interval in milliseconds (the default value is 7200000 ms).
Implemented in: Release 2.
1.2.3. Processing of documents in SharePoint libraries SharePoint library can be used as a source of images for Document Library‐type workflow. © ABBYY. All rights reserved. Page 7 of 30
It is possible to configure the workflow to process all images in:
a site a particular library or several libraries a folder or several folders.
If Export output files to source library option is enabled when configuring the input source of MS SharePoint, the output parameters will always include an output file with the export destination of SharePoint source libraries. Output files are saved into the same libraries/folders as they are at input. The format and naming schema of a file can be configured. By default the output files are saved under the same names as at input. If a file already exists, a new version is created. If Export output files to source library option is not enabled, than the output settings can be configured as usual, including saving the files into any SharePoint library/folder. Only one library/folder can be selected. If the same folder or library is used as both an input and an output location, the resulting files can overwrite the source files, be added to the library (folder) as new versions of the same documents, or saved with new filenames in addition to the existing files. This is controlled by setting If file exists in the Output Format Settings dialog: 1. 2. 3.
Create new name – a new file will be created, the name will contain a prefix. Overwrite file – the source file will be overwritten. Use SharePoint versioning options – this option is available only if document versioning is turned on in the SharePoint library. The source file will be replaced with the new file and the document version will be raised in accordance with the versioning policy in the SharePoint library.
Limitations: 1.
Recognition Server is able to process one site including all its libraries per workflow. For child sites separate workflows should be created. © ABBYY. All rights reserved. Page 8 of 30
2.
When the workflow is configured to save the resulting files in the same document library that serves as an input source, the option to create a job per each folder (For each folder option) is unavailable. Each file from the library will be imported by Recognition Server as a new job.
Possibility to indicate several libraries as input was implemented in Release 2.
1.2.4. Using IFilter for processing PDF File in MS SharePoint Due to change of Microsoft policy (lifting of the Microsoft ban for PDFs) Recognition Server IFilter can be used again for detection and indexing of PDF files stored in SharePoint 2013. To enable this possibility, the cumulative update package for SharePoint Server 2013 should be installed. This package can be downloaded here: http://support2.microsoft.com/default.aspx?scid=kb;EN‐US;2882989 Please note: The update for MS SharePoint should be installed before the installation of Recognition Server 4 Release 2. If the Recognition Server 4 Release 2 has been installed, install the update for MS SharePoint, then run the installation of the Recognition Server 4 Release 2 again and use the Repair command to modify the installation. Implemented in: Release 2.
1.2.5. Processing files by mask Recognition Server 4 offers an option to filter files that should be processed using a mask for file names. The program will process only those files the names and extensions of which fit the mask. The mask setting is available in the workflow properties, 1.Input tab, Select files to process. You can use the “?” and “*” symbols in the mask. “?” stands for any single character and “*” stands for any number of any characters. For instance, mask *.* will select all files, mask *.tiff will select only files with “.tiff” extension, and mask image*.* will select files of all types the names of which start with word “image”. For workflows of Hot Folder and Mail type, the default mask is *.*, i.e. all files from the Input folder are processed. For workflows of Document library type, the default mask selects files of all supported image formats: *.bmp;*.dib;*.rle;*.dcx;*.djvu;*.djv;*.gif;*.jb2;*.jbig2;*.jp2;*.j2k;*.jpf;*.jpx;*.jpc;*.jpg;*.jpeg;*.pcx;*.pdf;*.png;*. tif;*.tiff;*.wdp;*.wmp. Other files setting allows specifying what actions should be performed on files that do not fit the mask:
Exceptions folder ‐ files that do not fit the mask will be placed in the Exceptions folder. Output folder ‐ files that do not fit the mask will be placed in the Output folder. No action ‐ files that do not fit the mask will be ignored. Note: It is not recommended to use No action option for workflows of Hot Folder type, as this may cause the hot folder to get filled up with unprocessed files.
Note. Unprocessed files are always handled as a separate job in Recognition Server. If the workflow creates one job per folder, and in the folder it finds some files that need not be processed, it will create one job for the files that should be processed and another job for the unprocessed files. Examples of using a mask:
For Hot Folder‐type workflow: Sometimes scanners create *.tmp files in the hot folder where the scanning is performed. A mask allows Recognition Server to ignore such files. For Document Library‐type workflow: It may be required to recreate an exact same structure of files in the output folder as it was in the input document library. Recognition Server can process all the image files by a mask, and move all the other files in the Output folder as is.
© ABBYY. All rights reserved. Page 9 of 30
For Mail‐type workflow: Besides the attached images, the e‐mail message may contain a logo or a signature in a GIF file. Recognition Server can process the attachment and ignore the GIF file.
The files that constitute a failed job can also be moved to the Output folder, Exception folder or be ignored by Recognition Server. This behavior is controlled by Save failed jobs to setting on 4.Quality control tab of the workflow properties. Note. If the user selects to move unprocessed or failed files to the Output folder, and the workflow contains several output folders, the files will be copied in all of them.
1.2.6. Flexible detection of blank pages If empty pages are scanned they can contain specific characters from the scanner or noise caused by shadows. This might be ‘confusing’ for the recognition technology as text might seem to be contained. To avoid this it is possible to flexibly define ‘what is a blank page’. If you use Blank Pages Detection option in 3.Document Separation settings in the Workflow Properties and configure the settings for flexible detection of empty pages you can limit problems of wrong blank pages detection for images of low quality or images containing noise or non‐textual objects created due to scanning. To flexibly detect blank pages, margins, percentage of blackness and maximum number of objects allowed on such pages can be specified the Document Separation parameters.
1.2.7. Extended options on Indexing tab Other changes on the 5.Indexing tab of the Workflow Properties dialog box: Order of document types can be changed using the Up and Down buttons. The default document type can be selected using the Default type checkbox.
© ABBYY. All rights reserved. Page 10 of 30
1.2.8. Overwriting files in the output folder If a file with the same name already exists in the output folder, Recognition Server can now re‐ write it with a new output file. If option Rewrite if file exist is not checked, a 4‐digit index will be added to the file name. In the XML Result, RewriteIfFileExists property has been added to the FormatSettings object.
1.2.9. SSL for POP3 email servers Communication with POP3 email servers over SSL is now supported. Port 995 should be specified in the Port number field.
1.2.10. 1.2.10.1.
New processing parameters KeepPages property
New property KeepPages has been added to Configuration.xml file, under the ExportFormat section. The default value of this property is false. Set the value to true if you want to preserve page breaks as they were in the original file when exporting to DOC, DOCX or RTF formats. 1.2.10.2.
Despeckle option
The option Despeckle images is now available in the workflow settings, Process tab, Advanced Processing Settings dialog. This option removes noise from the image. During despeckling, the program also removes background dots or boundary lines on raster forms. By default the option is switched off because in some cases it can make recognition worse or even cause the loss of information. We recommend switching it on only after you have tested it on several samples and made sure that it helps to remove noise on your images. The correspondent API property is RemoveGarbage. 1.2.10.3.
Extending the font set
Some languages require fonts that are not included in the standard font set. For example, characters in Chinese, Japanese, Korean or Thai require special fonts to be displayed properly. These characters may not be correctly displayed on the Verification Station or in the text of the output document because by default, ABBYY Recognition Server uses only the standard font set. This is done for the purpose of uniformity of the text representation among all processing stations. If a special font is required, it can be enabled by new parameter AllowedFontsMode available in RecognitionParams section of the Configuration.xml file. Possible values are: •
Default – In this mode, only the following fonts will be used: Arial, Times New Roman, and Courier New.
© ABBYY. All rights reserved. Page 11 of 30
• All – All possible fonts will be used. Please note that processing will take longer in this case. It is also important that the same set of fonts is installed on all the stations; otherwise the result might be different on different computers. A custom font set can also be defined in addition to the standard font set. A list of additional fonts can be added below RecognitionParams section using element AdditionalAllowedFont. The following example illustrates adding font AngsanaUPC to the set of standard fonts: AngsanaUPC 1.2.10.4.
Enabling recognition of text inside pictures
To speed up processing, recognition of text inside picture objects is disabled by default. If you need to recognize text in pictures, you can enable this feature in the Configuration.xml file. This can only be done for the Quality recognition mode. The name of the parameter is ProhibitHiddenTextDetection, and the default value is true. 1.2.10.5.
Limiting the number of recognized pages in a file
Sometimes it is needed to extract the text only from the first few pages of the document. This option can now be switched on via an XML ticket: D:\Output Folder If this option is set, only recognized pages will be output in text‐based formats and counted against the license. Notes: This setting will work only if document assembly is disabled (option Create one document for each file in job is selected), otherwise it will be ignored. This setting will be ignored if the output format is PDF If Indexing or Verification is included in the workflow, all pages of the document will be opened on that station, but only the recognized pages will be available for indexing and editing. The operator will be able to recognize other pages on the Verification Station, if necessary. In this case, the newly recognized pages will also be counted against the license. This feature can be switched on via the Administration console for IFilter and Google Search Appliance workflows. 1.2.10.6.
New barcode type – USPS‐4CB (Intelligent Mail Barcode)
USPS‐4CB type barcodes which are used on mails in USA and is required by the US postal service has been supported. Those barcode values can now be recognized in documents. This barcode type can also be used for document separation (it can be selected as such in the workflow settings).
© ABBYY. All rights reserved. Page 12 of 30
Implemented in: Release 2. 1.2.10.7.
Disabled image compression of lossy JBIG type
Lossy JBIG2 image compression has been removed from the UI and internal compression parameters, as it produced the output files of low quality. Implemented in: Release 2.
1.2.11.
Export to ePub3 format
Export of output files to ePub v.3 format is supported. Implemented in: Release 2
1.2.12.
Settings of units measurement for export to ALTO XML
A unit of measurement (pixels, inches, and mms) can be selected when configuring export to ALTO XML format.
Implemented in: Release 2. © ABBYY. All rights reserved. Page 13 of 30
1.2.13.
Export to specific column types in SharePoint
Export of index fields to specific column types to SharePoint is supported: Single line of text Multiple lines of text Choice (menu to choose from) Number Currency Date and Time Yes/No (checkbox) Hyperlink or Picture Managed Metadata The document attributes (index fields) should be mapped with the appropriate content types imported from the selected SharePoint library. To configure the mapping process, click the Settings button. Then select the SharePoint document library in the output parameters. In the Mapping Document Attributes to SharePoint Columns window the links between the RS document types (created at the Indexing tab) and SharePoint content types (submitted from the selected library) should be established. After the appropriate SharePoint content type is selected, the RS document attributes (index fields) can be mapped with the SharePoint columns.
Implemented in: Release 2.
1.3.
PDF processing features
1.3.1. Improved MRC compression The method of MRC compression for resulting PDF files has been improved. The visual quality of output PDF files using the MRC compression is significantly better. The enhanced MRC compression method offers noticeably better visual quality of documents while offering almost the same small file size as with previous Recognition Server releases. The MRC compression for output files shows the same results of minimizing the file size and preserving the visual quality as offered by the major players on the market.
© ABBYY. All rights reserved. Page 14 of 30
The improved compression methods are now used by default in all new workflows as well as in teh previously created workflows (where compressed PDF output format is enabled ‐ Enhanced compression (MRC). To disable the updated MRC and use the previous compression mode set the LegacyMRCMode flag to True in the Configuration.xml of ABBYY Recognition Server settings. To manage the quality/size parameters of the output files, the Max Quality – (balanced) – Min Size profiles can be selected. These profiles help you to configure the settings for the desired PDF output automatically. For instance, when selecting Min Size profile, the quality parameter is set to 30% and the MRC compression is enabled. Implemented in: Release 2.
1.3.2. PDF/A standards and PDF versions Export settings for PDF and PDF/A are enriched with new options: it is possible to specify PDF version and PDF/A standard. The list of available PDF standards includes
PDF/A‐1a and PDF/A‐1b PDF/A‐2a, PDF/A‐2b and PDF/A‐2u
1.3.3. Export to PDF/A‐3 format Export of output files to PDF/A‐3 format has been supported. It is possible to select PDF/A‐3a, PDF/A‐3b, or PDF/A‐3u standards of PDF/A format. Please note: the attachment cannot be written into the output PDF/A‐3. Implemented in: Release 2.
© ABBYY. All rights reserved. Page 15 of 30
1.3.4. Tagged PDF enabled by default When adding a new output format for saving documents to PDF files, the option Enable tagged PDF (compatible with Adobe Acrobat 5.0 or above) is enabled by default now. Without this option words in the output PDF file could contain extra spaces between individual letters which would lead to problems when using full text search within the PDF file. Please note: this option may result in upto a 10% increase in the file size. Implemented in: Release 2.
1.3.5. Possibility to skip processing of PDF with text layer It is now possible to skip processing of PDF files that contain text layer. PDF files with text layer can be moved to the output folder if user selects option Do not modify files with high quality text layer. The user can also select a detection mode: In Fast mode, the application simply detects if there is a text layer in the file. If a text layer is detected, the file will be moved to the output folder ignoring the export settings. The application will not count pages of this file against the license for PDF export, but please note that if there are output formats other than PDF, OCR will be performed and pages will be counted. In Thorough mode, the application compares the text layer in PDF with OCR results (a piece of text on each page will be recognized and compared). If the text layer and OCR results coincide, the file will be moved to the output folder without re‐recognition. However, because partial OCR is done on each page for comparison purposes, all pages will be counted against the license. When the text layer is compared to OCR results, the default threshold is 5%. This means that the program will use the OCR results if there is more than more 5% difference between the texts. This threshold can be changed in the Configuration.xml file: SkipRecognizePdfsWithTextLayerCoefficient="25". This setting is located in the ExportFormat node of the PDF format properties. Notes. Documents that are skipped in the Fast mode will not be sent to operator stations (indexing or verification). This setting is only applicable to input files in PDF format. This option works for single files only. If two files need to be merged together, or one file needs to be split into several documents, OCR will be performed on all resulting files. This option works only when jobs are created “For each file”, which is typical for the scenario of preserving source PDF files.
© ABBYY. All rights reserved. Page 16 of 30
1.3.6. Injection of a text layer in existing PDF files Sometimes PDF files require adding a newly recognized text layer, while the quality of images has to be preserved as it was in the original files. Original PDF files can also contain bookmarks, annotations, attachments or other features which need to be preserved. In such cases, it is now possible to inject the recognized text into the file while preserving the image quality and all of the PDF file’s features. Option Modify text layer only is available in Output settings dialog for PDF and PDF/A formats. Note. This setting is only applicable to input files in PDF format.
1.3.7. Using PDF text layer for recognition results improvement In case PDF files with a text layer are OCRed by Recognition Server the source text layer is used for recognition results improvement. For example, uncertainly recognized characters are checked with the information in the text layer and are copied from there.
1.3.8. Using PDF text layer for generationg quality output files of different formats If imported PDF file contains a text layer, it can be reused for creating the quality output files of PDF and other formats. For example, PDF/A, ALTO XML, etc. When running the OCR, the original text layer in imported files is detected. The quality of the original text character is evaluated before copying it to the resulting file. By this algorithm the same or even better quality of the output file can be ensured (compared to the original file). Please note, that the license counter is decreased, even if the original files contain the text layer. Implemented in: Release 2.
© ABBYY. All rights reserved. Page 17 of 30
1.3.9. Fast WEB View mode for PDF files Fast Web View option is available in Output settings dialog for PDF and PDF/A formats. If it is enabled, a preview of pages is created for faster opening when the file is published on the web.
1.4.
Technological advances
1.4.1. Special mode for processing of plans and drawings The processing of technical drawings like construction plans has been significantly improved. A special mode Processing mode for technical drawings has been introduced for such images – it is available on 2. Process tab of the workflow settings. It is recommended to enable this mode for documents containing a lot of small details. In this mode, the program ignores graphical objects on the page, but tries to extract all the textual information. The recognition in this mode is performed in 3 directions:
Direction of main orientation, automatically detected Rotated clockwise to main orientation Rotated counterclockwise to main orientation
In the XML output file, the orientation of the text will be indicated in the orientation attribute:
RotatedClockwise RotatedCounterclockwise If not indicated, the orientation is normal (horizontal text)
Note. The usage of this mode can slow down image processing.
© ABBYY. All rights reserved. Page 18 of 30
1.4.2. Speed increase for Arabic OCR The speed of OCR in Arabic language was significantly increased. The productivity was measured on 2500 pages of Arabic texts with export to RTF format. The test has shown a speed increase of 17‐20% compared to Recognition Server 3.5.
1.5.
Administration features
1.5.1. Updated UI of the Administration Console The main window of the Administration Console has been redesigned for better usability. A workflow status pane that displays the current state of the selected workflow has been added. The information presented on the status pane includes: Workflow state: started or stopped Start time Stop time (if the workflow is stopped) Duration Total number of jobs Number of processed jobs Number of files that were copied to the output folder without processing Number of failed jobs The paths to the Output folders The path to the Exception folder For a Document Library‐type workflow that is currently running, the status pane also displays a progress bar with percentage of processing completed.
© ABBYY. All rights reserved. Page 19 of 30
If an error occurred in the workflow, the status pane will provide the error description.
1.5.2. User management via Active Directory groups In Users node of the Administration Console, it is now possible to add user groups from the Active Directory. The role of the Verifier or Indexer can be assigned to a group, and all members of the group will receive permissions correspondent to the role. When users are added to the AD group later, they will automatically be granted permissions to work with Recognition Server. When a group is being added in Recognition Server, the group name should be complete and include the domain name.
1.5.3. Logging of operators activities XML‐Result file now contains information about operators who verified and indexed the documents. This information is available in properties verificationUserName and indexingUserName inside JobDocument. If indexing and verification are not a part of the workflow, these properties will remain empty. XML result also contains information about the time of document indexing and verification. Jobs Log contains information about rejected jobs – who rejected the job and on which station the job was rejected.
1.5.4. Improved logging Job log contains records about every job finished by Recognition Server. Details pane consists of two tabs: Files tab that shows input files and output files of the correspondent job as well as paths to the files, and Details tab with detailed information about the job. The Job Log is not limited to 500 records anymore: the size or the longevity of the Job Log can be controlled by the administrator via the Jobs Log Properties dialog box. Records in the Job Log can be searched using Find button; search based on a mask is also supported.
© ABBYY. All rights reserved. Page 20 of 30
By default, the Jobs Log contains two views: all jobs and failed jobs. It is possible to create custom views by applying custom filters to the log. This can be done via Create Custom View command in the context menu.
1.5.5. Notification for the administrator includes server and workflow names in the message text The server name and the workflow names are included in the text of the notification messages that is automatically sent by e‐mail to the administrator. This information helps the administrator to faster identify and solve possible problems. The subject of the email message has the following structure (to be used for filtering the emails): ABBYY Recognition Server ():
Implemented in: Release 2.
1.5.6. In advance notification about license expiry Notifications about expected license expiry are added.
© ABBYY. All rights reserved. Page 21 of 30
Notifications can be received based on following settings: Percentage of pages that are remaining in the license Number of days left before the license expires
Implemented in: Release 2.
1.5.7. Soft stop of workflow processing It is possible to stop the processing of jobs using the so called “soft” stop mechanism. With this stop function the the processing of all current jobs will be finished but new jobs are not accepted for processing. After the results of all current jobs are published, the workflow is stopped. For this manual “soft” stop the Stop command should be used. (Note: If processing runs by schedule, the workflows are always stopped “softly”.) If the job processing must be interrupted immediately, the current jobs must be postponed without completion and the command Stop immediately should be used. In this case all current jobs are stopped and their processing is postponed. The computing power will be freed instantly. The postponed jobs will be finished, once the workflow is started again.
Implemented in: Release 2.
1.5.8. Job cancellation without the loss of files Now it is possible to cancel or delete job without deleting the files that constitute the job. Command Reject Job/Reject All Jobs cancels the job(s) and saves the files together with correspondent XML Result(s) in the Exception folder of the workflow. Command Delete Job/Delete All Jobs deletes the job(s). The files that constitute the jobs are dropped in the server’s Exception folder. © ABBYY. All rights reserved. Page 22 of 30
1.6.
Operator stations
1.6.1. Scanning Station: Sending registration parameters values to index fields When scanning a batch, the registration parameters entered for a document can be imported as values of the document’s index fields. The lists of index fields (document types and their attributes) must be pre‐configured in the workflow properties (Indexing tab). To use this new option, specify the batch sending parameters on the Scanning Station: select the desired workflow and import the list of index fields by clicking the Import Registration Parameters button. When creating a batch, select the desired Batch Type, assemble the documents and assign the Document Types in the Registration Parameters window.
After processing the batch in Recognition Server, the documents with pre‐filled index fields’ values will be displayed in the Indexing station. It is possible to skip the indexing stage by using the following code in the indexing script: “SkipManualIndexing = true;”. In this case index fields’ values will be exported according to the workflow settings.
© ABBYY. All rights reserved. Page 23 of 30
The values of document registration parameters can be obtained from indexing or export script by using the standard Attributes object. Also they are accessible from the XML result file.
Please note:
Only values of the parameters imported from the workflow can be sent as index fields to Recognition Serve (however it is possible to create more registration parameters in the batch type settings at the Scanning Station).
The types of entered values should correspond with the types of index fields, specified in the workflow properties.
Implemented in: Release 2.
1.6.2. Selection of documents for verification and indexing Operators of Verification and Indexing Stations can now select documents manually from the queue and process them ahead of others. The button on both stations toggles between manual and automatic modes of receiving documents.
The button should be used to open the Select Document for… dialog box. In that dialog, the operator can find the required document by sorting documents by name, priority or creation date. A new information pane displays the number of documents in the queue. This pane appears: •
Between tasks in manual mode
•
When connection with the server is lost
•
When the current document is returned to the queue after the timeout has been reached
© ABBYY. All rights reserved. Page 24 of 30
1.6.3. Saving of interim verification results It is now possible to save interim changes in the document on both Verification and Indexing Stations using command Document > Save or Ctrl+S. When the station is closed during document verification or indexing, the operator is offered to save the results. The document with the saved changes is returned to the server and becomes available to other operators. Note. On the Indexing Station, it is only possible to save results after the document type is selected.
1.6.4. Timeout of inactivity To prevent documents from sitting idle on operator stations, documents are returned to the queue after a timeout of inactivity has been reached. In the previous version of Recognition Server, the timeout value was hardcoded to 120 minutes. That proved insufficient for verifying large documents such as books. Now the timeout value can be changed in the Recognition Server Properties dialog box, or in the configuration file Configuration.xml (change the value in OperatorStationInactiveTimeoutInMinutes="120" in the QueueManager node).
Note. This timeout is applied to all workflows and to all jobs on Verification and Indexing Stations.
1.6.5. Import of document types and index fields Document types and index fields can now be imported into a workflow from an XML or CSV file. This feature is useful if there is a need to use the same field in different workflows. The feature is available on the Indexing tab of the of the Workflow Properties (click the Import… button). Imported files should have the following structure:
XML
Indexing.xml CSV
DocumentType type1 type1 type2
FieldName IsObligitary FieldType bbb List ccc SingleLine test TRUE MultipleLines
PossibleValues IsDefault Field1;Field2;Field3 TRUE Don't say Do this 1; test twice
© ABBYY. All rights reserved. Page 25 of 30
1.6.6. Quick input of index fields When the operator starts typing an index field value, the values starting with the same letter will be automatically selected from the list of allowed values.
1.6.7. Possibility to combine values from several regions into one index field It is possible to combine information from several areas of the page into one index field. This option can be used comfortably for multi‐line index fields. To combine the values, hold the CTRL key and click on the regions that contain values to be used as a single index field. The values are aggregated and separated with spaces automatically. Implemented in: Release 2.
1.6.8. Station UI improvements The following UI improvements have been made: When the operator starts typing a value into the index field on the Indexing Station, the values starting with the same characters will be automatically selected from the list of allowed values.
The new Warnings button on the Verification Station allows the user to hide/show the warnings pane. The button also displays the number of issued warnings.
The number of low‐confidence characters is displayed on the Check Spelling button on the Verification Station.
Reject All Documents button on both Indexing and Verification stations is hidden from the toolbar and is only available in the menu. Reject All Documents command should not be used very often because it rejects all documents of the job while the operator works on the current document only. Reject command rejects only the current document.
Information about the number of documents in the queue is now displayed in the status bar on both Indexing and Verification stations.
1.7.
Scripting
1.7.1. Access to subsequent pages from the document assembly script A new property was added for a page object to enable the document assembly based on the analysis of subsequent pages ‐ RecognizedPage: UserProperty. The decision on whether the page belongs to a document can be made based on the information about the next pages. For example, the same ID values on all the pages. Implemented in: Release 2.
© ABBYY. All rights reserved. Page 26 of 30
1.7.2. Detecting the workflow name by script A new property was added for a page object to get the workflow name for the page that is being processed ‐ RecognizedPage: WorkflowName. This possibility allows copying scripts to several workflows without manual modifications. Implemented in: Release 2.
1.8.
Compatibility features and limitations
1.8.1. Discontinued support of Windows XP and Windows Server 2003 Recognition Server 4 does not support and cannot be installed on Windows XP and Windows Server 2003.
1.8.2. Compatibility with FineReader Engine 11 Recognition Server 4 supports export to internal FineReader format which is compatible with FineReader Engine version 11. For such export, the FineReader Internal format (*.layout, *.image) should be selected as output format. Two files with *.layout and *.image extensions will be created. This feature is helpful when FineReader Engine and Recognition Server are used in combination.
1.9.
Changes in API and XML result
1.9.1. Page tracking in XML result The XML Result in Recognition Server 4 shows the correspondence between original and resulting files and pages in them:
The InputFile section has two new properties: Id and Pages. Id is the identifier of the input file. Pages represents the collection of the input file’s pages. For each page of the input file, Page section has the following properties: Id which identifies the page in the input file, and PageNumber – the number of the page in the input file. The JobDocument section has new property Pages which represents pages of the output document. For each page in the output document, Page section has the following properties: FileId – the identifier of the input file this page belonged to, and PageId – the page identifier within the input file.
1.9.2. Installation of COM‐based API and Web API on 64‐bit systems Both the Web and the COM API are automatically deployed by installer on 64x operating systems without the need for manual setup.
1.9.3. Changes in COM‐based API and Web API The namespace of the COM API has been changed from ABBYYRecognitionServer3 to ABBYYRecognitionServer. The namespace of the Web API has been changed from RSSoapService3 to RSSoapService. The following objects were added to COM‐based and Web‐based API to trace the correspondence between original and resulting files and pages in them:
InputFile This object represents one input image file and the results of the processing of this file. Properties
Name Type
Description
Pages Pages, read‐only Returns a collection of pages of the input file. ID
String, read‐only Unique identifier of the input file, generated by RS. © ABBYY. All rights reserved. Page 27 of 30
Pages This object represents a collection of Page objects.
Page This object represents a page of the input file. This is a child object of InputFile. Properties
Name
Type
Description
ID
String, read‐only Unique page identifier, generated by RS.
Number String, read‐only Page number in the input file.
JobDocument This object represents one output document. Properties
Name
Type
PagePositions
PagePositions, read‐ Returns a collection of pages of the output document with the information only about each page’s position in the input files.
Description
PagePositions This object represents a collection of PagePosition objects.
PagePosition This object represents a page in the output document and information about this page’s position in the input files. This is a child object of JobDocument. Properties
Name Type
Description
FileId String, read‐only ID of the input file the page belonged to. PageId String, read‐only ID of the page in that input file. The following method has been added to the COM‐based and Web‐based API to support the ability to delete a job after asynchronous processing:
DeleteJob A method that deletes a job and all of its images has been added to the IClient interface. Interface
Name
Description
IClient
DeleteJob (string jobId)
Deletes a job with all of its images.
The following objects have been added to the COM‐based API only to: Check if the workflow is started or stopped Check if there is a connection with the server Check if indexing and/or verification is switched on in the workflow and change indexing or verification settings Parameter WorkflowState has been added to the IWorkflow interface. Interface
Name
Type
Description
IWorkflow
WorkflowState WorkflowStateEnum, read‐only
Returns the state of the workflow.
WorkflowStateEnum is an enumeration constant that lists possible workflow states. Name
Description
WS_ApplyingSettings
The state of the workflow after it has been started and before the processing has begun. At this stage, the program checks if it can access the folder that © ABBYY. All rights reserved. Page 28 of 30
contains the input documents. This state is very short in duration.n The words “Applying Settings” or "Starting" are displayed in the console. WS_Crawling
At this stage, the program checks the folders of the Document Library workflow. It counts files, adds them to the database, and prepares to process them. The word "Crawling" is displayed in the console.
WS_Finishing
The state of the workflow when processing is coming to an end. At this stage, the program completes publishing the files. The words "Finishing Processing" are displayed in the console.
WS_NotAvailable
The state of the workflow that is inaccessible. The words "Not Available" are displayed in the console, together with the reason why the workflow cannot be accessed.
WS_Processing
The principal state of the workflow, when files are being received, processed, and recognized. The word "Processing" is displayed in the console.
WS_StartingProcess
The state of the workflow after the start command has been executed and before information about the beginning of processing has been returned. The word "Starting" is displayed in the console.
WS_Suspended
The state of the workflow that has been stopped. The word "Stopped" is displayed in the console.
Besides workflow states, it is possible to get the state of the server via the Connect method. Interface IClient
Name
Description
Connect(string serverName)
Establishes a connection with the server. If the server is stopped, there will be a COMException with this text: “ABBYY Recognition Server is not available: The client has successfully connected to the server, but the server is not running.”
A property that represents the server’s Exceptions folder is added in the IClient interface. Interface
Name
Type
Description
IClient
ServerExceptionsFolder string, read‐only
Returns the folder with the server’s exceptions
It is now possible to switch on/ off verification using IXmlTicket Interface
Name
IRecognitionParams VerificationMode
Type
Description
VerificationModeEnum
Returns the verification mode.
IRecognitionParams VerificationModeThreshold double
Sets the accuracy threshold for selective verification.
VerificationModeEnum is an enumeration constant that lists possible verification modes. Name
Description
DVM_DoNotVerify
Verification is switched off.
DVM_VerifyAlways
All documents verified.
DVM_VerifyIfThresholdExceeded
Only documents with the number of low‐confidence characters above the value indicated in VerificationModeThreshold are verified.
2. UI and Documentation localization Localization of ABBYY Recognition Server 4 is done according to the table below. New language that is supported in the Release 2 is filled with blue (Help file for Scanning Station in French).
© ABBYY. All rights reserved. Page 29 of 30
English
Russian
French
German
Italian
Spanish
Chinese
Portuguese (Brazil)
Czech
Hungarian
Polish
Resources
Console
+
+
+
+
+
+
+
+
+
+
+
Indexing Station
+
+
+
+
+
+
+
+
+
+
+
Verification Station
+
+
+
+
+
+
+
+
+
+
+
Scanning Station
+
+
+
+
+
+
+
+
+
+
+
Protection
+
+
+
+
+
+
+
+
+
+
+
Help
Console
+
+
+
+
+
+
‐
‐
‐
‐
‐
Indexing Station
+
+
+
+
+
+
‐
‐
‐
‐
‐
Verification Station
+
+
+
+
+
+
‐
‐
‐
‐
‐
Scanning Station
+
+
+
+
‐
‐
‐
‐
‐
‐
‐
Open API
+
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
Admin Guide
+
+
+
+
+
+
‐
‐
‐
‐
‐
EULA
+
+
+
+
+
+
+
+
+
+
+
Installer
Recognition Server
+
+
+
+
+
+
+
+
+
+
+
IFilter
+
+
+
+
+
+
+
+
+
+
+
Autorun
+
+
+
+
+
+
+
+
+
+
+
© ABBYY. All rights reserved. Page 30 of 30