Tika now has a module for Deep Learning powered by theDL4J toolkit. The initial included model is for InceptionV3and so using this module, natively in Java, Tika can useDeep learning for metadata/text extraction from Images usingthe power of the Inception model (Github-165).

A new parser for sentiment analysis using a categorical(multi-class, anry, sad, neutral, like, love) and binary(positive/negative) was added leveraging the USC datascience work (TIKA-2016).

Tika now has the ability to automatically detect objects in videos,using OpenCV and Tensorflow (TIKA-2322).

Change default behavior to parse embedded documents even if the userforgets to specify a Parser.class in the ParseContext (TIKA-2096).Users who wish to parse only the container document should setan EmptyParser as the Parser.class in the ParseContext.

Change default behavior of Office Parsers to _not_ extractMacros. User needs to setExtractMacros to "true" (TIKA-2302).

Added tika-eval module (TIKA-1332).

Unified logging across Tika: SLF4J as logging API, Apache Log4j asimplementation with JCL and JUL bridges in standalone tools liketika-app, tika-batch and tika-server (TIKA-2245).

Add parser for XLSB files (TIKA-1195).

Add parsers for EMF/WMF files (TIKA-2246/TIKA-2247).

Add parsers for WordPerfect and QuattroPro (.qpw) files.Contributed by Pascal Essiembre (TIKA-1946 and TIKA-2228).

Add experimental SAX parser for .pptx files. To select this parser,set useSAXPptxExtractor(true) on OfficeParserConfig (TIKA-2210).

Add experimental SAX parser for .docx files. To select this parser,set useSAXDocxExtractor(true) on OfficeParserConfig (TIKA-1321, TIKA-2191).

Add mime detection and parser for Word 2006ML format (TIKA-2179).

Bug fix for WordPerfect via Pascal Essiembre (TIKA-2352).

Added "text-main" equivalent option to tika-server via/tika/main (TIKA-2343).

Enabled configuration of the EncodingDetector used byparsers that extend AbstractEncodingDetectorParser (TIKA-2273).

Prevent easily preventable OOMs for both detection and parsingof some compression formats (TIKA-2330).

Extract images and thumbnails from ODT via Sam Bayer (TIKA-2295).

Fix potential NPE in FeedParser via Julien Nioche (TIKA-2269).

Official mime types for BMP, EMF and WMF have been registered withIANA, so switch to these (image/bmp image/emf image/wmf) (TIKA-2250)

Be more parsimonious with BufferedInputStreams via Josh Hight(TIKA-2244).

Enable handling of hyphenated language codes in TesseractOCRParservia Graham Russell (TIKA-2231).

Improve style tags in ODT (TIKA-2242).

Add container detection for embedded MSEquation files (TIKA-2238).

Add parsing of JBIG2 and extraction of JBIG2 from PDFs whenrequired dependencies are added to class path by user.Contributed by Pascal Essiembre (TIKA-2232).

Mime magic for the OneNote family (.one / .onetoc / .onepkg), no parser(TIKA-2224).

Add configurability of "preserve-interword-spacing" toTesseractOCRParser (TIKA-2190).

Upgrade PDFBox to 2.0.6 and JempBox 1.8.13 (TIKA-2361.

Refactor MockParser to consolidate service loadingand mime types into tika-core/src/test (TIKA-2195).

Enabled extraction of embedded objects from headers, footers,footnotes, endnotes and comments in legacy .docx parser (TIKA-2192).

Allow extraction of PDActions (including Javascript) fromPDFs (TIKA-2090). This is turned off by default. Usersmust setExtractActions(true) on the PDFParserConfig.

Change default behavior in experimental .docx parser to ignoredeleted text to align with .doc (TIKA-2187).

Upgrade to Apache POI 3.16 (TIKA-2116, TIKA-2181, TIKA-2329).

Allow configuration of timeout for ForkParser (TIKA-2170).

Add extraction of .jpx inline images from PDFs when required dependencies are added by user to class path (TIKA-2175).

Add .jpx, .jp2, .ppm to formats handled by Tesseract (TIKA-2174).

Upgrade "provided" Sqlite to 3.16.1 (TIKA-2334).

Upgrade CXF version to 3.0.12 (TIKA-2292).

Add Lingo24 Language Detector (TIKA-2297).

Further mime magic for WebVTT (TIKA-1772)

Extend support for increased PSM options up to 13 for modernversions of Tesseract (TIKA-2357).

Logo

为开发者提供学习成长、分享交流、生态实践、资源工具等服务,帮助开发者快速成长。

更多推荐