Using Datafari to extract text for academic research on NLU and NLP

Extracting raw text to do Natural Language Understanding (NLU) or Natural Language Processing (NLP) is often a boring and time consuming task. Any student or researcher that has already had to prepare a pipeline for that knows what we are talking about. First, assess available open source technologies (very often Apache Tika), then understand how it works, put documents in a folder and make it work with trial and errors, probably through a python script.

This is what we had in mind when preparing a documentation on how to use Datafari Community Edition just for that. After all, Datafari is an enterprise search solution, which means it encompasses these tasks as part of its overall mission to index documents and allow to search through them.

With the documentation we provide, researchers will be able to have a fully operational pipeline that will look in a specific shared folder, extract the text (via Apache Tika), and ouput it in a dedicated folder. And with a bit more motivation, researchers can go beyond and use other connectors than the fileshare, as the pipeline can work with any data source.

Discover now how to extract text from any document thanks to Datafari.

France Labs Enterprise Search Blog

blog on Enterprise Search, Solr, Datafari, ManifoldCF

Using Datafari to extract text for academic research on NLU and NLP