Timothy Spann

Paul Anderson photo

Developer Advocate

Timothy Spann is speaking at the following session/s

Real-Time Cloud Native Open Source Streaming Of Any Data to Apache Solr

Americas | 3:40PM - 3:40PM |

Utilizing Apache Pulsar and Apache NiFi we can parse any document in real-time at scale. We receive a lot of documents via cloud storage, email, social channels and internal document stores.   We want to make all the content and metadata to Apache Solr for categorization, full text search, optimization and combination with other datastores.   We will not only stream documents, but all REST feeds, logs and IoT data.   Once data is produced to Pulsar topics it can instantly be ingested to Solr through Pulsar Solr Sink.

Utilizing a number of open source tools, we have created a real-time scalable any document parsing data flow. We use Apache Tika for Document Processing with real-time language detection, natural language processing with Apache OpenNLP, Sentiment Analysis with Stanford CoreNLP, Spacy and TextBlob. We will walk everyone through creating an open source flow of documents utilizing Apache NiFi as our integration engine. We can convert PDF, Excel and Word to HTML and/or text.  We can also extract the text to apply sentiment analysis and NLP categorization to generate additional metadata about our documents. We also will extract and parse images that if they contain text we can extract with TensorFlow and Tesseract.

Intended Audience

Data Engineers, Search Engineers, Programmers, Analysts, Data Scientists, Operators

Attendee Takeaway

You will learn how to use open source streaming to sink data at scale to Solr.

All Levels