Principal Artificial Intelligence Engineer
The MITRE Corporation
Tim has been working in natural language processing since 2002. In the last 5+ years, his focus has shifted to content/metadata extraction (and evaluation), advanced search and relevance tuning. Tim is a member of the Apache Software Foundation (ASF), the chair/VP of Apache Tika, a committer and PMC member on Apache PDFBox (since 2016), and on Apache POI (2013) and a committer on Apache Lucene/Solr (2018). Tim holds a Ph.D. in Classical Studies, and in a former life, he was a professor of Latin and Greek.
Tim Allison is speaking at the following session/s
(R)Evolving Relevance Tuning with Genetic Algorithms
This talk builds on work by Simon Hughes and others to apply genetic algorithms (GA) and random search for finding optimal parameters for relevance ranking. While manual tuning can be useful, the parameter space is too vast to be confident that one has found optimal parameters without overfitting. We'll present Quaerite (https://github.com/mitre/quaerite), an open source toolkit that allows users to specify experiment parameters and then run a random search and/or a GA to identify the best settings given ground truth. We'll offer an overview of mapping the Solr parameter space to a GA problem, the importance of the baked-in n-fold cross-validation, and the surprises and successes found with deployed search systems.
Anyone interested in leveraging machine learning to improve relevance should find this useful. They'll learn about the overall technique and critical aspects for rigorous implementation.
This is intended for a technical audience already familiar with Solr parameters and relevance tuning.
Evaluating Content/Text Extraction at Scale
Apache Tika is widely used as a critical enabling technology for search in Solr and other systems. This library allows parsing and text extraction of numerous file formats. When something goes wrong with text extraction, though, the reliability of search is greatly hindered. Typically, search engineers pay little attention to the quality of content extracted and hope for the best.
This talk offers an overview of the tika-eval module and discusses ways of scaling its NLP/language-modeling based metrics to identify mojibake, corrupt text and/or bad OCR at scale. Search engineers can use these statistics to determine whether or not to index a document, flag it as potentially corrupt or carry out more computationally expensive methods to try to extract reliable text.
Anyone processing documents at scale who is concerned about reliability should be interested in this topic. Attendees will learn about the metrics available and how to integrate the tika-eval module's metrics at scale.
This is intended for a technical audience.