Rafał Kuć

Paul Anderson photo

Software Engineer
Sematext Group, Inc.

Software engineer, trainer, consultant and author from time to time - some would say that he is an all in one battle weapon concentrated mostly on Lucene, Solr and Elasticsearch. However he also likes all the other cool stuff that is happening in the IT world. Likes to share his knowledge by giving talks at various meet ups and conferences.

Rafał Kuć is speaking at the following session/s

Tweaking the Base Score: Lucene/Solr Similarities Explained

Thursday | 1:30PM - 2:10PM | Jefferson East

Lucene has a lot of options for configuring similarity, and Solr inherits them. Similarity makes the base of your relevancy score: how similar is this document to the query? The default similarity (BM25) is a good start, but you may need to tweak it for your use-case. In this session, you will learn how BM25 works and how you may want to change its parameters. Then, we'll move to other similarity classes: DFR, DFI, IB and LM. You will learn the thinking behind them, how that thinking translates to the similarity score, and which parameters allow you to tweak how score evolves based on things like term frequency or document length. By the end, you’ll have a good understanding of which similarity options are likely to work well for your use-case. You'll know which tunables are available and whether you need to implement a custom similarity class. As an example, we’ll focus on E-commerce, where you often end up ignoring term frequency altogether.

Attendee Takeaway
1) What are the built-in Lucene/Solr similarities and what they do
2) Which similarity to use for which use-case
3) How to use a custom similarity class in Solr

Intended Audience 
Lucene/Solr users interested in how scoring works, the ideas behind default scoring options and how to configure them

Level:
All Levels