Monday, April 25, 2011

Faster way to get distinct values from Lucene Query

Currently I do like this:

IndexSearcher searcher = new IndexSearcher(lucenePath);
Hits hits = searcher.Search(query);
Document doc;
List<string> companyNames = new List<string>();

for (int i = 0; i < hits.Length(); i++)
{
    doc = hits.Doc(i);
    companyNames.Add(doc.Get("companyName"));
}
searcher.Close();

companyNames = companyNames.Distinct<string>().Skip(offSet ?? 0).ToList();
return companyNames.Take(count??companyNames.Count()).ToList();

As you can see, I first collect ALL the fields (several thousands) and then distinct them, possibly skip some and take some out.

I feel like there should be a better way to do this.

From stackoverflow
  • I'm not sure there is, honestly, as Lucene doesn't provide 'distinct' functionality. I believe with SOLR you can use a facet search to achieve this, but if you want this in Lucene, you'd have to write some sort of facet functionality yourself. So as long as you don't run into any performance issues, you should be ok this way.

    borisCallens : Ok, thanks for letting me know.
  • Tying this question to an earlier question of yours (re: "Too many clauses"), I think you should definitely be looking at term enumeration from the index reader. Cache the results (I used a sorted dictionary keyed on the field name, with a list of terms as the data, to a max of 100 terms per field) until the index reader becomes invalid and away you go.

    Or perhaps I should say, that when faced with a similar problem to yours, that's what I did.

    Hope this helps,

    borisCallens : Could you elaborate on what you mean with "Term Enumeration"? Do you mean enumerating all my documents and getting those fields so I can use C#'s StartsWith()?
    borisCallens : +1 for seeing the question behind the question
    Moleski : Have a look at the Terms member function of the IndexReader class. BTW, I found out a good deal about this kind of thing by having a look at the Luke source code. Very interesting!
    borisCallens : I'm not a big fan of Luke actually. I don't know why, but it takes ages for each query to parse. Way slower then my own queries.

0 comments:

Post a Comment