Monday, April 25, 2011

Using Perl, how do I show the context around a search term in the search results?

I am writing a Perl script that is searching for a term in large portions of text. What I would like to display back to the user is a small subset of the text around the search term, so the user can have context of where this search term is used. Google search results are a good example of what I am trying to accomplish, where the context of your search term is displayed under the title of the link.

My basic search is using this:

if ($text =~ /$search/i ) {
    print "${title}:${text}\n";
}

($title contains the title of the item the search term was found in) This is too much though, since sometimes $text will be holding hundreds of lines of text.

This is going to be displayed on the web, so I could just provide the title as a link to the actual text, but there is no context for the user.

I tried modifying my regex to capture 4 words before and 4 words after the search term, but ran into problems if the search term was at the very beginning or very end of $text.

What would be a good way to accomplish this? I tried searching CPAN because I'm sure someone has a module for this, but I can't think of the right terms to search for. I would like to do this without modules if possible, because getting modules installed here is a pain. Does anyone have any ideas?

Thanks in advance!

Brian

From stackoverflow
  • You could try the following:

    if ($text =~ /(.*)$search(.*)/i ) {
    
      my @before_words = split ' ', $1;
      my @after_words = split ' ',$2;
    
      my $before_str = get_last_x_words_from_array(@before_words);
      my $after_str = get_first_x_words_from_array(@after_words); 
    
      print $before_str . ' ' . $search . ' ' . $after_str;
    
    }
    

    Some code obviously omitted, but this should give you an idea of the approach.

    As far as extracting the title ... I think this approach does not lend itself to that very well.

  • Your initial attempt at 4 words before/after wasn't too far off.

    Try:

    if ($text =~ /((\S+\s+){0,4})($search)((\s+\S+){0,4})/i) {
        my ($pre, $match, $post) = ($1, $3, $4);
        ...
    }
    
    BrianH : Okay, that works perfectly now, but takes a *very* long time. Using the same data, mine (which doesn't return correct results :) ) runs in less than 1 second. I changed the code to your snippit, and it ran more 15 seconds... Any guesses on how to improve performance?
    BrianH : if ($text =~ /((\S+\s+){0,4})($search)((\S+\s+){0,4})/ ) { print "$1$3$4\n"; } This produces the right output, and it flies. Thanks so much for your help!
    BrianH : I basically removed the ?: - not sure why that reduces performance to have them in, though...
    BrianH : Oooh - sorry - it wasn't the ?: - somehow I removed the /i from the end. My search was running fast because it was running case sensitive. When I add the /i back on the end, the performance slows *way* down. Your original solution works perfectly!
    BrianH : So now I need to figure out how to perform this matching case-insensitive, and still be fast...
    denkfaul : it looks like it works with or without the ?:, it just creates another matched variable if you don't. I'll leave this as is unless someone can pip in and explain what is better in this case :)
    BrianH : Sorry to confuse - my 4th comment explains it was actually yours doing a case-INsensitive match (which I want) that was causing the slowness. If I only search for the term without words around it, case insensitive matches go very fast.
  • You can use $ and $' to get the string before and after the match. Then truncate those values appropriately. But as blixtor points out, shlomif is correct to suggest using @+ and @- to avoid the performance penalty imposed by $ and #' -

    $foo =~ /(match)/;
    
    my $match = $1;
    #my $before = $`;
    #my $after = $';
    my $before = substr($foo, 0, $-[0]);
    my $after =  substr($foo, $+[0]);
    
    $after =~ s/((?:(?:\w+)(?:\W+)){4}).*/$1/;
    $before = reverse $before;                   # reverse the string to limit backtracking.
    $before =~ s/((?:(?:\W+)(?:\w+)){4}).*/$1/;
    $before = reverse $before;
    
    print "$before -> $match <- $after\n";
    
    BrianH : Hmm - this actually performs great, even when I turn on case insensitive matching...
    daotoad : The reverse trick for grabbing from the back of a string came from a post on Perlmonks called sexeger - http://www.perlmonks.org/index.pl?node_id=33410
    : Using the special variables $` and $' incurs a performance penalty for ALL regexes used anywhere in the program. See shlomif's answers for a better way.
  • I would suggest using the positional parameters - @+ and @- (see perldoc perlvar) to find the position in the string of the match, and how much it takes.

    : +1. That's the best answer, imho. It does not do any unnecessary matching around the real 'match' and does not incur the performance penalty of using $` and $'.

0 comments:

Post a Comment