I am writing a Perl script that is searching for a term in large portions of text. What I would like to display back to the user is a small subset of the text around the search term, so the user can have context of where this search term is used. Google search results are a good example of what I am trying to accomplish, where the context of your search term is displayed under the title of the link.
My basic search is using this:
if ($text =~ /$search/i ) {
print "${title}:${text}\n";
}
($title contains the title of the item the search term was found in) This is too much though, since sometimes $text will be holding hundreds of lines of text.
This is going to be displayed on the web, so I could just provide the title as a link to the actual text, but there is no context for the user.
I tried modifying my regex to capture 4 words before and 4 words after the search term, but ran into problems if the search term was at the very beginning or very end of $text.
What would be a good way to accomplish this? I tried searching CPAN because I'm sure someone has a module for this, but I can't think of the right terms to search for. I would like to do this without modules if possible, because getting modules installed here is a pain. Does anyone have any ideas?
Thanks in advance!
Brian
-
You could try the following:
if ($text =~ /(.*)$search(.*)/i ) { my @before_words = split ' ', $1; my @after_words = split ' ',$2; my $before_str = get_last_x_words_from_array(@before_words); my $after_str = get_first_x_words_from_array(@after_words); print $before_str . ' ' . $search . ' ' . $after_str; }Some code obviously omitted, but this should give you an idea of the approach.
As far as extracting the title ... I think this approach does not lend itself to that very well.
-
Your initial attempt at 4 words before/after wasn't too far off.
Try:
if ($text =~ /((\S+\s+){0,4})($search)((\s+\S+){0,4})/i) { my ($pre, $match, $post) = ($1, $3, $4); ... }BrianH : Okay, that works perfectly now, but takes a *very* long time. Using the same data, mine (which doesn't return correct results :) ) runs in less than 1 second. I changed the code to your snippit, and it ran more 15 seconds... Any guesses on how to improve performance?BrianH : if ($text =~ /((\S+\s+){0,4})($search)((\S+\s+){0,4})/ ) { print "$1$3$4\n"; } This produces the right output, and it flies. Thanks so much for your help!BrianH : I basically removed the ?: - not sure why that reduces performance to have them in, though...BrianH : Oooh - sorry - it wasn't the ?: - somehow I removed the /i from the end. My search was running fast because it was running case sensitive. When I add the /i back on the end, the performance slows *way* down. Your original solution works perfectly!BrianH : So now I need to figure out how to perform this matching case-insensitive, and still be fast...denkfaul : it looks like it works with or without the ?:, it just creates another matched variable if you don't. I'll leave this as is unless someone can pip in and explain what is better in this case :)BrianH : Sorry to confuse - my 4th comment explains it was actually yours doing a case-INsensitive match (which I want) that was causing the slowness. If I only search for the term without words around it, case insensitive matches go very fast. -
You can use $
and $' to get the string before and after the match. Then truncate those values appropriately. But as blixtor points out, shlomif is correct to suggest using@+and@-to avoid the performance penalty imposed by $and #' -$foo =~ /(match)/; my $match = $1; #my $before = $`; #my $after = $'; my $before = substr($foo, 0, $-[0]); my $after = substr($foo, $+[0]); $after =~ s/((?:(?:\w+)(?:\W+)){4}).*/$1/; $before = reverse $before; # reverse the string to limit backtracking. $before =~ s/((?:(?:\W+)(?:\w+)){4}).*/$1/; $before = reverse $before; print "$before -> $match <- $after\n";BrianH : Hmm - this actually performs great, even when I turn on case insensitive matching...daotoad : The reverse trick for grabbing from the back of a string came from a post on Perlmonks called sexeger - http://www.perlmonks.org/index.pl?node_id=33410: Using the special variables $` and $' incurs a performance penalty for ALL regexes used anywhere in the program. See shlomif's answers for a better way. -
I would suggest using the positional parameters - @+ and @- (see perldoc perlvar) to find the position in the string of the match, and how much it takes.
: +1. That's the best answer, imho. It does not do any unnecessary matching around the real 'match' and does not incur the performance penalty of using $` and $'.
0 comments:
Post a Comment