Partial String Match for Birds

The eBird mobile app, as far as I can tell, does not allow mistyped characters when searching for a bird species to add to a checklist. This, to me, seems like a problem! In the field I'm often holding a camera or binoculars and only have one hand available for typing. In that situation I'd like to quickly find and enter a bird on the checklist and minimize the number of seconds my eyes have to be on my phone.

I thought to myself "it can't be that hard to come up with an algorithm" so I threw together a JavaScript page to develop one. I ended up with a minimalist JavaScript page for comparing algorithms, and another page that runs the final algorithm.

I've published the code on GitHub, with an MIT license, and I've submitted a feedback report to the eBird folks to make them aware of it.

I don't think it's worth an in-depth post on this, so I'll try to be brief. What follows is my step-by-step process for writing the simple algorithm.

Test Bird List

For testing, I list of bird species names. If I tested against a list that was larger than any birder was likely to actually be searching, I would know whether my algorithm was fast enough.

For a large list, I went with eBird's year-round checklist for the entire United States. It doesn't have separate entries for subspecies for each bird, but at more than 2,100 entries, the list is plenty long for testing.

Levenshtein Distance

For a partial string match, the first thing I thought of was the Levenshtein distance. This "distance" is the minimum number of characters that need to be added, removed, or changed to get from a string x to another string y. For example, if you intend to type in a bird name like robin but you accidentally type tobin, then your typed search term has distance 1 from the target. I figured this would be the starting point for a partial string match algorithm.

Levenshtein Distance of Prefix

Next, I figured you'd want to find, for the number of characters you've typed, n, matches for the first n characters of any component "word" for a bird species name.

In other words, split each bird species name into component words with a regular expression:

// Acanthis/Spinus sp. --> [ "Acanthis", "Spinus", "sp." ]
// Black-throated Gray/Hermit Warbler --> [ "Black", "throated", "Gray", "Hermit", "Warbler" ]
// Steller's Jay --> [ "Steller's", "Jay" ]
const splitRegex = /[^A-Za-z'.]+/;

Then take the first n characters of each of those words (we'll call each of them a prefix), and find the Levenshtein distance from the search term to the prefix.

For the example results below, I'll include the Levenshtein distance for each match.

If we enter an exact match, we get a few 0 distance results, then some 1 distance results, and so on:

search term: nuth
results (showing Lev. dist.):
0: Brown-headed Nuthatch
0: Pygmy Nuthatch
0: Red-breasted Nuthatch
0: White-breasted Nuthatch
0: nuthatch sp.
1: Ash-throated/Nutting's Flycatcher
1: Clark's Nutcracker
1: Downy x Nuttall's Woodpecker (hybrid)
1: Nuttall's Woodpecker
1: Nuttall's x Hairy Woodpecker (hybrid)
...

This is looking good. I can find nuthatches just by typing nuth as expected. But this is how eBird's app already works:

Let's say we mistype a character, and accidentally type buth:

search term: buth
results (showing Lev. dist.):
1: Brown-headed Nuthatch
1: Bushtit
1: Buteo sp.
1: Buteo/eagle sp.
1: Geranoaetus/Buteo sp.
1: Japanese Bush Warbler
1: Pygmy Nuthatch
1: Red-breasted Nuthatch
1: White-breasted Nuthatch
1: nuthatch sp.
...

Here we have no matches with distance 0. But we do have 10 matches with a distance of 1. If you meant to search for a nuthatch, and you type buth, this algorithm is already showing some useful partial matches where eBird app would show no results:

Levenshtein Distance of Substring

But checking just component word prefixes isn't good enough. Say we are searching for Three-toed Woodpecker but, in our haste from typing while excited, we type thre toed woodpec:

search term: thre toed woodpec
results (showing Lev. dist.):
6: Pileated Woodpecker
9: Arizona Woodpecker
9: Nuttall's Woodpecker
9: Red-headed Woodpecker
10: American Three-toed Woodpecker  <----- our target!
10: American Three-toed/Black-backed Woodpecker
10: Crested Auklet
10: Crested Duck
10: Lewis's Woodpecker
10: Red-bellied Woodpecker
...

With prefix searching, we don't get the expected result at the top of the list. Pileated Woodpecker is first with a distance of 6. Our target American Three-toed Woodpecker doesn't appear until the 5th position, with a distance of 10.

To get our target to the top of the list with the "prefix" approach would require synthesizing new "multi-words" to check from each adjacent pair of words (three and toed becomes three-toed), then merge every 3-word sequence into another "multi-word" (three and toed and woodpecker becomes three-toed woodpecker), and so on. This would lead to a large list of component words and "multi-words" to check.

The "prefix" approach performs even more poorly when the search term isn't a prefix but instead a run of characters in the middle of a bird name. For example, say we are searching for Dunlin and search by typing nlin:

search term: nlin
results (showing Lev. dist.):
2: Acanthis/Spinus sp.
2: Barn x Cliff Swallow (hybrid)
2: Black-chinned Hummingbird
2: Black-chinned Sparrow
2: Black-chinned x Anna's Hummingbird (hybrid)
2: Black-chinned x Broad-billed Hummingbird (hybrid)
2: Black-chinned x Calliope Hummingbird (hybrid)
2: Black-chinned x Costa's Hummingbird (hybrid)
2: Black-chinned x Rufous Hummingbird (hybrid)
2: Chinese Egret
...

Our target doesn't even appear in the results!

Instead, a better approach may be to use a sliding n-character substring instead of an n-character prefix. By "sliding" I mean that every substring of length n would be tested starting from the left, with an offset of 0, and then moving both the start and end of the substring right by 1 character at a time.

Those substrings to test would be:

search term: nlin
target: dunlin
substrings of length 4:
dunl (lev. dist. of 4)
unli (lev. dist. of 2)
nlin (lev. dist. of 0)

search term: thre toed woodpec
target: american three-toed woodpecker
substrings of length 17:
american three-to (lev. dist. of 16)
merican three-toe (lev. dist. of 15)
erican three-toed (lev. dist. of 15)
rican three-toed  (lev. dist. of 14)
ican three-toed w (lev. dist. of 13)
can three-toed wo (lev. dist. of 11)
an three-toed woo (lev. dist. of 9)
n three-toed wood (lev. dist. of 7)
 three-toed woodp (lev. dist. of 5)
three-toed woodpe (lev. dist. of 3)
hree-toed woodpec (lev. dist. of 3)
ree-toed woodpeck (lev. dist. of 5)
ee-toed woodpecke (lev. dist. of 6)
e-toed woodpecker (lev. dist. of 7)

Now let's see how this algorithm performs for our problematic search terms nlin and thre toed woodpec:

search term: nlin
results (showing Lev. dist.):
0: Dunlin
0: Dunlin x White-rumped Sandpiper (hybrid)
1: Anhinga
1: Azure Gallinule
1: Barnacle x Cackling Goose (hybrid)
1: Berylline Hummingbird
1: Black Francolin
1: Black-bellied Whistling-Duck
1: Black-bellied x Fulvous Whistling-Duck (hybrid)
1: Blue Bunting
...
search term: thre toed woodpec
results (showing Lev. dist.):
3: American Three-toed Woodpecker
5: Great Spotted Woodpecker
6: Golden-fronted Woodpecker
6: Golden-fronted x Red-bellied Woodpecker (hybrid)
6: Golden-fronted/Red-bellied Woodpecker
6: Pileated Woodpecker
6: Red-bellied Woodpecker
6: White-headed Woodpecker
7: American Three-toed/Black-backed Woodpecker
7: Black-backed Woodpecker
...

As expected for nlin, our target Dunlin appears at the top of the list with a Levenshtein distance of 0. And American Three-toed Woodpecker is at the top of the list, as it should be, when searching thre toed woodpec.

Levenshtein Distance of Substring, Prefix Bonus

The "substring" approach gives much better results than "prefix" but it's still not good enough. In the case where I see a Lincoln's Sparrow, and type the sensible search term lin:

search term: lin
results (showing Lev. dist.):
0: Azure Gallinule
0: Barnacle x Cackling Goose (hybrid)
0: Berylline Hummingbird
0: Black Francolin
0: Black-bellied Whistling-Duck
0: Black-bellied x Fulvous Whistling-Duck (hybrid)
0: Bobolink
0: Brahminy Starling
0: Brambling
0: Brant x Cackling Goose (hybrid)
...

Lincoln's Sparrow starts with the search term lin, but that bird doesn't appear on the results list! Because we're finding matching substrings, we get a lot of 0-distance results. But we need to give a "bonus" to bird names where a component word starts with the search term. This is 2-dimensional sorting: first sort by Levenshtein distance, then within each group of results of the same Levenshtein distance, sort by whether a component word starts with the search term. This tweak to the algorithm seems to work as expected, and five new birds move to the top of the list:

search term: lin
results (showing Lev. dist.):
0: Eurasian Linnet
0: Lincoln's Sparrow
0: Lincoln's/Swamp Sparrow
0: Lined Seedeater
0: White-lined Tanager
0: Azure Gallinule
0: Barnacle x Cackling Goose (hybrid)
0: Berylline Hummingbird
0: Black Francolin
0: Black-bellied Whistling-Duck
...

Another example is the search term tern, where we'd expect all types of tern to be at the front of the results list. Without the "prefix bonus" American Bittern sneaks into the 2nd place:

search term: tern
results (showing Lev. dist.):
0: Aleutian Tern
0: American Bittern
0: Arctic Tern
0: Black Tern
0: Black x Eastern Phoebe (hybrid)
0: Bridled Tern
0: Caspian Tern
0: Cassin's/Western Kingbird
0: Common Tern
0: Common x Arctic Tern (hybrid)
...

With the "prefix bonus" 2-dimensional results sorting, American Bittern is properly outranked by all tern species:

search term: tern
results (showing Lev. dist.):
0: Aleutian Tern
0: Arctic Tern
0: Black Tern
0: Bridled Tern
0: Caspian Tern
0: Common Tern
0: Common x Arctic Tern (hybrid)
0: Common/Arctic Tern
0: Common/Forster's Tern
0: Elegant Tern
...

Levenshtein Distance of Substring, Prefix Bonus, Name Starts With Bonus

Another unexpected result is shown with a search like town. I would expect bird names starting with Townsend's ... to be displayed first, but instead the results are:

search term: town
results (showing Lev. dist.):
0: Black-throated Gray x Townsend's Warbler (hybrid)
0: Black-throated Gray/Townsend's Warbler
0: Leach's/Townsend's Storm-Petrel
0: Leach's/Townsend's Storm-Petrel (dark-rumped)
0: Leach's/Townsend's Storm-Petrel (white-rumped)
0: Leach's/Townsend's/Ainley's Storm-Petrel
0: Townsend's Solitaire
0: Townsend's Storm-Petrel
0: Townsend's Warbler
0: Townsend's x Black-throated Green Warbler (hybrid)
...

The Black-throated ... Townsend's Warbler results contain the word ... Townsend's ... but are sorted alphabetically before the Townsend's ... birds. The fix is to add another dimension to the sorting of the results: whether the overall bird name starts with the search term. Now, with 3-dimensional sorting: first sort by Levenshtein distance, then by whether a component word starts with the search term, then by whether the overall bird name starts with the term, we get the expected sorting of the results:

search term: town
results (showing Lev. dist.):
0: Townsend's Solitaire
0: Townsend's Storm-Petrel
0: Townsend's Warbler
0: Townsend's x Black-throated Green Warbler (hybrid)
0: Townsend's x Hermit Warbler (hybrid)
0: Townsend's/Hermit Warbler
0: Black-throated Gray x Townsend's Warbler (hybrid)
0: Black-throated Gray/Townsend's Warbler
0: Leach's/Townsend's Storm-Petrel
0: Leach's/Townsend's Storm-Petrel (dark-rumped)
...

Early Stopping

Next, I experimented with a couple optimizations: stopping the Levenshtein distance function early, and stopping the substring sliding early.

After comparing the first N characters in the search and target strings, the Levenshtein distance algorithm can be abandoned if we have a distance greater than will be shown in the results because any subsequent character comparisons can never lower the total distance count.

The performance results below vary quite a bit. Part of the variance comes from the arbitrary limit of 16 results I am using. Once 16 birds with a Levenshtein distance of 0 are found, the sliding substring portion of the algorithm can be stopped.

The results very quite a bit depending on the search term, but in general there's either a slight performance gain for early Levenshtein distance stopping, or a negligible performance penalty.

Another potential optimization was to abandon the substring "sliding" once any substring is found to have a Levenshtein distance of 0, since no other substring can improve upon that distance. After implementing this, I found that this did not provide any performance benefit. Since very few bird name words would have an exact match toward "the left side" of the word, only a small percentage of bird names would trigger this early stopping early enough in the "sliding" to make a difference.

Switching to .indexOf()

Say we want to return the best 16 matches for a search (or more than 16 if more are tied with the best Levenshtein distance). Once we have 16 results with a distance of 0, we no longer need to use the Levenshtein distance function for all substrings, and can instead use JavaScript's built-in .indexOf() function. Using this function should be much faster than looking for a Levenshtein distance of 0.

A quick performance test shows that this seems to be the case:

Since we quickly find 16 0-distance Levenshtein substrings for crow, including all the bird names containing crowned, this new optimization is able to switch to the faster .indexOf() function and run twice as fast overall.

Levenshtein Distance of Substring with Early Stopping and Bonuses For Prefix, Name Starts With, Whole Word

In searching for crow I noticed that the final result sorting wasn't I wanted:

search term: crow
results (showing Lev. dist.):
0: Crowned Slaty Flycatcher
0: crow sp.
0: crow/raven sp.
0: American Crow
0: American/Fish Crow
0: Black-crowned Night-Heron
0: Black-crowned x Yellow-crowned Night-Heron (hybrid)
0: Black-crowned/Yellow-crowned Night-Heron
0: Blue-crowned Parakeet
0: Broad-billed x Violet-crowned Hummingbird (hybrid)
0: Dark-eyed Junco x White-crowned Sparrow (hybrid)
0: Fish Crow
0: Golden-crowned Kinglet
0: Golden-crowned Sparrow
0: Golden-crowned Warbler
...

Since "American Crow" contains the component word Crow where the whole word has a Levenshtein distance of 0 from the search term crow, it should be given a "bonus" to improve its ranking in the final results. One way of doing this ranking bonus is to adjust its Levenshtein distance down from 0 to -1. If we also do the "name starts with" and "component word prefix" bonuses that way, by adjusting those entries' best Levenshtein substring distance, we end up with the following:

search term: crow
results (showing adjusted Lev. dist.):
-3: crow sp.
-3: crow/raven sp.
-2: American Crow
-2: American/Fish Crow
-2: Crowned Slaty Flycatcher
-2: Fish Crow
-2: Hawaiian Crow
-2: Hooded Crow
-2: House Crow
-2: House x Fish Crow (hybrid)
-2: Pied Crow
-2: Tamaulipas Crow
-1: Black-crowned Night-Heron
-1: Black-crowned x Yellow-crowned Night-Heron (hybrid)
-1: Black-crowned/Yellow-crowned Night-Heron
-1: Blue-crowned Parakeet
-1: Broad-billed x Violet-crowned Hummingbird (hybrid)
-1: Dark-eyed Junco x White-crowned Sparrow (hybrid)
-1: Golden-crowned Kinglet
-1: Golden-crowned Sparrow
-1: Golden-crowned Warbler
...

American Crow is now sorted ahead of Crowned Slaty Flycatcher, but crow sp. is then in first place! To fix this, we can add a penalty of 1 distance unit to all bird entries ending with sp.. After doing this, the results look like:

search term: crow
results (showing adjusted Lev. dist.):
-2: American Crow
-2: American/Fish Crow
-2: Crowned Slaty Flycatcher
-2: Fish Crow
-2: Hawaiian Crow
-2: Hooded Crow
-2: House Crow
-2: House x Fish Crow (hybrid)
-2: Pied Crow
-2: Tamaulipas Crow
-2: crow sp.
-2: crow/raven sp.
-1: Black-crowned Night-Heron
-1: Black-crowned x Yellow-crowned Night-Heron (hybrid)
-1: Black-crowned/Yellow-crowned Night-Heron
-1: Blue-crowned Parakeet
-1: Broad-billed x Violet-crowned Hummingbird (hybrid)
-1: Dark-eyed Junco x White-crowned Sparrow (hybrid)
-1: Golden-crowned Kinglet
-1: Golden-crowned Sparrow
-1: Golden-crowned Warbler
...

Much better. When crow is searched, the birds with exact matches are moved to the top of the results list. Crowned Slaty Flycatcher is highly ranked because of the "name starts with" bonus, but otherwise all the crow species are at the top of the list.

Multi-Word Searching

We're still not done! Next let's look at multi-word search terms. Here are some results for:

search term: lin spa 
results (showing adjusted Lev. dist.):
1: Fox Sparrow
1: loon sp.
2: Bell's Sparrow
2: Carolina Parakeet
2: Chipping Sparrow
...
search term: nor fli
results (showing adjusted Lev. dist.):
1: Boreal Chickadee
1: Boreal Owl
1: Boreal/Northern Saw-whet Owl
1: Poo-uli
2: Altamira Oriole
...
search term: ash fly
results (showing adjusted Lev. dist.):
1: Least Flycatcher
2: Crowned Slaty Flycatcher
2: La Sagra's Flycatcher
2: Mugimaki Flycatcher
3: Acadian Flycatcher
...
search term: wh cr sp
results (showing adjusted Lev. dist.):
3: Brown Creeper
3: Anser sp.
3: eider sp.
3: finch sp.
4: black-and-white shearwater sp.
...

As you can see, when using exact prefixes of several component words in a bird's name, we don't get the intended results. The fix for this should be straightforward: split the search term up into component words, run the search algorithm for each search term word, then sum the adjusted Levenshtein distance for each found species name.

For the nor fli example, when trying to find Northern Flicker, the nor search word gives an adjusted Levenshtein distance of -2 for Northern Flicker. The fli search word returns a -1 adjusted distance for Northern Flicker. Together, those results can be added to create an overall adjusted distance of -3 for both terms, nor fli.

After making this change, we seem to get the intended results!

search term: lin spa 
results (showing adjusted Lev. dist.):
-3: Lincoln's Sparrow
-3: Lincoln's/Swamp Sparrow
-2: Lined Seedeater
-2: sparrow/warbler sp. (trilling song)
-1: American Tree Sparrow
...
search term: nor fli
results (showing adjusted Lev. dist.):
-3: Northern Flicker
-3: Northern x Gilded Flicker (hybrid)
-3: Northern/Gilded Flicker
-2: Northern Beardless-Tyrannulet
-2: Northern Bobwhite
...
search term: ash fly
results (showing adjusted Lev. dist.):
-4: Ash-throated Flycatcher
-4: Ash-throated/Nutting's Flycatcher
-2: Ashy Storm-Petrel
-1: Acadian Flycatcher
-1: Alder Flycatcher
...
search term: wh cr sp
results (showing adjusted Lev. dist.):
-4: White-crowned Sparrow
-4: White-crowned x Golden-crowned Sparrow (hybrid)
-4: White-crowned x Harris's Sparrow (hybrid)
-4: White-crowned x White-throated Sparrow (hybrid)
-4: White-crowned/White-throated Sparrow
...

The search term is split into component words with the same regex given above /[^A-Za-z'.]+/. This means that searching for white-crowned will return the same results as when searching white crowned.

Performance

I think the algorithm is working nicely at this point. To test performance, I set up a loop that runs the algorithm some number of times, only updating the displayed results once, at the end.

For eBird's year-round checklist for the entire United States, which has over 2,100 entries (more than would be searched for a typical checklist), we get the following performance in Firefox on my M1 Mac:

The above times are for searches limited to 16 results, but because the top 16 results are determined by Levenshtein distance, including ties for 16th place, I believe a large portion of the bird list has to be searched for most search terms.

JavaScript Pages and GitHub Repo

The algorithms developed step-by-step here and be run in JavaScript on this page, and the final algorithm can be run in JavaScript using this page.

The GitHub repository for the code is here.