A database I am working on has a lot of records (40M+) and the guys mining it are currently using python to process it. This is very powerful (nltk) but also quite slow. One of the guys has found that you can do fuzzy string matching straight in Postgres and the feature is quite advanced!

  • fuzzystrmatch is a PostgreSQL module providing functions to determine similarities and distances between strings.
  • pg_trgm is a PostgreSQL module to determine the similarity of ASCII alphanumeric text based on trigram matching. This module offers indexing options too :)
  • pg_similarity is a third party module to support similarity queries on PostgreSQL

The first two can be installed with a simple CREATE EXTENSION command. The last module.. you need to get it, compile it and install it yourself.

References

HTH,