Dictionaries and Word Lists for Programmers

I love to play with words, and I especially love to play with words programatically. I’ve written three small apps (so far) which use some form of a dictionary to create readable, humorous text:

I’ve had some people ask, so here are some great resources that I’ve found while building these apps.

Dictionaries

/usr/share/dict/words (~235k words)
Available on any *nix system, this word list is a local way to check for words using a simple grep. You can also read the file in as long as you have permission to do so. Won’t work well if you’re trying to write something for the internet or windows.
Most of these dictionaries are licensed very freely, but you should check on your own system. Versions of this are available online, e.g. the FreeBSD version at  https://svnweb.freebsd.org/csrg/share/dict/ (click “words”)

GCIDE (~185k words)
http://www.ibiblio.org/webster/ or http://gcide.gnu.org.ua/download
This dictionary contains words and definitions. Very useful if you actually want to look up the words you are using. Sources for the definitions are available as well. There are two versions – GCIDE which comes in a strange format and needs its own reader software, and GCIDE-XML. Licensed under GNU.

SCOWL and friends (variable word count)
http://wordlist.aspell.net/
A very complete set of wordlists, used for the aspell spell checker. My favorite part is the customizable interface where you can create your own custom dictionary. Many links and different dictionaries are available on this page, including some with part-of-speech and/or inflection data. Be aware: many versions of SCOWL contain swears and racial slurs.
Variable licensure, but all are released for private or commercial use as long as you maintain a version of the license.

CMU Pronouncing Dictionary (~134k words)
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
Contains not only words but their phonemes, meaning this is a great dictionary for text-to-speech, rhyming, and syllable counting. There are CMUDict libraries for node/browser, just node, python, and many other languages (send links please).
The file is copyright Carnegie Mellon, but is unrestricted for personal or commercial use.

Corpora (Lists of ~1k words)
https://github.com/dariusk/corpora
A really neat set of lists, broken down by category (e.g. /data/foods/beer_styles.json). A great starting resource for small projects that don’t need an extensive dictionary of the English language. Licensed under CC0 (no copyright).

Wordnik Developer (Web API)
http://developer.wordnik.com/
A powerful web API. From the site: “request definitions, example sentences, spelling suggestions, related words like synonyms and antonyms, phrases containing a given word, word autocompletion, random words, words of the day, and much more.”
Free 15k calls per hour, licensed for anything that isn’t a direct clone of Wordnik itself.

(Please, send any more dictionaries you know of my way and I’ll add them to the list!)

Libraries

FastTag / jsPos
https://github.com/mark-watson/fasttag_v2 (java) and https://code.google.com/p/jspos/ (js)
Java and Javascript libraries to tag parts of speech in words. Very handy if you’re doing any sort of lexical generation.
Licensed under LGPL3 or Apache 2 licenses

(Please, send more libraries my way if you know of them!)

People / Blogs

Peter Norvig
Probably my greatest source of inspiration on this front is Peter Norvig’s How to Write a Spelling Corrector (python). He shows that you don’t need any sort of fancy tooling or arcane knowledge to write something that at first seems complex – just don’t be afraid of making the computer do a lot of work for you. That’s what they do – they do work really, really fast. (see, for example, this scrabble solver)

Allison Parrish
Allison plays with words in amazing ways. She is the brains behind @everyword. Her website and research are full of great inspiration for playing with words and experimenting with language.

(please, send anyone doing interesting things with words my way and I’ll add them to the list!)

One thought on “Dictionaries and Word Lists for Programmers

  1. Pingback: Tuesday is Link Day! Still Breathing - online home of Chris Taylor - web developer, musician, geek

Leave a Reply

Your email address will not be published. Required fields are marked *

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.