Monthly Archives: December 2015

Dictionaries and Word Lists for Programmers

I love to play with words, and I especially love to play with words programatically. I’ve written three small apps (so far) which use some form of a dictionary to create readable, humorous text:

I’ve had some people ask, so here are some great resources that I’ve found while building these apps.

Dictionaries

/usr/share/dict/words (~235k words)
Available on any *nix system, this word list is a local way to check for words using a simple grep. You can also read the file in as long as you have permission to do so. Won’t work well if you’re trying to write something for the internet or windows.
Most of these dictionaries are licensed very freely, but you should check on your own system. Versions of this are available online, e.g. the FreeBSD version at  https://svnweb.freebsd.org/csrg/share/dict/ (click “words”)

GCIDE (~185k words)
http://www.ibiblio.org/webster/ or http://gcide.gnu.org.ua/download
This dictionary contains words and definitions. Very useful if you actually want to look up the words you are using. Sources for the definitions are available as well. There are two versions – GCIDE which comes in a strange format and needs its own reader software, and GCIDE-XML. Licensed under GNU.

SCOWL and friends (variable word count)
http://wordlist.aspell.net/
A very complete set of wordlists, used for the aspell spell checker. My favorite part is the customizable interface where you can create your own custom dictionary. Many links and different dictionaries are available on this page, including some with part-of-speech and/or inflection data. Be aware: many versions of SCOWL contain swears and racial slurs.
Variable licensure, but all are released for private or commercial use as long as you maintain a version of the license.

CMU Pronouncing Dictionary (~134k words)
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
Contains not only words but their phonemes, meaning this is a great dictionary for text-to-speech, rhyming, and syllable counting. There are CMUDict libraries for node/browser, just node, python, and many other languages (send links please).
The file is copyright Carnegie Mellon, but is unrestricted for personal or commercial use.

Corpora (Lists of ~1k words)
https://github.com/dariusk/corpora
A really neat set of lists, broken down by category (e.g. /data/foods/beer_styles.json). A great starting resource for small projects that don’t need an extensive dictionary of the English language. Licensed under CC0 (no copyright).

Wordnik Developer (Web API)
http://developer.wordnik.com/
A powerful web API. From the site: “request definitions, example sentences, spelling suggestions, related words like synonyms and antonyms, phrases containing a given word, word autocompletion, random words, words of the day, and much more.”
Free 15k calls per hour, licensed for anything that isn’t a direct clone of Wordnik itself.

(Please, send any more dictionaries you know of my way and I’ll add them to the list!)

Libraries

FastTag / jsPos
https://github.com/mark-watson/fasttag_v2 (java) and https://code.google.com/p/jspos/ (js)
Java and Javascript libraries to tag parts of speech in words. Very handy if you’re doing any sort of lexical generation.
Licensed under LGPL3 or Apache 2 licenses

(Please, send more libraries my way if you know of them!)

People / Blogs

Peter Norvig
Probably my greatest source of inspiration on this front is Peter Norvig’s How to Write a Spelling Corrector (python). He shows that you don’t need any sort of fancy tooling or arcane knowledge to write something that at first seems complex – just don’t be afraid of making the computer do a lot of work for you. That’s what they do – they do work really, really fast. (see, for example, this scrabble solver)

Allison Parrish
Allison plays with words in amazing ways. She is the brains behind @everyword. Her website and research are full of great inspiration for playing with words and experimenting with language.

(please, send anyone doing interesting things with words my way and I’ll add them to the list!)

Project Idea: “Lanky” PHP Middleware Language

I keep not writing weekend web apps because I know that my projects generally follow this trajectory:
1. Build a database
2. Build some thin PHP middleware to expose an API to that database
3. Build a slick react/angular/whatever frontend
4. Realize a bunch of things I’m missing in my API and add them, lamenting the entire time that my “thin” PHP middleware is starting to bulge at the seams.

I keep realizing that that PHP layer is not really necessary. R0ml gave a great Recurse Center talk (if someone has notes I’ll link it)  on a postgres extension which renders the middleware unnecessary.

I’m not quite ready to make that leap and throw out the middleware entirely (even though I think it’s a great idea…). So I’d like to build a language to really easily describe a PHP API. My goal is to have the PHP routing be as thin as possible and mostly handled by one fast function. In fact, I’d love to eventually have this language implemented in node, or even C/C++ for a pure-speed kind of thing. I’d also love to figure out a way to cache the parsed document so that I don’t have to parse the entire thing every time, but that’s a future optimization.

I’ve started a github repository https://github.com/ertyseidohl/lanky  with an example of how I’d like a Lanky document to look at https://github.com/ertyseidohl/lanky/blob/master/thoughts.txt.

I may end up abandoning this at some point because of time restrictions but that’s why I figured I’d get my first ideas about it online – maybe if there’s something good in there someone can steal it, or even better, point me to a project which already does this but better.

I called it “Lanky” because that was the first synonym for “thin” that didn’t have outright negative connotations or an existing PHP framework (slim, gossamer, etc).

From Boulder, CO

–Erty