I needed some sample address book data for a project that I'm working on. Because of the number of people who may see it, I didn't want to use real address book entries, so I wrote some Python scripts to generate some.
I spread it across a few scripts because I wanted to generate data for different schemas. I put the main data generation functions in one file and then call those functions and format the data in scripts that are specialized for their particular output format. You can use these to generate data for a relational database, XML, your favorite RDF flavor, or whatever you like. The basic library has functions such as firstName() and zipCode() to generate random values, with some, like middleName() and note(), sometimes returning nothing. I have two scripts that use the library: one generates a CSV file that emulates one exported from Microsoft Outlook 2003, and the other emulates the CSV address file exported by Eudora 7. (Did you know that Eudora can't import the CSV files that it exports?)
The data is pretty US-oriented, but a few tweaks should adapt it for other countries. It randomly picks first and middle names from the US census list of most popular male and female names and surnames from the census list of most popular last names. It took very little web searching to find the most popular street names and US Cities, and for employer names I went with the last 100 of the Fortune 500.
There generated data has plenty of incongruities. Middle names are randomly picked separately from first names, so male and female names are often mixed. The same happens with city and state names, so that Albert Victoria Freeman Jr. may live in Baltimore, California. To convert an employer name to a domain name for a work email address, I just took out spaces and punctuation, converted to lower-case, and put ".com" at the end, which can result in some long domain names.
I've always enjoyed generating random content that faked the appearance of semantic value. One event in particular inspired me about twenty-three years ago, when the only programming languages I knew were Microsoft Basic and dBase II. I was in the early stages of a "poetry" generation program that only had seven or eight possible verbs, and all the nouns were pronouns, and it came out with this:
It thinks. It scares her.
(Try to picture it on green and white paper in a dot matrix font.) The heart of all of these is the random function; when coding for fun, seeing different output each time is often more entertaining than consistent output. I've recently figured out how I can generate multi-part music from an XSLT script, which I'll make public somewhere once I have the time to actually implement it and write it up.