17 November 2015

13 ways to make your writing look more professional

Simple copyediting things.

The nice thing about these is that, unlike with truly good writing, no skill and very little work is required to put them into practice. They’re all just a matter of paying attention.

I’ve done some copyediting as part of my job, especially with marketing material. Certain basic mistakes come up so often that I made a list that I’ve been tempted to give to whoever gave me the original content and say “please make sure that it doesn’t have any of these problems first!” I didn’t, but for those who are interested, following these simple rules will make your writing look more professional. The nice thing about these is that, unlike with truly good writing, no skill and very little work is required to put them into practice. They’re all just a matter of paying attention.

  1. Never give someone something to read that you haven’t spell checked. If it has typos that a spell checker would have caught, it’s like saying “my time is so much more valuable than yours that I couldn’t bother doing this simple, mechanical two-minute task before giving this to you.” If you’re writing with a tool that doesn’t have a spell checker, paste the text into Microsoft Word or LibreOffice and look for the red squiggly lines. If a spell checker doesn’t recognize a company name and you’re not 100% sure of its spelling, take ten seconds to check it on their website, especially if someone from that company may see the piece.

  2. Only put one space after a period, question mark, or exclamation mark ending a sentence, not two. People used two in the days of manual typewriters for hard copy manuscripts that would be submitted to typesetters, but as with the carriage returns that we formerly added to the end of every single line on typewriters, we now leave it up to the computer to decide how much spacing is appropriate. If you put two spaces after a period, your word processor will put too much space there.

  3. In something published by an American company, punctuation at the end of a quoted phrase goes inside the quotes, “like this,” not outside, “like this”. In the UK they do it outside. This is a stickier issue with technical writing, where you may be referring to specific strings of quoted text; for example, if I write that a password is “swordfish”, I don’t want readers thinking that the comma is part of the password. The important thing is to be consistent within a document.

  4. In a bulleted or numbered list, either end all the bullets with punctuation that treats the bullets as complete sentences or end none of them that way. Don’t do this:

    • Go out the front door

    • Pull the mail out of the mailbox.

    • Bring the mail back inside

    • Leave the mail on the dining room table.

  5. The items of a list like that should be grammatically consistent: all complete sentences or all grammatically consistent phrases (for example, all noun phrases) with no complete sentences. For example, if the first item says “Easier setup and installation” and the second says “Wide choice of reports,” then no other items in that list should be complete sentences.

  6. Put consistent spacing around em dashes and don’t confuse them with hyphens. A hyphen is the keyboard character that usually connects words being used together as a single adjective as in user-friendly interface or in-memory database. An em dash (named for being the width of the letter “m”) is used for appositive phrases. It’s often written with two hyphens--like this--which Microsoft Word and LibreOffice will convert to an em dash character. In HTML, you can enter — or just paste the character from somewhere else. (An en dash is a bit narrower and used for date ranges. Handy hint when you're unloading your last few tiles at the end of a Scrabble game: both em and en are legal words.) Em dashes should either have a space on both sides — like that, or on neither side—like that. Pick one spacing convention and make sure that all the em dashes in a given document are spaced consistently.

  7. Some phrases may or may not use initial caps, like Artificial Intelligence. If you do, capitalize it consistently throughout a document. Don’t refer to Artificial Intelligence in the first paragraph and artificial intelligence in the fourth. Also, with phrases that may or may not be written as one word, pick one and be consistent; don’t write “filename” in one paragraph and “file name” further on in the document. (Early drafts of this blog post made this mistake with “spellcheck.”)

  8. We use apostrophes to stand in for a missing letter in a contraction (such as standing in for the “o” from “is not” in “isn’t”) or for the possessive, as in Jim’s car, so never ever use “it’s” as a possessive—“it’s” can only be used as a contraction for “it is.” Don't use an apostrophe and an “s” to indicate a plural. (Some people make exceptions for numbers like 1990’s.)

  9. Use English instead of Latin abbreviations: “for example” instead of “e.g.” and “that is” instead of “i.e.” Instead of saying “etc.,” introduce a list with “such as” to indicate that the list is incomplete and that there are probably more entries. For example, say “baseball teams such as the Mets, Yankees, and Red Sox” and not “Mets, Yankees, Red Sox, etc.”.

  10. In the age of the web, underlining means hypertext link. Don't use it for anything else because it clutters a layout. (In the old days, it was an indication to a typesetter to italicize text.) For emphasis, use bold or italics. For example: Never use an apostrophe and an “s” to indicate a plural.

  11. Check that all the links work. As with spell checking, this is best done (or redone) just before sending a document off to someone, because if you do it and then make many other edits, those edits may introduce new problems.

  12. If a product name is trademarked, only put the trademark symbol after the first mention of the product in a document. Here is what one intellectual property attorney tells us:

    In written documents — it articles, press releases, promotional materials, and the like — it is only necessary to use a symbol with the first instance of the mark, or with the most prominent placement of the mark. It is a common misconception that each and every instance of the mark should bear a trademark symbol. Overuse creates visual clutter and may detract from the aesthetic appeal of the piece. Provided there is at least one conspicuous use of the TM, SM, or ® on the face of the writing, do not be afraid to eliminate superfluous markings.
  13. Don't say “and/or.” If necessary, rewrite the sentence. In general, the use of slashes to indicate indecision is a bad idea. Decide on something, or rewrite the sentence.

Please add any comments to this Google+ post.

17 October 2015

Data wrangling, feature engineering, and dada

And surrealism, and impressionism...

Man Ray assemblage

In my data science glossary, the entry for data wrangling gives this example: "If you have 900,000 birthYear values of the format yyyy-mm-dd and 100,000 of the format mm/dd/yyyy and you write a Perl script to convert the latter to look like the former so that you can use them all together, you're doing data wrangling." Data wrangling isn't always cleanup of messy data, but can also be more creative, downright fun work that qualifies as what machine learning people call "feature engineering," which Charles L. Parker described as "when you use your knowledge about the data to create fields that make machine learning algorithms work better." In other words, you're creating new fields (or features, or properties, or attributes, depending on your modeling frame of mind) from existing data to let systems do more with that data.

New York's Museum of Modern Art released metadata about their complete collection on github, and I recently had a great time doing some data wrangling with it. I managed to transform the data so that it could answer interesting questions such as "who are the youngest painters in MoMA's collection?" and "on average, which country's painters make the biggest paintings?" Neither of these questions could be answered with a query against their original data.

I enjoyed working with this data so much because I went to MoMA pretty regularly during my years in New York City. In addition to iconic paintings such as Picasso's Demoiselles d'Avignon, Dalí's Persistence of Memory, and van Gogh's The Starry Night, they have many key works by my own favorites such as Marcel Duchamp and Man Ray. My wife and I were members there for several years, which let us go to the members' special openings of some exhibits, and through a friend of hers we sometimes got to go to the more exclusive pre-members' openings where we'd see celebrities such as Chuck Close and David Bowie.

The data

The data on github is a comma-separated value file with 123,920 rows and 14 columns that have labels across the top such as "ArtistBio", "Medium", and "Dimensions". The feature engineering fun comes from looking in the more descriptive fields to find patterns that identify pieces of data that can be stored on their own with more structure so that they're easier to query. For example, the smaller of their two Monet Water Lilies paintings has a "Dimensions" value of "6' 6 1/2" x 19' 7 1/2" (199.5 x 599 cm)" and Man Ray's assemblage Indestructible Object (or Object to Be Destroyed) has a value of "8 7/8 x 4 3/8 x 4 5/8" (22.5 x 11 x 11.6 cm)". Along with that optional third dimension, other variations in this column include the use of the symbol "×" instead of the letter "x" and descriptive additions such as "Approx." (174 works) or "irregular" (101).

I wrote a Python script that churned through this data and used regular expressions to pull individual pieces of information from several different fields. (Regular expressions, also known as regexes, offer ways to look for patterns in data such as "four numeric digits followed by optional space, a hyphen, optional space, and then either two or four digits". O'Reilly has a whole book about them.) For the Dimensions field, my script pulled out the metric width, height, and, if included, the depth and descriptive note. My script, available with the resulting data on github, converts all the input fields and new data to RDF so that I could query it with SPARQL. For example, when writing the previous paragraph, I knocked out some quick SPARQL queries to find that the script had pulled "Approx." from the Dimensions data 174 times and "irregular" 101 times.

I considered also outputting the results to a new CSV table with additional columns for the extracted properties, but when an artist like Elizabeth Catlett is listed as both American and Mexican, I wanted to output these two separate facts about her, which would require two columns or a separate artist nationality table to handle artists with multiple values for this field. This would be a pain with table-based data, but of course, it's not an issue with RDF.

Artist nationalities came from the CSV file's ArtistBio column, which had simple descriptions such as "(Swiss, born 1943)" and more complex ones such as "(French and Swiss, born Switzerland 1944)" and "(American, born Germany. 1886-1969)". For each work's artist, my Python script's regular expressions pulled out nationality values, where they were born if specified, their birth years, and their death years (if specified) into separate RDF triples.

Not counting the header row and blank cells, the MoMA CSV file has 1,625,710 pieces of information in it. The resulting RDF has 2,364,277 triples, so it's clearly much richer.

Queries to play with the new data

I could make many interesting queries against the original CSV values that were converted to triples with no manipulation, but the value of this feature engineering is clearer if we look at queries that take advantage of the new, extracted data. (For those interested in the geekier details, each bullet below links to the actual SPARQL query and results.) You'll see that a common theme among the queries is doing a bit of arithmetic with numeric values extracted from the more descriptive CSV values, such as multiplying height by width to determine a work's area.

  • What's the single largest painting? At 798,972 square cm, James Rosenquist's F-111. I knew of and had seen this work, but didn't realize until looking at his Wikipedia page just now that F-111 was how this important sixties pop artist first came to the art world's attention.

  • What's the largest photograph? Mariah Robertson's 11, which uses a thirty-inch-wide one-hundred-foot roll of photographs as part of a three-dimensional work. (I might not consider this a "Photograph", but that is its Classification value in the original CSV data.)

  • What's the largest three-dimensional work? The 1994 installation Stations by Bill Viola, who first became known as a video artist. (The piece includes five video projections.)

  • How many painters come from each country? No surprise that the U.S. leads with 494 artists, followed by the French, German, British, and other European countries until you get to Argentina in seventh place and Japan in eighth. The full list has 52 countries, and I thought Argentina's high placement was interesting; off the top of my head I can't name a single artist from that country.

  • What's the average painting size by country? This query filters out countries with less than eleven paintings in the collection to increase the chance of getting a representative sampling, and again it's not a surprise that the U.S. leads with an average painting size of 28,244 square cm. (I'm sure Rosenquist helped here.) The next few are Germany, Britain, Japan, and Italy, all with average sizes over 20,000 square cm. The Russians have the smallest paintings, with the 32 of them having an average size of 6,758 square cm. I'm sure that closer analysis would find smaller or larger sizes to be favored by particular artists who are well-represented in MoMA's collection and skewing the average for their countries.

  • What are the oldest pieces in the collection and who made them? Besides a brocade from 1600 by "unknown", there are four "Black basalt with glazed interior" works dated 1768 such as this sugar bowl. These are pretty old for a museum of modern art, but if you look at any of them you'll see why they fit right into the collection. And, they're credited to a familiar name: Josiah Wedgwood, founder of the company that bears his name.

  • Who are the five youngest painters with work in the collection? One work apparently co-credited to two artists gives us a total of six names, all born in the eighties, and none of whom I've heard of.

Most of these queries focus on work in specific media because broader versions often ran into data anomalies that led to odd answers. For example, a query for the work in the collection took the longest to create showed several photographs that apparently took over a hundred years. I assume that the elapsed time represented the span between the exposure of the negatives and the creation of the prints in MoMA's collection. A query for the oldest living artist seemed simple enough--just look for the earliest birth year with no corresponding death year, but it turned out that there was no death date recorded for one artist born in 1731. (Sometimes the data has question marks as a birth or death date, but I didn't want to store those in a property that I'd use to perform arithmetic.) A query about the youngest artist in the whole collection found that it was someone named "Technology will save us" born in 2012--clearly a collective founded in that year and not a person. Also, since all artist names and information are properties of a "work", an artist whose name is spelled two different ways will be considered as two different artists with the current setup.

Other odd answers led to tweaks to the regular expressions and other logic in the data conversion and queries, but at some point, unless someone's paying you otherwise, you've got to quit and make the best you can of what you have. (On this topic, I highly recommend Jeni Tennison's classic Five Stages of Data Grief.)

Even if my script doesn't create perfect data about every work in MoMA's collection, the data it creates still offers plenty to query. I think it demonstrates pretty nicely how data wrangling techniques such as the use of regular expressions--in addition to cleaning up messes such as badly formatted data--can do the kind of feature engineering that improve a dataset to make it even more useful.

Photo of Man Ray's "Indestructible Object (or Object to be Destroyed)" by Chris Barker via Flickr (CC BY-NC-ND 2.0)

Please add any comments to this Google+ post.

19 September 2015

My data science glossary

Complete with a dot org domain name.

glossary in dictionary

Lately I've been studying up on the math and technology associated with data science because there are so many interesting things going on. Despite taking many notes, I found myself learning certain important terms, seeing them again later, and then thinking "What was that again? P-values? Huh?"

So, I turned a portion of my notes into a glossary to make these things easy to look up when I wanted to remember them. I decided that I may as well publish this glossary in case others found it helpful, or if they had suggestions or corrections. And, when I found that the domain name datascienceglossary.org wasn't taken, I couldn't resist grabbing it.

Now it's up and ready for the world: datascienceglossary.org. I also took the opportunity to try out Bootstrap to see how easily it might make my new little website look presentable on Android and Apple phones and tablets in addition to bigger screens. It was pretty easy, especially after I found their documentation page. (In the past, I've found that many CSS frameworks that are supposed to make your life easier have horrible if any documentation--"just look out our fabulous examples" isn't enough; if the class values that we're supposed to assign to our HTML elements are packed with cryptic little abbreviations, then tell us what all the abbreviations stand for.)

I hope my data science glossary is useful to some people. I know it will be useful to me, especially the next time I forget what "P-value" means.

Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets


    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists