17 January 2016

The past and present of hypertext

You know, links in the middle of sentences.

I've been thinking lately about the visionary optimism of the days when people dreamed of the promise of large-scale hypertext systems. I'm pretty sure they didn't mean linkless content down the middle of a screen with columns of ads to the left and right of it, which is much of what we read off of screens these days. I certainly don't want to start one of those rants of "the World Wide Web is deficient because it's missing features X and Y, which by golly we had in the HyperThingie™ system that I helped design back in the 80s, and the W3C should have paid more attention to us" because I've seen too many of those. The web got so popular because Tim Berners-Lee found such an excellent balance between which features to incorporate and which (for example, central link management) to skip.

The idea of inline links, in which words and phrases in the middle of sentences link to other documents related to those words and phrases, was considered an exciting thing back when we got most of information from printed paper. A hypertext system had links between the documents stored in that system, and the especially exciting thing about a "world wide" hypertext system was that any document could link to any other document in the world.

But who does, in 2016? The reason I've been thinking more about the past and present of hypertext (a word that, sixteen years into the twenty-first century, is looking a bit quaint) is that since adding a few links to something I was writing at work recently, I've been more mindful of which major web sites include how many inline links and how many of those links go to other sites. For example, while reading the article Bayes's Theorem: What's the Big Deal? on Scientific American's site recently, I found myself thinking "good for you guys, with all those useful links to other web sites right in the body of your article!"

To get some idea of relative proportions of internal links, external links, and linkless text on today's successful websites, I went to a top 15 most popular blogs list and did some random checking of articles on these sites. (An exercise for the reader to make up for my haphazard skimming: write some scripts to scrape some editorial content from each site, count the internal and external links, and produce a bar chart.) Because these are professionally managed sites, I imagine that management at some of them encourage links to other articles on the same site and discourage links to others as a matter of policy, because they want to keep their readers looking at their advertisers' ads.

There is a gray area between internal and external links: linking to other sites that are part of the same organization, such the many links in a Business Insider article to Tech Insider articles, or the many links between members of the Gawker Media stable, which is heavily represented in the top 15.

Of those top 15:

  • Huffington Post: a mix of internal and external links, but their number of external links fits with their business model of being a hub of other sites' content.

  • All about the internal links: TMZ, Mashable, Gawker, The Daily Beast, Engadget, Jezebel (where most external links are to their Gawker Media sibling Gawker).

  • Deadspin: a reasonable percentage of external links.

  • Gawker Media's video game site Kotaku: long stretches of text with no links, and others with both internal and external links.

  • TechCrunch: mostly internal and several to Gizmodo, even though TechCrunch is an AOL site and Gizmodo a Gawker media site.

  • Gawker Media's lifehacker, which is probably the site I visit most of all those listed here: external links if an article describes the external site's article, company, or product, but otherwise, internal links.

  • Perez Hilton: mostly internal links; external links tend to be redirected via goo.gl, I suppose so that Mr. Hilton's people can track which external links get clicked.

  • Gawker Media's Gizmodo: plenty of external links, even to non-Gawker sites, for a gadget site that I assume is mostly interested in helping advertisers sell gadgets.

  • Cheezburger: textual content not much of an issue here.

I'm guessing that there is no policy across all of Gawker Media about the use of links, but that each of their major properties has some sort of policy in place. (For an interesting, explicit enumeration of one carefully managed site's linking policy, see the guidelines at IBM Developer Works.)

On particularly link-rich bit of content that I read regularly is Data is Plural, which ironically is delivered via email—a technology that had a firm foothold in the Internet before Berners-Lee came up with the Web, and which most young people today only use to communicate with us old people.

Who even thinks about hypertext as hypertext anymore? A quick look at the former Usenet newsgroup (and now Google Group) alt.hypertext shows an average of about one new message or comment per month for the last few years, including spam. (Compare January of 1998, when the newsgroup had 39 topics with one or more postings in that one month.) The most recent topic shown is titled "NCSA Mosaic for X 0.10 available" from Marc Andreesen, posted—I thought—last month, making me think "isn't he a bit busy for Mosaic these days?" It turned out that last month someone added a comment to his original 1993 post. A relatively recent new topic is Paul Ford's January 2014 query "Do documents have a chance? Or is the future more and smarter optimized applications?" Actually, that makes a solid answer to my question that began this paragraph: Paul Ford, and I'm really looking forward to his upcoming book.

'Afternoon, A Story' package

The hypertext "novel" I bought in 1994 for $25

Please add any comments to this Google+ post.

20 December 2015

My new job

Lots of cutting edge technologies, 18 minutes from my home.

CCRi logo

I recently began a new full-time position as a technical writer at Commonwealth Computer Research, Inc., more commonly known as CCRi. CCRi was doing large-scale data science long before the term "data science" became so popular; one company founder also directs the University of Virginia's Data Science Institute. They also do a lot of work with distributed machine learning and other cutting edge technologies, especially in the area of geospatial analytics. The chance to work with so many different interesting new technologies and smart people—engineering and math PhD's tend to be the norm instead of the exception—right here in Charlottesville, after telecommuting for over eight years, was just too good to pass up.

Having recently grown to over 80 employees, CCRi has gotten large enough that it's become difficult for everyone there to know about all the technology and projects going on in other parts of the company. Part of my role will be to help with that, documenting these things so that it's easier for people to find connections between the different existing and new efforts underway. I'll also be helping them with marketing and business development.

RDF and SPARQL do play a role in some of the projects there, mostly using the Rya triplestore because of its use of Apache Accumulo for storage. Accumulo is a key-value pair NoSQL database built on Hadoop whose design is based on Google's BigTable database, and it plays an important part in several CCRi projects.

One of the biggest projects at CCRi is GeoMesa, which is described by its product page is "an open-source solution maintained and supported by CCRi for storing, indexing, querying, transforming, and visualizing spatio-temporal data at scale in Accumulo." For a start, it adds to Accumulo what PostGIS adds to PostgreSQL: datatypes, functions, and more features that make it easy to store and query geospatial data. Going beyond that, GeoMesa lets you store spatio-temporal data, so that event timestamps can play a role in applications that use GeoMesa. Apache Kafka provides GeoMesa with some nice infrastructure for handling real time streaming data. For example, it was used to create this animated U.S. map of tweets over the 2015 Super Bowl week.

As alternatives to using Accumulo for storage, GeoMesa can also use Apache HBase and Google Cloud BigTable, the public version of Google's internal Bigtable storage system. After Google heard about this, they contacted CCRi about a partnership, which was exciting enough in this town for a local TV station to run the news story shown below. That video is fun, but if you only have a minute and a half to watch a video about GeoMesa, I recommend the GeoMesa on Google BigTable one, which shows off some of the excellent visualizations that are possible.

In addition to products like GeoMesa and others that you can see on the website, the company does applied research, often for government agencies. (I'm learning a lot about those—did you know that the U.S. has an Office for Anticipating Surprise?) In this era of Big Data, the question sometimes comes up of how to best make use of all this data now that tools for working with such large quantities of it have become more easily available. CCRi's capabilities such as predictive analytics, optimization, and text analysis are helping customers get more out of this data in settings ranging from international sales patterns to battlefields. If anyone wants to contact me to learn more, I'd be happy to set them up with the right people to tell them about the kinds of services CCRi offers.

Please add any comments to this Google+ post.

17 November 2015

13 ways to make your writing look more professional

Simple copyediting things.

The nice thing about these is that, unlike with truly good writing, no skill and very little work is required to put them into practice. They’re all just a matter of paying attention.

I’ve done some copyediting as part of my job, especially with marketing material. Certain basic mistakes come up so often that I made a list that I’ve been tempted to give to whoever gave me the original content and say “please make sure that it doesn’t have any of these problems first!” I didn’t, but for those who are interested, following these simple rules will make your writing look more professional. The nice thing about these is that, unlike with truly good writing, no skill and very little work is required to put them into practice. They’re all just a matter of paying attention.

  1. Never give someone something to read that you haven’t spell checked. If it has typos that a spell checker would have caught, it’s like saying “my time is so much more valuable than yours that I couldn’t bother doing this simple, mechanical two-minute task before giving this to you.” If you’re writing with a tool that doesn’t have a spell checker, paste the text into Microsoft Word or LibreOffice and look for the red squiggly lines. If a spell checker doesn’t recognize a company name and you’re not 100% sure of its spelling, take ten seconds to check it on their website, especially if someone from that company may see the piece.

  2. Only put one space after a period, question mark, or exclamation mark ending a sentence, not two. People used two in the days of manual typewriters for hard copy manuscripts that would be submitted to typesetters, but as with the carriage returns that we formerly added to the end of every single line on typewriters, we now leave it up to the computer to decide how much spacing is appropriate. If you put two spaces after a period, your word processor will put too much space there.

  3. In something published by an American company, punctuation at the end of a quoted phrase goes inside the quotes, “like this,” not outside, “like this”. In the UK they do it outside. This is a stickier issue with technical writing, where you may be referring to specific strings of quoted text; for example, if I write that a password is “swordfish”, I don’t want readers thinking that the comma is part of the password. The important thing is to be consistent within a document.

  4. In a bulleted or numbered list, either end all the bullets with punctuation that treats the bullets as complete sentences or end none of them that way. Don’t do this:

    • Go out the front door

    • Pull the mail out of the mailbox.

    • Bring the mail back inside

    • Leave the mail on the dining room table.

  5. The items of a list like that should be grammatically consistent: all complete sentences or all grammatically consistent phrases (for example, all noun phrases) with no complete sentences. For example, if the first item says “Easier setup and installation” and the second says “Wide choice of reports,” then no other items in that list should be complete sentences.

  6. Put consistent spacing around em dashes and don’t confuse them with hyphens. A hyphen is the keyboard character that usually connects words being used together as a single adjective as in user-friendly interface or in-memory database. An em dash (named for being the width of the letter “m”) is used for appositive phrases. It’s often written with two hyphens--like this--which Microsoft Word and LibreOffice will convert to an em dash character. In HTML, you can enter — or just paste the character from somewhere else. (An en dash is a bit narrower and used for date ranges. Handy hint when you're unloading your last few tiles at the end of a Scrabble game: both em and en are legal words.) Em dashes should either have a space on both sides — like that, or on neither side—like that. Pick one spacing convention and make sure that all the em dashes in a given document are spaced consistently.

  7. Some phrases may or may not use initial caps, like Artificial Intelligence. If you do, capitalize it consistently throughout a document. Don’t refer to Artificial Intelligence in the first paragraph and artificial intelligence in the fourth. Also, with phrases that may or may not be written as one word, pick one and be consistent; don’t write “filename” in one paragraph and “file name” further on in the document. (Early drafts of this blog post made this mistake with “spellcheck.”)

  8. We use apostrophes to stand in for a missing letter in a contraction (such as standing in for the “o” from “is not” in “isn’t”) or for the possessive, as in Jim’s car, so never ever use “it’s” as a possessive—“it’s” can only be used as a contraction for “it is.” Don't use an apostrophe and an “s” to indicate a plural. (Some people make exceptions for numbers like 1990’s and abbreviations such as M.D.'s.)

  9. Use English instead of Latin abbreviations: “for example” instead of “e.g.” and “that is” instead of “i.e.” Instead of saying “etc.,” introduce a list with “such as” to indicate that the list is incomplete and that there are probably more entries. For example, say “baseball teams such as the Mets, Yankees, and Red Sox” and not “Mets, Yankees, Red Sox, etc.”.

  10. In the age of the web, underlining means hypertext link. Don't use it for anything else because it clutters a layout. (In the old days, it was an indication to a typesetter to italicize text.) For emphasis, use bold or italics. For example: Never use an apostrophe and an “s” to indicate a plural.

  11. Check that all the links work. As with spell checking, this is best done (or redone) just before sending a document off to someone, because if you do it and then make many other edits, those edits may introduce new problems.

  12. If a product name is trademarked, only put the trademark symbol after the first mention of the product in a document. Here is what one intellectual property attorney tells us:

    In written documents — it articles, press releases, promotional materials, and the like — it is only necessary to use a symbol with the first instance of the mark, or with the most prominent placement of the mark. It is a common misconception that each and every instance of the mark should bear a trademark symbol. Overuse creates visual clutter and may detract from the aesthetic appeal of the piece. Provided there is at least one conspicuous use of the TM, SM, or ® on the face of the writing, do not be afraid to eliminate superfluous markings.
  13. Don't say “and/or.” If necessary, rewrite the sentence. In general, the use of slashes to indicate indecision is a bad idea. Decide on something, or rewrite the sentence.

Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets


    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists