Welcome to the Interdome: databases

1/16/2010

Talking with machines

On the Twitter microsyntax front, Project EPIC is working out an emergency syntax for Haiti rescue communication efforts.

It's interesting to me because the syntax expression itself is not new, as it just uses hashtags. However, it utilizes hashtags for pre-set message components, as befitting important communication relay elements.

From the link above:

Our team and collaborators are proposing a Tweet-friendly hashtag-based syntax to help direct Twitter communications for more efficient data extraction for those communicating about the Haiti earthquake disaster. Use only requires modifications of Tweet messages to make information pieces that refer to #location, #status, #needs, #damage and several other elements of emergency communications more machine parsable.

EXAMPLE1: #haiti #imok #name John Doe #loc Mirebalais Shelter #status minor injuries

EXAMPLE2: #haiti #need #transport #loc Jacmel #num 10 #info medical volunteers looking for big boat to transport to PAP

EXAMPLE3: #haiti #need #translator #contact @pierrecote

EXAMPLE4: #haiti #ruok #name Camelia Siquineau #loc Hotel Montana

EXAMPLE5: #haiti #ruok #name Raymonde Lafrotune #loc Delmas 3, Rue Menelas #1

EXAMPLE6: #haiti #offering #volunteers #translators #loc Florida #contact @FranceGlobal

PRIMARY TAG
#need
#offering
#imok
#ruok
#damage
#injured
#road
.....

SECONDARY TAG
Need/Offering Descriptor Tags
#food
#water
#fuel
#medical of #med
#shelter
#transport
#volunteers... can shorten to #vols
#translator
#status
#status
#financial or #money
#information or #info
#supplies [list specific supplies needed]
.....

Data tags
#name [name]
#loc [location]
#num [amount or capacity]
#contact [email, phone, link, other]
#photo [link to photo]
#source [source of info]
#status [status]
.....

End Tag
#info [other information]

Overall order is not as important as tag-descriptor connection.

In a time of crisis, it makes sense to not quibble about whether slashes, backslashes, other symbols, or certain pre-set abbreviations make the most sense. And so, they've actually put something together quite sensible--they've basically converted a twitter post in a DB data record, with hashtag delimination. Any firehose sniffing program should be able to pick out and synthesize the relevant information from this list of tags.

They don't have any parsing programs showcased on the site, but they are live, tweeting with this syntax (@epiccolorado), and it shouldn't be too hard (for someone other than me) to build one pretty quickly.

There are some things I really like about this.

- It's simple. It takes a convention people already know, and re-uses it.

- It's basically making a simple little code book. One could print out the list of commonly-used tags on an index card, and in only a few seconds put together a message readable to the network of people looking for this format.

- It is indentifying a basic sentence structure, on a level up from "twitter syntax". This is new for Twitter semiotics. If you look at a commonly used syntax, such as the re-tweet, you will see a variety of different amalgamations of the syntax. Some put the "RT" first, or last, or some are now using "via" rather than "RT". Some RT only the last person in the RT chain, some put the first, or some put all. None of this matters, of course, because the message is still getting across. But with this EPIC format, the order of the tags matters, and yet is still a bit flexible. It leads with the identifier, "#haiti", and then continues in a line of primary, secondary, data, and then additional information tags to shape the message in an understandable way. The simplest way of forming a regular sentence is with [Subject] -> [Verb]. Then, you can expand that to [Subject] -> [Verb] -> [Object]. And then, [Subject] -> [Verb] -> [Object] -> [Adjective]. And then, [Subject] -> [Verb] -> [Adverb] -> [Object] -> [Adjective]. You get the idea. The position changes depending on what language you use, but our system of language is basically a database, assigning values to these different data types in a particular record, and then parsing the record in conjunction with other records. This EPIC format is doing that with the basic information types for crucial rescue information.

- It's readable by humans as well as machine. Anyone looking at a tweet in this format could tell what it means. In this way, it fits into the main trending flow of #haiti tweets, but also can be pulled out from the noise. It is a very ingenious, although simple, middle ground between incomprehensible DB record, and common language sentence. This is where I see the microsyntax on Twitter heading... some common, comprehensible ground between XML script and common language punctuation. It is an understandable written language, but syntaxed to be capable of being metadata.

It will be interesting to see how well this works in Haiti, but thinking ahead to the next disaster, they should print up laminated index cards with these tags on them, and syntax examples on the other side. They can air drop them, or distribute them with Twitterized cell phones. The beauty is that anyone can contribute to the information collection, using whatever means happens to work: cell phone, SMS, Internet, Twitter app, or even potentially voice. Add geotagging to the metadata, and you are getting near instant, localized, specifically formatted information from the ground. It should be pretty easy to go back and rank the tweets coming in, as DB reports are verified, bumping up users who provide good information. Any responder on the ground could easily be linked into the overall real-time awareness DB, without having to transfer on phone and radio, or waiting for confirmed contact. Report is made, and then the responder can go about his/her work.

Just wait until this sort of thing goes audible. Ten codes, the codes police and dispatch use over the radio, are currently being phased out all across the country because they are not unified, and sometimes cause confusion in hectic situations. But these are merely translations. One ten code stands for something else. What if they were syntactical codes, to let a computer or human listening know what sort of information was being read over the air? What if we are started using a vocal "click" to denote a hashtag, so the next spoken word would be known as an indexable primary or secondary tag, giving additional meaning to the data spoken next? It would be "plain speech", but plain speech imbued with metadata for easy compilation into DB style records. With voice-to-text-capture on the radio feed, there could be one open channel, with everyone speaking at once. The computer would capture the speech, complete with hash tags, and publish it to a readable timeline on the screen. The radio metadata (the unit's number is already included silently in the broadcast in current technology) would allow the dispatch or the particular units to follow the timeline of only particular units, say, involved on that particular response. You could listen to the open feed for instant vocal communication, or you could filter the feed to particular data tags.

Language has a great potential for cyborgization. Cybernetics is an extension of our logical thought processes, so there is no reason why our thought processes can't increase our computerized tools by interfacing our current age-old communication techniques with our new technology. Speak the future.

12/07/2009

IF(mp3=digital, createnewrecord, ctrl+A, Del)

I can't believe what a nerd I am. Look at this post I just wrote, and thought was a good idea! Can you believe the nerdy title I gave it? Wow. Anyway, posting anyway, as an example of the weird analytical stuff I actually think about during the day.

So I did this really stupid thing about a year and a half ago. While working as a karaoke DJ (this wasn’t the stupid part, okay?) I decided to copy over the external hard drive of DJ tunes to my own hard drive. I knew it wouldn’t be the best music of course, but I thought, here’s a chance to get all those classic party songs they put on those monthly mainstream DJ compilations, and well, I just never could turn down an opportunity to make my music collection more encyclopedic.

Big mistake.

Not only did I severely over-estimate the number of “classic party songs” to pure crap, I also forgot to take into account that the guy whose hard drive it was is one of the most unorganized, non-encyclopedic people I’ve ever met. My music library became clogged with unlabeled, mis-labeled, duplicate tracks, most of which I didn’t want anyway, with their titles written in all caps. To the tune of about 200 gigs.
If you are not acquainted with the true depths of my analytical neurosis, let’s just say that such a poorly organized “library” has been a heavy weight bearing on my database soul for the past year and a half.

But never fear dear reader, because I am working through. Little by little, I am making my way through the genres and deleting, re-categorizing, consolidating, stripping, and re-writing the metadata. I first did “alternative”, “punk/hardcore”, “classical”, and “jazz” so at least I could listen to some music without going crazy. I got rid of the “other” and “uncategorized” categories little by little, and eventually consolidated “hip-hop”, “hip hop/rap”, “gangsta rap”, “rap/hip-hop”, and “hip-hop/R&B”. Last night, I finally finished “pop”. The only ones left are “rock/pop” and “rock”, which are large, but by this time I am being brutal with my deletion, so I hope to finish this week. If I don’t immediately recognize the name, it goes. If they have a single song I don’t like, it goes. If they have a single song with a Christmas theme… Ctrl+A, Del.

Throughout this process, I have had much time to lament how horrible the music player programs are at sorting music. I use iTunes primarily (iPhone user). But for sorting purposes, I also tried Songbird, Media Monkey, Windows Media Player, and Winamp. They differ a little bit, but without buying extra modules, there really isn’t any improvement. The best thing one can do is to sort by a metadata category, and just brute force your way through it. Even so-called “duplicate” finders are pretty weak, with no way to qualify how close or far a supposed duplicate might match its pair. And then, they are remarkably proprietary. iTunes is notorious (at least among the people who discuss music library databases online) for not allowing the language of its library files to be touched. There are some Applescripts out there for making some changes, but amazingly, it is very hard to re-organize a music library any other way than through a browser.

Just so we’re clear, I’m talking about the music library, which is different than the actual mp3s on your hard drive. The library is basically a database file, in some derivative of XML, for organizing the track names, numbers, artwork, actual file locations, and other metadata for display through the player’s browser window. There is a re-write process between the file itself and the database (what iTunes calls “organizing”, or maybe “mediaTunes” now?) that will adjust the actual metadata of the mp3 to cohere with the library database.

Now, I know when I say this, the reason it is so is because so few people have the disposition to categorization that I have, but all the same—the databases available for media organization are abysmal. I don’t really see why—it is easy enough to add XML interpretation into a program. Your word processor can probably do it. But I guess in the effort to make media players as “cleanlined” as possible, (i.e. iPod/iTunes-like) these are abandoned in favor of tools that let the program do all the work.

And I’m not interested in trashing the iTunes mentality, because through it all, they’ve still put together an excellent media player. Sure, it’s a bit heavy for a media player program. And it has a tendency to do things “automatically” that really screw up—like losing user-uploaded artwork trying to auto-download it, and we don’t even need to get into the DRM stuff. But as basically a front end for their music store, it is still pretty damn usable for someone like me, who has only bought maybe two things from the iTunes Store ever.

For example, I love the Smart Playlists. This is the sort of functionality I’m talking about. These are basically database queries, where you can define ranges of the metadata variables like “times played” or “date last played”, and insert randomization and total record quantity. I have several personal “radio stations” made from these tools, and they work great. Of course, there is not as much flexibility as I would like. The same thing goes for the Genius function, which is basically a personalization query, based on variables iTunes doesn’t disclose. Of course, you can’t edit this, and for someone with +100 gigs of mp3s and a computer 5 years old, it kind of gums up the works. But it’s the right idea.

The thing I realized, while deleting 50+ copies of duplicate shit-club mixes of Akon’s three biggest songs of 2007, was that despite the hysteria about intellectual property insinuating that a song is infinitely replicable, and a mere collection of digital bits, we still don’t look at our music files as data. There is an aspect of the commodity in every mp3; it takes on more than what it is. An mp3, to a consumer, is purely the music experience, not the possession of data which can create the music experience. My DJ associate with bad file habits thinks to himself, I want this song in my music collection, and adds it in, with no thought of where it will go. When he wants to play the song, he searches for that particular track, and plays it. There is no browsing, no querying, no organization. The more duplicate tracks he has, with different spellings and different data in different categories, the more likely he’ll find an instance of it when he searches for it in the search bar. The entire analytical process is, Want->Get. It’s the purest sort of production/consumption there is.

This is good for record companies, who try to institute the fear that if they can’t make money, then you won’t have any more mp3s. Actually, with DRM, they’re probably right. But it isn’t true—being able to drag and drop an entire collection of mp3s proves the point. An mp3 is only data. Music has long since past the point of expressive performance, and has entered the realm of digital data, along with many other aspects of our life. Now, expressive performance, the actual production and consumption, live within the differences of binary digits.

So what are you going to do? Well, as any database administrator will tell you—stop doing that! That is, having poor data habits. We know to back up our data, and to be careful where we get our data, but now we need to learn to organize it. A well-kept database is a useful database. Only one item of data per variable, each record separate, no duplicates, proper linking conventions. Clean query programming. It’s just what makes sense.

Of course, no 13 year-old just starting their mp3 collection is going to do this. You just throw ‘em in a file as you download them. So instead of instituting my Universal Rules of Epistemological Fortitude, as I would like to do, I instead look to the media players. I want MS Access, with a media player function. I want write combo boxes for my playlists. I want SQL queries in mp3 queries. I want to add IF/THEN statements to my iPhone syncing. Maybe with some top-down redesign of software, we could start treating our mp3s as what they are—valuable data.

"..."

"For centuries the situation in literature was such that a small number of writers faced many thousands of times that number of readers. Then, towards the end of the last century, there came a change. As the press grew in volume, making ever-increasing numbers of new political, religious, scientific, professional and local organs available to its readership, larger and larger sections of that readership (gradually at first) turned into writers. It began with the daily newspapers opening their 'correspondence columns' to such people, and it has now reached a point where few Europeans involved in the labour process could fail, basically, to find some opportunity or other to publish an experience at work, a complaint, a piece of reporting or something similar. The distinction between writer and readership is thus in the process of losing its fundamental character. That distinction is becoming a functional one, assuming a different form from one case to to the next. "

--Walter Benjamin, The Work of Art in the Age of Mechanical Reproduction

Welcome to the Interdome

1/16/2010

Talking with machines

12/07/2009

IF(mp3=digital, createnewrecord, ctrl+A, Del)

Please Insert Coin

What is an Interdome?

Welcome to the Twitterdome

Welcome to the Twitterdome

plain ole Interdome RSS

Recently on Brute Press

Billboard

Best of other RSS (via Google Reader)

Other Interdome Projects

Blog Archive

These all look good!

Minion(s)

Automata

Labels

"..."

Usage Info