Talking to the Machine


When I was 18, I decided to write a novel by dictating it into a tape recorder. the idea, as I recall, was to avoid the quagmire of writing a first draft, and jump directly to the revision stage. What I found, though, was that in seeking to make my writing process easier, I had made it far more difficult, for the taped draft, once transcribed, was full of endless sentences, rambling digressions, and other conversational misdirections that rendered it literally impossible to read. The trouble with dictation, I came to understand, had to do with the dichotomy between spoken and written speech, the way that, by its nature, talk is loose and formless, while writing cannot help but be more controlled. Even the most natural writer has two distinct voices, one for conversation and one for the page. And in trying to merge mine, I realized that the key to writing was writing, and that there could be no alternatives to putting in the necessary work.

Over the years, I’ve done my best to stick to that maxim, never again adopting dictation as a writing tool. Yet today, I’m leaving history behind to take another stab at translating talk into prose. My motivation here is simple— to test-drive Dragon NaturallySpeaking Preferred voice recognition software, a computer program that theoretically “recognizes” its user’s speech and re-creates it in the form of written words. Experience to the contrary, such a notion continues to fascinate me, if for no other reason than my own laziness, my desire to find an easier way to work. After all, should it live up to even a fraction of its promise, voice recognition might represent a kind of electronic missing link between
language and its electronic simulcra, enabling us to eclipse at least some of the distance between writing and speech. Sure, I tell myself, this is dictation, but unlike the static receptivity of the tape recorder, it’s dictation of an interactive variety, in which you can see your sentences take shape as you frame them, much as you would with traditional text.

Of course, when it comes to voice recognition, interactivity takes place on a number of levels all at once. For me, the first involves actually having to leave my house and head to the office, where the NaturallySpeaking software has been installed on one of the paper’s machines. It takes a lot of memory, after all, to run a voice recognition program, a lot more memory than I have at home. But if, on the one hand, I’m looking forward to seeing how NaturallySpeaking works on a computer that can handle it, I’m also more than a little wary about using it in what is, essentially, a public setting, with people working all around me as I sit in an office and talk to myself.

That sense only increases when I settle in before the computer and prepare to customize, or “train,” the software to identify the inflections of my words. First, I adjust the headset until the microphone is the proper distance from my mouth; then, I talk my way through a couple of preliminary steps, making sure, as NaturallySpeaking reminds me, to enunciate every syllable, as though I were talking to a recalcitrant child. After a minute or two, I start to feel more comfortable, but still, I can’t keep from speaking quietly, at times so quietly the software can’t “hear” me.

NaturallySpeaking asks me to spend 30 minutes reading into the microphone, offering a choice of pre-programmed texts, including Alice’s Adventures in Wonderland and Charlie and the Chocolate Factory, which I recently read to my four-year-old son. There’s something delightfully whimsical about discovering works like these in the middle of a software training process, and, as I begin to read from Charlie, I imagine a room full of computer programmers laughing somewhere, as if they and I are sharing a joke. Yet before too long, I feel the knife’s edge of self-consciousness again, as if, in doing this, I’ve slipped the bounds of logic and fallen headfirst down the rabbit hole. In that regard, it might have been more appropriate to select Alice, more reflective of the effort to “communicate” with a computer— an idea that is, essentially, absurd.

How absurd begins to be apparent once I finish reading and start, as it were, to “write.” I speak a sentence and watch as, seconds later, it emerges on my screen. Although it’s a bit disconcerting to dictate punctuation marks— “When I was 18 comma,” I say, “I decided to write a novel by dictating it into a tape machine period”— I am attracted by the computer’s fluidity, its ability (or so it seems) to understand. That’s the lure of voice recognition software, and, using it, I feel as if I’m living in science fiction, like I’ve become a character in a film. The first time I ever saw voice recognition was in the movie Being There, when the character played by Melvyn Douglas has a heart attack while dictating, and we watch his computer record his struggling gasps. And sitting here, with paragraphs flowing like water across my monitor, it’s hard not to be tempted into believing that writing might one day be this simple, this unfiltered and direct.

As compelling as that sounds, however, it’s not long before I start to come up against the odd accommodations required by technology. The software is unable to “hear” exactly what I say. The phrase “unfiltered and direct,” for instance, with which I ended the last paragraph, originally appeared as “on filters and her act,” while “dichotomy” came out as “dike hot to me” when I used it earlier on. Then there are the non-words that come out as indiscriminate bits of verbiage. An accidental sigh is recorded as “ahead,” while the rustle of my fingertips against the microphone yields the nonsensical “do for hot and her hot day.” There’s something intriguing about this, not only because it makes me reconsider the relationship of sounds to language, but because of NaturallySpeaking’s strangely adolescent way of turning nearly every misinterpretation into a dirty joke. It’s a subtle bit of subversion, the flip side of the sensibility that asks office workers to read aloud from Charlie and the Chocolate Factory, and, in confronting it, I get the sense that the computer has been possessed by a prankster spirit, albeit one in digital form.

That prankster continues to emerge as the afternoon progresses, especially after the heat impels me to open up my office door. It’s a Friday, and the people outside are cleaning up various odds and ends, sorting papers, talking about their weekend plans. Without thinking about it, I begin to speak more softly, but no one pays me much attention; what’s surprising, rather, is the way the NaturallySpeaking software seems to listen in on everybody else. Someone drops a book, and the thud appears as “that.” A second person starts talking and his syllables create phrases that have nothing to do with what I mean to write. Before I can delete the errant sequence, the mike picks up another ambient burst. “Jesus fucking Christ,” I mutter, only to see it show up as “Jesus Fontaine Cranston,” a misreading that moves me, finally, to laughter— which NaturallySpeaking deciphers as “perfect effective at the back to.”

On the one hand, these are minor problems, correctable by continued training of the software, or finding a quieter place to work. Less easily resolved, though, is the way all this does the opposite of what it’s supposed to, which only highlights the ongoing limitations of the electronic age. Were I typing, after all, interpretation would not be an issue; my words would appear as I intended, without having to be “read.” But if this isn’t writing, then neither does it come with the flow, the rhythm, of speech. For one thing, there’s the matter of all that enunciation, which makes you hyper-aware of everything you say. Still more problematic are the expectations we bring to spoken language— that it is part of a conversation between sentient beings. Computers do not engender this, no matter how well they are programmed; they are, instead, essentially lifeless, composed of plastic and circuitry, with no intelligence of their own. As such, using voice recognition software is like performing a monologue, but a monologue where the textures of one’s language have been altered to fit the limitations of the medium, rather than the other way around.

In many ways, this piece represents a case in point. My initial intention was to “write” it using NaturallySpeaking, to dictate my ideas directly onto the page. What I encountered, however, were many of the problems that had undermined my novel: overblown sentences, meandering paragraphs, and thoughts that ebbed and flowed in a distressingly conversational way. Ultimately, I rewrote about half of what you’ve read here, trying to find connection in the rhythm of my keys. On some level, I suppose, that’s antithetical— or, as NaturallySpeaking hears it, “and medical”— to the concept of voice recognition, with its insistence on conflating text and speech. And that’s the rub. Although we’ve come a long way since I dictated my novel into a tape recorder, I’ll be sticking to my keyboard from now on.