Note: This is an unfinished "slush-pile" article: not only is it unfinished, bit it may change drastically or be deleted.
I'd like my cell phone to predict who I'm about to call before I call them, and put rapid-dial buttons at the top of the home screen for them. I'd like it to figure out when I need to buy milk, eggs and bread, add them to my shopping list and buzz when I drive near a supermarket. I'd like it to figure out when I've completed a ToDo item, and create new ones for me. I'd like it to prepare as much of my tax return as it can, so all that's left for me to do is make corrections and file. I'd like it to know when I'm hurt and call an ambulance, or drunk and call a taxi. I want it to look for bargains and steer me to them at the right time and place. And I'd like it to do all this without me having to take it from my pocket.
This is what I call Passive Data Entry (PDE), which is a product of carrying around multi-sensory, always-on computers loaded with learning and pattern recognizing algorithms. And while it would take User Experience (UX) to a new level it'll also make it harder to maintain privacy; there will be a price to pay for something that could shave decades of routine labor off our lives.
The mechanism behind PDE is what you could call "confirmed educated guesswork", which might be driven by things like Bayesian classifiers, genetic algorithms, crowd-sourced data, neural networks, Natural Language Processing, Expert Trees and other products of AI and UI research. Apple's Siri might be the halfway product--she uses machine-learning algorithms and contextual clues to make a good guess at what we mean when we talk to her, but she still needs us to activate her explicitly and confirm any irreversible or expensive action. You could think of Passive Data Entry as a tweak to Siri's implementation so that she's always listening, always making guesses, always preparing what she thinks we might need, even if we never take her out of our pocket.
Eric Schmidt, CEO of Google, thinks this is possible:
"If I look at enough of your messaging and your location, and use Artificial Intelligence, we can predict where you are going to go."
A computer in our pocket doesn't have to take irreversible action, but it can sort our address book to put copies of speculatively relevant numbers at the top of the list, and the only penalty for false-positives is that I must flick my thumb a bit to get at the options below--probably not a severe impact to UX even in the worst case, a stupendous win in the best case.
But sorting address-book records only whets our tongue, try these on for size:
Yet there's more, of a private nature:
We already have an early form of Passive Data Entry when we take photos with a camera equipped with a GPS, but they can't usually be programmed with rules to protect our privacy.
Another problem is manufacturer's complicity: The Disneyesque Apple won't give its iPhone any "adultery" features, but that might be for the sake of image rather than moral duty, and those features aren't really about enabling immoral behavior anyway; they're just more poignant when described that way. Consider these:
Privacy and safety could actually improve with PDE, given the right programming and configurability. Just how flexible it'll be will depend on the makers and the market.
Siri doesn't eliminate any input device, it just swaps one. If we all wore keyboards beneath our fingers at all times and could type as fast as we speak then Siri wouldn't use a microphone. What she does is eliminate large chunks of user interaction that slow us down. Speech recognition has been around for decades and a copy of Dragon Dictate can fill a form quite nicely as long as we give mouse, keyboard or vocal input to guide it from one field to another, all while looking at the screen and thinking about how to tell the computer what we mean. Siri and her like use as much context as they can to guess the best way to break up what I mean when I say "Remind me to call my sister when I get home". Once upon a time in the past I added a phone number for "Janet" in my address book, and sometime later I held down the Home button and said "Janet is my sister", and some other time I stood in the middle of my living room and said "This is my home", and without that Siri couldn't dial shit.
An iPhone 4S is an orgy of sensors: it has a microphone, accelerometer, magnetometer (compass), gyroscope, four different radios (WiFi, GPS, Bluetooth, and multi-band cell), proximity detector, internal thermometer (mine knows if I left it under a car window on a hot day), two cameras--one at 8 megapixels, moisture sensor (try claiming warrantee after you drop it in the bathtub), and a capacitive touch sensor. In the future Apple and other smartphone makers may add more: altimeters, barometers, external thermometer, a speaker and microphone tuned to enable sonic range finding, laser and detector, IrDa, a fifth or sixth radio for Near Field Communications or FM/SW/Marine/CB/Ham, touch-pressure sensor, geiger counter, galvanometer, sphygmomanometer, anemometer. Today, Google is presently demonstrating head-mounted sensor arrays ("Project Glass") that provide video and audio capture from the wearer's perspective, so the phone can now tell what the user is looking at.
There are also derivative sensors based on pattern-recognizing algorithms that fuse inputs over time, like the Graffiti engine in the original Palm Pilot or gesture recognizers in today's phones and tablets. Combine an accelerometer, altimeter and GPS and it will know if you're a passenger on an airplane, hot air balloon, climbed a mountain or ridden the elevator to the top of a skyscraper. Pair gyroscope with microphone and it knows when you've gone to sleep. Grab some frames from the camera and it knows if it's in your pocket or night table. Listen to your voice and it knows if you're under stress. The operating system might even be taught the magnetic signature of an MRI machine and make the phone shriek for life before its destroyed in a 4.5-Gauss bear hug.
Passive Data Entry is about converting explicit user interaction into implicit interaction--converting commands into confirmations. Because the computer can never be sure of a guess it can never--under most circumstances--commit to an irreversible or expensive action, but it can waste trillions of cycles anticipating the user for the price of a napkin.
What Siri can't do yet is bring you a cup of tea when you're audibly stressed, but that's not because she can't physically brew it because she could if there was an API to the tea machine, hotel room service, office catering, or the cafe across the street. An API is a very powerful thing because it can attach peripherals that do anything, even signal other humans to do something the computer can't do itself. APIs are abundant, however, and can now summon taxis, book flights and hotel rooms, change TV channels, and vacuum your carpet.
APIs also bring new types of sensor and sense-comprehending, such as the cloud service that analyzes a few seconds of microphone input to identify a song, the photo recognizer that names the face you're looking at and adds them to your address book, and the search engine that identifies gas stations and restaurants nearby. But in doing so they also leak private data, since that clip of a song in the background can also include a conversation, or the GPS coordinates it sends to a location service compromise an alibi.
I personally tolerate a navigation app on my phone that gives away my location to a server somewhere as I drive from one place to another, because it has to constantly download map data for specific regions, but I might not be willing to run an app that sends audio or video captures every few minutes unless I knew it would stay absolutely private. There is no safe way to solve this yet: homomorphic encryption is still too impractical, servers must keep activity logs. It's also ironic that the very mechanism that enables PDE is also its greatest threat to adoption. To see why, let's look at how PDE works.
Physics is based on the principle of consistency and conservation: every action has a reaction and all energy and matter is conserved, no matter what. Even nuclear reactions that convert matter to energy or vice versa are still predictably consistent and proportional. Actions also have more than one reaction: walking across a room displaces air, affects the gravitational field, creates noise, casts a shadow, changes the temperature distribution and more in a consistent fashion. Multi-sensory devices can exploit that by cross-referencing what one kind of sensor says with another. The altimeter and barometer are similar devices, but they'll agree with each other when you climb a hill. The GPS can be wrong if the satellite signal is being bounced off a large building, but the magnetometer can feel the steel-framed skyscraper and agree with where WiFi triangulation suggests you ought to be.
We do this in our own heads: we hear the noise of an engine rise in pitch, then see a car advancing toward us down the street, and in our brains we bind the two inputs to represent a single concept. It's partly intuition and partly learning but it works a lot like the Expert Systems that came out of AI research in the 70s and 80s: a particular signature is detected by one sensor, which selects a branch of a decision tree containing other questions to look out for, eliminating branches recursively until only a few probabilities are left. For example:
At some point, maybe after a few more branches, the computer decides that there's an approaching vehicle, and the whole process might take only a second. Confidence in the prediction can be increased with more inputs, more branches that test other possibilities, and better rules for correlating low-level inferences (such as the rise-in-pitch with decrease-in-distance).
This basic mechanism can also be used to figure out if you're cheating on your wife:
At some point the computer decides there's a good chance that you're having an affair, not because it saw you directly, but because your detectable behavior is consistent with patterns discovered in the activity of thousands of others.
Furthermore this doesn't have to run on your phone: it could run on a server with access to your phone and credit card records and it could do its job days or months after the event happened. Any advance in the kind of technology that makes things like "Google now" work can also be used by marketing firms, the IRS, or the prosecution's lawyer; the results are simply better if they can access sensory data that's closer to you.
###### Older version of article below
Apple's Siri and her kind improve on UI by eliminating chunks of user interaction; they don't just convert your voice into words--we've been doing that for decades--they try to extract meaning from the input without making us think about how to deliver meaning. Dragon Dictate and other software can fill out fields in a form but you have to tell it when to jump to the next field on the form--after thinking about which field and what to put in it--and when to "Okay" or "Submit" the form. Siri is programmed to figure out those distinctions herself so that I don't have to say "Phone... Janet... Mobile... Dial" while looking at the screen to make sure each grunt has activated the correct mode and selection. If we all had keyboards under our fingers at all times and could type as fast as we speak, then Siri wouldn't need a microphone.
Siri also uses history and clues, so if I say "Remind me to call my sister when I get home" the command can be broken down into executable actions if I'd previously said "Janet is my sister" and "This is my home" while standing in my living room. We're at a point where we can use ambient information to fix meaning by probability, so even while the computer is in a Chinese room and doesn't really comprehend what we're asking, it can still chop tens of branches off the tree of user interactions. A big tree can be pruned to a short list, sorted to the top of the screen, with the regular old UI interaction tree just a thumb-flick away in case the computer's guesses were wrong.