Passive Data Entry and the fine line between convenience, spookyness and privacy

Note: This is an unfinished "slush-pile" article: not only is it unfinished, bit it may change drastically or be deleted.

 I'd like my cell phone to predict who I'm about to call before I call them, and put rapid-dial buttons at the top of the home screen for them. I'd like it to figure out when I need to buy milk, eggs and bread, add them to my shopping list and buzz when I drive near a supermarket. I'd like it to figure out when I've completed a ToDo item, and create new ones for me. I'd like it to prepare as much of my tax return as it can, so all that's left for me to do is make corrections and file. I'd like it to know when I'm hurt and call an ambulance, or drunk and call a taxi. I want it to look for bargains and steer me to them at the right time and place. And I'd like it to do all this without me having to take it from my pocket.

 This is what I call Passive Data Entry (PDE), which is a product of carrying around multi-sensory, always-on computers loaded with learning and pattern recognizing algorithms. And while it would take User Experience (UX) to a new level it'll also make it harder to maintain privacy; there will be a price to pay for something that could shave decades of routine labor off our lives.

 The mechanism behind PDE is what you could call "confirmed educated guesswork", which might be driven by things like Bayesian classifiers, genetic algorithms, crowd-sourced data, neural networks, Natural Language Processing, Expert Trees and other products of AI and UI research. Apple's Siri might be the halfway product--she uses machine-learning algorithms and contextual clues to make a good guess at what we mean when we talk to her, but she still needs us to activate her explicitly and confirm any irreversible or expensive action. You could think of Passive Data Entry as a tweak to Siri's implementation so that she's always listening, always making guesses, always preparing what she thinks we might need, even if we never take her out of our pocket.

 Eric Schmidt, CEO of Google, thinks this is possible

 "If I look at enough of your messaging and your location, and use Artificial Intelligence, we can predict where you are going to go."

Consider these:
  • Maybe I often call my mom when the GPS knows I'm in the drugstore around the corner from her house
  • I tend to call my boss when the accelerometer, GPS and realtime traffic data thinks I've been stuck in a jam, and it's a few minutes before I'm due at work
  • I'm probably going to call whoever I've got an appointment with a few minutes before or after that calendar entry is due, or when I'm near their property
  • It's a good bet I'll want to set up a conference call on speakerphone when the gyroscope knows the phone has been placed on a flat surface, like a table
  • If the GPS knows I'm at a known hiking trail and the accelerometer detects I've just descended several feet at 9.8 m/s2, then I'll probably want to call an emergency number. Or if I go from 60+ MPH to zero in less than a second. Or if the internal thermometer detects a sudden temperature spike inconsistent with today's weather but consistent with a building on fire. Or if the microphone picks up the sound of my scream
"Passive Data Entry is about converting explicit user interaction into implicit interaction--of converting commands into confirmations."
 A computer in our pocket doesn't have to take irreversible action, but it can sort our address book to put copies of speculatively relevant numbers at the top of the list, and the only penalty for false-positives is that I must flick my thumb a bit to get at the options below--probably not a severe impact to UX even in the worst case, a stupendous win in the best case.

 But sorting address-book records only whets our tongue, try these on for size:
  • My ToDo list suppresses a deposit reminder when the GPS says I've been to the bank, and I confirm it later from the notification panel
  • It also creates a ToDo item that reminds me to fill an expense report (and pre-populates half of it) when I've been at a restaurant for more than 30 minutes during business hours
  • It uses my playlist history to remind me of upcoming concerts and links to buy tickets
  • The icon to launch Stop n' Shop's Scan-It mobile appears on the lock-screen when I enter one of their supermarkets
  • So does Amazon's barcode-scanning app whenever I walk into a Target or Wal-Mart
  • My fitness tracker is woken and fed pedometer data when the accelerometer detects the motion of a good walk or run, or the cycling app is woken when I walk to the garage, wobble a bit, and then start traveling at bicycle speeds
  • The microphone overhears a song on the store's PA, and there's a link to buy it when you open the music-store app
  • I receive a call from my mom, hears me exclaim "Oh my god! Is he okay?" and has the earliest and fastest flight itinerary ready to confirm and book after I hang up
  • The magnetometer picks up the oscillating thrum of a refrigerator, the clock says its breakfast time, and when the microphone hears you say "aw, damn!" the phone adds milk to your shopping list
  • Any voice uttering what sounds like a phone number goes into your address book along with time and place
  • Your W-9 was delivered electronically that afternoon, your email client or PDF viewer knows you've been reading it, and WiFi triangulation thinks you've been sitting at your dining-room table for an hour past dinner. Suddenly you stand up and walk to your flat-panel TV, which the phone turns on and begins beaming video tutorials from TurboTax and the IRS
  • The gyroscope realizes the phone has been put on a table and the microphone starts to hear voices. One voice says "My name is Richard", another says "My name is Susan", another says "My name is Chris", and from then on it keeps a transcript of every word said, annotated with the name of the speaker
 Yet there's more, of a private nature:
  • It deletes my browser history when I enter a police station
  • It sends my wife's calls to voicemail and suppresses my GPS location when I go to a motel
  • If that phone number it overheard was in a bar, it uses a codename for the address-book entry that I set up earlier
  • It hides job interviews in my calendar whenever I'm at my current employer's office
 We already have an early form of Passive Data Entry when we take photos with a camera equipped with a GPS, but they can't usually be programmed with rules to protect our privacy.

 Another problem is manufacturer's complicity: The Disneyesque Apple won't give its iPhone any "adultery" features, but that might be for the sake of image rather than moral duty, and those features aren't really about enabling immoral behavior anyway; they're just more poignant when described that way. Consider these:
  • Any calendar, reminder or ToDo entry related to gift stores, banquet halls or mobile DJ is obfuscated in the days before a birthday or anniversary 
  • Phone and browser history related to funeral homes is hidden in the vicinity of a hospital, retirement community, or elderly relatives' home
  • All memory is scrubbed entirely when the microphone hears someone uttering the Miranda warning
  • The phone begins recording and streaming audio and video to a cloud service the instant it hears its owner say "I don't want any trouble"
  • When the accelerometer detects a sudden fall, the microphone picks up a crunch or a groan, the gyroscope records no movement for 30 seconds, and the owner is 50 years or older, it dials for an ambulance and speaks the user's current location
 Privacy and safety could actually improve with PDE, given the right programming and configurability. Just how flexible it'll be will depend on the makers and the market.

An avalanche of input

 Siri doesn't eliminate any input device, it just swaps one. If we all wore keyboards beneath our fingers at all times and could type as fast as we speak then Siri wouldn't use a microphone. What she does is eliminate large chunks of user interaction that slow us down. Speech recognition has been around for decades and a copy of Dragon Dictate can fill a form quite nicely as long as we give mouse, keyboard or vocal input to guide it from one field to another, all while looking at the screen and thinking about how to tell the computer what we mean. Siri and her like use as much context as they can to guess the best way to break up what I mean when I say "Remind me to call my sister when I get home". Once upon a time in the past I added a phone number for "Janet" in my address book, and sometime later I held down the Home button and said "Janet is my sister", and some other time I stood in the middle of my living room and said "This is my home", and without that Siri couldn't dial shit.

 An iPhone 4S is an orgy of sensors: it has a microphone, accelerometer, magnetometer (compass), gyroscope, four different radios (WiFi, GPS, Bluetooth, and multi-band cell), proximity detector, internal thermometer (mine knows if I left it under a car window on a hot day), two cameras--one at 8 megapixels, moisture sensor (try claiming warrantee after you drop it in the bathtub), and a capacitive touch sensor. In the future Apple and other smartphone makers may add more: altimeters, barometers, external thermometer, a speaker and microphone tuned to enable sonic range finding, laser and detector, IrDa, a fifth or sixth radio for Near Field Communications or FM/SW/Marine/CB/Ham, touch-pressure sensor, geiger counter, galvanometer, sphygmomanometer, anemometer. Today, Google is presently demonstrating head-mounted sensor arrays ("Project Glass") that provide video and audio capture from the wearer's perspective, so the phone can now tell what the user is looking at.

 There are also derivative sensors based on pattern-recognizing algorithms that fuse inputs over time, like the Graffiti engine in the original Palm Pilot or gesture recognizers in today's phones and tablets. Combine an accelerometer, altimeter and GPS and it will know if you're a passenger on an airplane, hot air balloon, climbed a mountain or ridden the elevator to the top of a skyscraper. Pair gyroscope with microphone and it knows when you've gone to sleep. Grab some frames from the camera and it knows if it's in your pocket or night table. Listen to your voice and it knows if you're under stress. The operating system might even be taught the magnetic signature of an MRI machine and make the phone shriek for life before its destroyed in a 4.5-Gauss bear hug.

 Passive Data Entry is about converting explicit user interaction into implicit interaction--converting commands into confirmations. Because the computer can never be sure of a guess it can never--under most circumstances--commit to an irreversible or expensive action, but it can waste trillions of cycles anticipating the user for the price of a napkin.

An expansion of power and a leak in privacy

 What Siri can't do yet is bring you a cup of tea when you're audibly stressed, but that's not because she can't physically brew it because she could if there was an API to the tea machine, hotel room service, office catering, or the cafe across the street. An API is a very powerful thing because it can attach peripherals that do anything, even signal other humans to do something the computer can't do itself. APIs are abundant, however, and can now summon taxis, book flights and hotel rooms, change TV channels, and vacuum your carpet.

 APIs also bring new types of sensor and sense-comprehending, such as the cloud service that analyzes a few seconds of microphone input to identify a song, the photo recognizer that names the face you're looking at and adds them to your address book, and the search engine that identifies gas stations and restaurants nearby. But in doing so they also leak private data, since that clip of a song in the background can also include a conversation, or the GPS coordinates it sends to a location service compromise an alibi.

 I personally tolerate a navigation app on my phone that gives away my location to a server somewhere as I drive from one place to another, because it has to constantly download map data for specific regions, but I might not be willing to run an app that sends audio or video captures every few minutes unless I knew it would stay absolutely private. There is no safe way to solve this yet: homomorphic encryption is still too impractical, servers must keep activity logs. It's also ironic that the very mechanism that enables PDE is also its greatest threat to adoption. To see why, let's look at how PDE works.

The basic mechanism of multi-input PDE and the myth of privacy

 Physics is based on the principle of consistency and conservation: every action has a reaction and all energy and matter is conserved, no matter what. Even nuclear reactions that convert matter to energy or vice versa are still predictably consistent and proportional. Actions also have more than one reaction: walking across a room displaces air, affects the gravitational field, creates noise, casts a shadow, changes the temperature distribution and more in a consistent fashion. Multi-sensory devices can exploit that by cross-referencing what one kind of sensor says with another. The altimeter and barometer are similar devices, but they'll agree with each other when you climb a hill. The GPS can be wrong if the satellite signal is being bounced off a large building, but the magnetometer can feel the steel-framed skyscraper and agree with where WiFi triangulation suggests you ought to be.

 We do this in our own heads: we hear the noise of an engine rise in pitch, then see a car advancing toward us down the street, and in our brains we bind the two inputs to represent a single concept. It's partly intuition and partly learning but it works a lot like the Expert Systems that came out of AI research in the 70s and 80s: a particular signature is detected by one sensor, which selects a branch of a decision tree containing other questions to look out for, eliminating branches recursively until only a few probabilities are left. For example:
  1. The microphone is taking 22 thousand samples per second
  2. An algorithm analyzing a series of those samples identifies a sawtooth waveform
  3. The "sawtooth" branch of possibilities is chosen for the next test, eliminating the consideration of other waveforms for now
  4. More samples indicate that the pitch is rising
  5. A "rising pitch" branch is chosen from the "sawtooth" trunk
  6. "Rising pitch" contains branches that ask questions from the radar or camera, so those sensors are turned on
  7. Operating on another thread, the computer goes through a tree of possibilities that positively indicate an approaching object
  8. That event selects the "Rising sawtooth noise, approaching object" branch
  9. The choice of branch is strengthened when the rate of rising pitch matches the rate that the object approaches, alternatives are now dropped from consideration
  10. The camera has now identified the relative size of the object
  11. The "car sized" branch gets selected
 At some point, maybe after a few more branches, the computer decides that there's an approaching vehicle, and the whole process might take only a second. Confidence in the prediction can be increased with more inputs, more branches that test other possibilities, and better rules for correlating low-level inferences (such as the rise-in-pitch with decrease-in-distance).

 This basic mechanism can also be used to figure out if you're cheating on your wife:
  1. A phone call to your wife is made from your office at 4:45pm
  2. The "After work plans" branch is selected
  3. The microphone picks up and--with a similar sub-tree to the approaching-car example--thinks that you're at a bar
  4. The "After work socialization" branch is selected
  5. A female voice has been present for more than 30 minutes, but it can't be matched to your wife
  6. The "Non-spouse socialization" branch is selected
  7. Your GPS location changes to a motel 3 miles away from the bar
 At some point the computer decides there's a good chance that you're having an affair, not because it saw you directly, but because your detectable behavior is consistent with patterns discovered in the activity of thousands of others.

 Furthermore this doesn't have to run on your phone: it could run on a server with access to your phone and credit card records and it could do its job days or months after the event happened. Any advance in the kind of technology that makes things like "Google now" work can also be used by marketing firms, the IRS, or the prosecution's lawyer; the results are simply better if they can access sensory data that's closer to you.

###### Older version of article below

Pruning the user-interaction tree

Apple's Siri and her kind improve on UI by eliminating chunks of user interaction; they don't just convert your voice into words--we've been doing that for decades--they try to extract meaning from the input without making us think about how to deliver meaning. Dragon Dictate and other software can fill out fields in a form but you have to tell it when to jump to the next field on the form--after thinking about which field and what to put in it--and when to "Okay" or "Submit" the form. Siri is programmed to figure out those distinctions herself so that I don't have to say "Phone... Janet... Mobile... Dial" while looking at the screen to make sure each grunt has activated the correct mode and selection. If we all had keyboards under our fingers at all times and could type as fast as we speak, then Siri wouldn't need a microphone.

Siri also uses history and clues, so if I say "Remind me to call my sister when I get home" the command can be broken down into executable actions if I'd previously said "Janet is my sister" and "This is my home" while standing in my living room. We're at a point where we can use ambient information to fix meaning by probability, so even while the computer is in a Chinese room and doesn't really comprehend what we're asking, it can still chop tens of branches off the tree of user interactions. A big tree can be pruned to a short list, sorted to the top of the screen, with the regular old UI interaction tree just a thumb-flick away in case the computer's guesses were wrong.