2013-11-20

Television Viewing & the Deanonymization of Large Sparse Datasets.


[preamble: this is not me writing against collecting data analysing user behaviour, including Tv viewing actions. I cherish the fact that Netflix recommends different things to different family members, and I'm happy for the iPlayer team to get some generic use data and recognise that nobody actually wants to watch Graham Norton purely from the way that all viewers stop watching before the introductory credits are over. What is important here is that I get things in exchange: suggestions, content. What appears to be going on here is that a device I bought is sending details on TV watching activity so as to better place adverts on a a bit of the screen I paid for, possibly in future even interstitially during the startup of a service like Netflix or iPlayer. I don't appear to have got anything in exchange, and nobody asked me if I wanted the adverts let alone the collection of the details of myself and my family, including an 11 year old child.]

Graham Norton on iPlayer


Just after Christmas I wandered down to Richer Sounds and bought a new TV, first one in a decade, probably second TV we've owned since the late 1980s. My goal was a large monitor with support for free to air DTV and HD DTV, along with the HDMI and RGB ports to plug in useful things, including a (new) PS3 which would run iPlayer and Netflix. I ended up getting a deeply discounted LG Smart TV as the "smart" bits came with the monitor that I wanted.

I covered the experience back in March, where I stated that I felt that smart bit was AOL-like in its collection of icons of things I didn't want and couldn't delete, it's dumbed down versions of Netflix and iPlayer, and its unwanted adverts in the corner. But that's it, the netflix tablet/TV integration compensates for the weak TV interface, and avoids the problem of PS3 access time limits on school nights, as the PS3 can stay hidden until weekends.

Untitled

Last week I finally acceded to the TV's "new update available" popups, after which came the "reboot your TV" message. Which I did, to then get told that I had to accept an updated privacy policy. I started to look at this, but after screen 4 of 20+ gave up, mentioning it briefly on that social networking stuff (who give me things like Elephant-Bird in exchange for their logging my volunteered access -access where I turn off location notification in all devices).

I did later regret not capturing that entire privacy policy by camera, and tried to see if I could find it on line, but at the time, the search term "LG SmartTV privacy policy" returned next to nothing apart from a really good policy for the LG UK web site, which even goes into the detail of identifying each cookie and its role. I couldn't see the policy after a quick perusal of the TV menus, so that was it.

Only a few days later, Libby Miller pointed me at an article by DoctorBeet, who'd spun wireshark up to listen to what the TV was saying, and so showing how his LG TV is doing an HTTP forms  POST to a remote site of every channel change, as well as details on filenames in USB sticks.

This is a pretty serious change on what a normal television does. DoctorBeet went further and looked at why. Primarily it appears to be for advert placement, including in that corner of the "smart" portal, or a start time after you select "premium" content like iPlayer or netflix. I haven't seen that which is good -an extra 1.5MB download for an advert I'd have to stare through is not something I'd have been happy with.

Anyway, go look at his article, or even a captured request.

I'm thinking of setting up wireshark to do the same for an evening. I made an attempt yesterday but as the TV is CAT-5 to a 1Gbs hub, then an ether over power bridge to get into the base station, it's harder than I'd thought. My entire wired network is on switched ports so I can't packet sniff, and the 100 MB/s hub I dredged up from the loft turned out to be switched too. That means I'd have to do something innovative like use the WEP-only 802.11b ether to wifi bridge I also found in that box, hooked up to an open wifi base station plugged into the real router. Maybe at the weekend. A couple of days logs would actually be an interesting dataset even if it just logs PS3 activity hours as time-on-HDMI-port-1

What I did do is go to the "opt out of adverts" settings page DoctorBeet had found, scrolled down and eventually followed some legal info link to get back to the privacy settings. Which I did photo this time, and which are now up on Flickr.

Some key points of this policy

Information considered to be non personally identifiable include MAC addresses and "information about the live content you are watching"



LG Smart TV Privacy Policy


That's an interesting concept, which I will get back to. for now. note that that specific phrase is not indexed anywhere into BigTable, implying it is not published anywhere that google can index it.
Phrase not found: "information about the live content you are watching"

Or "until you sit through every page with a camera this policy doesn't get out much"

If you have issues, don't use the television

LG Smart TV Privacy Policy

That's at least consistent with customer support.

Anyway. there's a lot more slides. One of them gives a contact, who when  you tap in to LinkedIn not only shows that he's the head of legal at LGE UK,  that he's one hop away from me: datamining in action.

Now, returning to a key point: Is TV channel data Non-personal information?

Alternatively: If I had the TV viewing data of a large proportion of a country, how would I deanonymize it?

The answer there is straightforward, I'd use the work of [2004 Arvind Narayanan and Vitaly Shmatikov], Robust De-anonymization of Large Sparse Datasets.

In that seminal paper, Narayanan and Shmatikov took the anonymized Netflix dataset of (viewers->(movies, rankings)+), and deanonymized it by comparing film reviews on Netflix with IMDb reviews, looking for reviews that appeared on IMDb shortly after a Netflix review with ratings matching/close to that a Netflix review. They then took the sequence of a viewers' watched movies and looked to see if a large set of their Netflix review met that match critera. At the end of which they managed to deanonymize some Netflix viewers -correlating them with an IMDb reviewer may standard deviations out from from any other candidate. They could then use this  match to identify those movies which the viewer had seen and yet not reviewed on IMDb.

The authors had some advantages, both netflix and IMDb had reviews, albeit on a different scale. the TV details don't so the process would be more ad-hoc

  1. Discard all events that aren't movies
  2. Assume that anything where the user comes in late to some threshold isn't a significant "watch event" and discard.
  3. Assume that anything where the user watches all the way to the end is a significant "watch event" and may be reviewed later.
  4. Assume that watching events where the viewer changes channel some distance into a movie -say 20 min- as a significant watch failure event, which may be reviewed negatively.
  5. Consider watch events where the user was on the same channel for some time before the movie began as less significant than when they tuned in early.
  6. If information is collected when a user explicitly records a movie, a "recording event", that is treated even more significantly.
  7. Go through the IMDb data looking for any reviews appearing a short time after a significant set of watch events, expecting higher ratings from significant watch events and recording events, and potentially low ratings from a significant watch failure.

I don't know how many matches you'd get here -as the paper shows, it's the real outliers you find, especially the watchers of obscure content.

Even so, the fact that it is would to possible to identify at least one viewer this way shows that TV watching data is personal information. And I'm confident that it can be done, based on the maths and the specific example in the Robust De-anonymization of Large Sparse Datasets paper.

Conclusion: irrespective of the cookie debate, TV watching data may be personal -so the entire dataset of individual users must be treated this way, with all the restrictions on EU use of personal data, and the rights of those of us with a television.

No comments:

Post a Comment

Comments are usually moderated -sorry.