Biometric identification of users
Biometric identification of users is a short description of what is possible and some problems with a few biometric identification systems that may be used at Wikimedia-sites.
This following reflects the editors personal opinion, and others may disagree.
There are two different problems. The first is often described as the turnstile problem, where there exist a single preexisting positive identification and then a new unknown attempted identification of possibly the same person. The second is the mass identification problem, where there exist several preexisting positive identifications and then a new unknown attempted identification of an unknown person.
In the first case the sample space has one item, while in the second case it has several items and may be virtually unbounded.
The first case (the turnstile problem) is what may arise if you want to verify a person's identity, for example after a claim that he or she owns a specific account. In these cases an error rate of one percent or a few percent might be acceptable, in particular if positive identification is accompanied by other factors.
Often this kind of use tries to falsify an identification. If any sameness test fail for the persons' identification, then the identification is rejected.
The second case (the mass surveillance) is what may arise if you want to identify an unknown person given some or a few biometric features (traits) from a possibly very large set of persons. In this case the necessary precision to create an acceptable error rate will be very much larger than in the first case. The system must reject all the other persons, and then have sufficient error rate.
Often this kind of use tries to trutify an identification. If and only if the sameness test holds for the persons' identification, and the sameness tests fail for all the other persons, then the identification is accepted.
Sometimes the second case is simplified by saying that only some subset of persons is available, that is a shift from an open to a closed world. That shift may not be supported by the actual system, and in particular on a wiki that allows free creation of accounts and even anonymous editing, then it is unlikely that any real arguments can be made for a closed world.
(If you as a check user (CU) wants to reject a claim that two users are the same, then you might use the first case, but if you want to support your claim that two users are the same, then you should use the second case.)
There are rather few biometrics that can be used in an online content editing system. It is limited by what is observable and what is controllable. It is possible to observe how text is typed, and what is typed. Navigation actions, that is mouse movements, can also be observed to some degree.
The following three types are known to the editor of this page.
How users write
One type of biometrics is how you type. This usually consists of timing between different keys or “characters”. This might be visualized as a kind of key-map where some keys are closer together than other keys. The more data (text) available, the better map is made. Better systems might add uncertainty in location of some keys on the map, and might also be able to handle variations in typing speed.
The actual “map” might not look like a map at all, but might instead be some vector representation. More modern approaches tries to estimate speedy paths, which says something about how a user uses multiple fingers to do touch typing.
Biometrics on how users type is mostly done as statistics as features and ordinary classification. It is possible to reformulate the input data and use recurrent neural networks as a first leg in a Siamese neural network.
Known error sources are users alternating on how to type (one-, two-, four-, ten-fingers), and software that adds jitter to the typing to obfuscate identity. To improve the measure and get good statistics it is common to make the user type a known text of some length. This is although only useful if a user shall try to prove ownership of an account. That is, it is hard to type the same way as another user, but easy to type in a dissimilar way.
What users write
A second type of biometrics is what you type. This usually consists of words, and parts of words. In some cases it might also consist of phrase structures, with some or all phrase structures replaced by word classes. Instead of wo|rd classes the words might be replaced by vector representations.
A map of words might be constructed by running a word2vec-algorithm, and such a map might replace ordinary word classes.
Biometrics on what users type was implemented as Hidden Markov models, but now that is more commonly done as Recurrent neural networks. This can then be used as a first leg in a Siamese neural network.
Identification of writers of the w:United States Declaration of Independence is a common exercise in NLP-classes in the USA. In other countries similar exercises are given, but with other texts.
Known error sources are regiolects, sosiolects, and domain specific terms. Especially domain specific terms can be a hard problem, as they tend to be shared among persons from a specific technical or scientific background. Users without the same background tend to make up words when describing a phenomenon. A system would then measure high degree of similarity between users with the same technical or scientific background, perhaps even claim they are the same, while users without this background (i.e. ordinary editors at Wikipedia) will be measured as different. It is possible to counteract this by using domain specific stop-words. In some countries there are organized efforts to claim ownership of domain specific terms, and that may turn counteracting this into a really difficult task.
Another known error source is spelling and grammar correcting programs. Such programs tend to flatten the biometric fingerprint, and move them closer to each other. Such programs might have detectors for pet-phrases, and if removed it will be difficult to identify single users.
A third type of biometrics is what you learn that might reflect in how you react. Typically, some tones are played or color flashes, a short moment after a letter or number is shown, and then you are expected to type that letter or number. When this is repeated a few times the user starts to remember the sequence, and in particular the typing tend to move in front of the shown letter or number. The sequence can be pretty long, and it will still work, even if the user claim they do not remember the sequence.
Variations are added characters that don't belong in the sequence, and they should trigger a substantial delay, unless the replay are faked by an adversary.
Known error sources are unfamiliar keyboard or other devices.
(It is not known if this kind of system has been used in any type of mass surveillance system. The properties should imply it is only useful for authentication purposes.)
Information to the users
For any legitimate use the users must be informed, and in particular the tracked users, but in general all affected users. If biometric data is harvested from user contributions,
The tracked users in the identification case are not limited to the specific user under observation, it includes all users compared to that user, and in the general case (extreme maximum limit) all contributors to the wiki. In the turnstile case it is still necessary to inform the user under observation, but without knowing in advance who this user will be, all users must be informed.
It has been claimed that biometric identification can be done outside “Wikipedia”, and thus it is not necessary to inform the users. A system outside a wiki-instance will have no clear purpose, and it may be very difficult to get an acceptance for legitimate use according to GDPR. Details about information that must be provided is listed in Directive (EU) 2016/679 art. 13 for cases where personal data are collected from the data subject itself, and art. 14 for cases where personal data have not been obtained from the data subject itself.
It seems like the latest point where the users may be informed in the art. 13 case is when the data is obtained, that is when the biometric data is extracted from the contributions. In the art. 14 case it seems like the latest point is when the biometric data is accessed with an legitimate interest.
It is a pretty clear difference between fingerprinting the browser or machine and fingerprinting the user with some biometric signature. In some jurisdictions it even seems like a biometric fingerprint would be sensitive private information, which could make it necessary to use additional security measures.
The CitySense case
The company COWI had provided a system CitySense for travel time analysis along roads during construction for the state-owned road company Nye Veier. The main purpose is to quantify traffic disruption. NRK (Norwegian Broadcasting Corporation) ran a story The road company Nye Veier registers your mobile phone about the system that had a few interesting quotes. The director of technology Atle Årnes at Norwegian Data Protection Authority said (Quotes are translated and slightly edited for clarity.)
Gathering information that can be linked to an individual in one place, and then checking if the same information linking the individual exists at another place, is processing of personal data. Whether these are handled by encrypting and truncating the information does not matter, they can still be used to connect a passing individual to a previous location in place and time.
To say that information such as this is anonymous, and thus claim the processed information is not linked to a person, is in our opinion incorrect.
The director of communications Christian Altmann at Nye Veier replied (Quote is translated and slightly edited for clarity.)
The conclusion is that identifiable personal data is not processed, and that regulations therefore do not apply. The assessment is based on the MAC address, which can potentially identify individuals, is anonymized immediately and automatically before the information is registered in the system. MAC addresses are therefore not stored. This means, among other things, that there is no obligation to provide information to the public, in line with GDPR art. 13 flg.
The company (COWI) now concludes that the EU Privacy Regulation (GDPR) will apply. This information will be sent to the Norwegian Data Protection Authority and will form the basis for further dialogue on how the matter will be put in order.
If a biometric system to track users is sufficient similar to CitySense, then it is not sufficient to simply obfuscate the identifier to make the system go clear of GDPR (General Data Protection Regulation). A synthetic identifier made for the purpose, that is a vector representation, would have the same role as an obfuscated identifier. It is not clear whether a hash into a subspace, that is introducing deniability, is sufficient to make the system go clear of GDPR.
The ClearView case
The company w:Clearview AI makes a system for identification of people given that a facial photo is provided, and they have previous portraits of the person. It is known that the company has scraped over three billion facial images from various social sites without permit. The company uses the scaraped images to create identifiers (vectors) to identify real people, or “natural persons” as they are called in GDPR-lingo. The company and its business practices have thus created discussions.
Some of the on-going discussions are about Directive (EU) 2016/680 art. 10 and its interpretation
Processing of special categories of personal data
Processing of personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person's sex life or sexual orientation shall be allowed only where strictly necessary, subject to appropriate safeguards for the rights and freedoms of the data subject, and only:
- (a) where authorised by Union or Member State law;
- (b) to protect the vital interests of the data subject or of another natural person; or
- (c) where such processing relates to data which are manifestly made public by the data subject.
Our core problem in this context is
the processing of … biometric data for the purpose of uniquely identifying a natural person. It is the attempt to identify a recurring user which is problematic, given what is said in the CitySense case.
Even if it is a bit weird, the phrase
subject to appropriate safeguards for the rights and freedoms of the data subject is troublesome. It is quite obvious that using a biometric system to limit the right and freedoms of a legitim user would be highly problematic, so it also to ban someone that is in fact a vandal if that is the sole purpose of the system. If the purpose is )GDPR art. 10 (b))
to protect the vital interests of the data subject or of another natural person then the use is legal. The question is; protecting a wiki-instance from vandalism, is that a vital interst? The wiki-instance itself is not a natural person, so it is not included in this context.
GDPR art. 10 (c) clearly allows use
where such processing relates to data which are manifestly made public by the data subject. Is this sufficient for repurposing previous contributions for identification purposes? From the discussions about ClearView it seems like repurposing published photos, which may be viewed as contributions, is not acceptable.
In Norway there is a provision in Law on Processing of Personal Data (personopplysningsloven) § 3. The relationship to freedom of expression and information that can be used with GDPR art. 10 (a) to allow processing of personal data solely for journalistic purposes or for the purpose of academic, artistic or literary expression. It does not include processing for safeguarding such expressions.
It is the editors' opinion that biometrics may be used for reclaiming a lost account, or as tool for collecting anonymous accounts into a federated one, but it should not be used as an involuntary means of identification. Proper identification should be implemented, if possible, and not involuntarly back-door identification through use of biometrics.
The users affected by such biometricidentification should be clearly informed about it, and there should be made no attempts on clandestine identification of uninformed users.
It could also seem like the current CheckUser facility is out of line with GDPR, as it does not inform the users when someone has run a check where they have been targeted. It could be difficult to inform the anonymous users, but logged-in users should be notified.
- "Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation)". eur-lex.europa.eu. Retrieved 2020-09-07.
- Gildestad, Bjørn Atle (2020-08-26). "Vegselskapet Nye Veier registrerer mobiltelefonen din". NRK (in nn-NO). Retrieved 2020-09-07.
- Stortingets President (2020-09-03). "Spørsmål nr 2420 til skriftlig besvarelse fra stortingsrepresentant Kirsti Leirtrø til samferdselsminister Knut Arild Hareide" (PDF). stortinget.no (in nb-NO). Retrieved 2020-09-07.
- "Clearview AI Wants To Sell Its Facial Recognition Software To Authoritarian Regimes Around The World". BuzzFeed News. Retrieved 2020-09-07.
- Stolton, Samuel (2020-09-03). "MEPs furious over Commission's ambiguity on Clearview AI scandal". www.euractiv.com (in en-GB). Retrieved 2020-09-07.
- "Directive (EU) 2016/680 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data by competent authorities for the purposes of the prevention, investigation, detection or prosecution of criminal offences or the execution of criminal penalties, and on the free movement of such data, and repealing Council Framework Decision 2008/977/JHA". eur-lex.europa.eu. Retrieved 2020-09-07.
- "Lov om behandling av personopplysninger (personopplysningsloven) - Kapittel 2. Lovens saklige og geografiske virkeområde - Lovdata". lovdata.no. Retrieved 2020-09-07.