New tool to identify sockpuppets based on writing style

Giraffe Stapler · Unread post by **Giraffe Stapler** » Mon Sep 13, 2021 2:38 pm

New tool to identify sockpuppets based on writing style
Checkusers on the English Wikipedia will soon have access to a new tool aimed at identifying misuse of multiple accounts based on a person's writing style. masz, developed by Ladsgroup, uses natural language processing to create an individual 'fingerprint' of a user based on the way they use language on talk pages. Checkusers can log into a web interface to compare the fingerprints of two accounts or list accounts with similar fingerprints. The tool is already live on several projects and is expected to start running on enwiki after phab:T290793 is resolved. – Joe (talk) 07:15, 13 September 2021 (UTC)

Discussion here.

No comment.

Unread post by **Zoloft** » Mon Sep 13, 2021 2:53 pm

Giraffe Stapler wrote: ↑
Mon Sep 13, 2021 2:38 pm

New tool to identify sockpuppets based on writing style
Checkusers on the English Wikipedia will soon have access to a new tool aimed at identifying misuse of multiple accounts based on a person's writing style. masz, developed by Ladsgroup, uses natural language processing to create an individual 'fingerprint' of a user based on the way they use language on talk pages. Checkusers can log into a web interface to compare the fingerprints of two accounts or list accounts with similar fingerprints. The tool is already live on several projects and is expected to start running on enwiki after phab:T290793 is resolved. – Joe (talk) 07:15, 13 September 2021 (UTC)
Discussion here.

No comment.

It's a hammer, given to a varied group of people. Some will use it to ensure their opinion sounds legitimate.

boom · Unread post by **boom** » Mon Sep 13, 2021 3:46 pm

Cool. This will allow us to take evidence fabrication to a whole new level.

Will the tool be subject to the same anti-fishing guidelines as the CU itself? I'm afraid the servers might not be able to withstand the load otherwise.

Poetlister · Unread post by **Poetlister** » Mon Sep 13, 2021 4:45 pm

Of course, the late SlimVirgin claimed to be able to detect sockpuppets by this method. I hope that this program is more reliable.

Vigilant · Unread post by **Vigilant** » Mon Sep 13, 2021 4:53 pm

Was it coded by Jehochmann and Durova ?

Unread post by **Midsize Jake** » Mon Sep 13, 2021 9:08 pm

Vigilant wrote: ↑
Mon Sep 13, 2021 4:53 pm
Was it coded by Jehochmann and Durova ?

Even better — it was coded by this guy:

Anyway, this sort of innovation was probably inevitable, but even if Mr. Sarabadani has actual talent, it probably can't help but make joe-jobbing much easier for people who are willing to read their opponents' posts for comprehension but can't afford to subscribe to multiple VPNs at once.

Bezdomni · Unread post by **Bezdomni** » Mon Sep 13, 2021 10:42 pm

Believe it or not, the pleonasm "has (...) received positive reception" has been added to the Cambridge English dictionary. (as an example, with Wikipedia as its source: §)

I wonder how many of the 500+ en.wp occurrences were typed by Cirt.

Vigilant · Unread post by **Vigilant** » Mon Sep 13, 2021 11:30 pm

Midsize Jake wrote: ↑
Mon Sep 13, 2021 9:08 pm

Vigilant wrote: ↑
Mon Sep 13, 2021 4:53 pm
Was it coded by Jehochmann and Durova ?
Even better — it was coded by this guy:

Anyway, this sort of innovation was probably inevitable, but even if Mr. Sarabadani has actual talent, it probably can't help but make joe-jobbing much easier for people who are willing to read their opponents' posts for comprehension but can't afford to subscribe to multiple VPNs at once.

I will boldly predict that this will be nearly as funny as that shite tool that was supposed to determine aggression in text.

Name escapes me at the moment.

Poetlister · Unread post by **Poetlister** » Tue Sep 14, 2021 10:44 am

Midsize Jake wrote: ↑
Mon Sep 13, 2021 9:08 pm

Vigilant wrote: ↑
Mon Sep 13, 2021 4:53 pm
Was it coded by Jehochmann and Durova ?
Even better — it was coded by this guy:

Anyway, this sort of innovation was probably inevitable, but even if Mr. Sarabadani has actual talent, it probably can't help but make joe-jobbing much easier for people who are willing to read their opponents' posts for comprehension but can't afford to subscribe to multiple VPNs at once.

We all know how good WMF developers can be. But did he also develop the algorithms? if so, does he have the necessary expertise in AI?

ArmasRebane · Unread post by **ArmasRebane** » Tue Sep 14, 2021 3:09 pm

Poetlister wrote: ↑
Mon Sep 13, 2021 4:45 pm
Of course, the late SlimVirgin claimed to be able to detect sockpuppets by this method. I hope that this program is more reliable.

It's probably marginally so. This kind of computer-synthesized analysis is a little less likely to be prone to human pattern-matching behaviors, but I'd be highly surprised if it could put out any sort of definitive link.

I imagine this will be used like other behavioral evidence. This isn't going to suddenly turn up a bunch of unknown socks, especially since if they're smart people who have been trying to evade detection should have been doing stuff to change their language anyhow.

As for enabling Joe-jobs, what, are people going to constantly run their sock accounts' edits into the machine to try and match someone else's output?

Giraffe Stapler · Unread post by **Giraffe Stapler** » Tue Sep 14, 2021 3:18 pm

Poetlister wrote: ↑
Tue Sep 14, 2021 10:44 am
We all know how good WMF developers can be. But did he also develop the algorithms? if so, does he have the necessary expertise in AI?

I said "no comment" but I'm going comment anyway. It's not clear to me if there actually is any "AI" in this. It seems like straight statistical analysis of word use, but I've only seen the same couple of graphs everyone else has. (Word distributions of two users in fawiki 1.png and Word distributions of two users in fawiki 2.png)

The talk about restricting use to Checkusers made me laugh. Google "stylometry". There are plenty of papers on machine learning and stylometry and no shortage of projects implementing those papers. If you want to do a stylometric analysis of Wikipedia editors, you can already do it without this tool. It's a good project for someone, actually.

There are some things to be considered, though. Do you use everything that the editor has written on Wikipedia, or just what they have written outside of article space? I am quite sure that a fair percentage of what gets added to articles is just cut-and-pasted from the sources with minor edits like splicing two sentences together or leaving out an unecessary clause. That's going to muddy up your "fingerprint". But if you only use non-article space edits, you might have trouble getting enough text for a meaningful analysis. Not all sockmasters are given to ranting on talk pages (although it does seem to be a common trait).

There seems to be a suggestion that they are storing the "fingerprints" that this tool generates. I don't know why they would do that unless they intended to check them repeatedly. So the tool isn't for comparing two users like an editor interaction tool, it's for identifying users. It gives you the possibility of searching stored fingerprints for matches. And since this is based on Wikipedia edits, it means that data retention is no longer an issue. Lots of potential for past bad behaviour to be uncovered...

MrErnie · Unread post by **MrErnie** » Tue Sep 14, 2021 5:14 pm

Let's all start saying "respectfully defer to," and "acknowledgement of my," and linking diffs by saying "at DIFF" and see how many of us get blocked as Cirtpuppets.

ArmasRebane · Unread post by **ArmasRebane** » Tue Sep 14, 2021 8:46 pm

MrErnie wrote: ↑
Tue Sep 14, 2021 5:14 pm
Let's all start saying "respectfully defer to," and "acknowledgement of my," and linking diffs by saying "at DIFF" and see how many of us get blocked as Cirtpuppets.

This would seem the advance of computer-synthesized stylometry versus user-recognition: someone trying to joe job someone using trademark phrases is probably less likely to work because those phrases are only part of their overall corpus.

Basically, they'd have to be much better at aping someone else's style to appear indistinguishable.

Unread post by **Midsize Jake** » Tue Sep 14, 2021 10:37 pm

ArmasRebane wrote: ↑
Tue Sep 14, 2021 8:46 pm
This would seem the advance of computer-synthesized stylometry versus user-recognition: someone trying to joe job someone using trademark phrases is probably less likely to work because those phrases are only part of their overall corpus.

Basically, they'd have to be much better at aping someone else's style to appear indistinguishable.

You may well be right — we'll probably just have to wait and see how well (if at all) the software works. My point earlier was that if you're being reasonably subtle about it, and really trying to get someone else in trouble via joe-jobbing, you might be more likely to be successful at casting suspicion against the targeted user because the software is more likely to notice what you're doing. It's going to be processing the edit samples much faster (and therefore in much greater volume) than a human can, and of course it also doesn't sleep, and perhaps more importantly, it isn't hindered by compassion or the nice person's tendency to give people the benefit of the doubt. (IOW, "oh no, he would never do such a terrible thing on Wikipedia, of all places.")

Anyhoo, if the software in question works properly, presumably that means there will be (relatively) few false positives. It will probably "score" new-ish users as it compares them to more established ones, and only report comparisons that produce scores over a certain threshold (say, 75% likely). So if you're joe-jobbing, it just becomes a question of how similar you have to be to reach the reporting threshold, right? IMO it really depends on how good the algorithm is, and like you say, how good the joe-jobber is. So (at the risk of repeating myself repetitively) we'll just have to wait and see, I guess.

Tarc · Unread post by **Tarc** » Wed Sep 15, 2021 1:37 am

This triggers a random ANI memory. There was a user that was dragged there a few times for refusing to communicate on talk pages, they only did occasional short words in edit summaries. They expressed a fear of being identifiable from word patterns.

Poetlister · Unread post by **Poetlister** » Wed Sep 15, 2021 3:27 pm

Midsize Jake wrote: ↑
Tue Sep 14, 2021 10:37 pm
Anyhoo, if the software in question works properly

We're talking about stuff produced by a WMF developer. No doubt it will work quite as well as, ... let's see, ... the visual editor?

Ming · Unread post by **Ming** » Thu Sep 16, 2021 6:12 am

Ming does feel that this needs to be tested out in the open where everyone can see how well it works before people start applying it as if it were a reliable oracle. For that matter, it needs to be entirely open-source.

Giraffe Stapler · Unread post by **Giraffe Stapler** » Thu Sep 16, 2021 3:22 pm

Ming wrote: ↑
Thu Sep 16, 2021 6:12 am
Ming does feel that this needs to be tested out in the open where everyone can see how well it works before people start applying it as if it were a reliable oracle. For that matter, it needs to be entirely open-source.

Ah. It is open-source. If you could see it, which you can't, you would be allowed to use the code or create your own version of it, but it is understood that anyone allowed to see it will not do this. Open-source is really just licensing, although people usually and reasonably expect that open-source code is freely accessible.

Wikipediocracy

New tool to identify sockpuppets based on writing style

New tool to identify sockpuppets based on writing style

Re: New tool to identify sockpuppets based on writing style

Re: New tool to identify sockpuppets based on writing style

Re: New tool to identify sockpuppets based on writing style

Re: New tool to identify sockpuppets based on writing style

Re: New tool to identify sockpuppets based on writing style

Re: New tool to identify sockpuppets based on writing style

Re: New tool to identify sockpuppets based on writing style

Re: New tool to identify sockpuppets based on writing style

Re: New tool to identify sockpuppets based on writing style

Re: New tool to identify sockpuppets based on writing style

Re: New tool to identify sockpuppets based on writing style

Re: New tool to identify sockpuppets based on writing style

Re: New tool to identify sockpuppets based on writing style

Re: New tool to identify sockpuppets based on writing style

Re: New tool to identify sockpuppets based on writing style

Re: New tool to identify sockpuppets based on writing style

Re: New tool to identify sockpuppets based on writing style