Leak of FIFA World Cup 2018 Visitors


I haven’t kept this blog nicely updated, have I? Well, family, work, seasonal depressions, you know, usual excuses. However, recently I came across of an interesting leak, allegedly containing personal information of the visitors of the FIFA World Cup 2018, which was held in Russia.

The database, posted over at BreachForums in the end of November 2023, is a CSV file with 1,726,904 lines. The data is organized into the columns containing full names, phone numbers, dates of birth, email addresses, and documents (passport numbers with the issuing authority and dates).

Some time ago, I read an article on Hamatti’s blog about a tool named csvkit, and finally, I have a bit of time to try it out. Csvkit is a collection of tools for working with the CSV files. I’ll not attempt to make a tutorial on using the toolkit, my aim is just to try it out. If you’d like to learn how to use the tool, please see the official tutorial.

The original file was a pipe-delimited file, I used good old sed to change | into commas. And only after that I discovered that there’s a specific tool in csvkit, named csvformat, which helps with converting CSV files to a custom output formats.

To see the names and the indices of columns in the file we’ll use csvcut, a tool for filtering and truncating CSV files:

$ csvcut -n fifa.csv
  1: fio
  2: phone
  3: birthday
  4: email
  5: document

All the columns are self-explanatory, except the first one – “fio”, which in the Russian language stands for last name, name and patronymic name and denotes a full name of a person. The last column contains identification documents together with the issuing authorities.

Who doesn’t like stats, right? csvstat outputs descriptive statistics on the columns of a CSV file. However, we’ll use this tool in combination with csvcut, where we’ll “cut” the phone, birthday and email columns from out file and pass it to csvstat.

$ csvcut -c phone,birthday,email fifa.csv | csvstat

1. "phone"

    Type of data:          Text
    Contains null values:  True (excluded from calculations)
    Unique values:         1260577
    Longest value:         19 characters
    Most common values:    None (23132x)
                           9687824071 (3168x)
                           9687824068 (1879x)
                           351914386719 (1276x)
                           9032174134 (1265x)

2. "birthday"

    Type of data:          Text
    Contains null values:  True (excluded from calculations)
    Unique values:         30272
    Longest value:         10 characters
    Most common values:    01.01.1988 (307x)
                           01.01.1987 (303x)
                           01.01.1985 (291x)
                           01.01.1990 (285x)
                           01.01.1986 (282x)

3. "email"

    Type of data:          Text
    Contains null values:  True (excluded from calculations)
    Unique values:         1035206
    Longest value:         204 characters
    Most common values:    None (194107x)
                           FANIDREPORT@EMG.RU (3233x)
                           BELGIUM@TRAVELDOCS.EU (1673x)
                           TEST@GMAIL.COM (1361x)
                           MAGGIEBAO@MICESHANGHAI.COM.CN (1354x)

Cool, right? We can use csvstat, for instance, to count the number of unique values found within a column:

$ csvstat -c email --unique fifa.csv 
1035206

One of the most interesting tools in the kit is csvsql, which allows using SQL statements with a CSV file. In a basic example, let’s count the number of email addresses which have the finnish TLD:

$ csvsql --query "SELECT email FROM fifa WHERE email LIKE '%.fi'" fifa.csv | wc -l
2597

Now I’ll stop torturing you with commands and their results. Of course, one might say that nothing is needed beyond sed, awk and grep, but I find the csvkit a very handy toolkit to use in analysing CSV files.

So, we know that there are 2597 email addresses with the Finnish TLD, and obviously, not every one is using .fi email addresses. Counting occurrences of Finnish phone numbers gives 10222; however, unique number of phone numbers is 8733. You see, the same phone number can be used for multiple visitors. For instance, a company may have registered its employees to attend the football tournament and thus all the employees would be registered with the same phone number. Luckily for the Finns, the passport numbers exposed in this leak are not valid, because Finnish passports expire 5 year after the issuance.

One of the frequently used email addresses picked my interest: guestmanagement@fifa.org. In fact, it was used 1,270 times. I just started randomly going through people with the email address, to see who are they. First noted two ancient chaps, one was born in 1925 and other 1926: Aleksei Paramonov and Nikita Simonyan. Searched for them and discovered that they are old and legendary Soviet football players. Cool, so this email address was used for guests of the FIFA. Next I thought, well, maybe there are personal details of the football athletes who were participating in the World Cup. Now, I am not really good with the names of any athletes, specially in football. Just too many of them, and I recognize a few. While I was going through meaningless names (at least to me), I recognized one: Didier Drogba!

As the email address suggests, guestmanagement@fifa.org, is for the guests of FIFA. Thanks, Captain Obvious! I searched for other overly popular players like Messi and Neymar, both were present in the data-set and with own email addresses. In Messi’s case, it was some really random looking gmail.com, probably for privacy purposes. In Neymar’s case discovered that there are 2 entries that match with the full name and date of birth of the athlete. One email was hotmail.br, second with an email belonging to Brazilian Football Confederation’s employee.

This post was written by me, a human being, not LLM. I guess, we’ll start using a #noLLM hashtags, similar to nofilter ones on Instagram 🙂

, ,