Group Abstract

Message Boards

Ashley Madison Hack: analysing the data with the Wolfram Language

GROUPS:

Data Science, Recreation, Social Science, Graphics and Visualization, Wolfram Language, Statistics and Probability

Marco Thiel 13 Votes	Ashley Madison is an online dating service, launched in 2001 and based in Canada. Its target groups are individuals who are married or in a relationship. Its motto is: "Life is short. Have an affair." On July 15 this year the news broke that hackers had stolen its customer data, including emails, names, home addresses, credidt card information, etc. The hackers demended the service to shut down or they would publish the data. On 22 July, the first data were released. When the company did not shut down, all customer data was released on 18 August. A final batch of data was released on 20 August 2015. At the beginning the data was only available in the so called "dark web", but later it leaked to mainstream websites. The data released on 18th August, consisted of a dump of about 9-10 GB. It has been claimed that data of that size is difficult to analyse using standard computer hardware. In this post I will show how this data can be analysed and some conclusions can be drawn. I will abstain from disseminating any personal information as I do not want to contribute to destroying lives of those affected by the hack. I will not divulge credit card details or anything like that. I will also not explain in detail how to obtain the data dump; even thought that step has been explained in great detail on many websites. I will, however, show how to work with the data and show some potentially interesting aspects of it. The following analysis was done together with Bjoern Schelter, who helped with the mySQL connection and who is also a member of this community. The first part of the post is a little bit dry. It explains how to connect to an mySQL database. I hope that the second part of the post makes up for the first part. It will contain information on the gender balance, and reveal some other hopefully interesting bits of information. In order to analyse this large data set we will use mySQL. I am working on Mac OS X and there are very easy instructions found online of how to install mySQL. The basic steps are: Download mySQL from here. Install the program. Run the following terminal command in the terminal: sudo /usr/local/mysql/support-files/mysql.server start Generate a file called ~/.bash_profile if it does not yet exist. cd ; nano .bash_profile Copy this line export PATH="/usr/local/mysql/bin:$PATH" into the file and save. Set the root password for mySQL (from the command line) /usr/local/mysql/bin/mysqladmin -u root password 'yourrootpasswordhere' Fix the mySQL socket error: sudo mkdir /var/mysql sudo ln -s /tmp/mysql.sock /var/mysql/mysql.sock Then create this file: sudo nano /Library/LaunchDaemons/com.mysql.mysql.plist and copy this into it: <!--?xml version="1.0" encoding="UTF-8"?--> <plist version="1.0"> <dict> <key>KeepAlive</key> <true /> <key>Label</key> <string>com.mysql.mysqld</string> <key>ProgramArguments</key> <array> <string>/usr/local/mysql/bin/mysqld_safe</string> <string>--user=mysql</string> </array> </dict> </plist> Then save. Execute in the shell: sudo chown root:wheel /Library/LaunchDaemons/com.mysql.mysql.plist sudo chmod 644 /Library/LaunchDaemons/com.mysql.mysql.plist sudo launchctl load -w /Library/LaunchDaemons/com.mysql.mysql.plist Now mySQL is installed. We now prodeed to construct our database from the dump. We follow the instructions on this website. The folder you would get if you downloaded the dump is called dmps. It contains several zip files. After unzipping the folder should look like this: Ok. Now we need to run the following commands as root user in mysql (after login in the terminal using "mysql -u root -p"): CREATE DATABASE aminno; CREATE DATABASE am; CREATE USER 'am'@'localhost' IDENTIFIED BY 'loyaltyandfidelity'; GRANT ALL PRIVILEGES ON aminno.* TO 'am'@'localhost'; GRANT ALL PRIVILEGES ON am.* TO 'am'@'localhost'; The next step will take a while: we have to import the database as normal user. $ mysql -D am -uam -ployaltyandfidelity < am_am.dump $ mysql -D aminno -uam -ployaltyandfidelity < aminno_member.dump $ mysql -D aminno -uam -ployaltyandfidelity < aminno_member_email.dump $ mysql -D aminno -uam -ployaltyandfidelity < member_details.dump $ mysql -D aminno -uam -ployaltyandfidelity < member_login.dump Depending on your computer each if these will run for 4-10 hours. A solid state disk seems to make a huge difference here. While this is running you might already start analysing some of the other files. The file swappernetQAUser_Table.txt contains many password in plain text with no encryption. I believe that nowadays that would be considered bad practice when it comes to handling peoples' passwords. swappernet = Import["/Users/thiel/Desktop/dmps/ashleymadisondump/swappernet_QA_User_Table.txt", "CSV"]; passwords = Select[swappernet, Length[#] > 3 &][[;; , 4]]; Grid[(Reverse@SortBy[Tally[passwords], #[[2]] &])[[1 ;; 25]], Frame -> All] The resulting table contains surprisingly simple passwords. Some of the top passwords are inappropriate for this community, and do not further our understanding, so I have blurred them out. This is the result: There are Length[swappernet] (765607) 765607 entries in the file and the numbers of really, really unsafe passwords is quite low, but the 200 or so most frequent passwords would unlock about 10% of the profiles. This is a bit shocking, particularly because the accounts contain sensitive information. Let's move on to the main databases. We will start with the data base "am". Let's increase the heap space for Java a bit like so: << JLink`; InstallJava[]; ReinstallJava[CommandLine -> "java", JVMArguments -> "-Xmx4096m"] To connect to the database we need the following driver: Needs["DatabaseLink`"]; JDBCDrivers["MySQL(Connector/J)"] I found that the easiest way to connect to the database is to use the DatabaseExplorer: DatabaseExplorer[] The interface is quite straight forward to use. You need to add a new database. You click on the first icaon ("add database") on the upper left and follow the instructions. You like the two databases we generated above "am" and "aminno". As "type of the databse" you use "MySQL(Connector/J)". The host is "localhost", and the password is "loyaltyandfidelity". Ok. Now we are ready to use Wolfram Language. Let's start with the am databse. am = OpenSQLConnection["am", "Name" -> "am", "Password" -> "loyaltyandfidelity"] So let's see what we've got: colums = SQLExecute[SQLSelect["am", {"am_am_member"}, {SQLColumn[{"am_am_member", ""}]}, None, "SortingColumns" -> None, "MaxRows" -> 1, "Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]][[1]] Ok, that's a lot of information. Let's start with the gender balance (I use 35 million entries, the entire database has 37 million entries or so): gender = SQLExecute[SQLSelect["am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}]}, None, "SortingColumns" -> None, "MaxRows" -> 35000000, "Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]]; We then tally this: gendertally = Tally[Flatten@gender[[2 ;;]]][[1 ;; 2]] /. {2 -> "Male", 1 -> "Female"} ({{"Male", 27546956}, {"Female", 4414808}}) This tell us that out of the 35 million members that we analysed there are about 27.5 million male users and 4.4 million female users. Next we chart it: PieChart[Apply[Labeled, Reverse[{{Style["Male", Large], gendertally[[1, 2]]}, {Style["Female", Large], gendertally[[2, 2]]}}, 2], {1}], PlotTheme -> "Business"] It is easy to calculate the percentages: N@#[[2]]/Total[(Tally[Flatten@gender[[2 ;;]]][[1 ;; 2, 2]] /. {2 -> "Male", 1 -> "Female"})] & /@ (Tally[Flatten@gender[[2 ;;]]][[1 ;; 2]] /. {2 -> "Male", 1 -> "Female"}) Which suggestst that there are about 86.2% males and 13.8% females in this subsample of 35000000 entries (alltogether there are slighty more than 35 million entries in the databse). Let's have a look at the dates of birth - this time I will choose a 2 million subset. dob = SQLExecute[SQLSelect["am", {"am_am_member"}, {SQLColumn[{"am_am_member", "dob"}], SQLColumn[{"am_am_member", "gender"}]}, !SQLStringMatchQ[SQLColumn[{"am_am_member", "dob"}], "%0000-00-00%"], "SortingColumns" -> None, "MaxRows" -> 2000000,"Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]]; We have also retrieved the gender so that we can plot histograms for each gender: DateHistogram[{Select[dob, #[[2]] == 2 &][[All, 1, 1]], Select[dob, #[[2]] == 1 &][[All, 1, 1]]}, 130] The yellow histogram shows the males and the grey the females. Let's look at gender height: genderheight = SQLExecute[SQLSelect["am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}], SQLColumn[{"am_am_member", "profile_height"}]}, None, "SortingColumns" -> None, "MaxRows" -> 500000, "Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]]; Here is the respective histogram by gender: Histogram[{(Select[genderheight[[2 ;;]] /. {2 -> "Male", 1 -> "Female"}, #[[1]] == "Female" &])[[All, 2]], (Select[genderheight[[2 ;;]] /. {2 -> "Male", 1 -> "Female"}, #[[1]] == "Male" &])[[All, 2]]}, {120, 220, 1}, "Probability", AxesLabel -> {"height", "probability"}, LabelStyle -> Directive[Bold, Medium], ImageSize -> Large] So, as we would have anticipated, women are on average slightly smaller... Nothing new here. Let's have a look at what men and women are looking for: seekingfemale = Select[SQLExecute[SQLSelect["am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}], SQLColumn[{"am_am_member", "profile_caption"}]}, None, "SortingColumns" -> None, "MaxRows" -> 500000, "Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]], #[[1]] == 1 &]; seekingmale = Select[SQLExecute[SQLSelect["am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}], SQLColumn[{"am_am_member", "profile_caption"}]}, None, "SortingColumns" -> None, "MaxRows" -> 500000, "Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]], #[[1]] == 2 &]; Let's determine the important words here: wordsfemale = DeleteStopwords[Flatten[TextWords@Select[Flatten[seekingfemale], StringQ]]]; wordsmale = DeleteStopwords[Flatten[TextWords@Select[Flatten[seekingmale], StringQ]]]; Here are the corresponding word-clouds: Grid[{{"Female", "Male"}, {WordCloud[wordsfemale, IgnoreCase -> True],WordCloud[wordsmale, IgnoreCase -> True]}}, Frame -> True] Quite clearly, very similar words come up. Does this mean that males and females are similar in what they are looking for? To illucidate this further we look at one of the entries of the database which is "pref_opento". prefsfemale = Select[SQLExecute[ SQLSelect[ "am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}], SQLColumn[{"am_am_member", "pref_opento"}]}, None, "SortingColumns" -> None, "MaxRows" -> 50000, "Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]], #[[1]] == 1 &]; prefsmale = Select[SQLExecute[ SQLSelect[ "am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}], SQLColumn[{"am_am_member", "pref_opento"}]}, None, "SortingColumns" -> None, "MaxRows" -> 50000, "Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]], #[[1]] == 2 &]; Here's a histogram of the preferences: Histogram[{ToExpression@(Flatten@ StringSplit[prefsfemale[[1 ;; 50000, 2]], "\|"]), ToExpression@(Flatten@ StringSplit[prefsmale[[1 ;; 50000, 2]], "\|"])}, {1, 52, 1}, "Probability", AxesLabel -> {"Number Preference", "Probability"}, LabelStyle -> Directive[Bold, Medium], ImageSize -> Large] Males are blue, females orange/yellow. Each number corresponds to one -often rather explicit- preference. In the dump there is a table that allows you to interpret what these numbers mean, but for the purpose of this post, that would only be unneccessarily explicit and hence I avoid spelling out the preferences. Let's look at the account types. It turns out that an entry "1" in the data base describes a paid-for account, whereas "2" means free. accounttypefemale = Select[SQLExecute[ SQLSelect[ "am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}], SQLColumn[{"am_am_member", "account_type"}]}, None, "SortingColumns" -> None, "MaxRows" -> 50000, "Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]], #[[1]] == 1 &]; accounttypemale = Select[SQLExecute[ SQLSelect[ "am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}], SQLColumn[{"am_am_member", "account_type"}]}, None, "SortingColumns" -> None, "MaxRows" -> 50000, "Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]], #[[1]] == 2 &]; For female we get: SortBy[Tally[accounttypefemale[[All, 2]]], #[[1]] &] /. {1 -> "Paid", 2 -> "Free"} ({{"Paid", 4798}, {"Free", 868}}) For male we get: SortBy[Tally[accounttypemale[[All, 2]]], #[[1]] &] /. {1 -> "Paid", 2 -> "Free"} ({{"Paid", 44307}, {"Free", 22}}) Virtually all males have a paid for account. That is interesting. Let's have a look at when the accounts were created: createdonfemale = Select[SQLExecute[ SQLSelect[ "am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}], SQLColumn[{"am_am_member", "createdon"}]}, ! SQLStringMatchQ[SQLColumn[{"am_am_member", "createdon"}], "%0000-00-00 00:00:00%"], "SortingColumns" -> None, "MaxRows" -> 1000000, "Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]], #[[1]] == 1 &]; createdonmale = Select[SQLExecute[ SQLSelect[ "am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}], SQLColumn[{"am_am_member", "createdon"}]}, ! SQLStringMatchQ[SQLColumn[{"am_am_member", "createdon"}], "%0000-00-00 00:00:00%"], "SortingColumns" -> None, "MaxRows" -> 1000000, "Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]], #[[1]] == 2 &]; Here's a time line plot for the females: TimelinePlot@createdonfemale[[1 ;; 1000, 2, 1]] Here is the one for the males: TimelinePlot@createdonmale[[1 ;; 1000, 2, 1]] These plots are not very telling only show that there might be interesting information in the arrival times. DateHistogram[{createdonmale[[;; , 2, 1]], createdonfemale[[;; , 2, 1]]}, 500, PlotRange -> {{DateObject[{2001, 1, 1}], DateObject[{2006, 7, 1}]}, All}, AxesLabel -> {"date", "number of registrations"}, LabelStyle -> Directive[Bold, Medium], ImageSize -> Large] It is interesting to see that there are large peaks. Also, apart form the very beginning there are always more men to sign up then women. To study this further we can look at the differences between two arrival times; the time series is obviously not stationary, but could there be an underlying Poisson-process? arrivaltimesfemale = Table[DateDifference[createdonfemale[[i, 2, 1]], createdonfemale[[i + 1, 2, 1]], Quantity[1, "Seconds"]], {i, 1, 50000}]; arrivaltimesmale = Table[DateDifference[createdonmale[[i, 2, 1]], createdonmale[[i + 1, 2, 1]], Quantity[1, "Seconds"]], {i, 1, 50000}]; Here are the histograms: Grid[{{"female", Histogram[QuantityMagnitude /@ arrivaltimesfemale, "Log", ImageSize -> Medium]}, {"male", Histogram[QuantityMagnitude /@ arrivaltimesmale, "Log", ImageSize -> Medium]}}, Frame -> All] That is quite interesting. The histograms appear to be quite different. Let's check what type of distributions Wolfram Language believes this to be: FindDistribution[Select[(QuantityMagnitude /@ arrivaltimesfemale[[1 ;; 2000]]), # > 0 &], 3] ({MixtureDistribution[{0.921977, 0.0780234}, {LogNormalDistribution[7.64884, 1.99741], GammaDistribution[3.03315, 11677.4]}], MixtureDistribution[{0.938047, 0.0619529}, {LogNormalDistribution[7.66677, 1.97593], LogNormalDistribution[10.7177, 0.441292]}], LogNormalDistribution[7.84679, 2.04246]}) and for the males: FindDistribution[Select[(QuantityMagnitude /@ arrivaltimesmale[[1 ;; 2000]]), # > 0 &], 3] ({ParetoDistribution[3564.88, 2.03308, 1.08316, 1.00978], LogNormalDistribution[7.22247, 1.7025], FrechetDistribution[1.03896, 1219.71, -317.249]}) Next, we look at the geo-spacial distribution. Luckily, the database comes with GPS coordinates. (It also contains the address of the member.) I will not display any GPS coordinates in this post, as to not identify individuals. Instead I will display points on a world-wide map and a density distribution for the United States. geocordsfull = SQLExecute[ SQLSelect["am", {"am_am_member"}, {SQLColumn[{"am_am_member", "latitude"}], SQLColumn[{"am_am_member", "longitude"}]}, None, "SortingColumns" -> None, "Distinct" -> False, "MaxRows" -> 1000000, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]]; First the global distribution of the first 10000 entries: GeoListPlot[GeoPosition /@ (DeleteCases[RandomChoice[geocordsfull, 10000], {0., 0.}]), GeoBackground -> "ReliefMap", GeoCenter -> GeoPosition[{0, 0}], ImageSize -> Large] Ok, next the plot for the US. We first select particular geocoordinates, then calculate a density, and then plot it over the map of the United States: adulterercoordsusa = Select[RandomChoice[geocordsfull, 20000], (24.9382 < #[[1]] && #[[1]] < 49.37 && -67 > #[[2]] && #[[2]] > -124) &]; adultererDensityDistribution = SmoothKernelDistribution[adulterercoordsusa, "SheatherJones"]; The final stage is to plot the density and the map of the US: cplot = ContourPlot[PDF[adultererDensityDistribution, {y, x}], Evaluate@ Flatten[{x, {#[[1, 1, 2]], #[[2, 1, 2]]} &@ GeoBoundingBox[Entity["Country", "UnitedStates"]]}], Evaluate@ Flatten[{y, {#[[1, 1, 1]], #[[2, 1, 1]]} &@ GeoBoundingBox[Entity["Country", "UnitedStates"]]}], ColorFunction -> "Rainbow", Frame -> False, PlotRange -> {0, 0.027}, Contours -> 205, MaxRecursion -> 2, ColorFunction -> ColorData["TemperatureMap"], PlotRangePadding -> 0, ContourStyle -> None]; GeoGraphics[{GeoStyling[{"GeoImage", cplot}], Polygon[Entity["Country", "UnitedStates"]], Black, Opacity[1], PointSize[0.0015], Point[Reverse /@ adulterercoordsusa]}, GeoRange -> Entity["Country", "UnitedStates"], ImageSize -> Large] There is much more interesting stuff in the am database, but let's proceed to the aminno database. CloseSQLConnection[am]; OpenSQLConnection["aminno", "Name" -> "aminno_member", "Password" -> "loyaltyandfidelity"] Again as above you can use the DatabaseExplorer[] to make the database available to Mathematica. Let's look at the IP addresses that were used for signing up: TableForm@(Reverse@ SortBy[Tally[ DeleteCases[ Flatten[SQLExecute[ SQLSelect[ "aminno", {"aminno_member"}, {SQLColumn[{"aminno_member", "signupip"}]}, None, "SortingColumns" -> None, "MaxRows" -> 1000000, "Timeout" -> 10, "Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]]], ""][[2 ;;]]], #[[2]] &])[[1 ;; 25]] This gives: The interesting bit is that it contains the ip-address 127.0.0.1 which is "localhost". The first entry is an AOL address for example. Now we can ask how many people have certain email domains. First we get all email addresses. I will not display individual email addresses to protect individuals. emails = DeleteCases[(Flatten@SQLExecute[SQLSelect["aminno", {"aminno_member_email"}, {SQLColumn[{"aminno_member_email", "email"}]}, None, "SortingColumns" -> None, "MaxRows" -> 10000000, "Timeout" -> 100, "Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]])[[2 ;;]], ""]; We can count how many have a ".mil" address. Length[Select[emails, StringMatchQ[#, __ ~~ ".mil"] &]] (3588) This means that for the entire 37 million entries we would expect about 13276 ".mil"-addresses. This is the number of uk govermental addresses: Length[Select[emails, StringMatchQ[#, __ ~~ ".gov.uk"] &]] (34) There is another entry in the database which is called bcmaillast_time. It gives the time when the member checked the email last time. If the member never checked the email there will be all zeros in that entry. We therefore look at all females with non-zeros at that entry: emailactivity = (SQLExecute[SQLSelect["aminno", {"aminno_member"}, {SQLColumn[{"aminno_member", "bc_mail_last_time"}],SQLColumn[{"aminno_member", "gender"}]}, !SQLStringMatchQ[SQLColumn[{"aminno_member", "bc_mail_last_time"}], "%0000-00-00 00:00:00%"], "SortingColumns" -> None, "MaxRows" -> 2000000, "Timeout" -> 1100, "Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]][[2 ;;]]); So how many of the first 2000000 entries in the data base, i.e. people, were female: Length@(Select[emailactivity, #[[2]] == 1 &][[All, 1, 1]]) (254) Ups, that frighteningly low! the remainder are men: (Length@Select[emailactivity, #[[2]] == 2 &][[All, 1, 1]]) (1999746) Wow, that means that there are very few active women indeed. If I want to analyse the entire database, Java runs out of heaps-space. Luckily, there is a command to fix that: << JLink`; InstallJava[]; ReinstallJava[CommandLine -> "java", JVMArguments -> "-Xmx16384m"] Nice, now we've got all we need. emailactivity = (SQLExecute[ SQLSelect[ "aminno", {"aminno_member"}, {SQLColumn[{"aminno_member", "bc_mail_last_time"}], SQLColumn[{"aminno_member", "gender"}]}, ! SQLStringMatchQ[ SQLColumn[{"aminno_member", "bc_mail_last_time"}], "%0000-00-00 00:00:00%"], "SortingColumns" -> None, "MaxRows" -> 21000000, "Timeout" -> 1500, "Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]][[2 ;;]]); That last command lists up to 21 million entries that do not have zeros in the last reply, i.e. "sign-of-life" field. In a moment we will see that that covers all people, who show at least some activity beyond registration. Out of these there are Length@(Select[emailactivity, #[[2]] == 1 &][[All, 1, 1]]) (1492) 1492 females only!!! And there are: Length@(Select[emailactivity, #[[2]] == 2 &][[All, 1, 1]]) (20269675) 20269675 males. Note that both numbers add up to less than 21 million, i.e. we have extracted all at least briefly active members. This means that of over 20 million users who actually used the system, beyond registration, only 1492 were women. That corresponds to: 1001492/(1492 + 20269675) // N i.e. , 0.0074% women. This suggests that many men have paid to meet women, who were not (active) in the system. This analysis is far from complete. It just shows what can be achieved with Mathematica on a standard laptop. The dataset has many different types of data. I think that this analysis could be considered as a first step towards "big-data". Wolfram Language is quite capable to analyse rather large datasets even on standard home equipment. It is a really nice tool for big data, which is not only large, but also contains very heterogenous data. Wolfram Language functions for text analysis, efficient algorithms to perform operations on large data sets, and the curated data have all proven to be useful. There are many more things that Wolfram Language can help to extract from this particular database, but this post is already far to long, for which I apologise. If there is any interest for this kind of analysis, or if there are questions/suggestions about what else to do with the data, I will be happy to post that, as long as it does not disclose any personal information of the people involved. Cheers, Marco

13 Votes

Ashley Madison is an online dating service, launched in 2001 and based in Canada. Its target groups are individuals who are married or in a relationship. Its motto is: "Life is short. Have an affair." On July 15 this year the news broke that hackers had stolen its customer data, including emails, names, home addresses, credidt card information, etc. The hackers demended the service to shut down or they would publish the data. On 22 July, the first data were released. When the company did not shut down, all customer data was released on 18 August. A final batch of data was released on 20 August 2015.

At the beginning the data was only available in the so called "dark web", but later it leaked to mainstream websites. The data released on 18th August, consisted of a dump of about 9-10 GB. It has been claimed that data of that size is difficult to analyse using standard computer hardware. In this post I will show how this data can be analysed and some conclusions can be drawn. I will abstain from disseminating any personal information as I do not want to contribute to destroying lives of those affected by the hack. I will not divulge credit card details or anything like that. I will also not explain in detail how to obtain the data dump; even thought that step has been explained in great detail on many websites. I will, however, show how to work with the data and show some potentially interesting aspects of it.

The following analysis was done together with Bjoern Schelter, who helped with the mySQL connection and who is also a member of this community. The first part of the post is a little bit dry. It explains how to connect to an mySQL database. I hope that the second part of the post makes up for the first part. It will contain information on the gender balance, and reveal some other hopefully interesting bits of information.

In order to analyse this large data set we will use mySQL. I am working on Mac OS X and there are very easy instructions found online of how to install mySQL. The basic steps are:

Download mySQL from here.
Install the program.

Run the following terminal command in the terminal:

sudo /usr/local/mysql/support-files/mysql.server start

Generate a file called ~/.bash_profile if it does not yet exist.
```
cd ; nano .bash_profile
```

Copy this line

export PATH="/usr/local/mysql/bin:$PATH"

into the file and save.

Set the root password for mySQL (from the command line)

/usr/local/mysql/bin/mysqladmin -u root password 'yourrootpasswordhere'

Fix the mySQL socket error:

sudo mkdir /var/mysql
sudo ln -s /tmp/mysql.sock /var/mysql/mysql.sock

Then create this file:

sudo nano /Library/LaunchDaemons/com.mysql.mysql.plist

and copy this into it:

<!--?xml version="1.0" encoding="UTF-8"?-->
<plist version="1.0">
  <dict>
    <key>KeepAlive</key>
    <true />
    <key>Label</key>
    <string>com.mysql.mysqld</string>
    <key>ProgramArguments</key>
    <array>
      <string>/usr/local/mysql/bin/mysqld_safe</string>
      <string>--user=mysql</string>
    </array>        
  </dict>
</plist>

Then save.

Execute in the shell:

sudo chown root:wheel /Library/LaunchDaemons/com.mysql.mysql.plist
sudo chmod 644 /Library/LaunchDaemons/com.mysql.mysql.plist
sudo launchctl load -w /Library/LaunchDaemons/com.mysql.mysql.plist

Now mySQL is installed. We now prodeed to construct our database from the dump. We follow the instructions on this website. The folder you would get if you downloaded the dump is called dmps. It contains several zip files. After unzipping the folder should look like this:

enter image description here

Ok. Now we need to run the following commands as root user in mysql (after login in the terminal using "mysql -u root -p"):

CREATE DATABASE aminno;
CREATE DATABASE am;
CREATE USER 'am'@'localhost' IDENTIFIED BY 'loyaltyandfidelity';
GRANT ALL PRIVILEGES ON aminno.* TO 'am'@'localhost';
GRANT ALL PRIVILEGES ON am.* TO 'am'@'localhost';

The next step will take a while: we have to import the database as normal user.

$ mysql -D am -uam -ployaltyandfidelity < am_am.dump
$ mysql -D aminno -uam -ployaltyandfidelity < aminno_member.dump
$ mysql -D aminno -uam -ployaltyandfidelity < aminno_member_email.dump
$ mysql -D aminno -uam -ployaltyandfidelity < member_details.dump
$ mysql -D aminno -uam -ployaltyandfidelity < member_login.dump

Depending on your computer each if these will run for 4-10 hours. A solid state disk seems to make a huge difference here. While this is running you might already start analysing some of the other files. The file swappernetQAUser_Table.txt contains many password in plain text with no encryption. I believe that nowadays that would be considered bad practice when it comes to handling peoples' passwords.

swappernet = Import["/Users/thiel/Desktop/dmps/ashleymadisondump/swappernet_QA_User_Table.txt", "CSV"];
passwords = Select[swappernet, Length[#] > 3 &][[;; , 4]]; Grid[(Reverse@SortBy[Tally[passwords], #[[2]] &])[[1 ;; 25]], Frame -> All]

The resulting table contains surprisingly simple passwords. Some of the top passwords are inappropriate for this community, and do not further our understanding, so I have blurred them out. This is the result:

enter image description here

There are

Length[swappernet]
(*765607*)

765607 entries in the file and the numbers of really, really unsafe passwords is quite low, but the 200 or so most frequent passwords would unlock about 10% of the profiles. This is a bit shocking, particularly because the accounts contain sensitive information.

Let's move on to the main databases. We will start with the data base "am". Let's increase the heap space for Java a bit like so:

<< JLink`;
InstallJava[];
ReinstallJava[CommandLine -> "java", JVMArguments -> "-Xmx4096m"]

To connect to the database we need the following driver:

Needs["DatabaseLink`"];
JDBCDrivers["MySQL(Connector/J)"]

I found that the easiest way to connect to the database is to use the DatabaseExplorer:

DatabaseExplorer[]

enter image description here

The interface is quite straight forward to use. You need to add a new database. You click on the first icaon ("add database") on the upper left and follow the instructions. You like the two databases we generated above "am" and "aminno". As "type of the databse" you use "MySQL(Connector/J)". The host is "localhost", and the password is "loyaltyandfidelity". Ok. Now we are ready to use Wolfram Language. Let's start with the am databse.

am = OpenSQLConnection["am", "Name" -> "am", "Password" -> "loyaltyandfidelity"]

So let's see what we've got:

colums = SQLExecute[SQLSelect["am", {"am_am_member"}, {SQLColumn[{"am_am_member", "*"}]}, None, 
"SortingColumns" -> None, "MaxRows" -> 1, "Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]][[1]]

enter image description here

Ok, that's a lot of information. Let's start with the gender balance (I use 35 million entries, the entire database has 37 million entries or so):

gender = SQLExecute[SQLSelect["am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}]}, None, 
"SortingColumns" -> None, "MaxRows" -> 35000000, "Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]];

We then tally this:

gendertally = Tally[Flatten@gender[[2 ;;]]][[1 ;; 2]] /. {2 -> "Male", 1 -> "Female"}
(*{{"Male", 27546956}, {"Female", 4414808}}*)

This tell us that out of the 35 million members that we analysed there are about 27.5 million male users and 4.4 million female users. Next we chart it:

PieChart[Apply[Labeled, Reverse[{{Style["Male", Large], gendertally[[1, 2]]}, 
{Style["Female", Large], gendertally[[2, 2]]}}, 2], {1}], PlotTheme -> "Business"]

enter image description here

It is easy to calculate the percentages:

N@#[[2]]/Total[(Tally[Flatten@gender[[2 ;;]]][[1 ;; 2, 2]] /. {2 -> "Male", 1 -> "Female"})] & /@ 
(Tally[Flatten@gender[[2 ;;]]][[1 ;; 2]] /. {2 -> "Male", 1 -> "Female"})

Which suggestst that there are about 86.2% males and 13.8% females in this subsample of 35000000 entries (alltogether there are slighty more than 35 million entries in the databse). Let's have a look at the dates of birth - this time I will choose a 2 million subset.

dob = SQLExecute[SQLSelect["am", {"am_am_member"}, {SQLColumn[{"am_am_member", "dob"}], 
SQLColumn[{"am_am_member", "gender"}]}, !SQLStringMatchQ[SQLColumn[{"am_am_member", "dob"}], "%0000-00-00%"], 
"SortingColumns" -> None, "MaxRows" -> 2000000,"Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]];

We have also retrieved the gender so that we can plot histograms for each gender:

DateHistogram[{Select[dob, #[[2]] == 2 &][[All, 1, 1]], Select[dob, #[[2]] == 1 &][[All, 1, 1]]}, 130]

enter image description here

The yellow histogram shows the males and the grey the females. Let's look at gender height:

genderheight = SQLExecute[SQLSelect["am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}], 
SQLColumn[{"am_am_member", "profile_height"}]}, None, "SortingColumns" -> None, "MaxRows" -> 500000, 
"Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]];

Here is the respective histogram by gender:

Histogram[{(Select[genderheight[[2 ;;]] /. {2 -> "Male", 1 -> "Female"}, #[[1]] == "Female" &])[[All, 2]], 
(Select[genderheight[[2 ;;]] /. {2 -> "Male", 1 -> "Female"}, #[[1]] == "Male" &])[[All, 2]]}, {120, 220, 1}, 
"Probability", AxesLabel -> {"height", "probability"}, LabelStyle -> Directive[Bold, Medium], ImageSize -> Large]

enter image description here

So, as we would have anticipated, women are on average slightly smaller... Nothing new here. Let's have a look at what men and women are looking for:

seekingfemale = Select[SQLExecute[SQLSelect["am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}],     
SQLColumn[{"am_am_member", "profile_caption"}]}, None, "SortingColumns" -> None, "MaxRows" -> 500000, 
"Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]], #[[1]] == 1 &];

seekingmale = Select[SQLExecute[SQLSelect["am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}], 
SQLColumn[{"am_am_member", "profile_caption"}]}, None, "SortingColumns" -> None, "MaxRows" -> 500000, 
"Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]], #[[1]] == 2 &];

Let's determine the important words here:

wordsfemale = DeleteStopwords[Flatten[TextWords@Select[Flatten[seekingfemale], StringQ]]]; 
wordsmale = DeleteStopwords[Flatten[TextWords@Select[Flatten[seekingmale], StringQ]]];

Here are the corresponding word-clouds:

Grid[{{"Female", "Male"}, {WordCloud[wordsfemale, IgnoreCase -> True],WordCloud[wordsmale, IgnoreCase -> True]}}, Frame -> True]

enter image description here

Quite clearly, very similar words come up. Does this mean that males and females are similar in what they are looking for? To illucidate this further we look at one of the entries of the database which is "pref_opento".

prefsfemale = 
  Select[SQLExecute[
    SQLSelect[
     "am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}], 
      SQLColumn[{"am_am_member", "pref_opento"}]}, None, 
     "SortingColumns" -> None, "MaxRows" -> 50000, 
     "Distinct" -> False, "GetAsStrings" -> False, 
     "ShowColumnHeadings" -> True]], #[[1]] == 1 &];

prefsmale = 
  Select[SQLExecute[
    SQLSelect[
     "am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}], 
      SQLColumn[{"am_am_member", "pref_opento"}]}, None, 
     "SortingColumns" -> None, "MaxRows" -> 50000, 
     "Distinct" -> False, "GetAsStrings" -> False, 
     "ShowColumnHeadings" -> True]], #[[1]] == 2 &];

Here's a histogram of the preferences:

Histogram[{ToExpression@(Flatten@
     StringSplit[prefsfemale[[1 ;; 50000, 2]], "|"]), 
  ToExpression@(Flatten@
     StringSplit[prefsmale[[1 ;; 50000, 2]], "|"])}, {1, 52, 
  1}, "Probability", 
 AxesLabel -> {"Number Preference", "Probability"}, 
 LabelStyle -> Directive[Bold, Medium], ImageSize -> Large]

enter image description here

Males are blue, females orange/yellow. Each number corresponds to one -often rather explicit- preference. In the dump there is a table that allows you to interpret what these numbers mean, but for the purpose of this post, that would only be unneccessarily explicit and hence I avoid spelling out the preferences.

Let's look at the account types. It turns out that an entry "1" in the data base describes a paid-for account, whereas "2" means free.

accounttypefemale = 
  Select[SQLExecute[
    SQLSelect[
     "am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}], 
      SQLColumn[{"am_am_member", "account_type"}]}, None, 
     "SortingColumns" -> None, "MaxRows" -> 50000, 
     "Distinct" -> False, "GetAsStrings" -> False, 
     "ShowColumnHeadings" -> True]], #[[1]] == 1 &];
accounttypemale = 
  Select[SQLExecute[
    SQLSelect[
     "am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}], 
      SQLColumn[{"am_am_member", "account_type"}]}, None, 
     "SortingColumns" -> None, "MaxRows" -> 50000, 
     "Distinct" -> False, "GetAsStrings" -> False, 
     "ShowColumnHeadings" -> True]], #[[1]] == 2 &];

For female we get:

SortBy[Tally[accounttypefemale[[All, 2]]], #[[1]] &] /. {1 -> "Paid", 2 -> "Free"}
(*{{"Paid", 4798}, {"Free", 868}}*)

For male we get:

SortBy[Tally[accounttypemale[[All, 2]]], #[[1]] &] /. {1 -> "Paid", 2 -> "Free"}
(*{{"Paid", 44307}, {"Free", 22}}*)

Virtually all males have a paid for account. That is interesting. Let's have a look at when the accounts were created:

createdonfemale = 
  Select[SQLExecute[
    SQLSelect[
     "am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}], 
      SQLColumn[{"am_am_member", "createdon"}]}, ! 
      SQLStringMatchQ[SQLColumn[{"am_am_member", "createdon"}], 
       "%0000-00-00 00:00:00%"], "SortingColumns" -> None, 
     "MaxRows" -> 1000000, "Distinct" -> False, 
     "GetAsStrings" -> False, 
     "ShowColumnHeadings" -> True]], #[[1]] == 1 &];
createdonmale = 
  Select[SQLExecute[
    SQLSelect[
     "am", {"am_am_member"}, {SQLColumn[{"am_am_member", "gender"}], 
      SQLColumn[{"am_am_member", "createdon"}]}, ! 
      SQLStringMatchQ[SQLColumn[{"am_am_member", "createdon"}], 
       "%0000-00-00 00:00:00%"], "SortingColumns" -> None, 
     "MaxRows" -> 1000000, "Distinct" -> False, 
     "GetAsStrings" -> False, 
     "ShowColumnHeadings" -> True]], #[[1]] == 2 &];

Here's a time line plot for the females:

TimelinePlot@createdonfemale[[1 ;; 1000, 2, 1]]

enter image description here

Here is the one for the males:

TimelinePlot@createdonmale[[1 ;; 1000, 2, 1]]

enter image description here

These plots are not very telling only show that there might be interesting information in the arrival times.

DateHistogram[{createdonmale[[;; , 2, 1]], createdonfemale[[;; , 2, 1]]}, 500, PlotRange -> {{DateObject[{2001, 1, 1}], 
DateObject[{2006, 7, 1}]}, All}, AxesLabel -> {"date", "number of registrations"}, LabelStyle -> Directive[Bold, Medium], 
ImageSize -> Large]

enter image description here

It is interesting to see that there are large peaks. Also, apart form the very beginning there are always more men to sign up then women. To study this further we can look at the differences between two arrival times; the time series is obviously not stationary, but could there be an underlying Poisson-process?

arrivaltimesfemale = 
Table[DateDifference[createdonfemale[[i, 2, 1]], createdonfemale[[i + 1, 2, 1]], Quantity[1, "Seconds"]], {i, 1, 50000}];

arrivaltimesmale = 
Table[DateDifference[createdonmale[[i, 2, 1]], createdonmale[[i + 1, 2, 1]], Quantity[1, "Seconds"]], {i, 1, 50000}];

Here are the histograms:

Grid[{{"female", 
   Histogram[QuantityMagnitude /@ arrivaltimesfemale, "Log", 
    ImageSize -> Medium]}, {"male", 
   Histogram[QuantityMagnitude /@ arrivaltimesmale, "Log", 
    ImageSize -> Medium]}}, Frame -> All]

enter image description here

That is quite interesting. The histograms appear to be quite different. Let's check what type of distributions Wolfram Language believes this to be:

FindDistribution[Select[(QuantityMagnitude /@ arrivaltimesfemale[[1 ;; 2000]]), # > 0 &], 3]
(*{MixtureDistribution[{0.921977, 0.0780234}, {LogNormalDistribution[7.64884, 1.99741], GammaDistribution[3.03315, 11677.4]}], 
 MixtureDistribution[{0.938047, 0.0619529}, {LogNormalDistribution[7.66677, 1.97593], LogNormalDistribution[10.7177, 0.441292]}], 
 LogNormalDistribution[7.84679, 2.04246]}*)

and for the males:

FindDistribution[Select[(QuantityMagnitude /@ arrivaltimesmale[[1 ;; 2000]]), # > 0 &], 3]
(*{ParetoDistribution[3564.88, 2.03308, 1.08316, 1.00978], LogNormalDistribution[7.22247, 1.7025], FrechetDistribution[1.03896, 1219.71, -317.249]}*)

Next, we look at the geo-spacial distribution. Luckily, the database comes with GPS coordinates. (It also contains the address of the member.) I will not display any GPS coordinates in this post, as to not identify individuals. Instead I will display points on a world-wide map and a density distribution for the United States.

geocordsfull = SQLExecute[
   SQLSelect["am", {"am_am_member"}, {SQLColumn[{"am_am_member", "latitude"}], SQLColumn[{"am_am_member", "longitude"}]}, None, 
    "SortingColumns" -> None, "Distinct" -> False, "MaxRows" -> 1000000, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]];

First the global distribution of the first 10000 entries:

GeoListPlot[GeoPosition /@ (DeleteCases[RandomChoice[geocordsfull, 10000], {0., 0.}]), 
GeoBackground -> "ReliefMap", GeoCenter -> GeoPosition[{0, 0}], ImageSize -> Large]

enter image description here

Ok, next the plot for the US. We first select particular geocoordinates, then calculate a density, and then plot it over the map of the United States:

adulterercoordsusa = Select[RandomChoice[geocordsfull, 20000], (24.9382 < #[[1]] && #[[1]] < 49.37 && -67 > #[[2]] && #[[2]] > -124) &];
adultererDensityDistribution = SmoothKernelDistribution[adulterercoordsusa, "SheatherJones"];

The final stage is to plot the density and the map of the US:

cplot = ContourPlot[PDF[adultererDensityDistribution, {y, x}], 
   Evaluate@
    Flatten[{x, {#[[1, 1, 2]], #[[2, 1, 2]]} &@
       GeoBoundingBox[Entity["Country", "UnitedStates"]]}], 
   Evaluate@
    Flatten[{y, {#[[1, 1, 1]], #[[2, 1, 1]]} &@
       GeoBoundingBox[Entity["Country", "UnitedStates"]]}], 
   ColorFunction -> "Rainbow", Frame -> False, 
   PlotRange -> {0, 0.027}, Contours -> 205, MaxRecursion -> 2, 
   ColorFunction -> ColorData["TemperatureMap"], 
   PlotRangePadding -> 0, ContourStyle -> None];
GeoGraphics[{GeoStyling[{"GeoImage", cplot}], 
  Polygon[Entity["Country", "UnitedStates"]], Black, Opacity[1], 
  PointSize[0.0015], Point[Reverse /@ adulterercoordsusa]}, 
 GeoRange -> Entity["Country", "UnitedStates"], ImageSize -> Large]

enter image description here

There is much more interesting stuff in the am database, but let's proceed to the aminno database.

CloseSQLConnection[am];
OpenSQLConnection["aminno", "Name" -> "aminno_member", "Password" -> "loyaltyandfidelity"]

Again as above you can use the

DatabaseExplorer[]

to make the database available to Mathematica. Let's look at the IP addresses that were used for signing up:

TableForm@(Reverse@
    SortBy[Tally[
      DeleteCases[
        Flatten[SQLExecute[
          SQLSelect[
           "aminno", {"aminno_member"}, {SQLColumn[{"aminno_member", 
              "signupip"}]}, None, "SortingColumns" -> None, 
           "MaxRows" -> 1000000, "Timeout" -> 10, "Distinct" -> False,
            "GetAsStrings" -> False, "ShowColumnHeadings" -> True]]], 
        ""][[2 ;;]]], #[[2]] &])[[1 ;; 25]]

This gives:

enter image description here

The interesting bit is that it contains the ip-address 127.0.0.1 which is "localhost". The first entry is an AOL address for example. Now we can ask how many people have certain email domains. First we get all email addresses. I will not display individual email addresses to protect individuals.

emails = DeleteCases[(Flatten@SQLExecute[SQLSelect["aminno", {"aminno_member_email"}, 
{SQLColumn[{"aminno_member_email", "email"}]}, None,  "SortingColumns" -> None, "MaxRows" -> 10000000, 
"Timeout" -> 100, "Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]])[[2 ;;]], ""];

We can count how many have a ".mil" address.

Length[Select[emails, StringMatchQ[#, __ ~~ ".mil"] &]]
(*3588*)

This means that for the entire 37 million entries we would expect about 13276 ".mil"-addresses. This is the number of uk govermental addresses:

Length[Select[emails, StringMatchQ[#, __ ~~ ".gov.uk"] &]]
(*34*)

There is another entry in the database which is called bcmaillast_time. It gives the time when the member checked the email last time. If the member never checked the email there will be all zeros in that entry. We therefore look at all females with non-zeros at that entry:

emailactivity = (SQLExecute[SQLSelect["aminno", {"aminno_member"}, 
{SQLColumn[{"aminno_member", "bc_mail_last_time"}],SQLColumn[{"aminno_member", "gender"}]},                 
!SQLStringMatchQ[SQLColumn[{"aminno_member", "bc_mail_last_time"}], 
"%0000-00-00 00:00:00%"], "SortingColumns" -> None, "MaxRows" -> 2000000, "Timeout" -> 1100, 
"Distinct" -> False, "GetAsStrings" -> False, "ShowColumnHeadings" -> True]][[2 ;;]]);

So how many of the first 2000000 entries in the data base, i.e. people, were female:

Length@(Select[emailactivity, #[[2]] == 1 &][[All, 1, 1]])
(*254*)

Ups, that frighteningly low! the remainder are men:

(Length@Select[emailactivity, #[[2]] == 2 &][[All, 1, 1]])
(*1999746*)

Wow, that means that there are very few active women indeed. If I want to analyse the entire database, Java runs out of heaps-space. Luckily, there is a command to fix that:

<< JLink`;
InstallJava[];
ReinstallJava[CommandLine -> "java", JVMArguments -> "-Xmx16384m"]

Nice, now we've got all we need.

emailactivity = (SQLExecute[
     SQLSelect[
      "aminno", {"aminno_member"}, {SQLColumn[{"aminno_member", 
         "bc_mail_last_time"}], 
       SQLColumn[{"aminno_member", "gender"}]}, ! 
       SQLStringMatchQ[
        SQLColumn[{"aminno_member", "bc_mail_last_time"}], 
        "%0000-00-00 00:00:00%"], "SortingColumns" -> None, 
      "MaxRows" -> 21000000, "Timeout" -> 1500, "Distinct" -> False, 
      "GetAsStrings" -> False, "ShowColumnHeadings" -> True]][[2 ;;]]);

That last command lists up to 21 million entries that do not have zeros in the last reply, i.e. "sign-of-life" field. In a moment we will see that that covers all people, who show at least some activity beyond registration. Out of these there are

Length@(Select[emailactivity, #[[2]] == 1 &][[All, 1, 1]])
(*1492*)

1492 females only!!! And there are:

Length@(Select[emailactivity, #[[2]] == 2 &][[All, 1, 1]])
(*20269675*)

20269675 males. Note that both numbers add up to less than 21 million, i.e. we have extracted all at least briefly active members. This means that of over 20 million users who actually used the system, beyond registration, only 1492 were women. That corresponds to:

100*1492/(1492 + 20269675) // N

i.e. , 0.0074% women. This suggests that many men have paid to meet women, who were not (active) in the system.

This analysis is far from complete. It just shows what can be achieved with Mathematica on a standard laptop. The dataset has many different types of data. I think that this analysis could be considered as a first step towards "big-data". Wolfram Language is quite capable to analyse rather large datasets even on standard home equipment. It is a really nice tool for big data, which is not only large, but also contains very heterogenous data. Wolfram Language functions for text analysis, efficient algorithms to perform operations on large data sets, and the curated data have all proven to be useful. There are many more things that Wolfram Language can help to extract from this particular database, but this post is already far to long, for which I apologise. If there is any interest for this kind of analysis, or if there are questions/suggestions about what else to do with the data, I will be happy to post that, as long as it does not disclose any personal information of the people involved.

Cheers,

Marco

POSTED BY: Marco Thiel

Answer

20 hours ago

3 Replies

Patrick Stevens 1 Vote	I should note that "big data" as a term is usually applied to datasets many hundreds of gigabytes, or many terabytes, in size. If it can be computed on with MySQL, it's probably not big data. Nice analysis though.

POSTED BY: Patrick Stevens

Answer

10 hours ago

Marco Thiel 2 Votes	Dear Patrick, I agree. I did not mean to say that this is big data in terms of the volume. I think that it is a step towards analysing big data with Mathematica. I also believe that the classification of "Big Data" is a bit vague and it appears that a crucial characteristic is that it is very heterogenous.I do not want to say that this is big data, but then Wikipedia says: Big data can be described by the following characteristics: Volume The quantity of generated data is important in this context. The size of the data determines the value and potential of the data under consideration, and whether it can actually be considered big data or not. The name ‘big data’ itself contains a term related to size, and hence the characteristic. Variety The type of content, and an essential fact that data analysts must know. This helps people who are associated with and analyze the data to effectively use the data to their advantage and thus uphold its importance. Velocity In this context, the speed at which the data is generated and processed to meet the demands and the challenges that lie in the path of growth and development. Variability The inconsistency the data can show at times—-which can hamper the process of handling and managing the data effectively. Veracity The quality of captured data, which can vary greatly. Accurate analysis depends on the veracity of source data. Complexity Data management can be very complex, especially when large volumes of data come from multiple sources. Data must be linked, connected, and correlated so users can grasp the information the data is supposed to convey. This particular dataset is a very interesting test case for algorithms that can combine disparate data for what we could call "predictive analysis". I hoped that this example would allow everyone with a standard laptop and Mathematica (home edition or higher) to get a feeling for "how big data feels". But you point is absolutely valid, of course. Best wishes, Marco

POSTED BY: Marco Thiel

Answer

9 hours ago

Sam Carrettie 4 Votes	Thanks for sharing and work, Marco and Bjoern, very impressive ! Your contributions to this Community are really fascinating, keep it coming. I was surprised it is kind of hard to find similar stat/visual analysis of the data on the web. The only think I found was Ashley Madison - A final analysis of the data - perhaps some visualizations there can serve as further ideas or points of comparison. Something I noted, - the very first geo-spacial plot, you say First the global distribution of the first 10000 entries: but then use RandomChoice[geocordsfull, 10000] does this mean it should be a "First the global distribution of the randomly sampled 10000 entries" ? Anyways does not it seems odd that besides UK we got no Europe coverage ? Just in case here is s similar (i think) plot from the link above:

POSTED BY: Sam Carrettie

Answer

10 hours ago

Marco Thiel 2 Votes	Dear Sam, thank you for your reply. You are absolutely right. I was also surprised by the fact that there was nothing in Europe beyond the UK. Also, why were there so many cases in China... I was thinking that it might be linked to the subset that I choose. I still only chose the first 1.000.000 entries. It is actually very fast to get more coordinates from the database, but in an earlier post on Ebola, I ran into problems when I wanted to plot more than 16.000 points on a map. I did not explore that any further in this post, but I definitely should. I hope to be able to post an update some time in the future. The website you linked to is very nice indeed. It would be nice to have these diagrams done with the Wolfram Language, too. I don't see any problem to reproduce that. On a side note, several newspapers report today, that plenty of new people have signed up for the service. I would be delighted if more people contributed their ideas and graphics to this thread. Best wishes, Marco

POSTED BY: Marco Thiel

Answer

10 hours ago

Feedback