Unmasked: What 10 million passwords reveal about the people who choose them
A lot is known about passwords. Most are short, simple, and pretty easy to crack. But Much less is known about the psychological reasons a person chooses a specific password. Most experts recommend coming up with a strong password to avoid data breach. But why do so many internet users still prefer weak passwords?
We’ve analyzed the password choices of 10 million people, from CEOs to scientists, to find out what they reveal about the things we consider easy to remember and hard to guess.
Who is the first superhero that comes to mind? What about a number between one and 10? And finally, a vibrant color? Quickly think of each of those things if you haven’t already, and then combine all three into a single phrase.
Now, it’s time for us to guess it.
Is it Superman7red? No, no: Batman3Orange? If we guessed any one of the individual answers correctly, it’s because humans are predictable. And that’s the problem with passwords. True, we gave ourselves the advantage of some sneakily chosen questions, but that’s nothing compared to the industrial-scale sneakiness of purpose-built password-breaking software. HashCat, for instance, can take 300,000 guesses at your password a second (depending on how it’s hashed), so even if you chose Hawkeye6yellow, your secret phrase would, sooner or later, not be secret anymore.
Passwords are so often easy to guess because many of us think of obvious words and numbers and combine them in simple ways. We wanted to explore this concept and, in doing so, see what we could find out about how a person’s mind works when he or she arranges words, numbers, and (hopefully) symbols into a (probably not very) unique order.
We began by choosing two data sets to analyze.
Two Data Sets, Several Caveats
The first data set is a dump of 5 million credentials that first showed up in September 2014 on a Russian BitCoin forum.1 They appeared to be Gmail accounts (and some Yandex.ru), but further inspection showed that, while most of the emails included were valid Gmail addresses, most of the plain-text passwords were either old Gmail ones (i.e. no longer active) or passwords that were not used with the associated Gmail addresses. Nevertheless, WordPress.com reset 100,000 accounts and said that a further 600,000 were potentially at risk.2 The dump appears to be several years’ worth of passwords that were collected from various places, by various means. For our academic purposes, however, this didn’t matter. The passwords were still chosen by Gmail account holders, even if they weren’t for their own Gmail accounts and given that 98 percent were no longer in use, we felt we could safely explore them.3
We used this data set, which we’ll call the “Gmail dump,” to answer demographic questions (especially those related to the genders and ages of password-choosers). We extracted these facts by searching the 5 million email addresses for any that contained first names and years of birth. For example, if an address was [email protected], it was coded as a male born in 1984. This method of inference can be tricky. We won’t bore you with too many technical details here, but by the end of the coding process, we had 485,000 of the 5 million Gmail addresses coded for gender and 220,000 coded for age. At this point, it’s worth bearing in mind the question, “Do users who include their first names and years of birth in their email addresses choose different passwords than those who don’t?”—because it’s theoretically possible they do. We’ll discuss that more a bit later.
For now, though, here’s how the users we coded were divided by decade of birth and gender.
The Gmail dump, or at least those people in it with first names and/or years of birth in their addresses, was skewed toward men and people born in the ’80s. This is probably because of the demographic profiles of the sites whose databases were compromised to form the dump. Searching for addresses in the dump that contained the + symbol (added by Gmail users to track what sites do with their email addresses), revealed that a large number of the credentials originated from File Dropper, eHarmony, an adult tube site, and Friendster.
The second data set, and the one we’ve used to gather most of our results, was generously released by security consultant Mark Burnett, through his site xato.net.4 It consists of 10 million passwords, which were gathered from all corners of the web over a period of several years. Mark collected publicly dumped, leaked, and published lists from thousands of sources to build possibly one the most comprehensive lists of real passwords ever. To read more about this data set, check out the FAQ on his blog.5
We won’t spend too long giving you really basic facts about this data set (like all the averages). That’s been done many times before. Instead, let’s just look at the 50 most used passwords of the 10 million. Then we’ll step into potentially more interesting territory.
As you can see, and probably already know, the most common passwords are all shining examples of things that straight away pop into someone’s mind when a website prompts him or her to create a password. They are all extremely easy to remember and, by virtue of that fact, child’s play to guess using a dictionary attack. When Mark Burnett analyzed 3.3 million passwords to determine the most common ones in 2014 (all of which are in his bigger list of 10 million), he found that 0.6 percent were 123456. And using the top 10 passwords, a hacker could, on average, guess 16 out of 1,000 passwords.
However, fewer people than in previous years are using the kinds of passwords seen above. Users are becoming slightly more conscious of what makes a password strong. For instance, adding a number or two at the end of a text phrase. That makes it better, right?
“I’ll Add a Number to Make it More Secure.”
Nearly half a million, or 420,000 (8.4 percent), of the 10 million passwords ended with a number between 0 and 99. And more than one in five people who added those numbers simply chose 1. Perhaps they felt this was the easiest to remember. Or maybe they were prompted by the site to include a number with their base word choice. The other most common choices were 2, 3, 12 (presumably thought of as one-two, rather than 12), 7, and so on. It’s been noted that when you ask a person to think of a number between one and 10, most say seven or three (hence our guesses in the introduction), and people seem to have bias toward thinking of prime numbers.6, 7 This could be at play here, but it’s also possible that single digits are chosen as alternatives to passwords people already use but want to use again without “compromising” their credentials on other sites.
It’s a moot point, though, when you consider that a decent password cracker can very easily append a number, or several thousand, to its dictionary of words or brute-force approach. What a password’s strength really comes down to is entropy.
Evaluating Password Entropy
In simple terms, the more entropy a password has, the stronger it tends to be. Entropy increases with the length of the password and the variation of the characters that comprise it. However, while the variation in the characters used does affect its entropy score (and how hard it is to guess), the length of the password is more significant. This is because as the password gets longer, the number of ways its constituent parts can be shuffled into a new combination gets exponentially larger and therefore, much harder to take wild guesses at.
The average length of a password from the Gmail dump was eight characters (e.g. password), and there was no significant difference between the average length of men’s passwords compared to women’s.
What about entropy? Which is a more accurate reflection of password strength than character length alone?
The average entropy of a password from the Gmail dump was 21.6, which isn’t a particularly easy thing to conceptualize. The chart on the left gives a clearer picture. Again, there was only a negligible difference between the men and women, but there were a lot more passwords with close to zero entropy than over 60.
The example passwords vary by a character or two as the entropy ranges. Generally speaking, the entropy scales with length, and increasing the range of characters by including numbers, capitals, and symbols helps too.
So how did we calculate entropy for all 5 million passwords from the Gmail dump?
There are lots of ways to calculate password entropy, and some methods are more rudimentary (and less realistic) than others. The most basic assumes that a password can only be guessed by trying every single combination of its characters. A more intelligent approach, however, recognizes that humans—as we’ve seen—are addicted to patterns, and therefore certain assumptions can be made about most of their passwords. And based on those assumptions, rules for attempting to guess their passwords can be established and used to significantly speed up the cracking process (by chunking combinations of characters into commonly used patterns). It’s all very clever and we can take no credit for it. Instead, credit goes to Dan Wheeler, who created the entropy estimator we used. It’s called Zxcvbn, and it can be seen and read about in detail here.8
In brief, it builds a “knowledge” of how people unknowingly include patterns in their passwords into its estimation of what a good password cracker would need to do to determine those patterns. For example, password, by a naive estimation, has an entropy of 37.6 bits. Zxcvbn, however, scores it zero (the lowest and worst entropy score) because it accounts for the fact that every word list used by password crackers contains the word password. It does a similar thing with other more common patterns, like leet speak (adding numb3rs to words to m@ke them seemingly less gue55able).
It also scores other passwords, which at first glance look very random, as having zero entropy. qaz2wsx (the 30th most common password), for instance, looks pretty random, right? In fact, it’s anything but. It’s actually a keyboard pattern (an easily repeatable “walk” from one key on a keyboard to the next). Zxcvbn itself is named after one such pattern.
We pulled out the 20 most used keyboard patterns from the 10 million passwords data set. We chose to exclude patterns of numbers, like 123456, because they’re only sort of keyboard walks, and there are also so many of them at the top of the most used password list that there wouldn’t have been space to see some of the more interesting ones if we had included them.
Nineteen of the 20 keyboard patterns above look about as predictable as you might expect, except for the last one: Adgjmptw. Can you guess why that ranked among the most used patterns?
You probably don’t need to, as you’ve almost certainly already looked below.
Although we very much doubt we’re the first to spot it, we’ve not yet found any other reference to this keyboard pattern being among the most commonly used in passwords. Yet it ranks 20th above.
In case you haven’t realized, it’s generated by pressing 2 through 9 on a smartphone’s dial pad (the first letter of each corresponding to each letter of the key pattern in the password).
We were initially confused about this pattern because most people don’t type letters with a dial pad; they use the QWERTY layout. Then we remembered phones like Blackberries, which have a physical keyboard with numbers always in view on the keys.
This pattern poses an interesting question: How will password selection change as more people create them on touch devices that make certain characters (like symbols and capitals) harder to select than when using a regular keyboard?
Of course, keyboard patterns, especially those above, are no problem at all for any good password cracker. Passpat uses several keyboard layouts and a clever algorithm to measure the likelihood that a password is made from a keyboard pattern.9 And other tools exist for generating millions of keyboard patterns, to compile and use them as a list, rather than wasting time trying to crack the same combinations by brute force.10
Most people don’t use keyboard patterns though. They stick to the classic and frequently insecure method of choosing a random word.
Now you can see why we guessed Batman and Superman at the start of this article: they are the most used superhero names in the 10 million passwords data set. An important point about the above lists is that it’s sometimes hard to know in what sense a person uses a word when they include it in their password. For example, in the colors list, black might sometimes refer to the last name Black; the same goes for other words with dual contexts. To minimize this issue when counting the frequencies of the above words, we approached each list separately. The colors, for example, were only counted when passwords started with the name of the color and ended with numbers or symbols. This way, we avoided counting red in Alfred and blue in BluesBrothers. Using this conservative approach will, of course, mean we missed many legitimate names of colors, but it seems better to know the above list only contains “definites.”
Other lists had different rules. We didn’t include cats and dogs in the animals list because cat appears in too many other words. Instead, we counted cats and dogs separately and found that they’re used an almost identical number of times. However, cats is used a lot more in conjunction with Wild- and Bob- (sports teams) than dogs is used in other phrases. So we’d say dogs probably wins.
The most common nouns and verbs were only counted if they appeared in the top 1,000 nouns and top 1,000 verbs used in everyday English. Otherwise the lists would have been full of nouns like password and verbs like love.
Not that love isn’t an interesting word. It’s actually used surprisingly often in passwords. We found it 40,000 separate times in the 10 million passwords and a lot in the 5 million Gmail credentials too.
When we counted the frequency of love in the passwords of the people whose ages we inferred from their usernames, those born in the ’80s and ’90s used it slightly more often than older people.
In the Gmail data, 1.4 percent of the women’s passwords contained love, compared to 0.7 percent of men’s. In other words, based on this data at least, women appear to use the word love in their passwords twice as often as men. This finding follows in the footsteps of other recent research on the word love in passwords. A team at the University of Ontario Institute of Technology reported that ilove[male name] was four times more common than ilove[female name]; iloveyou was 10 times more common than iloveme; and <3 was the second most common method of combining a symbol with a number.11
Now that we’ve learned a bit about the most common words and numbers in passwords, the most used keyboard patterns, the concept of password entropy, and the relative futility of simple password obfuscation methods like leet speak, we can move onto our final port of call. It’s the most personal and, potentially, the most interesting.
Passwords of the Rich and Powerful
Mark Burnett notes on his website that password dumps are worryingly frequent.12 Crawling fresh dumps is how he compiled the 10 million passwords data set, after all. The other events that seem to be hitting the headlines on an ever-more-frequent basis are high-profile hacks of celebrities and corporations. Jennifer Lawrence et al. and Sony immediately spring to mind. We were curious about how the Gmail data could potentially be used to determine which high-profile people were affected by this dump in particular. In other words, whose passwords were published? We did it by using Full Contact’s Person API, which takes a list of email addresses and runs them through the APIs of several major social networking sites like Twitter, LinkedIn, and Google+. Then it provides new data points for any it finds, like age, gender, and occupation.13
We already knew a few fairly high-profile people were in the Gmail dump. For instance, Mashable noted a month after the list was released that one of its reporters was included (the password listed for him was his Gmail password, but several years old and no longer in use).14 But we didn’t think Full Contact would turn up so many more.
Within the 78,000 matches we found, there were hundreds of very high-profile people. We’ve selected about 40 of the most notable below. A few very important points:
1. We’ve deliberately not identified anyone by name.
2. The company logos represent those organizations the individuals work for now and not necessarily when they were using the password listed for them.
3. There’s no way of knowing where the passwords were originally used. They may have been personal Gmail passwords, but it’s more likely that they were used on other sites like File Dropper. It’s therefore possible that many of the weak passwords are not representative of the passwords the individuals currently use at work, or anywhere else for that matter.
4. Google confirmed that when the list was published, less than 2 percent (100,000) of the passwords might have worked with the Gmail addresses they were paired with. And all affected account holders were required to reset their passwords. In other words, the passwords below—while still educational—are no longer in use. Instead, they’ve been replaced by other, hopefully more secure, combinations.
If the passwords hadn’t been reset, however, the situation would be more of a concern. Several studies have shown that a number of us use the same passwords for multiple services.15 And given that the list below includes a few CEOs, many journalists, and someone very high up at the talent management company of Justin Bieber and Ariana Grande, this dump could have caused a lot of chaos. Thankfully it didn’t, and now can’t.
The most noticeable thing about the passwords above is how many of them would be woefully easy to guess if an offline cracking process were used against them. The strongest of the bunch once belonged to a GitHub developer (ns8vfpobzmx098bf4coj) and, with an entropy of 96, it looks almost too random. It was probably created by a random password generator or password manager. The weakest belonged to a senior IBM manager (123456), which—conversely—seems so basic that it was surely used for a throwaway sign-up somewhere. Many of the others strike enough of a balance between complexity and simplicity to suggest that their owners cared about making them secure and wanted to safeguard the accounts they were chosen for.
A couple of interesting standouts to finish: the Division Chief for the U.S. Department of State whose password (but not name) was linco1n (Lincoln) and the Huffington Post writer who followed in Mulder’s footsteps (from the X-Files) and chose trustno1. And more generally, it’s interesting to see just how many of the high-profile people we selected did exactly what so many of the rest of us do: combine our names, dates of birth, simple words, and a couple of numbers to make lousy passwords. We guess it makes sense though. Even President Obama recently admitted that he once used the password 1234567. A password with a much higher entropy score would have been PoTuS.1776. Although, to a clever cracker, that might have been a little obvious.
***
So what about your own passwords? While reading this post you likely thought about yourself and wondered, “Could somebody guess the password to my online banking, email, or blog?” If you use one of the big email providers, like Gmail, you shouldn’t have to worry too much about your password being guessed through a brute-force attack. Gmail cuts off illegitimate attempts almost immediately. Your online banking is likely similarly protected. If you have a blog, though, the situation is more complicated because—in simple terms—there are more potential ways for an attacker to find a way in, so each must be proactively secured to keep them out. The point is never to take password security for granted and come up with an easy but still hard to figure out a system to come up with a secure password.
The team at WP Engine spends a lot of time and continuous effort keeping our customer’s WordPress sites secure. Our secure WordPress hosting platform integrates into WordPress itself and protects our customer’s sites against brute-force attacks on their passwords with intelligent, reactive software that constantly learns and adapts to threats and takes action. We also safeguard our customer’s from attacks that have nothing to do with password guessing, like sniffing login attempts and SQL injections. WP Engine provides the best managed WordPress hosting platform, powering brands and the enterprise to reach global audiences with WordPress technology.
Download our WordPress security White Paper and learn about the 10 best practices for securing a WordPress deployment, including how to safely generate, store, and regularly change passwords.
References
1. http://www.dailydot.com/crime/google-gmail-5-million-passwords-leaked/
2. http://www.eweek.com/blogs/security-watch/wordpress-resets-100000-passwords-after-google-account-leak.html
3. https://xato.net/passwords/ten-million-passwords
4. https://xato.net/passwords/ten-million-passwords-faq/
5. http://groups.csail.mit.edu/uid/deneme/?p=628
6. http://micro.magnet.fsu.edu/creatures/pages/random.html
7. http://www.dailymail.co.uk/news/article-2601281/Why-lucky-7-really-magic-number.html
8. https://blogs.dropbox.com/tech/2012/04/zxcvbn-realistic-password-strength-estimation/
9. http://digi.ninja/projects/passpat.php
10. https://github.com/Rich5/Keyboard-Walk-Generators
11. http://www.thestar.com/news/gta/2015/02/13/is-there-love-in-your-online-passwords.html
12. https://xato.net/passwords/understanding-password-dumps
13. https://www.fullcontact.com/developer/person-api/
14. http://mashable.com/2014/09/10/5-million-gmail-passwords-leak/
15. http://www.jbonneau.com/doc/DBCBW14-NDSS-tangled_web.pdf