Analyze Chrome’s Browsing History with Python

37

Analyze Chrome’s browsing history with Python

In this edition, I’ll deal with some Python codes and help you find the websites that you would visit more often. Assuming that you don’t clear your history often, I advise you to prepare yourself a little for any sizeable impact or perhaps, regret, upon seeing your browsing activities.

I am focusing on the more common audiences here. So, I’m going to jump straightaway into the layout of my little python (2.7) script that I used to analyze my Chrome’s history on Windows, which eventually led to a little epiphany moment.

Chrome’s history database

Chrome stores its data locally in an SQLite database. So all we need to do here is write a consistent Python code that would make a connection to the database, query the necessary fields and extract the required data, which is the URLs visited and the corresponding total visit counts, and churn it out like a puppy.

On a Windows machine, this database usually can be found under the following path.

C:\Users\<YOURUSERNAME>\AppData\Local\Google\Chrome\User Data\Default

As my code is focused on Chrome on Windows, I will leave the code for yourself to tweak it for Linux and Mac. The following code gives us the path on Windows.

data_path = os.path.expanduser('~')+"\AppData\Local\Google\Chrome\User Data\Default"
files = os.listdir(data_path)
history_db = os.path.join(data_path, 'history')

Now that we have the path and the database name, we shall do some SQL operations in it. To do that, you’d need to import the sqlite3 library that comes with Python. The variables are self-explanatory. If necessary, refer the sqlite3 documentation.

c = sqlite3.connect(history_db)
cursor = c.cursor()
select_statement = "SELECT urls.url, urls.visit_count FROM urls, visits WHERE urls.id = visits.url;"
cursor.execute(select_statement)

results = cursor.fetchall()

The important thing here is the select_statement. How did I know which fields to fetch from the table? Well, you could either download the SQLite browser and get a glimpse of the schema or get the table and run a master fetch of the table from the console.

So, when the program runs, the fetchall function returns a list of tuples. Go ahead and try printing it. You’d end up with a console full of tuples with the website and the number of times the URL was visited in a particular session or interval, as values. Though it helps us to get a gist of our history, it’s not helpful in this format. The URL is repetitive and I have no idea why the visit count are scattered along the list. However, we can loop every occurrence of the URL, make our own count, and associate these two as a key-value pair in a dictionary.

Parsing URLs

If you print the tuple in the console, we can understand that the counts are much lesser than expected. This is due to the different counts for both HTTP, HTTPS URLs, including the subdomains. While the following parsing part is not the pythonic way to do it, I decided to go with a nifty function that would split the URL into components and return the plain naked domain. For consistency, I’d recommend you to use the urlparse library and use the netloc function to do it in a better way.

def parse(url):
	try:
		parsed_url_components = url.split('//')
		sublevel_split = parsed_url_components[1].split('/', 1)
		domain = sublevel_split[0].replace("www.", "")
		return domain
	except IndexError:
		print "URL format error!"

Creating and sorting the URL, Count dictionary

As seen above, now all we have to do is to loop through the list of tuples, fetch the URL, parse it and then add it to a dictionary as a key, and for every occurrence for the URL, you simply add an incremental counter forwarded as the value.

sites_count = {} #dict makes iterations easier :D

for url, count in results:
	url = parse(url)
	if url in sites_count:
		sites_count[url] += 1
	else:
		sites_count[url] = 1

As Python’s dictionary is intrinsically not ordered, it should be sorted by the value for a clean result. I’ve used the OrderedDict method to do that.

sites_count_sorted = OrderedDict(sorted(sites_count.items(), key=operator.itemgetter(1), reverse=True))

Analyzing and plotting the results

Now that we have all we need in a good shape, all you need to do is to put up a strong face and pile up all the energy in you to feel whatever it is that you would feel when finding your most visited sites. I hardly have any experience in plotting with matplotlib, so the following is the simplest graph I could come up with.

plt.bar(range(len(sites_count_sorted)), sites_count_sorted.values(), align='edge')
plt.xticks(rotation=45)
plt.xticks(range(len(sites_count_sorted)), sites_count_sorted.keys())
plt.show()

While you might choose to plot only the first fifty or hundred sites, I chose to plot every damn URL I visited. This is my graph as on April 20, 2016, showing a zoomed-in snapshot of the primary sites I spend my time on.

The graph showing my most visited sites.

Time well spent!

Complete code – Python history analyzer for Chrome

Feel free to fork it and tweak it.

import os
import sqlite3
import operator
from collections import OrderedDict
import matplotlib.pyplot as plt

def parse(url):
	try:
		parsed_url_components = url.split('//')
		sublevel_split = parsed_url_components[1].split('/', 1)
		domain = sublevel_split[0].replace("www.", "")
		return domain
	except IndexError:
		print "URL format error!"

def analyze(results):

	prompt = raw_input("[.] Type <c> to print or <p> to plot\n[>] ")

	if prompt == "c":
		for site, count in sites_count_sorted.items():
			print site, count
	elif prompt == "p":
		plt.bar(range(len(results)), results.values(), align='edge')
		plt.xticks(rotation=45)
		plt.xticks(range(len(results)), results.keys())
		plt.show()
	else:
		print "[.] Uh?"
		quit()

#path to user's history database (Chrome)
data_path = os.path.expanduser('~')+"\AppData\Local\Google\Chrome\User Data\Default"
files = os.listdir(data_path)

history_db = os.path.join(data_path, 'history')

#querying the db
c = sqlite3.connect(history_db)
cursor = c.cursor()
select_statement = "SELECT urls.url, urls.visit_count FROM urls, visits WHERE urls.id = visits.url;"
cursor.execute(select_statement)

results = cursor.fetchall() #tuple

sites_count = {} #dict makes iterations easier :D

for url, count in results:
	url = parse(url)
	if url in sites_count:
		sites_count[url] += 1
	else:
		sites_count[url] = 1

sites_count_sorted = OrderedDict(sorted(sites_count.items(), key=operator.itemgetter(1), reverse=True))

analyze (sites_count_sorted)

If you have any issues, please do let me know in the comments or at GitHub. Have fun.

This post was first published on April 20, 2016.

Avatar

Karthikeyan KC

Aeronautical engineer, dev, science fiction author, gamer, and an explorer. I am the creator of Geekswipe. I love writing about physics, aerospace, astronomy, and python. I created Swyde. Currently working on Arclind Mindspace.

Leave a Reply

Your email address will not be published. Required fields are marked *

37 Responses

  1. Avatar ARUN PRAKASH

    i got an error like this. please say what is the eroor and how to clear it.
    files = os.listdir(data_path)
    FileNotFoundError: [WinError 3] The system cannot find the path specified: ‘C:\\Users\\Gayathri\\AppData\\Local\\Google\\Chrome\\User Data\\Default’

  2. Is there any way I can make this work in Jupyter notebook?

  3. Hello. The code is working but getting an error database locked..how can i Solve this. Thanks alot

  4. Avatar Balasubramanian

    (Unicode error) 'unicodeescape' codec can't decode bytes in position 28-29: truncated \UXXXXXXXX escape

    • Avatar Balasubramanian

      data_path=os.path.expanduser('~')+"\AppData\Local\Google\Chrome\User Data\Default" taking "(Unicode error) ‘unicodeescape’ codec can’t decode bytes in position 28-29: truncated \UXXXXXXXX escape" Error!!! how can it be solved ?

      • Avatar balasubramanian

        I have modified Data_path like data_path=os.path.expanduser('~')+ r"\AppData\Local\Google\Chrome\User Data\Default" and applied "r+"\AppData\Local\Google\Chrome\User Data\Default Now getting sqlite3.OperationalError: database is locked
        How to unlock the database? Pls explain.

    • If your ‘\’ is not escaped properly, this might happen. In the following response, I see that you have converted the path into raw string. That should work.

      • Avatar Balasubramanian

        Now I am getting the error for url, credentials in credential.iteritems():" below AttributeError: 'dict' object has no attribute 'iteritems'.

        How can this be resolved?

  5. Hey! nice article , i wanted to know how can i scrape the browsing history for a remote computer or any device and what information do i need and is that possible with python.

    • What are you actually trying to accomplish by this? The above code is to get one’s own browser history and it is supposed to be a fun script for learning and analysis purposes. You are asking to remotely access a browser and record its history. I can’t help you here.

  6. how to get chrome search history using vb.net. give ans

    • Please stop spamming the comment section. You need not ask me more than once. The above script is written in Python. The logic is self explanatory. Go figure out a way to write one in VB .net. Also, note that I’ll have to place a ban on you if you keep spamming the comments.

  7. I love this! Now can you turn it into something button pushing simple for a Python beginner so I could just run it and see my outputs? Just kidding….. I was looking for something I can use to help me determine some client time for billing and ran across you …… I know this isn’t it, but I still loved it when I ran across it…… you are great! thanks for sharing!! I’ve got to come back to this when I know more about how to work in Python on my machine…. I’m a programmer from the days of Cobol and C (before the +) and I’m self learning web development and doing ok ….. it’s just hard to know where to start sometimes …. some days I think I need to go back to school and start over, but who would pay my bills? lol

  8. You write some of the best python tutorials online for specific tasks and include easy to understand introductions. Thank You. I hope in the future you will also keep in mind those using Python 3x, some things like the windows path, and input() have changed and the above code wont work unless the user knows about those changes.

  9. Hi

    Great tutorial – i was able to extract the history no worries.

    I cannot plot the graph though.
    I’m wondering if it is because of:
    1 – your analysis code looks different to the analyze method in the complete code – is there a reason for this?
    2 – perhaps i have not fed (put) the right parameter/arg into my copy of your analze method?

    I am realtively new – so any help would be greatly appreciated

    regards,

    m :)

    • Hello! :) This is the part that plots the graph.

      plt.bar(range(len(results)), results.values(), align='edge')
      plt.xticks(rotation=45)
      plt.xticks(range(len(results)), results.keys())
      plt.show()

      Would help if you could give me any info about the errors.

  10. Hye, I tried following you ! But once I make a connection object aka cursor and try printing my results using fetchall function I am getting an error saying DB is locked. What should I do Now? I am stucked here..

  11. Avatar RandomeJi

    is there a way to get the search keyword, page title and URL altogether? By using one SQL statement?

  12. Hi,
    I just tried your code and get the following error:

    sqlite3.OperationalError: no such table: urls

    Has the name of the table changed?

    • No, the table names haven’t changed as of V 55.0.2883.87. You can list all the tables by using the following snippet.

      #list all tables
      list_tables = "SELECT name FROM sqlite_master WHERE type='table';"
      cursor.execute(list_tables)
      print cursor.fetchall()

      The urls table exists.
      Chrome tables

  1. August 16, 2016

    […] take our previous Python code that we used to analyze our browsing history and tinker it a bit to steal our own passwords from […]

Related