Playing with Stack Overflow

October 1, 2016

How similar are Stack Overflow users? Let’s measure the similarity of users by looking at the types of questions they answer.

I downloaded a data dump of all Stack Overflow posts in 2015 (a large 10GB XML file). This file is too large to use read() on the whole dataset. Don’t try, otherwise you’ll probably get a memory error like I did.

A better solution is to parse out the questions posted in 2015, make a pandas DataFrame (4 columns: Id, CreationDate, OwnerUserId, and the first tag in Tags). Then we can save the DataFrame to a csv file using to_csv(). Alternatively, you could parse out the csv data and then create a DataFrame using read_csv().

Here are some tools I use to do some analysis with large files:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import datetime
import xml.etree.ElementTree as et
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
import sklearn.metrics as metrics
import seaborn as sns
import csv
%matplotlib inline

To quickly iterate over a huge XML file, I recommend using xml.etree.ElementTree module. ElementTree renders the entire XML file as a tree. This is how I iterated through the XML file using a tree, and extracted the fields I wanted.

# create iterator over elem tree 
itertree = iter(et.iterparse('stackoverflow-posts-2015.xml', ('start', 'end')))

# get root elem
event, root = next(itertree)

questionPost = {}
questionPosts = []

for event, elem in itertree:
    if event == 'start' and elem.tag == 'row' and elem.get('PostTypeId','') == '1':
        # reinit dict for new row
        questionPost = {}    
        # get question Id
        questionPost['Id'] = elem.get('Id','')
        # get CreationDate
        questionPost['CreationDate'] = elem.get('CreationDate','')
        # get OwnerUserId
        questionPost['OwnerUserId'] = elem.get('OwnerUserId','')
        # get first in Tags
        t = elem.get('Tags','')
        questionPost['Tags'] = t[t.find("<")+1:t.find(">")]
        # add current post row to aggregate list of question posts
        questionPosts.append(questionPost) 
    # clear root to prevent excess references in memor
    root.clear() 
 
question_df = pd.DataFrame(questionPosts, columns=['Id', 'CreationDate', 'OwnerUserId', 'Tags'])
question_df.to_csv('question_dataframe.csv', index=False)
qdf = pd.read_csv('question_dataframe.csv')

Here is part of the question dataframe (qdf) we’re working with. Notice it consists of 2,530,504 rows

Now that we’ve gathered information of Stack Overflow questions in 2015, we can start looking at answers. Let’s extact information from our rows of answers using the same method we did for questions.

itertree2 = iter(et.iterparse('stackoverflow-posts-2015 .xml', ('start', 'end')))
event, root = next(itertree2)

answerers = []
for event, elem in itertree2:
    # get answerer information if this is an answer post
    if event == 'start':
        if elem.tag == 'row' and elem.get('PostTypeId','') == '2':
            answer = {}
            answer['AnswererId'] = elem.get('OwnerUserId','')
            answer['ParentId'] = str(int(elem.get('ParentId', '')))
            answerers.append(answer)        
    root.clear()

answer_df = pd.DataFrame(answerers, columns=['AnswererId', 'ParentId', "Tags"])
answer_df.to_csv('answer_dataframe.csv', index=False)
adf = pd.read_csv('answer_dataframe.csv')

Now let’s rank the top 100 users by the number of answer posts.

matches = qdf.loc[qdf['Id'].isin(adf['ParentId'])]
adf['Tags'] = matches['Tags']

adfRanked = pd.DataFrame(adf.groupby('AnswererId').size().rename('AnswerCount'), columns=['AnswerCount', 'Tags']).nlargest(100, 'AnswerCount')
adf = adf.groupby('AnswererId').aggregate(lambda t: list(t))
adfRanked['Tags'] = adf['Tags']

I compare the users based on the types of questions they answer. We can identify each user by tags of questions they answer, and the frequency of those tags. Once we have the collection of tags each user used in 2015, we can use Euclidean distance as a similarity metric. Feature extraction and vectorization can help us turn this categorical tag data into a standardized metric we can compare.

tags = dict(zip([str(t) for t in adfRanked['Tags']], [True]*len(adfRanked['Tags'])))

# Vectorize tags for comparison
v = CountVectorizer()
x = v.fit_transform(tags).toarray()

# Perform kmeans clustering
kmeans = KMeans(init='k-means++', n_clusters=3, n_init=100)
kmeans.fit_predict(x)

# Use Euclidean distance to assess user similarity
eucldists = metrics.euclidean_distances(x)
labels = kmeans.labels_

Now, we can plot the distance of the top 100 users using a heatmap.

fig, ax1 = plt.subplots(1,1,figsize=(8,8))
stackheatmap = sns.heatmap(eucldists, xticklabels=25, yticklabels=25, linewidths=0, cbar=False, ax=ax1)
Heatmap of Top 100 Stack Overflow Users.
Fig. 1 - Heatmap of Top 100 Users.

It seems like the 74th top user is very different from the rest of the top 100 users. After investigating this further, I found that this user almost exclusively answered NaN tagged questions! I guess that’s one way to become a top answerer.