This dataset contains the features extracted from the Stack Overflow dataset retrieved from https://archive.org/details/stackexchange. The features were used for the paper:
D.V. van Dijk, M. Tsagkias & M. de Rijke (2015). Early detection of topical expertise in community question and answering. In SIGIR 2015: 38th international ACM SIGIR conference on Research and development in information retrieval.
author = “Dijk, D.V. van, Tsagkias, M. \& Rijke, M. de”,
title = “Early detection of topical expertise in community question and answering”,
booktitle = “SIGIR 2015: 38th international ACM SIGIR conference on Research and development in information retrieval”,
year = “2015”,
doi = “http://dx.doi.org/10.1145/2766462.2767840”
The archive file (144,5MB) contains this README and a dataset file (extracted 565MB).
The dataset file is a text dump of the features table, which holds 3,619,440 rows.
Each line in the dataset corresponds to a topic and a user for a given period: A period is the time between the moment a user joins the forum and the time of their best answer provided (static), or the time between provided best answers (dynamic). Features for each line are computed over the available content for a user on a topic for a period. Each line consists of 32 tab-separated fields (25 feature fields and 7 metadata fields).
The table contains the following fields:
Field position | Short name | Description
1 | Topic identifier | Identifier of the topic
2 | User identifier | Identifier of the user
3 | Best answer nr | The N-th best answer provided by a user for the topic since joining the forum
4 | Static or temp | Static: moment of joining till best answer provided, temp: period between best answers
5 | Expert or not | 1 of the user has provided more then 9 best answers on the topic, else 0
6 | Start date | Start date of the period over which the features are extracted, which relates to field 3 and 4
7 | End date | End date of the period over which the features are extracted, which relates to field 3 and 4
8 | LM | Model 2 using language modeling scoring
9 | BM25 | Model 2 using BM25 scoring
10 | TFIDF | Model 2 using tf.idf scoring
11 | Question | Number of questions by a user
12 | Answer | Number of answers by a user
13 | Comment | Number of comments by a user
14 | Z-Score | Question-answering ratio
15 | Q.-A. | Nr. of questions divided by nr. of answers
16 | A.-C. | Nr. of answers divided by nr. of comments
17 | C.-Q. | Nr. of comments div. by nr. of questions
18 | First Answer | Number of first answers a user has posted
19 | Timely Answer | Nr. of answers posted within 4h by a user
20 | Time Interval | Days between joining and N-th best answer
21 | LM/T | LM / Time interval
22 | BM25/T | BM25 / Time interval
23 | TFIDF/T | TFIDF/ Time interval
24 | Question/T | Question / Time interval
25 | Answer/T | Answer / Time interval
26 | Comment/T | Comment / Time interval
27 | Z-Score/T | Z-Score / Time interval
28 | Q.-A./T | Q.-A. / Time interval
29 | A.-C./T | A.-C. / Time interval
30 | C.-Q./T | C.-Q. / Time interval
31 | First Answer/T | First Answer / Time interval
32 | Timely Answer/T | Timely Answer / Time interval
If you require more details or have enquiries about the dataset, please contact the principal author, David van Dijk at email@example.com.
Dataset released on 15/05/2015 based on data from 08/2008 to 09/2014.