CORPUS

Corpus

To develop your software, we provide you with a training data set that consists of source codes written in Java. These source codes are labelled with personality traits of the programmer in a continuous range from 20 to 80.

Click here to download the training corpus (The file is password-protected. To obtain the password, you need to register first)

Your software must generate a file with a line for each document of the dataset with the following information separated by commas (the same format as the truth file provided in the training corpus):

Author id, emotional stability / neuroticism, extroversion, openness to experience, agreeableness, conscientiousness

For example, the following line:

5,74,38,46,42,46

Corresponds to author 5, with 74 for emotional stability, 38 for extroversion, 46 for openness to experience, 42 for agreeableness and 46 for conscientiousness.

Test corpus

Click here to download the test corpus (The file is password-protected with the same password than the training set. To obtain the password, you need to register first)

Click here to download the ground truth for the test corpus (The file is password-protected with the same password than the training set. To obtain the password, you need to register first)

Author profiling consists of predicting an author’s demographics (e.g. age, gender, personality) from her writing. In the PR-SOCO shared task we will address the problem of predicting an author’s personality from her source code. Personality traits influence most, if not all, of the human activities, such as the way people write (Celli et al., 2014), (Rangel et al., 2015), interact with others, and the way people make decisions, for instance in the case of developers the criteria they consider when selecting a software project they want to participate (Paruma-Parbón et al., 2016), or the way they write and structure their source code.