The CSO Classifier is an application that classifies the content of scientific papers (i.e., full-text, abstract, and title) according to CSO. Specifically, given a research paper, the classifier takes as input text from its abstract or full-text and outputs a list of relevant concepts from CSO. It does so by mapping the n-grams in the text to concepts in the CSO and then inferring their super concepts. It accepts four optional parameters:
min_similarity, which controls the minimum similarity value for mapping n-grams to concepts.
infer_super_topics, which controls whether the classifier will try to infer, given a topic (e.g., Linked Data), only the direct super-topics (e.g., Semantic Web) or all its super-topics (e.g., Semantic Web, WWW, Computer Science).
num_children, which controls the number of concepts necessary for inferring a super concept. For example, when this factor is set to three, the topic Semantic Web will be inferred if at least three of its sub-topics (e.g., OWL, RDF, Linked Data) are present.
verbose, is a flag controlling the verbosity level of the result.
The CSO Classifier removes English stop words and it gathers together unigrams, bigrams and trigrams. Then, for each n-gram, it computes the Levenshtein similarity with the labels of the topics in CSO. Research topics having similarity equal or higher than the minimum similarity threshold with an n-gram, are added to the final set of topics. In order to further enrich the set of inferred topics, the CSO Classifier infers also their super topics by exploiting the skos:broaderGeneric relationships within the CSO [1]. The output of this process can contain equivalent topics linked by relatedEquivalent relationships in CSO, e.g., Ontology Matching and Ontology Mapping. Therefore, the CSO Classifier also clean up these redundant concepts by preserving only one of them.
The algorithm produces two kinds of result, depending on the verbose parameter. When it is set to true, the algorithm returns a detailed list of topics, with the matched n-grams and the evaluated similarity scores. Conversely, if verbose is set to false, the algorithm returns a more synthetic list of topics.
More info: http://oro.open.ac.uk/55908/