TLDR; we need to load the labelled data and train a model to predict a label based on text input.
We can start with a simple model using Support Vector Machines (SVMs) before figuring out how to do this by fine turning a language model using transformers.
need to create a lower case version of the title without spaces and punctuation to allow for merging
need to get rid of documents without abstracts because those can’t be used for learning
then we load open alex data and create the same title variable for merging
oa_data = pd.read_csv("data/openalex_data.csv").rename(columns={"id": "OA_id"})
oa_data["title_lcase"] = oa_data["title"].apply(lambda x: re.sub("\\W", "", x).lower())
oa_data = oa_data.dropna(subset=["abstract"])
oa_data["seen"] = 0
print(oa_data.shape)
oa_data.head()
add the open alex rows which are not in the labelled data to the labellled data
FIgure out how many documents have been labelled and how many have not
plot by how many examples we have of each impact type
predict the binary inclusion label
move into the impact type (multilabel output → from the graph above)
encode text input numerically → for use in models
^^ code examples above from scikit learn
each document is a row, each column is a feature
SVM work when they find a hyperplane in a multi-dimensional space - to seperate samples of diff classes.
→ example: matrix above comes from multidimensional space