model pipeline

TLDR; we need to load the labelled data and train a model to predict a label based on text input.

We can start with a simple model using Support Vector Machines (SVMs) before figuring out how to do this by fine turning a language model using transformers.

Notes on Transformers

Notes on SVMs

Data

need to create a lower case version of the title without spaces and punctuation to allow for merging
need to get rid of documents without abstracts because those can’t be used for learning

then we load open alex data and create the same title variable for merging

oa_data = pd.read_csv("data/openalex_data.csv").rename(columns={"id": "OA_id"})
oa_data["title_lcase"] = oa_data["title"].apply(lambda x: re.sub("\\W", "", x).lower())
oa_data = oa_data.dropna(subset=["abstract"])
oa_data["seen"] = 0
print(oa_data.shape)
oa_data.head()

add the open alex rows which are not in the labelled data to the labellled data

LABELS

FIgure out how many documents have been labelled and how many have not
plot by how many examples we have of each impact type
- impacts
  - terrestial es
  - rivers, lakes, and soil moisture
  - ……………….

Pipeline yay

predict the binary inclusion label
- basically decide whether a title exists or not - same with abstract and other lements
move into the impact type (multilabel output → from the graph above)
- each document can be labelled with one or more of the impact types
  - encode text input numerically → for use in models
    - use feature extraction module from scikitlearn (counter vectorizer and TfidVectorizer
    count vectorizer
    
    TfidVectorizer
    
    ^^ code examples above from scikit learn
each document is a row, each column is a feature
- cells contain the number of times each word occurs in each document
- TF-IDF (see the toggle above) gives greater weight to the terms which appear infrequently in the cormus, meaning that very frequent but uninformative terms like “the” aren’t given as much importance

learn from features

SVM work when they find a hyperplane in a multi-dimensional space - to seperate samples of diff classes.

→ example: matrix above comes from multidimensional space