Kaggle doesn’t have comprehensive data for this kind of project. Using Open Alex - its an open source and comprehensive catalog of scholarly papers and other things
has a nice api I think I can just add to my code also good documentation
since we’re trying to reproduce the previous study → we wont be able to do it as well but that’s okay
generally good practice to search for lierature with boolean queries → lists of search terms joined by and or or operators
Getting results with the OpenAlex API
JSON → store temporary data
to get them all, we use pagination → cursor paging
result = res ["results"][0]
result
need to collect relevant information → for this project I need a csv with a row for each record and basic info (title, abstract, authorship, and publication year)
use function →
abstractInvertedIndex
is a dictionary of word:[indices]. From this dictionary, first get a list of (word,index) pairs
word_index = [] for k,v in abstractInvertedIndex.items(): for index in v: word_index.append([k,index])
Now sort this list word_index
to retain index order
word_index = sorted(word_index,key = lambda x : x[1])
And finally join only the words from word_index
list with a space
from: https://stackoverflow.com/questions/72093757/running-python-loop-to-iterate-and-undo-inverted-index
I need to cycle through the works and put them into a list of dicts
cycling through for abstract search as well
result: data is in convenient format yay
extra: we can explore with the data by plotting things like the number of publications we searched by the number of publications in each year, or even the number of publications with a specific commonly used keyword. Numeric data is the best for plotting, so the publication_year data would work best for this situation