looking for data

Kaggle doesn’t have comprehensive data for this kind of project. Using Open Alex - its an open source and comprehensive catalog of scholarly papers and other things

has a nice api I think I can just add to my code also good documentation

Welcome!

since we’re trying to reproduce the previous study → we wont be able to do it as well but that’s okay

Boolean query → API call for open alex

generally good practice to search for lierature with boolean queries → lists of search terms joined by and or or operators
- one or more blocks of terms capturing different concepts that can be linked with “AND” operators so we demand that results contain one or matches from each block.
- terms within blocks are linked by “OR” operators. in the example there are three blocks (one with climate, one with impacts, and one with attribution)
  - query developed by adding words to the query until it returned every available record from a list of references konwn to be relevent
Getting results with the OpenAlex API
- query the api using requests
  - requests let you send http requests - you don’t need to add query strings to your URLs or form-encode the PUT and POST data - you can use the json method
JSON → store temporary data
- “meta” object says that there’s more than 10,000 results for the query
  - to get them all, we use pagination → cursor paging
```
result = res ["results"][0]
result
```
need to collect relevant information → for this project I need a csv with a row for each record and basic info (title, abstract, authorship, and publication year)
- abstract is stored in an inverted index lol
  - use function →
    1. abstractInvertedIndex is a dictionary of word:[indices]. From this dictionary, first get a list of (word,index) pairs
      
      word_index = [] for k,v in abstractInvertedIndex.items(): for index in v: word_index.append([k,index])
    2. Now sort this list word_index to retain index order
      
      word_index = sorted(word_index,key = lambda x : x[1])
    3. And finally join only the words from word_index list with a space
    from: https://stackoverflow.com/questions/72093757/running-python-loop-to-iterate-and-undo-inverted-index
I need to cycle through the works and put them into a list of dicts
- then turn into dataframe → two dimensional, mutable, tabular data (it’s kind of like a spreadsheet)
cycling through for abstract search as well
- problem: takes many hours to run it for abstract search since the abstracts tend to be significantly longer than the title, so we limit it to the first 50 pages of results

result: data is in convenient format yay

extra: we can explore with the data by plotting things like the number of publications we searched by the number of publications in each year, or even the number of publications with a specific commonly used keyword. Numeric data is the best for plotting, so the publication_year data would work best for this situation