Add Your Heading Text Here

Automating Scraping & Summarizing Google Scholar Articles with Scrapingdog Scholar API, Python & Open AI

Table of Contents

Research papers contain valuable insights, but manually analyzing and summarizing them can be time-consuming. In this article, we will scrape research papers from Google using Scrapingdog’s Google Scholar API and then summarize them using OpenAI API. This sounds interesting, right?

By the end of this guide, you’ll have a streamlined workflow to fetch academic papers from Google Scholar and generate concise AI-powered summaries.

 

Requirements

I hope you have Python installed on your machine. If it is not installed then you can download it from here.

Create a folder with any name you like. We will keep our Python files inside this folder.

				
					mkdir scholar
				
			

We’ll need a trial account on Scrapingdog to scrape research papers. Along with this, we need to sign up for Openrouter. Openrouter will help us access OpenAI APIs. We are using Openrouter because it provides a fallback strategy plus under one hood we can access almost all the LLM models. You can sign up to get its API key.

Additionally, we must install the requests library for making an HTTP connection with the APIs.

				
					pip install requests
				
			

Create a Python file. I am naming the file as summary.py.

 

Setup

This article is divided into three parts.

  1. First, we will use Scrapingdog’s Google Scholar API to pull the JSON data on a particular topic. It is advisable to read the documentation before proceeding ahead with the code. You can also watch this video to get more ideas about the API.

 

 

2. Then we will use the web scraping API from Scrapingdog to scrape the content from that research paper.

3. The last step would be to use Openrouter APIs to pull a 10-point summary of the research paper.

 

Scraping Google Scholar Data

We will analyze a research paper written on cancer. Once you open Scrapingdog’s dashboard you will find a Google Scholar Scraper. You just need to fill up a query (cancer in our case) in the form and you will get full Python code on the right.

Just copy this and paste it into the file summary.py.

 

				
					import requests
  
api_key = "your-api-key"
url = "https://api.scrapingdog.com/google_scholar"
  
params = {
    "api_key": api_key,
    "query": "cancer",
    "language": "en",
    "page": 0,
    "results": 10
}
  
response = requests.get(url, params=params)
  
if response.status_code == 200:
    data = response.json()
    print(data)
else:
    print(f"Request failed with status code: {response.status_code}")
				
			

 

The code is very simple but let me explain you step-by-step.

  • We are making a GET request to the /google_scholar endpoint of Scrapingdog. Remember to use your own API key while making the request.
  • We are passing the API key, query, language, page number, and number of results as parameters.
  • Then if the request is successful we are printing the results and if it fails we are printing an error.

Once you run this code you will get this beautiful JSON response from the API.

The research paper link can be found as the value of the key title_link. You will find it inside the scholar_results array. We will only consider the first object of this response. So, we will just analyze the first research article from the JSON we got from the API.

 

Extracting Google Scholar Text

Now we will extract the text from the research paper using Scrapingdog’s Data Extraction API.

				
					import requests

api_key = "your-api-key"
url = "https://api.scrapingdog.com/google_scholar"

params = {
    "api_key": api_key,
    "query": "cancer",
    "language": "en",
    "page": 0,
    "results": 10
}

response = requests.get(url, params=params)

if response.status_code == 200:
    print('got the research paper link')
    data = response.json()
    html_data = requests.get(f"https://api.scrapingdog.com/scrape?api_key={api_key}&url={data['scholar_results'][0]['title_link']}")
    if html_data.status_code==200:
        print(html_data.text)
else:
    print(f"Request failed with status code: {response.status_code}")
				
			

 

After getting the 200 status code from the scholar API, we hit the /scrape endpoint with the URL of the research paper. Again we are printing the data on the console after a successful request.

Now we have the data that can be passed to Openrouter.

 

Extracting Summary with Openrouter

This step is the most interesting part of this entire tutorial. We will use the GPT-4 module from OpenAI to make a summary of the research paper. I am pretty sure you will be amazed by the results you will get with this step.

				
					import requests

OPENROUTER_API_KEY = "Your-openrouter-key"

# OpenRouter API URL
API_URL = "https://openrouter.ai/api/v1/chat/completions"

def summarize_research_paper(paper_text):
    """Generate a 10-point summary of a research paper using OpenRouter."""

    headers = {
        "Authorization": f"Bearer {OPENROUTER_API_KEY}",
        "Content-Type": "application/json"
    }

    payload = {
        "model": "openai/gpt-4",  # Use OpenAI's GPT-4 via OpenRouter
        "messages": [
            {"role": "system", "content": "Summarize the given research paper in exactly 10 key points."},
            {"role": "user", "content": paper_text}
        ],
        "temperature": 0.7,  # Adjust randomness (0 = strict, 1 = creative)
        "max_tokens": 500  # Limit output length
    }

    response = requests.post(API_URL, json=payload, headers=headers)

    if response.status_code == 200:
        return response.json()["choices"][0]["message"]["content"]
    else:
        return f"Error {response.status_code}: {response.text}"

api_key = "Scrapingdog-API-key"
url = "https://api.scrapingdog.com/google_scholar"

params = {
    "api_key": api_key,
    "query": "cancer",
    "language": "en",
    "page": 0,
    "results": 10
}

response = requests.get(url, params=params)

if response.status_code == 200:
    print('got the research paper link')
    data = response.json()
    html_data = requests.get(f"https://api.scrapingdog.com/scrape?api_key={api_key}&url={data['scholar_results'][0]['title_link']}")
    print("paper scraped")
    if html_data.status_code==200:
        summary=summarize_research_paper(html_data.text)
        print("\n🔹 **10-Point Summary:**\n")
        print(summary)
else:
    print(f"Request failed with status code: {response.status_code}")
				
			

 

Let me explain to you what we are doing inside summarize_research_paper() function.

  • We have set headers to authenticate our request.
  • Then inside the payload, we are specifying the model we are going to use to generate the response. message to the model is to summarize the research paper in 10 points. temperature 0.7 balances accuracy and creativity. max_tokens 500 limits response length to avoid excessive output.
  • Then finally we made a POST request to the Openrouter endpoint.

Once you run the code you will get this concise 10-point summary.

 

				
					1. The research paper titled "The Immunobiology of Cancer Immunosurveillance and Immunoediting" was published in the journal Immunity in 2004.
2. The paper explores the reemergence of interest in the study of cancer immunosurveillance and the expansion of this concept into cancer immunoediting.
3. Cancer immunoediting is a process that consists of three phases: elimination (i.e., cancer immunosurveillance), equilibrium, and escape.
4. The study provides strong experimental data derived from murine tumor models and correlative data from studying human cancer.
5. The findings suggest that the immune system not only protects the host against the development of primary nonviral cancers but also sculpts tumor immunogenicity.
6. The elimination phase involves the immune system recognizing and destroying cancer cells.
7. In the equilibrium phase, the immune system and cancer cells are in a state of balance, with neither being able to eliminate the other.
8. The escape phase refers to when cancer cells manage to evade the immune system and continue to proliferate.
9. The research contributes to the understanding of how the immune system interacts with cancer cells and the potential implications of this process for cancer treatment strategies.
10. The study suggests a need for further research to fully understand the process of cancer immunoediting and its role in cancer development and progression.
				
			

 

Isn’t that amazing? We just turned huge research into small points. Students and teachers can take advantage of this approach to read research papers.

Conclusion

Automating research paper summarization using Python, Scrapingdog’s Google Scholar API, and OpenRouter’s OpenAI API streamlines the research process, saving valuable time and effort. By integrating web scraping and AI-driven summarization, we can efficiently extract relevant academic papers and generate concise 10-point summaries for quick insights.

This approach is particularly useful for students, researchers, and content writers who need to process large volumes of academic information quickly. With further optimizations, such as categorizing research topics, keyword extraction, or citation management, this workflow can become an even more powerful tool for research automation.

 

Additional Resources

My name is Manthan Koolwal and I am the founder of scrapingdog.com. I love creating scraper and seamless data pipelines.
Manthan Koolwal

Web Scraping with Scrapingdog

Scrape the web without the hassle of getting blocked

Recent Blogs

Automating Scraping & Summarizing Google Scholar Articles with Scrapingdog Scholar API, Python & Open AI

In this article, we have scraped relevant researches from Google scholars, summarized them using Python, Open AI & Scrapingdog Google Scholar API.

Automating Competitor Sentiment Analysis For Local Businesses Using Scrapingdog’s Google Maps & Reviews API

In this blog, we have used Google Maps & Reviews API from Scrapingdog to get review text, further we have use AI to do sentiment analysis over them.