Automated Blog Posts

Oct 14, 2023automation, ai, github, huggingface

Github repository

Description

In this project I wanted to experiment and integrate an automated way to create new posts in my personal page on a scheduled basis, for this purpose I am using Github Actions to execute a JavaScript file and Hugging Face in the script to connect to a chatbot model hosted in Hugging Face.

Research

Firstly I wanted to check different models available in the web, my goto was ChatGPT by OpenAI since it is a widely known and used model

Option 1 - ChatGPT

ChatGPT is "a model which interacts in a conversational way. The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests"

I tried to use the OpenAI api, I created a API Key and tried to use the model by curl

curl https://api.openai.com/v1/models \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -H "OpenAI-Organization: org-axYbMuJ2r0fsYI7elH0P0hQt"

At this point I realized that OpenAI doesn’t provide a free quota, so you need to buy credits to use the api, since I didn't want to spend money for this project I searched for another solution.

Option 2 - Self-Hosted ChatGPT4All

ChatGPT4All is "a free-to-use, locally running, privacy-aware chatbot. No GPU or internet required"

This option seemed like a viable way to talk to gpt to get a post.

The main downside for this is that hosting a model requires a lot of resources, memory to download the model (3 - 8 gb) and cpu for the interference to generate an answer, I searched for a hosting for this characteristic but there isn’t anything out there for free for this kind of project, and I dont have an infrastructure to host a project like this

Option 3 - Replicate AI

Replicate is "A cloud API that enables users to quickly and easily run open-source machine learning models"

They have some conversational models such as:

- meta/llama-2-70b-chat

- lucataco/qwen-vl-chat

These models work well for my application, and Replicate provides API tokens linked to your account to communicate with the models through their API, the downside for this is that they don't provide a way to know how much is left from your free quota, or if this quota resets at sometime, so the communication to the API could be broken at any moment.

Option 4 - HuggingFace Spaces

HuggingFace is "The platform where the machine learning community collaborates on models, datasets, and applications"

HuggingFace provides a hosting solution called Spaces, where anyone can host project code similar to GitHub and deploy the application right away using Gradio.

Gradio is "the fastest way to demo your machine learning model with a friendly web interface so that anyone can use it, anywhere!"

So Gradio is a frontend for data science applications, in this case python applications running machine learning models.

Gradio provides a UI but also provides an API for the application, is like a wrapper for your python application, and you can interact with the text boxes, buttons, etc, all through the API.

Also, we don't require an API Token to communicate with the API, the only downside is that if there is a lot of people using the same App, you'll be in the queue, and will need to wait until your turn arrives, but for this kind of application we don't require a fast response, so it's ok for us.

There are some good models hosted in HuggingFace, such as:

- artificialguybr/qwen-14b-chat-demo

- yuntian-deng/ChatGPT

Scheduled Run

To execute the script on a scheduled basis I'm using GitHub Actions.

GitHub Actions is "a continuous integration and continuous delivery (CI/CD) platform that allows you to automate your build, test, and deployment pipeline"

Github actions provides different ways to trigger a workflow, in this case I needed an scheduled trigger, scheduled trigger in github Actions follow the UNIX POSIX cron syntax.

For example, trigger the workflow every day at 5:30 and 17:30 UTC:

on: schedule: - cron: '30 5,17 * * *'

My workflow would run in github actions hosted runners,

github provides a free quota for runtime in their runners, for a GitHub Free account, they provide 2000 minutes of runtime a month,

for my purpose is more than enough, since i will trigger the script once a day, from monday to friday, supposing that my script takes 5 min to run, I would be consuming 100min/month, from the the 2000min/month minutes available.

NodeJS Script

For the script I'm using nodejs, simplifying the steps, this is what I needed to do:

Create an env var to hold the USER_PROMPT value, this value will be the question passed to the model through the huggingface api
Connect to huggingface spaces, as mentioned before, gradio provides an api for Python and Javascript client, so I needed to install the JS client using npm
Execute the model with the USER_PROMPT and store the answer
Parse the answer to get the needed value which is the model response

Connection to Sanity IO

Since I'm using Sanity IO as my CMS for my page, I needed to connect to Sanity IO through their Api and create a new document in my posts schema, with the information received from the model.

This can be done with sanity http-mutations, the mutation create allows us to create a new document with the specified fields.

Conclusion

At the end, we have a JS script that executes at 8:00am PST from monday yo friday, this script executes a huggingface model with a custom prompt, the result is parsed and connecting to sanity io, we create a new document under our area of interest.

This project shows the potential uses of gradio applications hosted in huggingface, how easy is to integrate a data science application in an automated flow, and how can we take advantage of the github actions free runners to automate processes.

This project has areas to improve, since I am adding new posts without format, everything is added in plain text, I new a way to process the document in markdown to render the titles and text.

Result:

Update (Oct 15, 2023)

There were some improvements added to this project:

Firstly, I added a data parsing to have a better visualization of the posts: previously I was asking for the post and markdown format and I was thinking to implement a gatsby plugin for data visualization, but that approach requires more effort in the front-end side, so I decided to better ask gpt for the post in a specific HTML format that I can parse easily and build the portable text blocks from sanity in the required format.
The second improvement is a filter to know if the generated post is not similar to previous posts: I noticed that sometimes the gpt model was generating a post similar to existing posts, so I added the Levenshtein distance algorithm that calculates similarity between words or phrases, so now, before publishing a post I calculate the distance against the published posts and if there is a similar post, it simply don't post it.
The third improvement is that I added a block at the end of each post to warn readers that the content was generated by GPT and that it can generate false data.

In a future improvement I'm thinking to include snippets of code in the posts (if required) and include them in the post.

Result: