How to Clone a Website Using AI

How to Create a Automatic Blog by Using AI

What I’m about to show you is approaching the border of unethical behavior, but it doesn’t cross the line. At least I don’t think so. My hope in sharing this project with you is to demonstrate how easily LLMs can be integrated with your existing workflows.

Now on to the project.

Around 2023, I decided it would be a great idea for me to start a blog. My goal was simple. It was to capture some of that sweet, sweet Google Ad revenue. Through my research, I concluded that I would need approximately 300 to 500 blog posts in order to effectively accomplish my goal. However, I had a major problem. I didn’t have 300 to 500 blog posts to actually post on a website.

So I devised a solution.

I decided I would write a simple Python script, which would create all of my blog posts for me. The script would work by scraping all of the URLs off of another website’s sitemap. Then it would compare those URLs to a list of keywords, which I would provide. Then it would pass the content of each of these URLs into my Ollama LLM, which was on my local computer. This would create new blog posts based on blog posts of other websites. I would simply spot check the output for plagiarism and then I’d be off to the races.

The Python script and the creation of all those blog posts was a resounding success. However, getting site traffic proved to be more difficult. Nevertheless, the project still was fruitful because it demostrated to me how easily LLMs could play with my existing Python scripts.

I’m going to show you that Python script now, piece by piece, and explain each section. This is not so you can create some useless blog that no one really wants to read. However, there are certain parts of this script which are obviously useful for other applications. Also, if I were doing this project today, I’d probably change my approach. I want to show you how I would do things differently.

Let’s first discuss the web scraping.

Despite the fact we live in the Era of the Dead Internet, People still insist on trying to be found through SEO practices. Thus, most websites will actually include a sitemap. This can usually be found by simply typing the URL followed by sitemap.xml into your browser. Would you like to try it? Go type lukesmith.xyz/sitemap.xml into a browser window right now. My sitemap is even more simple. It’s just a text file, which is perfectly acceptable according to Google’s documentation. You can try jonathanadams.pro/sitemap.txt in a browser window and see the sitemap for this site.

The sitemaps make it easy for the search engine bots to crawl over your website and index the pages. However, it also makes it easy for me to crawl your website. If you’re a blog website with 784 blog posts, every single one of them can be indexed by me in just a few seconds. I can look for keywords in your permalinks and decide whether or not it would be a post that I would like to “copy” with my LLM.

Okay, so let’s get down to some code and see what it would take to actually do this.

Since we’re going to be working in Python, we would first need to import some web scraping libraries.


           import requests
           from bs4 import BeautifulSoup
           import ollama

We’re going to utilize Requests and BeautifulSoup along with our obvious Ollama library. If I were redoing this project today, I’m not sure how much I would actually rely upon Python for doing my requests and parsing. Seeing as to how we can now use the subprocess module in Python to utilize Unix commands, It makes it really diffcult for me to want to use Beautiful Soup any longer.

This brings up an important point. When it comes to data gathering and cleaning, oftentimes staying within just one language is not the best choice. Consider the fact that in just the Unix environment, I have available to me the following tools:

curl
wget
sed
awk
grep
vim (editor, not command)
pandoc

With these tools I can extract and clean data quickly and efficiently. Now my workflows often consist of utilizing these tools and pre-cleaning my data before ever feeding it in to a Python script.

Enough of that rabbit trail. Let’s get back to the code.

First, we need to pull all the URLs from the sitemap and store them in a dictionary.


            def get_sitemap_links(sitemap_url):
               response = requests.get(sitemap_url)
               soup = BeautifulSoup(response.content, 'xml')  
               loc_elements = soup.find_all('loc')  
           
               # Create a dictionary to store webpage links
               webpage_links = {}
               for loc in loc_elements:
                   url = loc.text
                   # Add the URL to the dictionary 
                   webpage_links[url] = None
           
               return webpage_links

Next, we’re going to filter our website links within this dictionary by keywords. In my case, I was writing a blog based on the financial markets, so I had a whole list of keywords to filter these URLs by.

~~~~~~~~~~~~~~~ def filter_urls_by_keywords(urls, keywords): filtered_urls = [] for url in urls: for keyword in keywords: if keyword.lower() in url.lower(): filtered_urls.append(url) break return filtered_urls ~~~~~~~~~~~~

Once we call this function later on in our code, we will have a list of URLs associated with the keywords that we’re looking for. This will essentially be the list of blog posts we are going to feed into our large language model. Pretty slick, isn’t it? I never even had to visit the website or read any of the content.

Most people, when creating permalinks for their actual blog posts, will be as descriptive as possible. That makes this AI website cloning that much easier.

Now that we’ve filtered out our URLs by keywords, we’re actually going to need to extract the article content. This next function will take care of that.


            def extract_article_content(url):
               try:
                   response = requests.get(url)
                   response.raise_for_status()  # Raise an exception if the request fails
           
                   soup = BeautifulSoup(response.content, 'html.parser')
           
                   meta = soup.find_all(['meta'])
           
                   headings = soup.find_all(['h1', 'h2', 'h3'])
           
                   paragraphs = soup.find_all('p')
                   
                   quotes = soup.find_all('blockquote')
                   
                   #lists = soup.find_all(['ul', 'ol', 'li'])
                   
                   content_elements =   meta + paragraphs + headings + quotes
           
                   extracted_content = ' '.join(element.get_text().strip() for element in content_elements)
           
                   return extracted_content
               except Exception as e:
                   return f"Error: {str(e)}"

I look at this now and think, “Oh my, why didn’t I just use curl and pandoc?” The function is very simple. You may notice I have commented out the actual list tags. I wanted to leave that in there in case I ever wanted to turn the list tags on again for some reason. Anyway, as you can see, it simply just pulls everything inside certain HTML tags and then joins it all together in a large piece of text. Definitely not the way I would do it today.

I would certainly recommend anyone wishing to use BeautifulSoup to be a lot more deliberate about the tags they pull and the information they’re scraping.

Moving on to our next function. It is time to start feeding information into our LLM of choice!


            def write_post(post_text):
               query = f'OBJECTIVE: I want you to write a blog post and for the blog post to be complete.                       
               -- STYLE: The tone and style of the final written piece or posts is going to be that of a financial journalist writing for a magazine.                       
               -- AUDIENCE: This blog post is going to be aimed at workers in their 20s to 60s who are concerned with their finances.                       
               -- RESPONSE: Make sure that the blog post is structed with headings and is very readable.                        
               -- I will tip you $400 if the post that you write is over 1000 words                        
               -- CONTENT: I have text here about a topic, but the length is short. You are to rewrite this post and really expand on its central theme. Here is the text:                       -- {post_text}'
               response = ollama.chat(model='macblog', messages=[
               {
                   'role': 'user',
                   'content': query,
               },
               ])
               response = response['message']['content']
           
               return response

Okay, a few notes here about this wild thing. First you will notice the LLM I had chosen for this section of my blog posts was called “Macblog.” I had tested out several different LLMs, and this one seemed to do the best.

Why did I tip my LLM $400? Well at the time, is was thought by many in the prompt engineering space that tipping your model would cause it to produce better results and force it to adhere to your instructions. I think it was just a ploy to make us all look ridiculous.

Following this function, I had three more almost identical to it. Each of them passing some portion of the blog post into Ollama and requesting either a title, meta tags, SEO description, or a social media posts about the blog post it just wrote.

Next, we need our keywords so we can search through our original list of URLs we pulled from the target sitemap.xml

keywords = ['budgeting', 'savings', 'saving']

My list of keywords was much longer than that, but it just gives you a general idea.

Now for some real coding!


            sitemap_url = input('Paste Sitmap URL:')
            webpage_dict = get_sitemap_links(sitemap_url)
            common_text1 = input('Paste common text here:')
            filtered_urls_dict  = {}
            for url in filter_urls_by_keywords(list(webpage_dict.keys()), keywords):
                filtered_urls_dict[url]  = None

            for i, (url, info) in enumerate(filtered_urls_dict.items()):
                if 211 < i < 270:
                    try:
                        extracted_content = extract_article_content(url)
                        extracted_content = extracted_content.replace(common_text1, "")
                        print(f"{i + 1}. URL: {url}")
                        print(extracted_content)
                        new_title = write_title(extracted_content)
                        new_post = write_post(extracted_content)
                        new_meta = write_meta(extracted_content)
                        new_social = write_social(extracted_content)
                        file_name = f"wiseapple_{i+1}.txt"
                        with open(file_name, "w", encoding="utf-8", errors="ignore" ) as f:
                            f.write(f"NEW TITLE ---- {new_title}\n\n")
                            f.write(f"NEW POST ---- {new_post}\n\n")
                            f.write(f"NEW META DESCRIPTION ---- {new_meta}\n\n")
                            f.write(f"SOCIAL MEDIA ---- {new_social}\n\n")
                        i+1
                    except Exception as e:
                        print(f"Error processing URL {url}: {e}")

This is where it all comes together!

First the script simply asks for the URL where the sitemap is located. Then it uses the functions we already defined to make a list of all the URLs and filter them based upon our keywords.

Next we indicate which filtered URLs we want processed. Here you can see I am only going to be using URLs 212 thru 269. I would usually batch them in chucks of 50 or so. I wanted to spot check the results because I didn’t want to be guilty of plagiarism or copyright violations!

You may take notice of the variable common_text1. I used that variable to ignore any common text which was found in all of the articles and had not been parsed out by my earlier functions. Things such as “Sign up for our newsletter. We would…” feeding that extra garbage into the LLM can cause issues.

So there you have it. I could simply run the script, go eat some breakfast, and have 50 or so text files full of new content for my blog site!

I will place a sample of the output at the bottom of this article.

Implications

Well, I certainly wouldn’t recommend trying to get rich from creating a blog site. However, I think there is a great deal to be learned from such a project.

Let’s consider a situation where you are required to examine files or data on a local machine, or from the internet, and then you are required to generate a response.

These types of functions can be used to automate that entire process.

For instance…

Lets say I own a landscape business where I charge people to mow their yard. I have a landing page which sends them to a form where they enter their address and request a quote.

I could then use a script to parse out the data from the contact form and send a personalized message and a quote back to the customer! I could even use dynamic pricing models based upon zip code!

My workflow would look something like this.

Extract the data from the web form by utilizing a bash script.
Pass the data to Python and use a database to obtain yard size.
Filter address data through zip code dictionaries to increase or decrease price based upon location. (Some neighborhoods will require more attention, thus higher prices)
Pass all returned pricing back.
Use Python, Ollama to generate personalized message about the client. (Perhaps you asked if they had pets in the form? Ollama can craft a message involving Fluffy to really increase your response rate!)
Post all of this back into a form and mail it back to client.

This project shows how easily one can integrate LLMs into their workflows at any point. You are only limited by your creativity!

NEW TITLE —- “Drowning in Debt and Bills? Here’s How to Take Control and Get Back on Track”

NEW POST —-

Introduction

Facing a point in life where your bills exceed the income you have can be daunting and overwhelming, as mounting debt and increasing bill obligations make it difficult to cope. You may feel like you are in the deep end with no lifeguard in sight. However, there is hope. By taking specific steps to address your financial challenges, you can regain control of your finances and start moving toward a more stable future.

Step 1: Assess Your Current Financial Situation

The first step to gaining control over your finances is to understand where you currently stand. Gather all your bills and calculate the total amount owed. Once you have this information, compare it with your income to get a clearer picture of the financial gap you need to bridge. Being honest and realistic about your situation will help you devise an effective plan moving forward.

Step 2: Trim the Fat from Your Budget

The average household is said to waste between 10% and 15% of their monthly income on non-essential expenses. To better manage your finances, start by scrutinizing your spending habits and identifying unnecessary items to cut from your budget. This may involve reducing or eliminating expenses such as dining out, entertainment, or apparel purchases. You might be surprised at the hidden costs you can trim without impacting your quality of life significantly.

Step 3: Increase Your Income

In addition to cutting costs, you should also consider ways to boost your income. This may involve taking on a part-time job, starting a side gig, or doing odd jobs to generate some extra cash that can go directly towards paying off your debts and bills. You can even earn additional funds by selling unused household items online. Combining cost reduction with increased income will help you tackle your debt more effectively.

Step 4: Prioritize Your Debt and Bills

With a clear understanding of your financial situation and a plan to generate more income, the next step is to prioritize which debts and bills must be paid first. Focus on secured debts such as your mortgage or car loan, as failing to pay these may lead to losing valuable assets. Additionally, prioritize essential bills like insurance and utilities. While unsecured credit card debt may not be as critical, you might need to postpone payments for a while to address the more pressing concerns first.

Step 5: Know Your Rights When Dealing with Debt Collectors

At some point in your journey towards financial stability, you will likely encounter debt collectors. Familiarize yourself with the Fair Debt Collection Practices Act to understand how these collectors can and cannot interact with you. In the event that you are unable to keep up with your payments, reach out to your creditors to discuss payment plans you can manage.

Step 6: Consolidate Your Debts, Rebuild Credit, and Re-establish Financial Responsibility

Consider consolidating your debts through credit card balance transfers or personal loans to make it more manageable to pay off your debt without breaking the bank. As you work towards financial stability, focus on rebuilding your credit by applying for a secured credit card and making regular on-time payments. Over time, this will help improve your credit score and open the door for unsecured credit cards in the future.

Conclusion

Climbing out of a difficult financial situation is not an overnight process; it requires persistence, patience, and discipline. By taking these steps to understand and address your debt, you’ll be on the path towards regaining control over your finances and enjoying greater financial stability in the long run. If you have emerged from a challenging financial period, share your experience with others facing similar circumstances in the comments below.

Disclaimer: The article is for informational purposes only and should not be taken as professional financial advice. Always consult a certified financial advisor or credit counselor before making any significant financial decisions.

NEW META DESCRIPTION —- When facing more bills than income, follow these steps: assess your financial situation, cut unnecessary spending, increase income, prioritize debts, deal with collectors and creditors, consider debt consolidation, and rebuild credit.

SOCIAL MEDIA —- Do you ever feel overwhelmed when your bills and debt exceed your income? Here are six steps to help regain control of your finances: 1) Determine your financial standing, 2) Cut back on unnecessary expenses, 3) Increase your income through side jobs or selling unused items, 4) Prioritize which debts and bills to pay first, 5) Know your rights when dealing with debt collectors, and 6) Re-establish your credit by responsibly using secured credit cards.

Jonathan Adams

August 4th, 2025