ScrapeGraphAI/Scrapegraph-aiPublic

NotificationsYou must be signed in to change notification settings
Fork1.9k
Star21.9k

Comparison with similar tools#50

Closed

ftoppi started this conversation inGeneral

ftoppi

Apr 9, 2024

· 6 comments· 11 replies

Return to top

Discussion options

ftoppi
Apr 9, 2024

Hello,
how does it compare to similar (ie.: with the same goal) tools in terms of

speed (I know it's going to depend on the GPU)
quality (how accurate is the output)
ease of use (how easy is it to use)
reliability (is it always producing the same output)
Good idea though ;)

You must be logged in to vote

Replies: 6 comments 11 replies

Comment options

ftoppi
Apr 9, 2024
Author

By "similar tools", I mean things like:

trafilatura
browserless
beautifulsoup4
mozilla readability.js
etc

You must be logged in to vote

0 replies

Comment options

VinciGit00
Apr 9, 2024
Maintainer

ok, now the question is clearer.
The goal was to create a scraper that works with the use of AI without having any knowledge of how HTML code works, using openai api, gemini or local networks.
The output from our tests seems to be precise, at the expense of the speed which is slower because a neural network is needed.
The advantage, however, lies in the speed of code creation and its modularity, as you can see, just use a pre-established script (https://github.com/VinciGit00/Scrapegraph-ai/blob/main/examples/openai/smart_scraper_openai.py ) where you just need to change the prompt and the link to have already done so.
Therefore this remains the main advantage together with modularity and fault tolerance. In fact, the previous script will continue to work even if the source code of the link continues to change.
Thanks for the question and if you like, please leave a star, it could be really useful for the growth of the project

You must be logged in to vote

0 replies

Comment options

ftoppi
Apr 9, 2024
Author

I understand the goal of the project.
I am asking you for facts.
On the readme, show what your project can do. Show how good it is compared to the standards, show the tradeoff.

You must be logged in to vote

2 replies

Comment options

MarcoVinciguerra Apr 9, 2024

yeah, it could be a great idea, thank you!

Comment options

VinciGit00 Apr 13, 2024
Maintainer

hi, we did this, let me know if you have more ideaslink

Comment options

ftoppi
Apr 9, 2024
Author

not impressive at all with the first example:

root@66d61d8a5f59:/# cat test1.pyfrom scrapegraphai.graphs import SmartScraperGraphgraph_config = {    "llm": {        "base_url": "http://ollama:11434",        "model": "ollama/dolphin-mistral:7b-v2.6-dpo-laser-q4_K_M",        "temperature": 0,        "format": "json",  # Ollama needs the format to be specified explicitly        "model_tokens": 16000    },}smart_scraper_graph = SmartScraperGraph(    prompt="List me all the news with their description.",    # also accepts a string with the already downloaded HTML code    source="https://www.wired.com",    config=graph_config)result = smart_scraper_graph.run()print(result)

root@66d61d8a5f59:/# python test1.py--- Executing Fetch Node ---Fetching pages: 100%|###########################################################################################################################################################| 1/1 [00:00<00:00,  8.58it/s]--- Executing Parse Node ------ Executing RAG Node ------ (updated chunks metadata) ------ (tokens compressed and vector stored) ------ Executing GenerateAnswer Node ---Processing chunks: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5275.85it/s]{'news': [{'title': 'KitchenAid’s New Espresso Machine Won’t Wake Up Your Roommates', 'description': 'KitchenAid’s new compact espresso machine is thoughtfully designed and reliable--after you tune it a bit.'}, {'title': 'BYD’s Entry-Level Electric SUV Lacks Excitement', 'description': 'Chinese automaker BYD has unveiled its entry-level electric SUV, the Seal, but it may not be as exciting as some might hope.'}, {'title': "Cherry's New MX2A Switch: A Keyboard Nerd's Dream?", 'description': 'The new Cherry MX2A switch is designed for a more tactile typing experience, but it may not be the perfect fit for every keyboard nerd.'}, {'title': 'Fender Tone Master Pro: An All-in-One Guitar Studio', 'description': 'The Fender Tone Master Pro is an all-inclusive guitar studio that offers a wide range of sounds and features for musicians.'}, {'title': 'North Korea Hacked Him. So He Took Down Its Internet', 'description': "Disappointed with the lack of US response to the Hermit Kingdom's attacks against US security researchers, one hacker took matters into his own hands."}, {'title': 'Sam Bankman-Fried Built a Crypto Paradise in the Bahamas--Now He’s a Bad Memory', 'description': 'The Strain: Inside the Battle to Define Mental Illness'}, {'title': 'What Really Happened When Google Ousted Timnit Gebru', 'description': 'Machine Not Learning: What Really Happened When Google Ousted Timnit Gebru'}]}

You must be logged in to vote

9 replies

Comment options

ftoppi Apr 10, 2024
Author

Your ouput is better (starts at the top, continues sequentially) but it still does not include all news.

I used Ollama in docker.
It took several minutes but that was expected since I ran it on CPU only. This being said, most of the time was spent on embedding since you are using the LLM to perform the embeddings.
I think you tried to fix it in this commit:8b915e3 , however sinceembedder_model is alwaysNone, it will not use it.

You should try to make embeddings work with this model, it will be much faster:https://ollama.com/library/nomic-embed-text

Comment options

PeriniM Apr 10, 2024

Hei there, thank you for the feedback!
So, right now we are working to make it more deterministic to not loose any important information. As you might have noticed, there are some websites with a complex structure which are a little more difficult to scrape (try instead thishttps://www.wired.com/category/science/) but we are working on it, if you are interested contact us on discord :)

Regarding theOllamaEmbeddings, if the embedding model is not specified in the graph_config it will use the llm model, I noticed it was properly implemented only in the SearchGraph but now it is fixed also in the other graphs in the commit982f142

I tried using the model you suggested for the embeddings and it is actually pretty fast! Thanks :)

Comment options

PeriniM Apr 10, 2024

@ftoppi Try with thisgraph_config

graph_config = {    "llm": {        "model": "ollama/mistral",        "temperature": 0,        "format": "json",  # Ollama needs the format to be specified explicitly        # "model_tokens": 2000, # set context length arbitrarily,        # "base_url": "http://ollama:11434", # set ollama URL arbitrarily    },    "embeddings": {        "model": "ollama/nomic-embed-text",        "temperature": 0,    }}

Comment options

VinciGit00 Apr 10, 2024
Maintainer

using the embedding you are talking about we improve of the 50% the speed, (from 6 to 3 minutes).
If you run it Ollama it takes less than 1 minute
thank you for the help

Comment options

ftoppi Apr 10, 2024
Author

You're welcome.

The embeddings dict in the config does not seem to work properly. It tries to connect to localhost.

graph_config = {    "llm": {        "base_url": "http://ollama:11434",        "model": "ollama/dolphin-mistral:7b-v2.6-dpo-laser-q4_K_M",        "temperature": 0,        "format": "json",  # Ollama needs the format to be specified explicitly        "model_tokens": 8000, # set context length arbitrarily    },    "embeddings": {        "base_url": "http://embeddings:11434",        "model": "ollama/nomic-embed-text:137m-v1.5-fp16",        "temperature": 0,        "model_tokens": 8000,    }}

ValueError: Error raised by inference endpoint: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/embeddings (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcb1c6f5f50>: Failed to establish a new connection: [Errno 111] Connection refused'))

Comment options

FarVision2
May 3, 2024

It is a fairly deep topic.

The bigger the model the worse the performance for embedding and CPU instead of GPU would of course be the worst out of all four possible positions of small big CPU GPU.

TheMTEBis the way to go when you are choosing performant models for your hardware.

Most everyone should be able to make sure the NVIDIA SDK is loaded in the docker and I think most of the projects have auto install if you don't have it. Ola running everything in GPU it's going to be the best. Most of the embedding models are small enough where that should be your best bet.

Ollama does have quite several nice embedding models in their list.

The Nomic embedding worked quite well for me (they always do good stuff)

However I found theMixed Bread model was slightly more performant.

This is helpful when you are upserting hundreds of documents into your vector database for your rag pipeline. I'm not 100% sure of the buffer size for the scraping delay on this project however there was a posting on Reddit that was going around so it piqued my interest :)

Movatterモバイル変換

Uh oh!

Comparison with similar tools#50

Uh oh!

ftoppiApr 9, 2024

Replies: 6 comments· 11 replies

Uh oh!

ftoppiApr 9, 2024 Author

Uh oh!

VinciGit00Apr 9, 2024 Maintainer

Uh oh!

ftoppiApr 9, 2024 Author

Uh oh!

MarcoVinciguerraApr 9, 2024

Uh oh!

VinciGit00Apr 13, 2024 Maintainer

Uh oh!

ftoppiApr 9, 2024 Author

Uh oh!

ftoppiApr 10, 2024 Author

Uh oh!

PeriniMApr 10, 2024

Uh oh!

PeriniMApr 10, 2024

Uh oh!

Uh oh!

VinciGit00Apr 10, 2024 Maintainer

Uh oh!

ftoppiApr 10, 2024 Author

Uh oh!

FarVision2May 3, 2024

Uh oh!

gaurav15113010Oct 25, 2024

Uh oh!

ftoppi
Apr 9, 2024

Replies: 6 comments 11 replies

ftoppi
Apr 9, 2024
Author

VinciGit00
Apr 9, 2024
Maintainer

ftoppi
Apr 9, 2024
Author

MarcoVinciguerra Apr 9, 2024

VinciGit00 Apr 13, 2024
Maintainer

ftoppi
Apr 9, 2024
Author

ftoppi Apr 10, 2024
Author

PeriniM Apr 10, 2024

PeriniM Apr 10, 2024

VinciGit00 Apr 10, 2024
Maintainer

ftoppi Apr 10, 2024
Author

FarVision2
May 3, 2024

gaurav15113010
Oct 25, 2024