Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Comparison with similar tools#50

Closed
ftoppi started this conversation inGeneral
Apr 9, 2024· 6 comments· 11 replies
Discussion options

Hello,
how does it compare to similar (ie.: with the same goal) tools in terms of

  • speed (I know it's going to depend on the GPU)
  • quality (how accurate is the output)
  • ease of use (how easy is it to use)
  • reliability (is it always producing the same output)
    Good idea though ;)
You must be logged in to vote

Replies: 6 comments 11 replies

Comment options

By "similar tools", I mean things like:

  • trafilatura
  • browserless
  • beautifulsoup4
  • mozilla readability.js
  • etc
You must be logged in to vote
0 replies
Comment options

ok, now the question is clearer.
The goal was to create a scraper that works with the use of AI without having any knowledge of how HTML code works, using openai api, gemini or local networks.
The output from our tests seems to be precise, at the expense of the speed which is slower because a neural network is needed.
The advantage, however, lies in the speed of code creation and its modularity, as you can see, just use a pre-established script (https://github.com/VinciGit00/Scrapegraph-ai/blob/main/examples/openai/smart_scraper_openai.py ) where you just need to change the prompt and the link to have already done so.
Therefore this remains the main advantage together with modularity and fault tolerance. In fact, the previous script will continue to work even if the source code of the link continues to change.
Thanks for the question and if you like, please leave a star, it could be really useful for the growth of the project

You must be logged in to vote
0 replies
Comment options

I understand the goal of the project.
I am asking you for facts.
On the readme, show what your project can do. Show how good it is compared to the standards, show the tradeoff.

You must be logged in to vote
2 replies
@MarcoVinciguerra
Comment options

yeah, it could be a great idea, thank you!

@VinciGit00
Comment options

hi, we did this, let me know if you have more ideaslink

Comment options

not impressive at all with the first example:

root@66d61d8a5f59:/# cat test1.pyfrom scrapegraphai.graphs import SmartScraperGraphgraph_config = {    "llm": {        "base_url": "http://ollama:11434",        "model": "ollama/dolphin-mistral:7b-v2.6-dpo-laser-q4_K_M",        "temperature": 0,        "format": "json",  # Ollama needs the format to be specified explicitly        "model_tokens": 16000    },}smart_scraper_graph = SmartScraperGraph(    prompt="List me all the news with their description.",    # also accepts a string with the already downloaded HTML code    source="https://www.wired.com",    config=graph_config)result = smart_scraper_graph.run()print(result)
root@66d61d8a5f59:/# python test1.py--- Executing Fetch Node ---Fetching pages: 100%|###########################################################################################################################################################| 1/1 [00:00<00:00,  8.58it/s]--- Executing Parse Node ------ Executing RAG Node ------ (updated chunks metadata) ------ (tokens compressed and vector stored) ------ Executing GenerateAnswer Node ---Processing chunks: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5275.85it/s]{'news': [{'title': 'KitchenAid’s New Espresso Machine Won’t Wake Up Your Roommates', 'description': 'KitchenAid’s new compact espresso machine is thoughtfully designed and reliable--after you tune it a bit.'}, {'title': 'BYD’s Entry-Level Electric SUV Lacks Excitement', 'description': 'Chinese automaker BYD has unveiled its entry-level electric SUV, the Seal, but it may not be as exciting as some might hope.'}, {'title': "Cherry's New MX2A Switch: A Keyboard Nerd's Dream?", 'description': 'The new Cherry MX2A switch is designed for a more tactile typing experience, but it may not be the perfect fit for every keyboard nerd.'}, {'title': 'Fender Tone Master Pro: An All-in-One Guitar Studio', 'description': 'The Fender Tone Master Pro is an all-inclusive guitar studio that offers a wide range of sounds and features for musicians.'}, {'title': 'North Korea Hacked Him. So He Took Down Its Internet', 'description': "Disappointed with the lack of US response to the Hermit Kingdom's attacks against US security researchers, one hacker took matters into his own hands."}, {'title': 'Sam Bankman-Fried Built a Crypto Paradise in the Bahamas--Now He’s a Bad Memory', 'description': 'The Strain: Inside the Battle to Define Mental Illness'}, {'title': 'What Really Happened When Google Ousted Timnit Gebru', 'description': 'Machine Not Learning: What Really Happened When Google Ousted Timnit Gebru'}]}
You must be logged in to vote
9 replies
@ftoppi
Comment options

Your ouput is better (starts at the top, continues sequentially) but it still does not include all news.

I used Ollama in docker.
It took several minutes but that was expected since I ran it on CPU only. This being said, most of the time was spent on embedding since you are using the LLM to perform the embeddings.
I think you tried to fix it in this commit:8b915e3 , however sinceembedder_model is alwaysNone, it will not use it.

You should try to make embeddings work with this model, it will be much faster:https://ollama.com/library/nomic-embed-text

@PeriniM
Comment options

Hei there, thank you for the feedback!
So, right now we are working to make it more deterministic to not loose any important information. As you might have noticed, there are some websites with a complex structure which are a little more difficult to scrape (try instead thishttps://www.wired.com/category/science/) but we are working on it, if you are interested contact us on discord :)

Regarding theOllamaEmbeddings, if the embedding model is not specified in the graph_config it will use the llm model, I noticed it was properly implemented only in the SearchGraph but now it is fixed also in the other graphs in the commit982f142

I tried using the model you suggested for the embeddings and it is actually pretty fast! Thanks :)

@PeriniM
Comment options

@ftoppi Try with thisgraph_config

graph_config = {    "llm": {        "model": "ollama/mistral",        "temperature": 0,        "format": "json",  # Ollama needs the format to be specified explicitly        # "model_tokens": 2000, # set context length arbitrarily,        # "base_url": "http://ollama:11434", # set ollama URL arbitrarily    },    "embeddings": {        "model": "ollama/nomic-embed-text",        "temperature": 0,    }}
@VinciGit00
Comment options

using the embedding you are talking about we improve of the 50% the speed, (from 6 to 3 minutes).
If you run it Ollama it takes less than 1 minute
thank you for the help

@ftoppi
Comment options

You're welcome.

The embeddings dict in the config does not seem to work properly. It tries to connect to localhost.

graph_config = {    "llm": {        "base_url": "http://ollama:11434",        "model": "ollama/dolphin-mistral:7b-v2.6-dpo-laser-q4_K_M",        "temperature": 0,        "format": "json",  # Ollama needs the format to be specified explicitly        "model_tokens": 8000, # set context length arbitrarily    },    "embeddings": {        "base_url": "http://embeddings:11434",        "model": "ollama/nomic-embed-text:137m-v1.5-fp16",        "temperature": 0,        "model_tokens": 8000,    }}
ValueError: Error raised by inference endpoint: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/embeddings (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcb1c6f5f50>: Failed to establish a new connection: [Errno 111] Connection refused'))
Comment options

It is a fairly deep topic.

The bigger the model the worse the performance for embedding and CPU instead of GPU would of course be the worst out of all four possible positions of small big CPU GPU.

TheMTEBis the way to go when you are choosing performant models for your hardware.

Most everyone should be able to make sure the NVIDIA SDK is loaded in the docker and I think most of the projects have auto install if you don't have it. Ola running everything in GPU it's going to be the best. Most of the embedding models are small enough where that should be your best bet.

Ollama does have quite several nice embedding models in their list.

The Nomic embedding worked quite well for me (they always do good stuff)

However I found theMixed Bread model was slightly more performant.

This is helpful when you are upserting hundreds of documents into your vector database for your rag pipeline. I'm not 100% sure of the buffer size for the scraping delay on this project however there was a posting on Reddit that was going around so it piqued my interest :)

You must be logged in to vote
0 replies
Comment options

will this be helpful for websites which has enabled anti scraping techniques ?

You must be logged in to vote
0 replies
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Category
General
Labels
None yet
6 participants
@ftoppi@PeriniM@MarcoVinciguerra@VinciGit00@FarVision2@gaurav15113010

[8]ページ先頭

©2009-2025 Movatter.jp