- Notifications
You must be signed in to change notification settings - Fork108
Add VN publisher (VnExpress)#802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:master
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Hey, thank you so much for adding our first Vietnamese publisher! This looks quite good already. I only have a couple of remarks to simplify the code.
| sources=[ | ||
| RSSFeed("https://vnexpress.net/rss/tin-moi-nhat.rss"), | ||
| Sitemap("https://vnexpress.net/sitemap.xml"), | ||
| NewsMap("https://vnexpress.net/google-news-sitemap.xml"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Seems like they tricked you with the sitemap links, They all redirect you to home
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Instead you can add the other RSSFeeds from here:https://vnexpress.net/rss as sources
| @attribute | ||
| deftitle(self)->Optional[str]: | ||
| title_list:List[Any]=self.precomputed.ld.xpath_search("//NewsArticle/headline") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
You can usescalar=True here, and it will not return a List. Have you observedself.precomputed.ld.xpath_search("//NewsArticle/headline") to be unreliable? Usually, relying on the JSON should be sufficient.
| @attribute | ||
| defauthors(self)->List[str]: | ||
| author_data_list:List[Any]=self.precomputed.ld.xpath_search("//NewsArticle/author") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
You can pass in whatever you get back directly intogeneric_author_parsing. It is designed to work with various inputs. Have you observed self.precomputed.ld.xpath_search("//NewsArticle/author") to be unreliable? Usually, relying on the JSON should be sufficient.
| @attribute | ||
| defpublishing_date(self)->Optional[datetime]: | ||
| date_list:List[Any]=self.precomputed.ld.xpath_search("//NewsArticle/datePublished") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Here, you can usescalar=True as well. And it should be sufficient to use the JSON value.
| @attribute | ||
| deftopics(self)->List[str]: | ||
| ld_topics=self._parse_ld_keywords() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
You can simplify this greatly by just usinggeneric_topic_parsing(self.precomputed.meta.get("keywords"), which essentially does the same thing your custom helper methods do.
| classVnExpressIntlParser(ParserProxy): | ||
| classV1(BaseParser): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Theimages attribute seems to be missing in this parser.
| classVnExpressIntlParser(ParserProxy): | ||
| classV1(BaseParser): | ||
| _summary_selector=CSSSelector("p.description") | ||
| _paragraph_selector=CSSSelector("article.fck_detail > p") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
In thisarticle, the author is also extracted from the bottom of the article.
| ld_topics=self._parse_ld_keywords() | ||
| ifld_topics: | ||
| returnld_topics | ||
| returnself._parse_meta_topics() No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There are also some bloat topic likeTin nóng (= hot news), which should be removed.
| classSupportsBool(Protocol): | ||
| def__bool__(self)->bool: | ||
| ... | ||
| def__bool__(self)->bool: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
You probably have a differentblack version installed. This PR should normally not edit these files.
Uh oh!
There was an error while loading.Please reload this page.
Hi, I’ve added a new VN publisher (VnExpress).
Please review when you have time. Thank you!