Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Add VN publisher (VnExpress)#802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
bachthyaglx wants to merge4 commits intoflairNLP:master
base:master
Choose a base branch
Loading
frombachthyaglx:add-vn-publisher

Conversation

@bachthyaglx
Copy link

@bachthyaglxbachthyaglx commentedOct 22, 2025
edited
Loading

Hi, I’ve added a new VN publisher (VnExpress).
Please review when you have time. Thank you!

Copy link
Collaborator

@addie9800addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hey, thank you so much for adding our first Vietnamese publisher! This looks quite good already. I only have a couple of remarks to simplify the code.

sources=[
RSSFeed("https://vnexpress.net/rss/tin-moi-nhat.rss"),
Sitemap("https://vnexpress.net/sitemap.xml"),
NewsMap("https://vnexpress.net/google-news-sitemap.xml"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Seems like they tricked you with the sitemap links, They all redirect you to home

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Instead you can add the other RSSFeeds from here:https://vnexpress.net/rss as sources


@attribute
deftitle(self)->Optional[str]:
title_list:List[Any]=self.precomputed.ld.xpath_search("//NewsArticle/headline")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You can usescalar=True here, and it will not return a List. Have you observedself.precomputed.ld.xpath_search("//NewsArticle/headline") to be unreliable? Usually, relying on the JSON should be sufficient.


@attribute
defauthors(self)->List[str]:
author_data_list:List[Any]=self.precomputed.ld.xpath_search("//NewsArticle/author")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You can pass in whatever you get back directly intogeneric_author_parsing. It is designed to work with various inputs. Have you observed self.precomputed.ld.xpath_search("//NewsArticle/author") to be unreliable? Usually, relying on the JSON should be sufficient.


@attribute
defpublishing_date(self)->Optional[datetime]:
date_list:List[Any]=self.precomputed.ld.xpath_search("//NewsArticle/datePublished")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Here, you can usescalar=True as well. And it should be sufficient to use the JSON value.


@attribute
deftopics(self)->List[str]:
ld_topics=self._parse_ld_keywords()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You can simplify this greatly by just usinggeneric_topic_parsing(self.precomputed.meta.get("keywords"), which essentially does the same thing your custom helper methods do.



classVnExpressIntlParser(ParserProxy):
classV1(BaseParser):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Theimages attribute seems to be missing in this parser.

classVnExpressIntlParser(ParserProxy):
classV1(BaseParser):
_summary_selector=CSSSelector("p.description")
_paragraph_selector=CSSSelector("article.fck_detail > p")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

In thisarticle, the author is also extracted from the bottom of the article.

ld_topics=self._parse_ld_keywords()
ifld_topics:
returnld_topics
returnself._parse_meta_topics() No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There are also some bloat topic likeTin nóng (= hot news), which should be removed.

classSupportsBool(Protocol):
def__bool__(self)->bool:
...
def__bool__(self)->bool: ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You probably have a differentblack version installed. This PR should normally not edit these files.

@addie9800addie9800 self-assigned thisOct 26, 2025
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@addie9800addie9800addie9800 requested changes

Requested changes must be addressed to merge this pull request.

Assignees

@addie9800addie9800

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@bachthyaglx@addie9800

[8]ページ先頭

©2009-2025 Movatter.jp