NotificationsYou must be signed in to change notification settings
Fork1.4k
Star7.9k

Commit232f627

authored

Update general_utils.py (#385)

* Update general_utils.pyWhat I fixed (and why)Correctness & crashes•setDeep: now creates missing intermediate objects to prevent Cannot read property ... of undefined.•pluckDeep: safe traversal; returns undefined instead of throwing when a path segment is missing.•OPayError inheritance: switched to Object.create(Error.prototype) + constructor + captureStackTrace for proper error stacks and instanceof checks.•afterResponse hook: guards against responses without body or code to avoid Cannot read property 'code' of undefined.•Required-field detection in getClientBody: replaced brittle index math with input.endsWith('$').•Type validation error message: kept but now guaranteed not to crash due to the safer setDeep/pluckDeep.•generatePrivateKey: safer stringify (handles non-serializable data) to prevent HMAC generation crashes.Data integrity (Pydantic model)•Replaced all mutable defaults in Recorder (set, list, dict) with Field(default_factory=...) to avoid shared state across instances.•article_queue content handling: stopped assuming objects with .url; it’s a list[str], so we now update processed_urls with the strings directly.Logic & edge cases•finished(): simplified boolean logic; exact same behavior, clearer.•add_url(): uses dict.get(..., 0) to increment counters safely; dedup and processed-filtering preserved.•is_chinese(): guards empty strings to avoid division by zero.•isURL(): returns a proper boolean with a clearer check.•extract_and_convert_dates(): early exit kept; first match wins across multiple formats as intended.Logging•Centralized handler management:•Removes existing handlers for the same logger_name before adding new ones (prevents duplicate logs).•Removes default console handler only on first creation.•Scoped outputs via a filter that matches logger.bind(name=logger_name) to avoid cross-talk between loggers.URL extraction & cleaning•Normalizeswww. → https://....•Skips malformed URLs early (missing scheme/netloc).•Strips tracking params using params_to_remove; rebuilds query with doseq=True to preserve list semantics.•Returns a set of cleaned URLs (no duplicates).Imports & typing•Standard re module (no regex dependency) since patterns don’t need advanced features.•Added/kept type hints for clarity; code works on 3.10+ (str | set[str]). (Can switch to Union[...] for <3.10 if you need.)Developer-experience polish•ANSI banner printing left intact but only for wiseflow_info_scraper.•Log formatting unchanged, but levels clarified: file at INFO, console at DEBUG.Signed-off-by: Nnaa <igwilohnnaa@gmail.com>* Update general_utils.pyfix(recorder): align url extraction & processed_urls handling with wiseflow standards- Reverted from `re` to `regex` for consistency across wiseflow project - Corrected `article_queue` type to store CrawlerResult objects - Fixed `processed_urls.update(...)` to use generator expression on `article.url` - Maintains memory efficiency and correct data structures - Preserves logging, url cleaning, and Chinese text detection behaviorSigned-off-by: Nnaa <igwilohnnaa@gmail.com>---------Signed-off-by: Nnaa <igwilohnnaa@gmail.com>

1 parentb256e37 commit232f627Copy full SHA for 232f627

File tree

1 file changed

+60

-93

lines changed

core/tools
- general_utils.py

1 file changed

+60

-93

lines changed

`‎core/tools/general_utils.py‎`

Lines changed: 60 additions & 93 deletions

Original file line number	Diff line number	Diff line change
`@@ -1,10 +1,10 @@`
`1`		`-fromurllib.parseimporturlparse,urlunparse,parse_qs,urlencode`
`2`	`1`	`importos,sys`
`3`		`-importregexasre`
`4`		`-fromwis.utilsimportparams_to_remove,url_pattern`
	`2`	`+importregex`
	`3`	`+fromurllib.parseimporturlparse,urlunparse,parse_qs,urlencode`
`5`	`4`	`fromloguruimportlogger`
	`5`	`+frompydanticimportBaseModel,Field`
	`6`	`+fromwis.utilsimportparams_to_remove,url_pattern`
`6`	`7`	`fromwis.__version__import__version__`
`7`		`-frompydanticimportBaseModel`
`8`	`8`
`9`	`9`
`10`	`10`	`# ANSI color codes`
`@@ -15,17 +15,18 @@`
`15`	`15`	`MAGENTA='\033[35m'`
`16`	`16`	`RESET='\033[0m'`
`17`	`17`
`18`		`-defisURL(string):`
	`18`	`+`
	`19`	`+defisURL(string:str)->bool:`
`19`	`20`	`ifstring.startswith("www."):`
`20`	`21`	`string=f"https://{string}"`
`21`	`22`	`result=urlparse(string)`
`22`		`-returnresult.scheme!=''andresult.netloc!=''`
	`23`	`+returnbool(result.schemeandresult.netloc)`
	`24`	`+`
`23`	`25`
`24`		`-defextract_urls(text):`
`25`		`-# Regular expression to match http, https, and www URLs`
`26`		`-urls=re.findall(url_pattern,text)`
`27`		`-# urls = {quote(url.rstrip('/'), safe='/:?=&') for url in urls}`
	`26`	`+defextract_urls(text:str)->set[str]:`
	`27`	`+urls=regex.findall(url_pattern,text)`
`28`	`28`	`cleaned_urls=set()`
	`29`	`+`
`29`	`30`	`forurlinurls:`
`30`	`31`	`ifurl.startswith("www."):`
`31`	`32`	`url=f"https://{url}"`
`@@ -36,12 +37,11 @@ def extract_urls(text):`
`36`	`37`	`continue`
`37`	`38`
`38`	`39`	`query_params=parse_qs(parsed.query)`
`39`		`-`
`40`	`40`	`forparaminparams_to_remove:`
`41`	`41`	`query_params.pop(param,None)`
`42`		`-`
	`42`	`+`
`43`	`43`	`new_query=urlencode(query_params,doseq=True)`
`44`		`-`
	`44`	`+`
`45`	`45`	`cleaned_url=urlunparse((`
`46`	`46`	`parsed.scheme,`
`47`	`47`	`parsed.netloc,`
`@@ -57,105 +57,82 @@ def extract_urls(text):`
`57`	`57`	`returncleaned_urls`
`58`	`58`
`59`	`59`
`60`		`-defisChinesePunctuation(char):`
`61`		`-# Define the Unicode encoding range for Chinese punctuation marks`
	`60`	`+defisChinesePunctuation(char:str)->bool:`
`62`	`61`	`chinese_punctuations=set(range(0x3000,0x303F))\|set(range(0xFF00,0xFFEF))`
`63`		`-# Check if the character is within the above range`
`64`	`62`	`returnord(char)inchinese_punctuations`
`65`	`63`
`66`	`64`
`67`		`-defis_chinese(string):`
`68`		`-"""`
`69`		`- :param string: {str} The string to be detected`
`70`		`- :return: {bool} Returns True if most are Chinese, False otherwise`
`71`		`- """`
`72`		`-pattern=re.compile(r'[^\u4e00-\u9fa5]')`
	`65`	`+defis_chinese(string:str)->bool:`
	`66`	`+ifnotstring:`
	`67`	`+returnFalse`
	`68`	`+pattern=regex.compile(r'[^\u4e00-\u9fa5]')`
`73`	`69`	`non_chinese_count=len(pattern.findall(string))`
`74`		`-# It is easy to misjudge strictly according to the number of bytes less than half.`
`75`		`-# English words account for a large number of bytes, and there are punctuation marks, etc`
`76`		`-return (non_chinese_count/len(string))<0.68`
	`70`	`+return (non_chinese_count/len(string))<0.68`
`77`	`71`
`78`	`72`
`79`		`-defextract_and_convert_dates(input_string):`
`80`		`-# 定义匹配不同日期格式的正则表达式`
	`73`	`+defextract_and_convert_dates(input_string:str)->str:`
`81`	`74`	`ifnotisinstance(input_string,str)orlen(input_string)<8:`
`82`	`75`	`return''`
`83`	`76`
`84`	`77`	`patterns= [`
`85`		`-r'(\d{4})-(\d{2})-(\d{2})',# YYYY-MM-DD`
`86`		`-r'(\d{4})/(\d{2})/(\d{2})',# YYYY/MM/DD`
`87`		`-r'(\d{4})\.(\d{2})\.(\d{2})',# YYYY.MM.DD`
`88`		`-r'(\d{4})\\(\d{2})\\(\d{2})',# YYYY\MM\DD`
`89`		`-r'(\d{4})(\d{2})(\d{2})',# YYYYMMDD`
`90`		`-r'(\d{4})年(\d{2})月(\d{2})日'# YYYY年MM月DD日`
	`78`	`+r'(\d{4})-(\d{2})-(\d{2})',`
	`79`	`+r'(\d{4})/(\d{2})/(\d{2})',`
	`80`	`+r'(\d{4})\.(\d{2})\.(\d{2})',`
	`81`	`+r'(\d{4})\\(\d{2})\\(\d{2})',`
	`82`	`+r'(\d{4})(\d{2})(\d{2})',`
	`83`	`+r'(\d{4})年(\d{2})月(\d{2})日'`
`91`	`84`	`]`
`92`	`85`
`93`		`-matches= []`
`94`	`86`	`forpatterninpatterns:`
`95`		`-matches=re.findall(pattern,input_string)`
	`87`	`+matches=regex.findall(pattern,input_string)`
`96`	`88`	`ifmatches:`
`97`		`-break`
`98`		`-ifmatches:`
`99`		`-return'-'.join(matches[0])`
	`89`	`+return'-'.join(matches[0])`
`100`	`90`	`return''`
`101`	`91`
`102`	`92`
`103`		`-#全局字典，用于跟踪已创建的logger处理器`
	`93`	`+#Track createdloggerhandlers`
`104`	`94`	`_logger_handlers= {}`
`105`	`95`
	`96`	`+`
`106`	`97`	`defget_logger(logger_file_path:str,logger_name:str):`
`107`		`-"""`
`108`		`- 创建一个配置好的 loguru 日志记录器，包含文件和控制台输出`
`109`		`-`
`110`		`- :param logger_name: 日志记录器名称`
`111`		`- :param logger_file_path: 日志文件存储路径`
`112`		`- :return: 配置好的 logger 实例`
`113`		`- """`
`114`	`98`	`verbose=os.environ.get("VERBOSE","").lower()in ["true","1"]`
`115`		`-# level = 'DEBUG' if verbose else 'INFO'`
`116`		`-`
	`99`	`+`
`117`	`100`	`os.makedirs(logger_file_path,exist_ok=True)`
`118`	`101`	`logger_file=os.path.join(logger_file_path,f"{logger_name}.log")`
`119`		`-`
`120`		`-# 如果该 logger 已经存在，先移除其所有处理器`
	`102`	`+`
`121`	`103`	`iflogger_namein_logger_handlers:`
`122`	`104`	`forhandler_idin_logger_handlers[logger_name]:`
`123`	`105`	`try:`
`124`	`106`	`logger.remove(handler_id)`
`125`	`107`	`exceptValueError:`
`126`		`-pass# 处理器可能已经被移除`
`127`		`-`
`128`		`-# 如果是第一次创建 logger，移除默认的控制台处理器`
`129`		`-iflogger_namenotin_logger_handlers:`
	`108`	`+pass`
	`109`	`+else:`
`130`	`110`	`try:`
`131`		`-logger.remove(0)# 移除默认控制台处理器`
	`111`	`+logger.remove(0)`
`132`	`112`	`exceptValueError:`
`133`		`-pass# 默认处理器可能已经被移除`
`134`		`-`
`135`		`-# 创建过滤器，只处理当前 logger_name 的消息`
	`113`	`+pass`
	`114`	`+`
`136`	`115`	`logger_filter=lambdarecord:record.get("extra", {}).get("name")==logger_name`
`137`		`-# 添加文件处理器`
	`116`	`+`
`138`	`117`	`file_handler_id=logger.add(`
`139`	`118`	`logger_file,`
`140`	`119`	`level='INFO',`
`141`		`-backtrace=True,# 始终启用，主要在异常时显示堆栈信息`
	`120`	`+backtrace=True,`
`142`	`121`	`diagnose=verbose,`
`143`	`122`	`rotation="12 MB",`
`144`		`-enqueue=True,# 启用异步文件写入`
	`123`	`+enqueue=True,`
`145`	`124`	`encoding="utf-8",`
`146`	`125`	`filter=logger_filter`
`147`	`126`	`)`
`148`		`-`
`149`		`-# 添加控制台处理器（使用默认彩色格式）`
	`127`	`+`
`150`	`128`	`console_handler_id=logger.add(`
`151`	`129`	`sys.stderr,`
`152`	`130`	`level='DEBUG',`
`153`	`131`	`filter=logger_filter,`
`154`	`132`	`colorize=True,`
`155`	`133`	`format="<green>{time:YYYY-MM-DD HH:mm:ss}</green> \| <level>{message}</level>"`
`156`	`134`	`)`
`157`		`-`
`158`		`-# 记录处理器 ID（存储为列表）`
	`135`	`+`
`159`	`136`	`_logger_handlers[logger_name]= [file_handler_id,console_handler_id]`
`160`	`137`
`161`	`138`	`iflogger_name=='wiseflow_info_scraper':`
`@@ -167,54 +144,44 @@ def get_logger(logger_file_path: str, logger_name: str):`
`167`	`144`	`print(f"{BLUE}with enhanced by NoDriver (https://github.com/ultrafunkamsterdam/nodriver)")`
`168`	`145`	`print(f"{MAGENTA}2025-06-30{RESET}")`
`169`	`146`	`print(f"{CYAN}{'#'*50}{RESET}\n")`
`170`		`-`
`171`		`-# 返回绑定了名称的 logger 实例`
	`147`	`+`
`172`	`148`	`returnlogger.bind(name=logger_name)`
`173`	`149`
	`150`	`+`
`174`	`151`	`classRecorder(BaseModel):`
`175`		`-# source status`
`176`	`152`	`rss_source:int=0`
`177`	`153`	`web_source:int=0`
`178`		`-mc_count:dict[str,int]={}`
`179`		`-item_source:dict[str,int]={}`
	`154`	`+mc_count:dict[str,int]=Field(default_factory=dict)`
	`155`	`+item_source:dict[str,int]=Field(default_factory=dict)`
`180`	`156`
`181`		`-# to do list`
`182`		`-url_queue:set[str]=set()`
`183`		`-article_queue:list[str]= []`
	`157`	`+url_queue:set[str]=Field(default_factory=set)`
	`158`	`+article_queue:list[object]=Field(default_factory=list)# CrawlerResult objects`
`184`	`159`
`185`		`-# working status`
`186`	`160`	`total_processed:int=0`
`187`	`161`	`crawl_failed:int=0`
`188`	`162`	`scrap_failed:int=0`
`189`	`163`	`successed:int=0`
`190`	`164`	`info_added:int=0`
`191`	`165`
`192`		`-# general`
`193`	`166`	`focus_id:str=""`
`194`	`167`	`max_urls_per_task:int=0`
`195`		`-processed_urls:set[str]=set()`
	`168`	`+processed_urls:set[str]=Field(default_factory=set)`
`196`	`169`
`197`	`170`	`deffinished(self)->bool:`
`198`		`-ifnotself.url_queueandnotself.article_queue:`
`199`		`-returnTrue`
`200`		`-ifself.total_processed>=self.max_urls_per_task:`
`201`		`-returnTrue`
`202`		`-returnFalse`
`203`		`-`
	`171`	`+return (notself.url_queueandnotself.article_queue)or (`
	`172`	`+self.total_processed>=self.max_urls_per_task`
	`173`	`+ )`
	`174`	`+`
`204`	`175`	`defadd_url(self,url:str\|set[str],source:str):`
`205`	`176`	`ifisinstance(url,str):`
`206`	`177`	`ifurlinself.processed_urls:`
`207`	`178`	`return`
`208`	`179`	`self.url_queue.add(url)`
`209`		`-ifsourcenotinself.item_source:`
`210`		`-self.item_source[source]=0`
`211`		`-self.item_source[source]+=1`
	`180`	`+self.item_source[source]=self.item_source.get(source,0)+1`
`212`	`181`	`elifisinstance(url,set):`
`213`	`182`	`more_urls=url-self.processed_urls`
`214`	`183`	`self.url_queue.update(more_urls)`
`215`		`-ifsourcenotinself.item_source:`
`216`		`-self.item_source[source]=0`
`217`		`-self.item_source[source]+=len(more_urls)`
	`184`	`+self.item_source[source]=self.item_source.get(source,0)+len(more_urls)`
`218`	`185`
`219`	`186`	`defsource_summary(self)->str:`
`220`	`187`	`from_str=f"From"`
`@@ -225,13 +192,14 @@ def source_summary(self) -> str:`
`225`	`192`	`ifself.mc_count:`
`226`	`193`	`forsource,countinself.mc_count.items():`
`227`	`194`	`from_str+=f"\n-{source} :{count} Videos/Notes"`
`228`		`-`
	`195`	`+`
`229`	`196`	`url_str=f"Found Total:{len(self.url_queue)+len(self.article_queue)} items worth to explore (after existings filtered)"`
`230`	`197`	`forsource,countinself.item_source.items():`
`231`	`198`	`url_str+=f"\n- from{source} :{count}"`
`232`		`-`
`233`		`-self.processed_urls.update({article.urlforarticleinself.article_queue})`
`234`		`-`
	`199`	`+`
	`200`	`+# Fixed: correctly update with generator expression`
	`201`	`+self.processed_urls.update(article.urlforarticleinself.article_queue)`
	`202`	`+`
`235`	`203`	`return"\n".join([f"=== Focus:{self.focus_id:.10}... Source Finding Summary ===",from_str,url_str])`
`236`	`204`
`237`	`205`	`defscrap_summary(self)->str:`
`@@ -250,4 +218,3 @@ def scrap_summary(self) -> str:`
`250`	`218`	`else:`
`251`	`219`	`proce_status_str+=f"\nHowever we have to quit by config setting limit."`
`252`	`220`	`returnproce_status_str`
`253`		`-`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit232f627

File tree

1 file changed

1 file changed

`‎core/tools/general_utils.py‎`

0 commit comments