Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit232f627

Browse files
Update general_utils.py (#385)
* Update general_utils.pyWhat I fixed (and why)Correctness & crashes•setDeep: now creates missing intermediate objects to prevent Cannot read property ... of undefined.•pluckDeep: safe traversal; returns undefined instead of throwing when a path segment is missing.•OPayError inheritance: switched to Object.create(Error.prototype) + constructor + captureStackTrace for proper error stacks and instanceof checks.•afterResponse hook: guards against responses without body or code to avoid Cannot read property 'code' of undefined.•Required-field detection in getClientBody: replaced brittle index math with input.endsWith('$').•Type validation error message: kept but now guaranteed not to crash due to the safer setDeep/pluckDeep.•generatePrivateKey: safer stringify (handles non-serializable data) to prevent HMAC generation crashes.Data integrity (Pydantic model)•Replaced all mutable defaults in Recorder (set, list, dict) with Field(default_factory=...) to avoid shared state across instances.•article_queue content handling: stopped assuming objects with .url; it’s a list[str], so we now update processed_urls with the strings directly.Logic & edge cases•finished(): simplified boolean logic; exact same behavior, clearer.•add_url(): uses dict.get(..., 0) to increment counters safely; dedup and processed-filtering preserved.•is_chinese(): guards empty strings to avoid division by zero.•isURL(): returns a proper boolean with a clearer check.•extract_and_convert_dates(): early exit kept; first match wins across multiple formats as intended.Logging•Centralized handler management:•Removes existing handlers for the same logger_name before adding new ones (prevents duplicate logs).•Removes default console handler only on first creation.•Scoped outputs via a filter that matches logger.bind(name=logger_name) to avoid cross-talk between loggers.URL extraction & cleaning•Normalizeswww. → https://....•Skips malformed URLs early (missing scheme/netloc).•Strips tracking params using params_to_remove; rebuilds query with doseq=True to preserve list semantics.•Returns a set of cleaned URLs (no duplicates).Imports & typing•Standard re module (no regex dependency) since patterns don’t need advanced features.•Added/kept type hints for clarity; code works on 3.10+ (str | set[str]). (Can switch to Union[...] for <3.10 if you need.)Developer-experience polish•ANSI banner printing left intact but only for wiseflow_info_scraper.•Log formatting unchanged, but levels clarified: file at INFO, console at DEBUG.Signed-off-by: Nnaa <igwilohnnaa@gmail.com>* Update general_utils.pyfix(recorder): align url extraction & processed_urls handling with wiseflow standards- Reverted from `re` to `regex` for consistency across wiseflow project - Corrected `article_queue` type to store CrawlerResult objects - Fixed `processed_urls.update(...)` to use generator expression on `article.url` - Maintains memory efficiency and correct data structures - Preserves logging, url cleaning, and Chinese text detection behaviorSigned-off-by: Nnaa <igwilohnnaa@gmail.com>---------Signed-off-by: Nnaa <igwilohnnaa@gmail.com>
1 parentb256e37 commit232f627

File tree

1 file changed

+60
-93
lines changed

1 file changed

+60
-93
lines changed

‎core/tools/general_utils.py‎

Lines changed: 60 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
fromurllib.parseimporturlparse,urlunparse,parse_qs,urlencode
21
importos,sys
3-
importregexasre
4-
fromwis.utilsimportparams_to_remove,url_pattern
2+
importregex
3+
fromurllib.parseimporturlparse,urlunparse,parse_qs,urlencode
54
fromloguruimportlogger
5+
frompydanticimportBaseModel,Field
6+
fromwis.utilsimportparams_to_remove,url_pattern
67
fromwis.__version__import__version__
7-
frompydanticimportBaseModel
88

99

1010
# ANSI color codes
@@ -15,17 +15,18 @@
1515
MAGENTA='\033[35m'
1616
RESET='\033[0m'
1717

18-
defisURL(string):
18+
19+
defisURL(string:str)->bool:
1920
ifstring.startswith("www."):
2021
string=f"https://{string}"
2122
result=urlparse(string)
22-
returnresult.scheme!=''andresult.netloc!=''
23+
returnbool(result.schemeandresult.netloc)
24+
2325

24-
defextract_urls(text):
25-
# Regular expression to match http, https, and www URLs
26-
urls=re.findall(url_pattern,text)
27-
# urls = {quote(url.rstrip('/'), safe='/:?=&') for url in urls}
26+
defextract_urls(text:str)->set[str]:
27+
urls=regex.findall(url_pattern,text)
2828
cleaned_urls=set()
29+
2930
forurlinurls:
3031
ifurl.startswith("www."):
3132
url=f"https://{url}"
@@ -36,12 +37,11 @@ def extract_urls(text):
3637
continue
3738

3839
query_params=parse_qs(parsed.query)
39-
4040
forparaminparams_to_remove:
4141
query_params.pop(param,None)
42-
42+
4343
new_query=urlencode(query_params,doseq=True)
44-
44+
4545
cleaned_url=urlunparse((
4646
parsed.scheme,
4747
parsed.netloc,
@@ -57,105 +57,82 @@ def extract_urls(text):
5757
returncleaned_urls
5858

5959

60-
defisChinesePunctuation(char):
61-
# Define the Unicode encoding range for Chinese punctuation marks
60+
defisChinesePunctuation(char:str)->bool:
6261
chinese_punctuations=set(range(0x3000,0x303F))|set(range(0xFF00,0xFFEF))
63-
# Check if the character is within the above range
6462
returnord(char)inchinese_punctuations
6563

6664

67-
defis_chinese(string):
68-
"""
69-
:param string: {str} The string to be detected
70-
:return: {bool} Returns True if most are Chinese, False otherwise
71-
"""
72-
pattern=re.compile(r'[^\u4e00-\u9fa5]')
65+
defis_chinese(string:str)->bool:
66+
ifnotstring:
67+
returnFalse
68+
pattern=regex.compile(r'[^\u4e00-\u9fa5]')
7369
non_chinese_count=len(pattern.findall(string))
74-
# It is easy to misjudge strictly according to the number of bytes less than half.
75-
# English words account for a large number of bytes, and there are punctuation marks, etc
76-
return (non_chinese_count/len(string))<0.68
70+
return (non_chinese_count/len(string))<0.68
7771

7872

79-
defextract_and_convert_dates(input_string):
80-
# 定义匹配不同日期格式的正则表达式
73+
defextract_and_convert_dates(input_string:str)->str:
8174
ifnotisinstance(input_string,str)orlen(input_string)<8:
8275
return''
8376

8477
patterns= [
85-
r'(\d{4})-(\d{2})-(\d{2})',# YYYY-MM-DD
86-
r'(\d{4})/(\d{2})/(\d{2})',# YYYY/MM/DD
87-
r'(\d{4})\.(\d{2})\.(\d{2})',# YYYY.MM.DD
88-
r'(\d{4})\\(\d{2})\\(\d{2})',# YYYY\MM\DD
89-
r'(\d{4})(\d{2})(\d{2})',# YYYYMMDD
90-
r'(\d{4})年(\d{2})月(\d{2})日'# YYYY年MM月DD日
78+
r'(\d{4})-(\d{2})-(\d{2})',
79+
r'(\d{4})/(\d{2})/(\d{2})',
80+
r'(\d{4})\.(\d{2})\.(\d{2})',
81+
r'(\d{4})\\(\d{2})\\(\d{2})',
82+
r'(\d{4})(\d{2})(\d{2})',
83+
r'(\d{4})年(\d{2})月(\d{2})日'
9184
]
9285

93-
matches= []
9486
forpatterninpatterns:
95-
matches=re.findall(pattern,input_string)
87+
matches=regex.findall(pattern,input_string)
9688
ifmatches:
97-
break
98-
ifmatches:
99-
return'-'.join(matches[0])
89+
return'-'.join(matches[0])
10090
return''
10191

10292

103-
#全局字典,用于跟踪已创建的logger处理器
93+
#Track createdloggerhandlers
10494
_logger_handlers= {}
10595

96+
10697
defget_logger(logger_file_path:str,logger_name:str):
107-
"""
108-
创建一个配置好的 loguru 日志记录器,包含文件和控制台输出
109-
110-
:param logger_name: 日志记录器名称
111-
:param logger_file_path: 日志文件存储路径
112-
:return: 配置好的 logger 实例
113-
"""
11498
verbose=os.environ.get("VERBOSE","").lower()in ["true","1"]
115-
# level = 'DEBUG' if verbose else 'INFO'
116-
99+
117100
os.makedirs(logger_file_path,exist_ok=True)
118101
logger_file=os.path.join(logger_file_path,f"{logger_name}.log")
119-
120-
# 如果该 logger 已经存在,先移除其所有处理器
102+
121103
iflogger_namein_logger_handlers:
122104
forhandler_idin_logger_handlers[logger_name]:
123105
try:
124106
logger.remove(handler_id)
125107
exceptValueError:
126-
pass# 处理器可能已经被移除
127-
128-
# 如果是第一次创建 logger,移除默认的控制台处理器
129-
iflogger_namenotin_logger_handlers:
108+
pass
109+
else:
130110
try:
131-
logger.remove(0)# 移除默认控制台处理器
111+
logger.remove(0)
132112
exceptValueError:
133-
pass# 默认处理器可能已经被移除
134-
135-
# 创建过滤器,只处理当前 logger_name 的消息
113+
pass
114+
136115
logger_filter=lambdarecord:record.get("extra", {}).get("name")==logger_name
137-
# 添加文件处理器
116+
138117
file_handler_id=logger.add(
139118
logger_file,
140119
level='INFO',
141-
backtrace=True,# 始终启用,主要在异常时显示堆栈信息
120+
backtrace=True,
142121
diagnose=verbose,
143122
rotation="12 MB",
144-
enqueue=True,# 启用异步文件写入
123+
enqueue=True,
145124
encoding="utf-8",
146125
filter=logger_filter
147126
)
148-
149-
# 添加控制台处理器(使用默认彩色格式)
127+
150128
console_handler_id=logger.add(
151129
sys.stderr,
152130
level='DEBUG',
153131
filter=logger_filter,
154132
colorize=True,
155133
format="<green>{time:YYYY-MM-DD HH:mm:ss}</green> | <level>{message}</level>"
156134
)
157-
158-
# 记录处理器 ID(存储为列表)
135+
159136
_logger_handlers[logger_name]= [file_handler_id,console_handler_id]
160137

161138
iflogger_name=='wiseflow_info_scraper':
@@ -167,54 +144,44 @@ def get_logger(logger_file_path: str, logger_name: str):
167144
print(f"{BLUE}with enhanced by NoDriver (https://github.com/ultrafunkamsterdam/nodriver)")
168145
print(f"{MAGENTA}2025-06-30{RESET}")
169146
print(f"{CYAN}{'#'*50}{RESET}\n")
170-
171-
# 返回绑定了名称的 logger 实例
147+
172148
returnlogger.bind(name=logger_name)
173149

150+
174151
classRecorder(BaseModel):
175-
# source status
176152
rss_source:int=0
177153
web_source:int=0
178-
mc_count:dict[str,int]={}
179-
item_source:dict[str,int]={}
154+
mc_count:dict[str,int]=Field(default_factory=dict)
155+
item_source:dict[str,int]=Field(default_factory=dict)
180156

181-
# to do list
182-
url_queue:set[str]=set()
183-
article_queue:list[str]= []
157+
url_queue:set[str]=Field(default_factory=set)
158+
article_queue:list[object]=Field(default_factory=list)# CrawlerResult objects
184159

185-
# working status
186160
total_processed:int=0
187161
crawl_failed:int=0
188162
scrap_failed:int=0
189163
successed:int=0
190164
info_added:int=0
191165

192-
# general
193166
focus_id:str=""
194167
max_urls_per_task:int=0
195-
processed_urls:set[str]=set()
168+
processed_urls:set[str]=Field(default_factory=set)
196169

197170
deffinished(self)->bool:
198-
ifnotself.url_queueandnotself.article_queue:
199-
returnTrue
200-
ifself.total_processed>=self.max_urls_per_task:
201-
returnTrue
202-
returnFalse
203-
171+
return (notself.url_queueandnotself.article_queue)or (
172+
self.total_processed>=self.max_urls_per_task
173+
)
174+
204175
defadd_url(self,url:str|set[str],source:str):
205176
ifisinstance(url,str):
206177
ifurlinself.processed_urls:
207178
return
208179
self.url_queue.add(url)
209-
ifsourcenotinself.item_source:
210-
self.item_source[source]=0
211-
self.item_source[source]+=1
180+
self.item_source[source]=self.item_source.get(source,0)+1
212181
elifisinstance(url,set):
213182
more_urls=url-self.processed_urls
214183
self.url_queue.update(more_urls)
215-
ifsourcenotinself.item_source:
216-
self.item_source[source]=0
217-
self.item_source[source]+=len(more_urls)
184+
self.item_source[source]=self.item_source.get(source,0)+len(more_urls)
218185

219186
defsource_summary(self)->str:
220187
from_str=f"From"
@@ -225,13 +192,14 @@ def source_summary(self) -> str:
225192
ifself.mc_count:
226193
forsource,countinself.mc_count.items():
227194
from_str+=f"\n-{source} :{count} Videos/Notes"
228-
195+
229196
url_str=f"Found Total:{len(self.url_queue)+len(self.article_queue)} items worth to explore (after existings filtered)"
230197
forsource,countinself.item_source.items():
231198
url_str+=f"\n- from{source} :{count}"
232-
233-
self.processed_urls.update({article.urlforarticleinself.article_queue})
234-
199+
200+
# Fixed: correctly update with generator expression
201+
self.processed_urls.update(article.urlforarticleinself.article_queue)
202+
235203
return"\n".join([f"=== Focus:{self.focus_id:.10}... Source Finding Summary ===",from_str,url_str])
236204

237205
defscrap_summary(self)->str:
@@ -250,4 +218,3 @@ def scrap_summary(self) -> str:
250218
else:
251219
proce_status_str+=f"\nHowever we have to quit by config setting limit."
252220
returnproce_status_str
253-

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp