- Notifications
You must be signed in to change notification settings - Fork1.4k
Commit232f627
authored
Update general_utils.py (#385)
* Update general_utils.pyWhat I fixed (and why)Correctness & crashes•setDeep: now creates missing intermediate objects to prevent Cannot read property ... of undefined.•pluckDeep: safe traversal; returns undefined instead of throwing when a path segment is missing.•OPayError inheritance: switched to Object.create(Error.prototype) + constructor + captureStackTrace for proper error stacks and instanceof checks.•afterResponse hook: guards against responses without body or code to avoid Cannot read property 'code' of undefined.•Required-field detection in getClientBody: replaced brittle index math with input.endsWith('$').•Type validation error message: kept but now guaranteed not to crash due to the safer setDeep/pluckDeep.•generatePrivateKey: safer stringify (handles non-serializable data) to prevent HMAC generation crashes.Data integrity (Pydantic model)•Replaced all mutable defaults in Recorder (set, list, dict) with Field(default_factory=...) to avoid shared state across instances.•article_queue content handling: stopped assuming objects with .url; it’s a list[str], so we now update processed_urls with the strings directly.Logic & edge cases•finished(): simplified boolean logic; exact same behavior, clearer.•add_url(): uses dict.get(..., 0) to increment counters safely; dedup and processed-filtering preserved.•is_chinese(): guards empty strings to avoid division by zero.•isURL(): returns a proper boolean with a clearer check.•extract_and_convert_dates(): early exit kept; first match wins across multiple formats as intended.Logging•Centralized handler management:•Removes existing handlers for the same logger_name before adding new ones (prevents duplicate logs).•Removes default console handler only on first creation.•Scoped outputs via a filter that matches logger.bind(name=logger_name) to avoid cross-talk between loggers.URL extraction & cleaning•Normalizeswww. → https://....•Skips malformed URLs early (missing scheme/netloc).•Strips tracking params using params_to_remove; rebuilds query with doseq=True to preserve list semantics.•Returns a set of cleaned URLs (no duplicates).Imports & typing•Standard re module (no regex dependency) since patterns don’t need advanced features.•Added/kept type hints for clarity; code works on 3.10+ (str | set[str]). (Can switch to Union[...] for <3.10 if you need.)Developer-experience polish•ANSI banner printing left intact but only for wiseflow_info_scraper.•Log formatting unchanged, but levels clarified: file at INFO, console at DEBUG.Signed-off-by: Nnaa <igwilohnnaa@gmail.com>* Update general_utils.pyfix(recorder): align url extraction & processed_urls handling with wiseflow standards- Reverted from `re` to `regex` for consistency across wiseflow project - Corrected `article_queue` type to store CrawlerResult objects - Fixed `processed_urls.update(...)` to use generator expression on `article.url` - Maintains memory efficiency and correct data structures - Preserves logging, url cleaning, and Chinese text detection behaviorSigned-off-by: Nnaa <igwilohnnaa@gmail.com>---------Signed-off-by: Nnaa <igwilohnnaa@gmail.com>1 parentb256e37 commit232f627
1 file changed
+60
-93
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
2 | 1 | | |
3 | | - | |
4 | | - | |
| 2 | + | |
| 3 | + | |
5 | 4 | | |
| 5 | + | |
| 6 | + | |
6 | 7 | | |
7 | | - | |
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
18 | | - | |
| 18 | + | |
| 19 | + | |
19 | 20 | | |
20 | 21 | | |
21 | 22 | | |
22 | | - | |
| 23 | + | |
| 24 | + | |
23 | 25 | | |
24 | | - | |
25 | | - | |
26 | | - | |
27 | | - | |
| 26 | + | |
| 27 | + | |
28 | 28 | | |
| 29 | + | |
29 | 30 | | |
30 | 31 | | |
31 | 32 | | |
| |||
36 | 37 | | |
37 | 38 | | |
38 | 39 | | |
39 | | - | |
40 | 40 | | |
41 | 41 | | |
42 | | - | |
| 42 | + | |
43 | 43 | | |
44 | | - | |
| 44 | + | |
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
60 | | - | |
61 | | - | |
| 60 | + | |
62 | 61 | | |
63 | | - | |
64 | 62 | | |
65 | 63 | | |
66 | 64 | | |
67 | | - | |
68 | | - | |
69 | | - | |
70 | | - | |
71 | | - | |
72 | | - | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
73 | 69 | | |
74 | | - | |
75 | | - | |
76 | | - | |
| 70 | + | |
77 | 71 | | |
78 | 72 | | |
79 | | - | |
80 | | - | |
| 73 | + | |
81 | 74 | | |
82 | 75 | | |
83 | 76 | | |
84 | 77 | | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | | - | |
89 | | - | |
90 | | - | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
91 | 84 | | |
92 | 85 | | |
93 | | - | |
94 | 86 | | |
95 | | - | |
| 87 | + | |
96 | 88 | | |
97 | | - | |
98 | | - | |
99 | | - | |
| 89 | + | |
100 | 90 | | |
101 | 91 | | |
102 | 92 | | |
103 | | - | |
| 93 | + | |
104 | 94 | | |
105 | 95 | | |
| 96 | + | |
106 | 97 | | |
107 | | - | |
108 | | - | |
109 | | - | |
110 | | - | |
111 | | - | |
112 | | - | |
113 | | - | |
114 | 98 | | |
115 | | - | |
116 | | - | |
| 99 | + | |
117 | 100 | | |
118 | 101 | | |
119 | | - | |
120 | | - | |
| 102 | + | |
121 | 103 | | |
122 | 104 | | |
123 | 105 | | |
124 | 106 | | |
125 | 107 | | |
126 | | - | |
127 | | - | |
128 | | - | |
129 | | - | |
| 108 | + | |
| 109 | + | |
130 | 110 | | |
131 | | - | |
| 111 | + | |
132 | 112 | | |
133 | | - | |
134 | | - | |
135 | | - | |
| 113 | + | |
| 114 | + | |
136 | 115 | | |
137 | | - | |
| 116 | + | |
138 | 117 | | |
139 | 118 | | |
140 | 119 | | |
141 | | - | |
| 120 | + | |
142 | 121 | | |
143 | 122 | | |
144 | | - | |
| 123 | + | |
145 | 124 | | |
146 | 125 | | |
147 | 126 | | |
148 | | - | |
149 | | - | |
| 127 | + | |
150 | 128 | | |
151 | 129 | | |
152 | 130 | | |
153 | 131 | | |
154 | 132 | | |
155 | 133 | | |
156 | 134 | | |
157 | | - | |
158 | | - | |
| 135 | + | |
159 | 136 | | |
160 | 137 | | |
161 | 138 | | |
| |||
167 | 144 | | |
168 | 145 | | |
169 | 146 | | |
170 | | - | |
171 | | - | |
| 147 | + | |
172 | 148 | | |
173 | 149 | | |
| 150 | + | |
174 | 151 | | |
175 | | - | |
176 | 152 | | |
177 | 153 | | |
178 | | - | |
179 | | - | |
| 154 | + | |
| 155 | + | |
180 | 156 | | |
181 | | - | |
182 | | - | |
183 | | - | |
| 157 | + | |
| 158 | + | |
184 | 159 | | |
185 | | - | |
186 | 160 | | |
187 | 161 | | |
188 | 162 | | |
189 | 163 | | |
190 | 164 | | |
191 | 165 | | |
192 | | - | |
193 | 166 | | |
194 | 167 | | |
195 | | - | |
| 168 | + | |
196 | 169 | | |
197 | 170 | | |
198 | | - | |
199 | | - | |
200 | | - | |
201 | | - | |
202 | | - | |
203 | | - | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
204 | 175 | | |
205 | 176 | | |
206 | 177 | | |
207 | 178 | | |
208 | 179 | | |
209 | | - | |
210 | | - | |
211 | | - | |
| 180 | + | |
212 | 181 | | |
213 | 182 | | |
214 | 183 | | |
215 | | - | |
216 | | - | |
217 | | - | |
| 184 | + | |
218 | 185 | | |
219 | 186 | | |
220 | 187 | | |
| |||
225 | 192 | | |
226 | 193 | | |
227 | 194 | | |
228 | | - | |
| 195 | + | |
229 | 196 | | |
230 | 197 | | |
231 | 198 | | |
232 | | - | |
233 | | - | |
234 | | - | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
235 | 203 | | |
236 | 204 | | |
237 | 205 | | |
| |||
250 | 218 | | |
251 | 219 | | |
252 | 220 | | |
253 | | - | |
0 commit comments
Comments
(0)