User talk:Dalba

From Meta, a Wikimedia project coordination wiki

JAMA Pages[edit]

The JAMA pages are currently grabbing website links but not adding the title as seen here.

It would be nice if the bot can cite the journal article from the link But I understand if it is a pain to code --Cs california (talk) 08:05, 29 December 2023 (UTC)[reply]

Thanks for taking the time to report the issue. It appears that the website employs a rate-limiting algorithm that temporarily blocks Toolforge's IP address when an excessive number of requests are made within a brief period. To address this, Citer has been recently configured to return a general citation template, instead of displaying an error message, in such cases. This allows users to manually fill in or add missing parameters, including the title. While ideally a warning would be displayed to inform users about the issue, I haven't been able to carve out the time to implement it. As a workaround, you can pass Citer the DOI of the article instead of its URL, which not only bypasses the rate-limiting issue, but usually results in more accurate citations. Dalba 16:20, 29 December 2023 (UTC)[reply]
This has started to work again. I'm not sure if it is due to recent changes in citer or a temporary. Dalba 16:50, 23 February 2024 (UTC)[reply]

Highlighting words on the target page[edit]

Hi again. If I enter this address to the citer:

https://books.google.com/books?id=GVIEAAAAMBAJ&dq=oppenheimer+%22if+the+radiance+of+a+thousand+suns+were+to+%22&pg=PA133

it results in a citation without the highlighting instructions:

https://books.google.com/books?id=GVIEAAAAMBAJ&pg=PA133

Of course, I can add it back in, but would there be any way of having the citer retain the highlighting? Swood100 (talk) 17:49, 29 August 2023 (UTC)[reply]

Hi! Actually, retaining the search highlights was the way Citer originally used to work, but then some users asked for the highlights to be removed automatically, arguing that the highlights may be irrelevant/distracting. I can't remember if Google provided the "Clear search" link at the time, but I can see it is there now, so it should be easy for users to remove the highlights themselves. I'm OK with this change, just not sure how other editors will receive it. Dalba 11:07, 1 September 2023 (UTC)[reply]

An unknown error occurred[edit]

Hi again. The following address:

https://www.isu.org/inside-isu/rules-regulations/isu-statutes-constitution-regulations-technical/29326-constitution-general-regulations-2022/file

results in the message: "An unknown error occurred." Swood100 (talk) 19:03, 10 September 2023 (UTC)[reply]

Related to the above is this address:

https://www.isu.org/inside-isu/rules-regulations/isu-statutes-constitution-regulations-technical

In the result, the website= parameter is "-", as is the first part of the ref name. Also, two words in the title are run together. Continuing to love the citer! Swood100 (talk) 19:49, 10 September 2023 (UTC)[reply]

Fixed Dalba 07:33, 15 September 2023 (UTC)[reply]

In case it helps you find the cause of the problem, I just had the same issue with this address:

https://stillmed.olympics.com/media/Documents/News/2023/03/Participation-for-Individual-Neutral-Athletes-Personnel-with-a-Russian-or-Belarusian-Passport.pdf

Best regards Swood100 (talk) 00:34, 13 September 2023 (UTC)[reply]

And also with this address:

https://s3.documentcloud.org/documents/23387965/bills-117hr7776eas-rcp117-70.pdf

Are all pdf files broken at this point? Swood100 (talk) 19:28, 13 September 2023 (UTC)[reply]

Hi there. Sorry for the late response. Yes. Citer does not currently support PDF files. They are hard to process and the results are usually very unreliable. But at the very least I have to improve the error message. Dalba 11:55, 14 September 2023 (UTC)[reply]
OK, no problem. How about the following as a message:
Citer does not yet support PDFs but here’s a generic stub you can begin with: <ref name="">{{cite web | title= | url=[address entered] | access-date=[today’s date]| date= | website=  }}</ref>
By the way, I also got that message when this address was entered:
https://packaged-media.redd.it/phb67w58bvl61/pb/m2-res_480p.mp4?m=DASHPlaylist.mpd&v=1&e=1694710800&s=d44b3be14fe4bcdfc10a04b918fb1e226914557e#t=0
which is a video clip. Also not supported? (I'm not complaining—the Citer is a godsend!) Swood100 (talk) 18:53, 14 September 2023 (UTC)[reply]
OK, Citer now generates a general template for these URLs instead of returning an error. But one should note that the {{sfn}} and ref name are almost always useless in such cases and will need to be modified manually if used. Dalba 07:42, 15 September 2023 (UTC)[reply]
Terrific! That will be very helpful. But allow me to suggest that you accompany this result with a clear statement that the result you are supplying is incomplete. As is, it is not explained do the user that this result is in a different category from the others, in that PDF files are not supported yet. As it is, the user might be left with the impression that, for some unknown reason, some of the results they get back are unaccountably incomplete. I would put, as the first line in the result box, something like this:
***ATTENTION*** You supplied a citation source that was a PDF file, but the Citer does not yet support these. However, as a convenience we have constructed a template for you to begin with as you create your own. Please examine it and supply the missing parameters.
If you could display this in red that would be even better. The trouble is that after they supply the necessary information and click Copy, they also get the warning text, and if you don't give them the warning text they never read it. Here's an alternative: put the warning text in red on the screen (and perhaps beep at them). Then, beep at them each time they click "Copy" until they click the box titled "I understand" that is next to the warning. It's a lot of work, but you don't want to leave it as it is. Swood100 (talk) 17:30, 15 September 2023 (UTC)[reply]
I think I would put the message right inside and at the top of the result box, and beep. Then if they click "Copy," beep again and copy a null instead (to clear the previous contents) and give them a message that after completing the citation themselves they must manually copy it. A special procedure for a special case. This way, you retain your reputation for reliability. Swood100 (talk) 17:52, 15 September 2023 (UTC)[reply]
Since this applies to the video as well, maybe you have to use a more generic message. So instead of "source that was a PDF file" you might say "source that was in format not yet supported by Citer." Swood100 (talk) 18:03, 15 September 2023 (UTC)[reply]
Thanks for all the suggestions! Delivering a warning requires quite a few changes in the code base. Not sure when I'll find the time to implement it, but I have that in mind. Dalba 04:20, 16 September 2023 (UTC)[reply]
OK. all these refs seem to be missing is a title, and that is pointed out when they do a Wikipedia preview. But the addition is a tremendous help. Thanks! Swood100 (talk) 19:09, 16 September 2023 (UTC)[reply]

HTTPError[edit]

My assumption is that you would rather hear about issues than not. The changes you made to present PDF citations in partial form have been a terrific help. I just need to add the title and the author. However, the following URL

https://www.icj-cij.org/public/files/case-related/182/182-20220316-ORD-01-00-EN.pdf

produces: HTTPError

How popular is Citer? Do you keep track of how many uses per day it is getting? Best regards. Swood100 (talk) 19:28, 10 December 2023 (UTC)[reply]

I do, thank you. I wish I had more time to work on parsing pdf files, it might be possible to extract more information about PDF files, I'm just concerned about the performance. Anyway, the problem with this particular URL is that it is behind some CloudFlare restriction mechanism. Not actually sure why, but I cannot download the file from command line either:
 
$ wget https://www.icj-cij.org/public/files/case-related/182/182-20220316-ORD-01-00-EN.pdf
--2023-12-15 14:58:52--  https://www.icj-cij.org/public/files/case-related/182/182-20220316-ORD-01-00-EN.pdf
Resolving www.icj-cij.org (www.icj-cij.org)... 104.22.41.99, 172.67.26.159, 104.22.40.99, ...
Connecting to www.icj-cij.org (www.icj-cij.org)|104.22.41.99|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2023-12-15 14:58:52 ERROR 403: Forbidden.
Citer cannot access the URL through HTTP protocol and hence the HTTPError. I guess, the result can be improved by returning a partial cite web template instead, but it may take a while before I can get to it.
Regarding popularity, I really don't know and I regularly clear the limited logs that toolforge provides. But since you asked, I just looked, and for the past 6 hours there has been around 324 requests processed. Not sure how many of them are unique though, the logs are anonymized.
Dalba 15:23, 15 December 2023 (UTC)[reply]

HTTPStatusError[edit]

Hi again,

This link:

https://www.jpeds.com/article/S0022-3476(22)00185-8/fulltext

produces the above error, though supplying the DOI listed on that page works fine:

doi.org/10.1016/j.jpeds.2022.03.005

Best regards, Swood100 (talk) 15:20, 27 December 2023 (UTC)[reply]

Unfortunately the website has blocked toolforge's IP address. :( Dalba 09:53, 28 December 2023 (UTC)[reply]
Seems to be Fixed using curl-impersonate. Dalba 07:27, 25 February 2024 (UTC)[reply]

ConnectError[edit]

Hi again, when I ran this URL I got the above message:

https://web.archive.org/web/20161105162350/https:/thejungsoul.com/guidance-for-parents-of-teens-with-rapid-onset-gender-dysphoria/

However, when I switched at random to this different saved version it worked fine:

https://web.archive.org/web/20171106084816/http://thejungsoul.com/guidance-for-parents-of-teens-with-rapid-onset-gender-dysphoria

I see what the problem is. In the first one the second https: is only followed by a single '/' instead of two. Looks like a screwball error from the page I got this URL from, because I got another URL from that page:

https://web.archive.org/web/20161209083621/http:/adflegal.org/detailspages/blog-details/allianceedge/2016/08/24/the-weekly-digest-8-24-16

This one also has a single '/' after the http: but it results in a ref that retains the error in two locations:

<ref name="Arnold 2016 i520">{{cite web | last=Arnold | first=James | title=The Weekly Digest: 8-24-16 | website=web.archive.org | date=24 August 2016 | url=http:/adflegal.org/detailspages/blog-details/allianceedge/2016/08/24/the-weekly-digest-8-24-16 | archive-url=https://web.archive.org/web/20161209083621/http:/adflegal.org/detailspages/blog-details/allianceedge/2016/08/24/the-weekly-digest-8-24-16 | archive-date=9 December 2016 | url-status=dead | access-date=29 December 2023}}</ref>

This results in a "{{cite web}}: Check |url= value (help)" red error message, a reference to this page, and a tooltip when I hover over the link:

Arnold, James (24 August 2016). "The Weekly Digest: 8-24-16". web.archive.org. Archived from [http:/adflegal.org/detailspages/blog-details/allianceedge/2016/08/24/the-weekly-digest-8-24-16 the original] on 9 December 2016. Retrieved 29 December 2023. {{cite web}}: Check |url= value (help)

When I add another '/' to the http: in the "url" param in the produced ref the error goes away. I suppose it is asking too much for Citer to correct errors in the URLs it is supplied.

Swood100 (talk) 04:28, 29 December 2023 (UTC)[reply]

Hi there! For me, none of the URLs work. I believe this is another case of toolforge's IP address being blocked by a third party server. Unfortunately, there is not much I can do in these cases. There might be some workarounds, but it will take me a while to implement and test. Dalba 04:07, 31 December 2023 (UTC)[reply]

HTTPStatusError[edit]

Hi again,

This URL:

https://www.reuters.com/world/middle-east/iraq-pays-last-chunk-524-billion-gulf-war-reparations-un-2022-02-09/

Results in the above error. Another website blocking toolforge's IP address? Why do they do that? Is it always rate-limiting? Best regards, Swood100 (talk) 20:25, 6 January 2024 (UTC)[reply]

Hi. Yes, reuters.com has blocked the IP address of toolforge. It's completely blocked as far as I can tell, no rate limiting here. I can only guess, but I believe after the recent OpenAI and New York Times confrontation, websites have become more stringent about who can access their contents. Toolforge, being the host of several citation generating tools is sending more than usual requests and therefore websites have started blocking its IP address. Dalba 08:17, 12 January 2024 (UTC)[reply]
This seems to be Fixed now that citer is using curl-impersonate. Dalba 07:25, 25 February 2024 (UTC)[reply]

Allowing citer requests from en.wikipedia.org[edit]

Hi Dalba, I'm writing a citation script for myself on en.wikipedia.org and encountered a CORS error when trying to use citer.toolforge.org. Would it be possible to enable CORS by setting the "Access-Control-Allow-Origin" header appropriately on the citer web server? This page has more information. Your tool is awesome, by the way. Thanks. Daniel Quinlan (talk) 08:40, 6 February 2024 (UTC)[reply]

Hi there! Done. Just note that since I'm not maintaining a stable API yet, the response format might change in the future without any deprecation period. (I have had some thoughts about using Citoid response format, but it's unlikely I'll be able to implement it anytime soon.) Dalba 17:27, 6 February 2024 (UTC)[reply]
Thank you so much! One thing that might help scripts a bit would be adding a parameter to get a raw text response (if you have to choose, just the latter format). I haven't really used Citoid because it doesn't seem to extract enough information to make it worthwhile. Daniel Quinlan (talk) 13:43, 7 February 2024 (UTC)[reply]
Not sure how you are using it right now, but if you send a POST request instead of a GET request and send the user_input in the body of the request, then citer will return a json response which I guess might be more easily digestible by scripts. Something like await (await fetch('https://citer.toolforge.org/', {'method': 'POST', 'body': 'https://example.com/somepath.html' })).json() should work. Dalba 07:39, 8 February 2024 (UTC)[reply]
I've barely started, but I was doing a GET request and parsing the document. JSON is so much better. For easier updates in the future, you might consider returning a JSON dictionary with named keys like "sfn", "cite", and "ref-name". Also, can the date format be included in the POST request? Thanks again. Daniel Quinlan (talk) 13:49, 8 February 2024 (UTC)[reply]
All parameters of a GET request also work on a POST request if they remain in the URL. The only difference between GET and POST is that `user_input` value should be the body and not in the URL. My previous example with a date_format parameter would become: await (await fetch('https://citer.toolforge.org/?date_format=%Y-%m-%d', {'method': 'POST', 'body': 'https://citer.toolforge.org/' })).json(). You are right about returning a dictionary, it's more flexible and easier to understand. I will probably change it in the future. Dalba 14:06, 8 February 2024 (UTC)[reply]
Thanks! Daniel Quinlan (talk) 07:28, 9 February 2024 (UTC)[reply]

Citing via archive links[edit]

Hello again Dalba. I've been having some issues trying to use citer with archive.org links. It is frequently returning a 500 code with "ConnectError" in the JSON almost immediately. archive.org can be exceptionally slow retrieving archives, it often takes 15 to 30 seconds and sometimes is probably even more than that. It's also possible citer is just being rate limited by archive.org and my limited testing might be enough to drive it from bad to worse. Any ideas?

I've also tried using archive.today links like https://archive.today/N3fQ (they also use archive.is and archive.ph, and probably a few more aliases) and that always seems to result in a ReadTimeout error from citer. Would it be possible to support archive.today archive links?

By the way, I did reach out to archive.org to request that they enable CORS for *.wikipedia.org. If they do that, it's possible that clients could make the request to archive.org and then POST the archive link and the entire web page result to citer for data extraction. That might help if rate limits are the issue. Anyhow, I'll let you know if my request goes anywhere. Regards. Daniel Quinlan (talk) 07:13, 13 February 2024 (UTC)[reply]

The more I look at it, the more archive.today is starting to look like a good addition for dead links. They do comment out scripts including application/ld+json, but that's easy to work around. I'm not sure how aggressive the server is about blocking non-interactive clients, but the maintainer has been willing to whitelist IP addresses in the past. Daniel Quinlan (talk) 18:48, 13 February 2024 (UTC)[reply]
Hi!
  • archive.org: I currently cannot reproduce. It's probably a rate limit. Citer is set to wait for 10 seconds before aborting the request, if you are getting the response immediately then it is not a timeout, perhaps the server has declined the request sooner or some other issue. There might be some clues in the logs, I might need to dig into them. Let me know if they enable CORS for wikipedia, I'll implement a way to submit HTML content to citer.
  • archive.today: I would love to add support, but apparently the server does not reply to toolforge requests, no matter the timeout. Here is the verbose output of a curl call:
:$ time curl -I https://archive.today/N3fQ --connect-timeout 300 -v
:*   Trying 51.38.69.52...
:* TCP_NODELAY set
:* Connected to archive.today (51.38.69.52) port 443 (#0)
:* ALPN, offering h2
:* ALPN, offering http/1.1
:* successfully set certificate verify locations:
:*   CAfile: none
:  CApath: /etc/ssl/certs
:* TLSv1.3 (OUT), TLS handshake, Client hello (1):
:* TLSv1.3 (IN), TLS handshake, Server hello (2):
:* TLSv1.2 (IN), TLS handshake, Certificate (11):
:* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
:* TLSv1.2 (IN), TLS handshake, Server finished (14):
:* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
:* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
:* TLSv1.2 (OUT), TLS handshake, Finished (20):
:* TLSv1.2 (IN), TLS handshake, Finished (20):
:* SSL connection using TLSv1.2 / ECDHE-ECDSA-AES256-GCM-SHA384
:* ALPN, server accepted to use h2
:* Server certificate:
:*  subject: CN=archive.today
:*  start date: Feb  4 02:20:57 2024 GMT
:*  expire date: May  4 02:20:56 2024 GMT
:*  subjectAltName: host "archive.today" matched cert's "archive.today"
:*  issuer: C=US; O=Let's Encrypt; CN=R3
:*  SSL certificate verify ok.
:* Using HTTP2, server supports multi-use
:* Connection state changed (HTTP/2 confirmed)
:* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
:* Using Stream ID: 1 (easy handle 0x5645340fd110)
:> HEAD /N3fQ HTTP/2
:> Host: archive.today
:> User-Agent: curl/7.64.0
:> Accept: */*
:>
:* TLSv1.2 (IN), TLS alert, close notify (256):
:* Empty reply from server
:* Connection #0 to host archive.today left intact
:curl: (52) Empty reply from server
:real    1m0.404s
:user    0m0.030s
:sys     0m0.009s
:
Copying `User-Agent` and other headers from browser did not help either. I suspect they have blacklisted toolforge. Dalba 06:26, 22 February 2024 (UTC)[reply]
I suspect archive.today has done something to block non-interactive requests. It might be necessary to use something like Selenium. As an alternative, would it be possible for Citer to support submitting the web page content in a POST request along with the original link and the archive link (if the content is from an archive server)? That would help with sites blocking tools like curl and it might help with rate limits and timeouts too.
Also, archive.today responded positively to two of my requests: CORS requests now work and they also added back some <meta> tags as <old-meta>. The application/ld+json data is available as well (it's commented out, but easy to extract). Daniel Quinlan (talk) 07:00, 22 February 2024 (UTC)[reply]
They are using SSL handshake fingerprinting to detect non-browser requests. I was able to access the website using https://github.com/lwthiker/curl-impersonate . I might be able to embed that into citer, it just might take me some time.
The POST request idea is also possible and I do plan to implement it. Dalba 13:11, 22 February 2024 (UTC)[reply]
OK, archive.today URLs are now expected to work (not tested thoroughly though).
Also, you can now submit HTML using post request. In order to implement this I had to change the POST submit format. Now all parameters should be submitted within the body of the requests in json format. To submit HTML forms, "input_type" should be set to "html" and "user_input" should be an object containing two keys: {"html": "<HTML string of the page>", "url": "<URL>"}. Dalba 17:00, 23 February 2024 (UTC)[reply]

RequestsError[edit]

Hi again,

I copied a DOI address from a web page. It was split over two lines which resulted in a space being placed in the middle:

https://doi.org/10.1371/%20journal.pgph.0000245

This resulted in Citer returning the message: "RequestsError". When I removed the '%20' from the string I got the right result. If it is true that a space is never appropriate in the middle of a DOI string, then stripping any such spaces before running the query might result in more satisfied and less confused users (or in the alternative, substitute the message, "You did not enter a valid DOI. Please check your source."). Swood100 (talk) 20:45, 15 March 2024 (UTC)[reply]

Hi. Thanks for the suggestion. I had to refer to DOI handbook to see if space is a valid character or not. According to section 3.2.1 GENERAL CHARACTERISTICS OF THE DOI SYNTAX: "The DOI name is case-insensitive and can incorporate any printable characters from the legal graphic characters of Unicode." Apparently, space is considered both a graphic character and printable character. That being said, I have not seen any DOI containing the space character.
Currently citer does not consider the space a valid DOI character, but https://doi.org/10.1371/%20journal.pgph.0000245 is still a valid URL and citer tries to connect to its server, but it fails with RequestsError because the server responds with 404 error code.
It is possible to add a separate input type for DOIs. That way citer would not confuse a DOI for a URL. However I believe a separate input type would be a little less convenient for users. For now I'm going to leave citer as it is but might reconsider if other users report similar issues. Dalba 08:33, 22 March 2024 (UTC)[reply]