User talk:Dalba

From Meta, a Wikimedia project coordination wiki

Kew POWO citations format[edit]

For Kew Plants of the World Citations can the format be change from this:

<ref name="Plants of the World Online k345">{{cite web | title=Melocactus estevesii P.J.Braun | website=Plants of the World Online | url=https://powo.science.kew.org/taxon/urn:lsid:ipni.org:names:938363-1 | access-date=2024-04-29}}</ref>

to this:

<ref name="Plants of the World Online k345">{{BioRef|powo | title=''Melocactus estevesii'' P.J.Braun | id=938363-1 | access-date=2024-04-29}}</ref>

One of the users complained on my talk page about the cites -Cs california (talk) 05:34, 4 May 2024 (UTC)[reply]

HTTPError[edit]

My assumption is that you would rather hear about issues than not. The changes you made to present PDF citations in partial form have been a terrific help. I just need to add the title and the author. However, the following URL

https://www.icj-cij.org/public/files/case-related/182/182-20220316-ORD-01-00-EN.pdf

produces: HTTPError

How popular is Citer? Do you keep track of how many uses per day it is getting? Best regards. Swood100 (talk) 19:28, 10 December 2023 (UTC)[reply]

I do, thank you. I wish I had more time to work on parsing pdf files, it might be possible to extract more information about PDF files, I'm just concerned about the performance. Anyway, the problem with this particular URL is that it is behind some CloudFlare restriction mechanism. Not actually sure why, but I cannot download the file from command line either:
 
$ wget https://www.icj-cij.org/public/files/case-related/182/182-20220316-ORD-01-00-EN.pdf
--2023-12-15 14:58:52--  https://www.icj-cij.org/public/files/case-related/182/182-20220316-ORD-01-00-EN.pdf
Resolving www.icj-cij.org (www.icj-cij.org)... 104.22.41.99, 172.67.26.159, 104.22.40.99, ...
Connecting to www.icj-cij.org (www.icj-cij.org)|104.22.41.99|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2023-12-15 14:58:52 ERROR 403: Forbidden.
Citer cannot access the URL through HTTP protocol and hence the HTTPError. I guess, the result can be improved by returning a partial cite web template instead, but it may take a while before I can get to it.
Regarding popularity, I really don't know and I regularly clear the limited logs that toolforge provides. But since you asked, I just looked, and for the past 6 hours there has been around 324 requests processed. Not sure how many of them are unique though, the logs are anonymized.
Dalba 15:23, 15 December 2023 (UTC)[reply]

HTTPStatusError[edit]

Hi again,

This link:

https://www.jpeds.com/article/S0022-3476(22)00185-8/fulltext

produces the above error, though supplying the DOI listed on that page works fine:

doi.org/10.1016/j.jpeds.2022.03.005

Best regards, Swood100 (talk) 15:20, 27 December 2023 (UTC)[reply]

Unfortunately the website has blocked toolforge's IP address. :( Dalba 09:53, 28 December 2023 (UTC)[reply]
Seems to be Fixed using curl-impersonate. Dalba 07:27, 25 February 2024 (UTC)[reply]

ConnectError[edit]

Hi again, when I ran this URL I got the above message:

https://web.archive.org/web/20161105162350/https:/thejungsoul.com/guidance-for-parents-of-teens-with-rapid-onset-gender-dysphoria/

However, when I switched at random to this different saved version it worked fine:

https://web.archive.org/web/20171106084816/http://thejungsoul.com/guidance-for-parents-of-teens-with-rapid-onset-gender-dysphoria

I see what the problem is. In the first one the second https: is only followed by a single '/' instead of two. Looks like a screwball error from the page I got this URL from, because I got another URL from that page:

https://web.archive.org/web/20161209083621/http:/adflegal.org/detailspages/blog-details/allianceedge/2016/08/24/the-weekly-digest-8-24-16

This one also has a single '/' after the http: but it results in a ref that retains the error in two locations:

<ref name="Arnold 2016 i520">{{cite web | last=Arnold | first=James | title=The Weekly Digest: 8-24-16 | website=web.archive.org | date=24 August 2016 | url=http:/adflegal.org/detailspages/blog-details/allianceedge/2016/08/24/the-weekly-digest-8-24-16 | archive-url=https://web.archive.org/web/20161209083621/http:/adflegal.org/detailspages/blog-details/allianceedge/2016/08/24/the-weekly-digest-8-24-16 | archive-date=9 December 2016 | url-status=dead | access-date=29 December 2023}}</ref>

This results in a "{{cite web}}: Check |url= value (help)" red error message, a reference to this page, and a tooltip when I hover over the link:

Arnold, James (24 August 2016). "The Weekly Digest: 8-24-16". web.archive.org. Archived from [http:/adflegal.org/detailspages/blog-details/allianceedge/2016/08/24/the-weekly-digest-8-24-16 the original] on 9 December 2016. Retrieved 29 December 2023. {{cite web}}: Check |url= value (help)

When I add another '/' to the http: in the "url" param in the produced ref the error goes away. I suppose it is asking too much for Citer to correct errors in the URLs it is supplied.

Swood100 (talk) 04:28, 29 December 2023 (UTC)[reply]

Hi there! For me, none of the URLs work. I believe this is another case of toolforge's IP address being blocked by a third party server. Unfortunately, there is not much I can do in these cases. There might be some workarounds, but it will take me a while to implement and test. Dalba 04:07, 31 December 2023 (UTC)[reply]

HTTPStatusError[edit]

Hi again,

This URL:

https://www.reuters.com/world/middle-east/iraq-pays-last-chunk-524-billion-gulf-war-reparations-un-2022-02-09/

Results in the above error. Another website blocking toolforge's IP address? Why do they do that? Is it always rate-limiting? Best regards, Swood100 (talk) 20:25, 6 January 2024 (UTC)[reply]

Hi. Yes, reuters.com has blocked the IP address of toolforge. It's completely blocked as far as I can tell, no rate limiting here. I can only guess, but I believe after the recent OpenAI and New York Times confrontation, websites have become more stringent about who can access their contents. Toolforge, being the host of several citation generating tools is sending more than usual requests and therefore websites have started blocking its IP address. Dalba 08:17, 12 January 2024 (UTC)[reply]
This seems to be Fixed now that citer is using curl-impersonate. Dalba 07:25, 25 February 2024 (UTC)[reply]

Allowing citer requests from en.wikipedia.org[edit]

Hi Dalba, I'm writing a citation script for myself on en.wikipedia.org and encountered a CORS error when trying to use citer.toolforge.org. Would it be possible to enable CORS by setting the "Access-Control-Allow-Origin" header appropriately on the citer web server? This page has more information. Your tool is awesome, by the way. Thanks. Daniel Quinlan (talk) 08:40, 6 February 2024 (UTC)[reply]

Hi there! Done. Just note that since I'm not maintaining a stable API yet, the response format might change in the future without any deprecation period. (I have had some thoughts about using Citoid response format, but it's unlikely I'll be able to implement it anytime soon.) Dalba 17:27, 6 February 2024 (UTC)[reply]
Thank you so much! One thing that might help scripts a bit would be adding a parameter to get a raw text response (if you have to choose, just the latter format). I haven't really used Citoid because it doesn't seem to extract enough information to make it worthwhile. Daniel Quinlan (talk) 13:43, 7 February 2024 (UTC)[reply]
Not sure how you are using it right now, but if you send a POST request instead of a GET request and send the user_input in the body of the request, then citer will return a json response which I guess might be more easily digestible by scripts. Something like await (await fetch('https://citer.toolforge.org/', {'method': 'POST', 'body': 'https://example.com/somepath.html' })).json() should work. Dalba 07:39, 8 February 2024 (UTC)[reply]
I've barely started, but I was doing a GET request and parsing the document. JSON is so much better. For easier updates in the future, you might consider returning a JSON dictionary with named keys like "sfn", "cite", and "ref-name". Also, can the date format be included in the POST request? Thanks again. Daniel Quinlan (talk) 13:49, 8 February 2024 (UTC)[reply]
All parameters of a GET request also work on a POST request if they remain in the URL. The only difference between GET and POST is that `user_input` value should be the body and not in the URL. My previous example with a date_format parameter would become: await (await fetch('https://citer.toolforge.org/?date_format=%Y-%m-%d', {'method': 'POST', 'body': 'https://citer.toolforge.org/' })).json(). You are right about returning a dictionary, it's more flexible and easier to understand. I will probably change it in the future. Dalba 14:06, 8 February 2024 (UTC)[reply]
Thanks! Daniel Quinlan (talk) 07:28, 9 February 2024 (UTC)[reply]

Citing via archive links[edit]

Hello again Dalba. I've been having some issues trying to use citer with archive.org links. It is frequently returning a 500 code with "ConnectError" in the JSON almost immediately. archive.org can be exceptionally slow retrieving archives, it often takes 15 to 30 seconds and sometimes is probably even more than that. It's also possible citer is just being rate limited by archive.org and my limited testing might be enough to drive it from bad to worse. Any ideas?

I've also tried using archive.today links like https://archive.today/N3fQ (they also use archive.is and archive.ph, and probably a few more aliases) and that always seems to result in a ReadTimeout error from citer. Would it be possible to support archive.today archive links?

By the way, I did reach out to archive.org to request that they enable CORS for *.wikipedia.org. If they do that, it's possible that clients could make the request to archive.org and then POST the archive link and the entire web page result to citer for data extraction. That might help if rate limits are the issue. Anyhow, I'll let you know if my request goes anywhere. Regards. Daniel Quinlan (talk) 07:13, 13 February 2024 (UTC)[reply]

The more I look at it, the more archive.today is starting to look like a good addition for dead links. They do comment out scripts including application/ld+json, but that's easy to work around. I'm not sure how aggressive the server is about blocking non-interactive clients, but the maintainer has been willing to whitelist IP addresses in the past. Daniel Quinlan (talk) 18:48, 13 February 2024 (UTC)[reply]
Hi!
  • archive.org: I currently cannot reproduce. It's probably a rate limit. Citer is set to wait for 10 seconds before aborting the request, if you are getting the response immediately then it is not a timeout, perhaps the server has declined the request sooner or some other issue. There might be some clues in the logs, I might need to dig into them. Let me know if they enable CORS for wikipedia, I'll implement a way to submit HTML content to citer.
  • archive.today: I would love to add support, but apparently the server does not reply to toolforge requests, no matter the timeout. Here is the verbose output of a curl call:
:$ time curl -I https://archive.today/N3fQ --connect-timeout 300 -v
:*   Trying 51.38.69.52...
:* TCP_NODELAY set
:* Connected to archive.today (51.38.69.52) port 443 (#0)
:* ALPN, offering h2
:* ALPN, offering http/1.1
:* successfully set certificate verify locations:
:*   CAfile: none
:  CApath: /etc/ssl/certs
:* TLSv1.3 (OUT), TLS handshake, Client hello (1):
:* TLSv1.3 (IN), TLS handshake, Server hello (2):
:* TLSv1.2 (IN), TLS handshake, Certificate (11):
:* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
:* TLSv1.2 (IN), TLS handshake, Server finished (14):
:* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
:* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
:* TLSv1.2 (OUT), TLS handshake, Finished (20):
:* TLSv1.2 (IN), TLS handshake, Finished (20):
:* SSL connection using TLSv1.2 / ECDHE-ECDSA-AES256-GCM-SHA384
:* ALPN, server accepted to use h2
:* Server certificate:
:*  subject: CN=archive.today
:*  start date: Feb  4 02:20:57 2024 GMT
:*  expire date: May  4 02:20:56 2024 GMT
:*  subjectAltName: host "archive.today" matched cert's "archive.today"
:*  issuer: C=US; O=Let's Encrypt; CN=R3
:*  SSL certificate verify ok.
:* Using HTTP2, server supports multi-use
:* Connection state changed (HTTP/2 confirmed)
:* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
:* Using Stream ID: 1 (easy handle 0x5645340fd110)
:> HEAD /N3fQ HTTP/2
:> Host: archive.today
:> User-Agent: curl/7.64.0
:> Accept: */*
:>
:* TLSv1.2 (IN), TLS alert, close notify (256):
:* Empty reply from server
:* Connection #0 to host archive.today left intact
:curl: (52) Empty reply from server
:real    1m0.404s
:user    0m0.030s
:sys     0m0.009s
:
Copying `User-Agent` and other headers from browser did not help either. I suspect they have blacklisted toolforge. Dalba 06:26, 22 February 2024 (UTC)[reply]
I suspect archive.today has done something to block non-interactive requests. It might be necessary to use something like Selenium. As an alternative, would it be possible for Citer to support submitting the web page content in a POST request along with the original link and the archive link (if the content is from an archive server)? That would help with sites blocking tools like curl and it might help with rate limits and timeouts too.
Also, archive.today responded positively to two of my requests: CORS requests now work and they also added back some <meta> tags as <old-meta>. The application/ld+json data is available as well (it's commented out, but easy to extract). Daniel Quinlan (talk) 07:00, 22 February 2024 (UTC)[reply]
They are using SSL handshake fingerprinting to detect non-browser requests. I was able to access the website using https://github.com/lwthiker/curl-impersonate . I might be able to embed that into citer, it just might take me some time.
The POST request idea is also possible and I do plan to implement it. Dalba 13:11, 22 February 2024 (UTC)[reply]
OK, archive.today URLs are now expected to work (not tested thoroughly though).
Also, you can now submit HTML using post request. In order to implement this I had to change the POST submit format. Now all parameters should be submitted within the body of the requests in json format. To submit HTML forms, "input_type" should be set to "html" and "user_input" should be an object containing two keys: {"html": "<HTML string of the page>", "url": "<URL>"}. Dalba 17:00, 23 February 2024 (UTC)[reply]

RequestsError[edit]

Hi again,

I copied a DOI address from a web page. It was split over two lines which resulted in a space being placed in the middle:

https://doi.org/10.1371/%20journal.pgph.0000245

This resulted in Citer returning the message: "RequestsError". When I removed the '%20' from the string I got the right result. If it is true that a space is never appropriate in the middle of a DOI string, then stripping any such spaces before running the query might result in more satisfied and less confused users (or in the alternative, substitute the message, "You did not enter a valid DOI. Please check your source."). Swood100 (talk) 20:45, 15 March 2024 (UTC)[reply]

Hi. Thanks for the suggestion. I had to refer to DOI handbook to see if space is a valid character or not. According to section 3.2.1 GENERAL CHARACTERISTICS OF THE DOI SYNTAX: "The DOI name is case-insensitive and can incorporate any printable characters from the legal graphic characters of Unicode." Apparently, space is considered both a graphic character and printable character. That being said, I have not seen any DOI containing the space character.
Currently citer does not consider the space a valid DOI character, but https://doi.org/10.1371/%20journal.pgph.0000245 is still a valid URL and citer tries to connect to its server, but it fails with RequestsError because the server responds with 404 error code.
It is possible to add a separate input type for DOIs. That way citer would not confuse a DOI for a URL. However I believe a separate input type would be a little less convenient for users. For now I'm going to leave citer as it is but might reconsider if other users report similar issues. Dalba 08:33, 22 March 2024 (UTC)[reply]

Twin ISSN generated by Citer in cite journal[edit]

In quite a few cases, Citer generates a twin ISSN in the form issn=<ISSN1>, <ISSN2> in the {{Cite journal}}. The magazines now routinely declare twin ISSNs, one for Internet, one for print. Is it possible to channel the second ISSN into eissn= ? Thank you in advance! Викидим (talk) 19:29, 23 April 2024 (UTC)[reply]

Could you provide an example input that has this issue? Dalba 05:14, 26 April 2024 (UTC)[reply]
For example, https://www.jstor.org/stable/1687467 produces "issn=00368075, 10959203" that does not work with cite templates. The first ISSN is print, the second - online. Викидим (talk) 18:19, 27 April 2024 (UTC)[reply]
Fixed AFAICT, JSTOR does not provide any info about which ISSN is the electronic one. I decided to ignore the second one and use the first as |issn=. Dalba 18:03, 2 May 2024 (UTC)[reply]

DOI 10.1109/5992.805138[edit]

With input 10.1109/5992.805138 , the result is unexpected: the submit button stays grayed out, I( have to close the window to continue. There is no result either. While at it, this is a truly great tool! Thank you! Викидим (talk) 18:25, 27 April 2024 (UTC)[reply]

Thank you! Should be fixed now. Dalba 17:58, 2 May 2024 (UTC)[reply]