NCBI Connection: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(3 intermediate revisions by one other user not shown)
Line 1: Line 1:
Please contact us to get permission if you need to connnect to NCBI on GACRC clusters. Please specify your path of scripts, the block of code on how to connect to NCBI and command to test your script in the request.  
Please contact us to get permission if you need to connect to NCBI on GACRC clusters. Please specify your path of scripts, the block of code on how to connect to NCBI and command to test your script in the request.  


NCBI's main mission is to provide interactive web access. Fetching data directly from www.ncbi.nlm.nih.gov is blocked at GACRC.  
NCBI's main mission is to provide interactive web access. Fetching data directly from www.ncbi.nlm.nih.gov is blocked at GACRC.  


To fetch data from NCBI, i.e. fasta, fastq, sra etc. There is a scripting API available users sto avoid intensive visisit of NCBI web URLs. The API is called e-utilities and you must use this for all your
To fetch data from NCBI, i.e. fasta, fastq, sra etc. There is a scripting API available users to avoid intensive visit of NCBI web URLs. The API is called e-utilities and you must use this for all your
scripts. There is no advantage or functionality that this scraping of the web pages can offer that the API cannot. Please refer to:
scripts. There is no advantage or functionality that this scraping of the web pages can offer that the API cannot. Please refer to:


Line 16: Line 16:
- Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov, not the standard NCBI web address.
- Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov, not the standard NCBI web address.


- Use the &email= and the &tool= field so that we can track your project and contact you if there is a problem. NOTE: In Summer 2010 this will become mandatory and all requests without these field will return an error message.
- Use the &email= and the &tool= field so that we can track your project and contact you if there is a problem.  


- For all scripts, do not send requests more than 3 per second. http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.General_Usage_Guidelines
- For all scripts, do not send requests more than 3 per second. http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.General_Usage_Guidelines
- Only one job one thread at a time. No multi-thread / multiple jobs to run the query at same time


- If you are sending unique identifiers (GI's, Accession numbers, Gene IDsetc.) you can send multiple UIDs per requests (up 500 per requests) rather than one per request. Or use the webenv and retmax fields for batches of records (http://www.ncbi.nlm.nih.gov/books/NBK1058/#eutils_esayers-5-4-3).
- If you are sending unique identifiers (GI's, Accession numbers, Gene IDsetc.) you can send multiple UIDs per requests (up 500 per requests) rather than one per request. Or use the webenv and retmax fields for batches of records (http://www.ncbi.nlm.nih.gov/books/NBK1058/#eutils_esayers-5-4-3).

Latest revision as of 13:45, 11 October 2017

Please contact us to get permission if you need to connect to NCBI on GACRC clusters. Please specify your path of scripts, the block of code on how to connect to NCBI and command to test your script in the request.

NCBI's main mission is to provide interactive web access. Fetching data directly from www.ncbi.nlm.nih.gov is blocked at GACRC.

To fetch data from NCBI, i.e. fasta, fastq, sra etc. There is a scripting API available users to avoid intensive visit of NCBI web URLs. The API is called e-utilities and you must use this for all your scripts. There is no advantage or functionality that this scraping of the web pages can offer that the API cannot. Please refer to:

http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=coursework&part=eutils

http://www.ncbi.nlm.nih.gov/books/NBK25501

If you feel you must write a script in order to download data please follow policy:

- If you are searching PubMed or Entrez, please use the E-utilities if you have not already done so (http://www.ncbi.nlm.nih.gov/books/NBK25497).

- Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov, not the standard NCBI web address.

- Use the &email= and the &tool= field so that we can track your project and contact you if there is a problem.

- For all scripts, do not send requests more than 3 per second. http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.General_Usage_Guidelines

- Only one job one thread at a time. No multi-thread / multiple jobs to run the query at same time

- If you are sending unique identifiers (GI's, Accession numbers, Gene IDsetc.) you can send multiple UIDs per requests (up 500 per requests) rather than one per request. Or use the webenv and retmax fields for batches of records (http://www.ncbi.nlm.nih.gov/books/NBK1058/#eutils_esayers-5-4-3).

- Limit all scripts to the off peak hours of 9 PM to 5 AM Eastern Standard Time (USA).

Please note that abstracts in PubMed may incorporate material that may be protected by U.S. and foreign copyright laws. All persons reproducing, redistributing, or making commercial use of this information are expected to adhere to the terms and conditions asserted by the copyright holder. Transmission or reproduction of protected items beyond that allowed by fair use (PDF) as defined in the copyright laws requires the written permission of the copyright owners. If you wish to do a large scale data mining project on PubMed, the raw data, MEDLINE, is you can enter into a licensing agreement and lease the data. For more information on this please see http://www.nlm.nih.gov/databases/leased.html

Back to Top