NCBI Connection: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
(Created page with " NCBI's main mission is to provide interactive web access. Fetching data directly from www.ncbi.nlm.nih.gov is blocked at GACRC. Please contact us to get permission if you n...")
 
No edit summary
 
(6 intermediate revisions by one other user not shown)
Line 1: Line 1:
Please contact us to get permission if you need to connect to NCBI on GACRC clusters. Please specify your path of scripts, the block of code on how to connect to NCBI and command to test your script in the request.


NCBI's main mission is to provide interactive web access. Fetching data directly from www.ncbi.nlm.nih.gov is blocked at GACRC.  
NCBI's main mission is to provide interactive web access. Fetching data directly from www.ncbi.nlm.nih.gov is blocked at GACRC.  


Please contact us to get permission if you need to connnect to NCBI on GACRC clusters. Please specify your path of scripts, the block of code on how to connect to NCBI and command to test your script in the request.
To fetch data from NCBI, i.e. fasta, fastq, sra etc. There is a scripting API available users to avoid intensive visit of NCBI web URLs. The API is called e-utilities and you must use this for all your
 
scripts. There is no advantage or functionality that this scraping of the web pages can offer that the API cannot. Please refer to:
To fetch data from NCBI, i.e. fasta, fastq, sra etc. There is a scripting API available users sto avoid intensive visisit of NCBI web URLs. The API is called e-utilities and you must use this for all your
scripts. Please refer to:


http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=coursework&part=eutils
http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=coursework&part=eutils
Line 17: Line 16:
- Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov, not the standard NCBI web address.
- Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov, not the standard NCBI web address.


- Use the &email= and the &tool= field so that we can track your project and contact you if there is a problem. NOTE: In Summer 2010 this will become mandatory and all requests without these field will return an error message.
- Use the &email= and the &tool= field so that we can track your project and contact you if there is a problem.  


- For all scripts, do not send requests more than 3 per second. http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.General_Usage_Guidelines
- For all scripts, do not send requests more than 3 per second. http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.General_Usage_Guidelines
- Only one job one thread at a time. No multi-thread / multiple jobs to run the query at same time


- If you are sending unique identifiers (GI's, Accession numbers, Gene IDsetc.) you can send multiple UIDs per requests (up 500 per requests) rather than one per request. Or use the webenv and retmax fields for batches of records (http://www.ncbi.nlm.nih.gov/books/NBK1058/#eutils_esayers-5-4-3).
- If you are sending unique identifiers (GI's, Accession numbers, Gene IDsetc.) you can send multiple UIDs per requests (up 500 per requests) rather than one per request. Or use the webenv and retmax fields for batches of records (http://www.ncbi.nlm.nih.gov/books/NBK1058/#eutils_esayers-5-4-3).
Line 27: Line 28:
Please note that abstracts in PubMed may incorporate material that may be protected by U.S. and foreign copyright laws. All persons reproducing, redistributing, or making commercial use of this information are expected to adhere to the terms and conditions asserted by the copyright holder. Transmission or reproduction of protected items beyond that allowed by fair use (PDF) as defined in the copyright laws requires the written permission of the copyright owners. If you wish to do a large scale data mining project on PubMed, the raw data, MEDLINE, is you can enter into a licensing agreement and lease the data. For more information on this please see http://www.nlm.nih.gov/databases/leased.html
Please note that abstracts in PubMed may incorporate material that may be protected by U.S. and foreign copyright laws. All persons reproducing, redistributing, or making commercial use of this information are expected to adhere to the terms and conditions asserted by the copyright holder. Transmission or reproduction of protected items beyond that allowed by fair use (PDF) as defined in the copyright laws requires the written permission of the copyright owners. If you wish to do a large scale data mining project on PubMed, the raw data, MEDLINE, is you can enter into a licensing agreement and lease the data. For more information on this please see http://www.nlm.nih.gov/databases/leased.html


We have extensive API tools for downloading large numbers of records and this includes protein FASTA format. It is called E-utilities and the documentation is available on the NCBI web site http://www.ncbi.nlm.nih.gov/books/NBK25501/. There is no advantage or functionality that this scraping of the web pages can offer that the API cannot.
[[#top|Back to Top]]

Latest revision as of 13:45, 11 October 2017

Please contact us to get permission if you need to connect to NCBI on GACRC clusters. Please specify your path of scripts, the block of code on how to connect to NCBI and command to test your script in the request.

NCBI's main mission is to provide interactive web access. Fetching data directly from www.ncbi.nlm.nih.gov is blocked at GACRC.

To fetch data from NCBI, i.e. fasta, fastq, sra etc. There is a scripting API available users to avoid intensive visit of NCBI web URLs. The API is called e-utilities and you must use this for all your scripts. There is no advantage or functionality that this scraping of the web pages can offer that the API cannot. Please refer to:

http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=coursework&part=eutils

http://www.ncbi.nlm.nih.gov/books/NBK25501

If you feel you must write a script in order to download data please follow policy:

- If you are searching PubMed or Entrez, please use the E-utilities if you have not already done so (http://www.ncbi.nlm.nih.gov/books/NBK25497).

- Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov, not the standard NCBI web address.

- Use the &email= and the &tool= field so that we can track your project and contact you if there is a problem.

- For all scripts, do not send requests more than 3 per second. http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.General_Usage_Guidelines

- Only one job one thread at a time. No multi-thread / multiple jobs to run the query at same time

- If you are sending unique identifiers (GI's, Accession numbers, Gene IDsetc.) you can send multiple UIDs per requests (up 500 per requests) rather than one per request. Or use the webenv and retmax fields for batches of records (http://www.ncbi.nlm.nih.gov/books/NBK1058/#eutils_esayers-5-4-3).

- Limit all scripts to the off peak hours of 9 PM to 5 AM Eastern Standard Time (USA).

Please note that abstracts in PubMed may incorporate material that may be protected by U.S. and foreign copyright laws. All persons reproducing, redistributing, or making commercial use of this information are expected to adhere to the terms and conditions asserted by the copyright holder. Transmission or reproduction of protected items beyond that allowed by fair use (PDF) as defined in the copyright laws requires the written permission of the copyright owners. If you wish to do a large scale data mining project on PubMed, the raw data, MEDLINE, is you can enter into a licensing agreement and lease the data. For more information on this please see http://www.nlm.nih.gov/databases/leased.html

Back to Top