-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable use of cloud reference files #1804
Conversation
@markjschreiber This is the branch I mentioned in email - would be useful to know if this works with AWS. |
This would work for S3. Any reason for using |
Oh yeah - thanks for catching that - we definitely should use PicardHtsPath - I'm just trying to do too many things in parallel today. I won't be able to finish a full PR today, but I'm testing getReferencePath() on a hacked Picard tool on GCS now. |
If this works then I think we should deprecate direct access of |
3828906
to
4a099f5
Compare
Naturally, this was a little more complicated than I first thought, but I think it's correct now. It will allow us to incrementally migrate tools to support cloud references without introducing any compatibility issues. Specifically, it retains and populates the |
628a84a
to
0d3398e
Compare
ce16448
to
d3c6084
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good apart from the lack of a test with a reference on the cloud -- we'd need that before this can go in
@takutosato This PR is currently draft because it needs a cloud test. We should resurrect it when you start on the Picard cloud work, and add a test once we have a way to do that. |
return null; | ||
} else if (picardPath.getScheme().equals(PicardHtsPath.FILE_SCHEME)) { | ||
// file on a local file system | ||
return picardPath == null ? null : picardPath.toPath().toFile(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return picardPath.toPath().toFile()
because picardPath
is never null if the code gets here
d3c6084
to
36bc7d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tsato. I started reviewing, but I think the previous PR (the one that included GATKBucketUtils
and GATKIOUtils
) propagates some code from GATK that will cause us to go down a path where we reproduce a lot of the patterns from GATK that we're trying to eradicate. I put a couple of comments here inline - take a look at them and see if they make sense to you, and maybe we can meet to discuss if you want.
But my basic uber-comment is to try to eliminate any code that attempts to manipulate/modify/interrogate a PicardHtsPath by converting it to a String
, and to only convert to File
, or String
if you absolutely have to.
So for instance, the methods in GATKBucketUtils
that accept or return a String would ideally be replaced with versions that manipulate a PicardHtsPath
. Of course there may be cases where we have no choice, but I think most of the code here can be modified to eliminate these code patterns.
@@ -43,6 +43,9 @@ private GATKBucketUtils(){} //private so that no one will instantiate this class | |||
*/ | |||
public static String getTempFilePath(String prefix, String extension){ | |||
if (isGcsUrl(prefix) || (isHadoopUrl(prefix))){ | |||
if (!extension.startsWith(".")) { | |||
extension = "." + extension; | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big fan of modifying extension here - I think it would be preferable to have this method use the extension as presented, and document that behavior, and require the call sites specify the "." if thats what they want (we haven't been consistent about this in the past).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why does this file have a GATK prefixes in the name ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here was that the method was behaving inconsistently when the input is a local file path (always return .txt, modifying the extension where necessary) vs a gs:// file path (do not modify the extension, which resulted in temp file path like "filetxt"). I like your approach too; will change it for both cases.
final String cloudPath = GATKBucketUtils.getTempFilePath(GCloudTestUtils.getTestInputPath(), "txt"); | ||
final String localPath = GATKBucketUtils.getTempFilePath("test", "txt"); | ||
|
||
int d = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused.
validateSamFile.INPUT = downsampled; | ||
Assert.assertEquals(validateSamFile.doWork(), 0); | ||
// temporarily skip this check for cram files until ValidateSam is updated the support cloud reference input | ||
if (! downsampled.getRawInputString().endsWith(".cram")){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (! downsampled.getRawInputString().endsWith(".cram")){ | |
if (! downsampled.hasExtension(FileExtensions.CRAM)){ |
Converting these paths (like downSampled
) to String
or File
is exactly the code pattern we want to eradicate. In an ideal world everything would be done in PicardHTSPath
space, and we would only just-in-time convert to String
or Path
, or in the worst case File
, if its absolutely necessary because we're calling some external API that requires it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -181,30 +184,40 @@ private Iterable<SAMSequenceRecord> getSamSequenceRecordsIterable() { | |||
*/ | |||
protected String[] customCommandLineValidation() { | |||
if (URI == null) { | |||
URI = "file:" + referenceSequence.getReferenceFile().getAbsolutePath(); | |||
if (GATKBucketUtils.isRemoteStorageUrl(referenceSequence.getHtsPath().getURIString())){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting to a String
like this is exactly the kind of thing we want to avoid wherever possible. It would be much preferable to change isRemoteStorageURL
to take a PicardHtsPath
directly (or make isRemoteStorageURL
a PicardHtsPath
method), to discourage this kind of interconversion (see my comment elsewhere on String interconversion).
@@ -167,7 +170,7 @@ public SAMSequenceDictionary makeSequenceDictionary(final File referenceFile) { | |||
private Iterable<SAMSequenceRecord> getSamSequenceRecordsIterable() { | |||
return () -> { | |||
final SequenceDictionaryUtils.SamSequenceRecordsIterator iterator = | |||
new SequenceDictionaryUtils.SamSequenceRecordsIterator(REFERENCE_SEQUENCE, | |||
new SequenceDictionaryUtils.SamSequenceRecordsIterator(referenceSequence.getHtsPath().toPath(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use 'referenceSequence.getReferencePath()' instead of duplicating the 'toPath()' call here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
URI = referenceSequence.getHtsPath().getURIString(); | ||
} else { | ||
URI = "file:" + referenceSequence.getHtsPath().getURIString(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, this is the kind of string manipulation (and bug) that plagues the GATK code base - its really hard to preserve the structure of the URI when you do string manipulation like this. So we don't want to reproduce these patterns in Picard.
As it stands, this code will produce a URI string with a duplicate "file:" prefix for local files, since the protocol scheme is already present in the URI string returned by getURIString()
(i.e., if the reference is a file called "myref.fasta", the value assigned to URI
resulting from this will be "file:file:///path/to/myref.fasta").
Instead, you can just use:
} | |
URI = referenceSequence.getHtsPath().getURIString(); |
But then the whole enclosing if block (starting in line 187) becomes redundant and can be replaced with one line:
} | |
URI = referenceSequence.getHtsPath().getURIString(); |
There is no need to test whether the file is remote or not - PicardHtsPath
is always backed by a valid URI that includes a protocol scheme. If the file is local, the protocol scheme will be "file://".
This is a great example of why PicardHtsPath
and friends (the base classes) are better currency than File
/Path
/String
, etc. The whole idea is to always use PicardHtsPath as the datatype wherever possible, including in function signatures, and then only interconvert to String
, Path
, or, in the worst case, File
, when you absolutely have to, because you're calling some method that you have no control over. In this case, since the URI
variable is a String, there is no getting around the conversion to String, but thats a terminal operation, and any intermediate processing should be done on the structured type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, added a test case for URI, will look around for other places where I did this
/** | ||
* @return The reference provided by the user, if any, or the default, if any, as a PicardHtsPath. | ||
*/ | ||
public PicardHtsPath getREFERENCE_SEQUENCE() { return REFERENCE_SEQUENCE; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method can be eliminated, and this class and RequiredReferenceArgument
both should override getHtsPath
(I wish we had better names) to return REFERENCE_SEQUENCE directly. Ideally getHtsPath
, and getReferencePath
would then become the primary way for tools to obtain the reference (at least for nio-aware tools).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, sounds good. I think getHtsPath
is a good name!
@cmnbroad made the changes you requested, back to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkpointing a partial review again due to time constraints.
There is also one other issue for this whole PR that we should think about, which is that we're using the concrete class HtsPicardPath
everywhere, including in method signatures and return values, but really, we should only be using that when we need a concrete class (i.e., for things annotated with @Argument
annotation, or when we need to construct a new instance of one of these objects), but everywhere else we should really just use the IOPath
interface. That would make it easier to share code with GATK (though we haven't really done this in GATK either), or to migrate functionality to htsjdk. We might want to discuss this at the team meeting next week.
@Test | ||
public void testExtensionConsistent(){ | ||
final String cloudPath = GATKBucketUtils.getTempFilePath(GCloudTestUtils.getTestInputPath(), "txt"); | ||
final String localPath = GATKBucketUtils.getTempFilePath("test", "txt"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, GSTKBucketUtils.getTempFilePath
would return an HtsPicardPath
(or HtsPath
if it were to be moved to htsjdk).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were looking at old code here for some reason; this class is now called PicardBucketUtilsTest
, and the getTempFilePath returns an object of type PicardHtsPath
|
||
// tsato: does it work with the System.getProperty() way? What if the gcloud path was specified as an environment variable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as Defaults.REFERENCE_FASTA
in htsjdk is of type File
, I don't think this will work if a gcs path (or any URI with a non-file protocol scheme) is provided as an environment variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that sounds right. thanks.
|
||
/** | ||
* @return The reference provided by the user, if any, or the default defined by {@code htsjdk.samtools.Defaults.REFERENCE_FASTA} | ||
*/ | ||
@Override | ||
public File getReferenceFile() { | ||
return ReferenceArgumentCollection.getFileSafe(REFERENCE_SEQUENCE, log); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could inherit the javadoc from the base class, but if the javadoc is kept here I would suggest adding a including comment saying that this method is preserved for compatibility, and that getReferencePath
and getHtsPath
are preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the javadoc here and moved the disclaimer about backward compatibility to the base class too.
} | ||
|
||
/** | ||
* @return The reference provided by the user, if any, or the default, if any, as an nio Path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "may be null".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I delete this javadoc too since the base class has a better, more detailed javadoc?
* | ||
* @return The reference provided by the user or the default as a File. May be null. | ||
*/ | ||
File getReferenceFile(); // tsato: update tools that call this method to convert File to PicardHtsPath and use getHtsPath() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is going to be retained, I would suggest updating the javadoc to say that this method is retained for compatibility, and that getReferencePath
and getHtsPath
are preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment to yourself on this line can probably be removed now.
} | ||
if (OUTPUT == null) { | ||
OUTPUT = ReferenceSequenceFileFactory.getDefaultDictionaryForReferenceSequence(referenceSequence.getReferenceFile()); | ||
final Path outputPath = ReferenceSequenceFileFactory.getDefaultDictionaryForReferenceSequence(referenceSequence.getHtsPath().toPath()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit of a nit, but this could be getReferencePath
instead of getHtsPath
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
try { | ||
Files.delete(OUTPUT.toPath()); | ||
} catch (IOException e2) { | ||
throw new PicardException("Unknown problem encountered, and failed to delete the incomplete output.", e2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a good idea to include the text of the first exception in the error message, since it describes the original problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, added e.getMessage()
in the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit of code is a little ugly….
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see you want e
given to the exception, not e2
try (final BufferedWriter writer = Files.newBufferedWriter(METRICS_FILE.toPath())){ | ||
metricsFile.write(writer); | ||
} catch (IOException e) { | ||
throw new PicardException("Encountered an error writing the metrics file: " + METRICS_FILE.getURIString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, retain the original exception (pass e
to the PicardException
constructor) since it describes the actual problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
*/ | ||
public static String getRequesterPaysBucket() { | ||
return getSystemProperty("PICARD_REQUESTER_PAYS_BUCKET", REQUESTER_PAYS_BUCKET_DEFAULT); | ||
return getSystemProperty("PICARD_REQUESTER_PAYS_BUCKET", REQUESTER_PAYS_BUCKET_DEFAULT.getURIString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is unused, so it could be removed. But if we keep it, it could return an HtsPath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. @lbergelson let us know if you object since this is your code...
if (append){ | ||
return new PicardHtsPath(path.getURIString() + newExtension); | ||
} else { | ||
return new PicardHtsPath(FilenameUtils.removeExtension(path.getURI().toString() + newExtension)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a handy method to have, but I think its a bit tricky to actually get this right in all cases - it will need a good suite of test cases.
As it stands, I think its adding the new extension first, then immediately removing it because the parentheses are wrong. You would need to remove the extension first, and THEN add on the new one:
return new PicardHtsPath(FilenameUtils.removeExtension(path.getURI().toString() + newExtension)); | |
return new PicardHtsPath(FilenameUtils.removeExtension(path.getURI().toString()) + newExtension); |
But also I'm a bit leary of relying on the apache FilenameUtils
class, since I'm not sure that it handles all forms of the URI correctly.
I think the right way to do this is to use Path::getFileName
to get just the file name part of the existing path, massage that to the new file name with the new extension, and then use Path::resolveSibling against the original Path to get the new Path, and then finally recreate a new HtsPath using the URI string resulting from the new path. Again, I think it may be a bit tricky to get right, but it would be super valuable to have.
Also, this could be an instance method on the original HtsPath rather than a static method (since the first arg is a PicardHtsPath anyway). This is also true of the resolve
method below. If we can get it right, its a good candidate to move right into htsjdk on the HtsPath class. Maybe @lbergelson has thoughts on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. Now using resolveSibling. Will add some tests and see if FilnameUtils can be replaced by something in house. Let's discuss whether we should move this to htsjdk during next week's meeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced FileNameUtils::removeExtension with a combination of IOPath::getExtension() and String::replaceAll(). Also added tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for moving this method to resolve, I think replaceExtension() belongs in PicardHtsPath because it returns a new PicardHtsPath object (much like PicardHtsPath::fromPath()). If it's in HtsPath we would have to override it, right?
public static final String GOOGLE_CLOUD_STORAGE_FILESYSTEM_SCHEME = "gs"; | ||
public static final String HTTP_FILESYSTEM_PROVIDER_SCHEME = "http"; | ||
public static final String HTTPS_FILESYSTEM_PROVIDER_SCHEME = "https"; | ||
public static final String HDFS_SCHEME = "hdfs"; | ||
public final static String FILE_SCHEME = "file"; | ||
|
||
// To the extent possible, avoid converting e.g. a PicardHtsPath to a string and calling beginsWith() against these | ||
// string variables. Instead, work directly with PicardHtsPath/HtsPath using e.g. PicardHtsPath::getScheme() | ||
private static final String GCS_PREFIX = "gs://"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad removed these string variables as discussed
@cmnbroad thanks again for your review. I made the changes you requested. I also took a stab at changing PicardHtsPath to IOPath where appropriate; these changes are in a separate (latest) commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still haven't gotten all the way through, but there is a fair amount of cleanup to be done, mostly in extraneous javadoc comments (I commented where appropriate). Its converging, but slowly - I expect we'll need at least one more round, and maybe two before we can merge.
As for the use of IOPath, there are quite a few more places where that could be used (basically, everywhere EXCEPT where a new object is created (new
operator, plus in @argument annotated variables, since those need require a concrete type). I kind of feel like we should do it everywhere possible, or not at all. I'm not sure how many more changes would be required if we try to do it everywhere, so perhaps we can discuss at the next meeting. I hate to not do it in this PR, but maybe its too much.
Also, is it intentional that we retained the hadoop code paths ? I fine either way, just wondering if we need those. Sometime soon we'll need similar support for Azure/AWS, but we might want to discuss whether we want retain the hadoop code paths.
|
||
/** | ||
* Base interface for a reference argument collection. | ||
*/ | ||
public interface ReferenceArgumentCollection { | ||
/** | ||
* @return The reference provided by the user, if any, or the default, if any. | ||
* This method is retained for backward compatibility with legacy tools that have not been updated to support PicardHtsPath input files. | ||
* The preferred methods for accessing the reference file in the command line argument is either getHtsPath() or getReferencePath(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* The preferred methods for accessing the reference file in the command line argument is either getHtsPath() or getReferencePath(). | |
* The preferred methods for accessing the reference file provided on the command line is either getHtsPath() or getReferencePath(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this change locally
* | ||
* @return The reference provided by the user or the default as a File. May be null. | ||
*/ | ||
File getReferenceFile(); // tsato: update tools that call this method to convert File to PicardHtsPath and use getHtsPath() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment to yourself on this line can probably be removed now.
* We do not currently support setting a remote path via an environment variable. For this, we would have to update | ||
* HtsJdk.Defaults.REFERENCE_FASTA to support remote paths. | ||
* | ||
* @return The reference provided by the user, if any, or the default, if any, as a PicardHtsPath. May be null. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* We do not currently support setting a remote path via an environment variable. For this, we would have to update | |
* HtsJdk.Defaults.REFERENCE_FASTA to support remote paths. | |
* | |
* @return The reference provided by the user, if any, or the default, if any, as a PicardHtsPath. May be null. | |
*/ | |
* Return the reference provided by the user as a PicardHtsPath. This currently does not support remote paths set via HtsJdk.Defaults.REFERENCE_FASTA, since that uses a `File` object. | |
* | |
* @return The reference provided by the user, if any, or the default, if any, as a PicardHtsPath. May be null. | |
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to keep this comment to myself as a TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
/** | ||
* @return A "safe" way to obtain a File object for any reference path. | ||
* | ||
* For files that reside on a local file system, this returns a valid File object. Files that reside on | ||
* a non-local file system can't be accessed via a File object, so return a placeholder File object that | ||
* will defer failure when/if the file is actually accessed. This allows code paths that blindly propagate | ||
* the value returned by calls to getReferenceFile to not get an NPE, and to fail gracefully downstream | ||
* with an error message that includes the reference file specifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I wrote this, but a couple of proposed changes:
/** | |
* @return A "safe" way to obtain a File object for any reference path. | |
* | |
* For files that reside on a local file system, this returns a valid File object. Files that reside on | |
* a non-local file system can't be accessed via a File object, so return a placeholder File object that | |
* will defer failure when/if the file is actually accessed. This allows code paths that blindly propagate | |
* the value returned by calls to getReferenceFile to not get an NPE, and to fail gracefully downstream | |
* with an error message that includes the reference file specifier. | |
/** | |
* @return A "safe" File object for any reference path. | |
* | |
* For files that reside on a local file system, this returns a valid File object. Files that reside on | |
* a non-local file system can't be accessed via a File object, so return a placeholder File object that | |
* will defer failure until the the File object is actually accessed. This allows code paths that blindly propagate | |
* the value returned by calls to getReferenceFile to not get an NPE, and to fail gracefully downstream | |
* with an error message that includes the reference file specifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incorporated
|
||
@Argument(shortName = StandardOptionDefinitions.REFERENCE_SHORT_NAME, doc = "Reference sequence file.", common = false) | ||
public File REFERENCE_SEQUENCE; | ||
@Argument(shortName = StandardOptionDefinitions.REFERENCE_SHORT_NAME, doc = "Reference sequence file.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, ok.
@@ -66,13 +64,12 @@ public static String getTempFilePath(String prefix, String extension){ | |||
* @param prefix The beginning of the file name | |||
* @param suffix The end of the file name, e.g. ".tmp" | |||
*/ | |||
private static String randomRemotePath(String stagingLocation, String prefix, String suffix) { | |||
private static Path randomRemotePath(String stagingLocation, String prefix, String suffix) { | |||
if (isGcsUrl(stagingLocation)) { | |||
// Go through URI because Path.toString isn't guaranteed to include the "gs://" prefix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this comment was already here, but I'm not sure what it refers to. I think its obsolete or something since it seems out or place (and possibly even wrong). I'd remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted
GATKUtils.nonNull(path); | ||
return path.startsWith(GCS_PREFIX); | ||
return path.startsWith("gs://"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the GOOGLE_CLOUD_STORAGE_FILESYSTEM_SCHEME
constant here (similar to the isHadoopUrl(String) one is written).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
new File(file.getAbsolutePath() + FileExtensions.TABIX_INDEX).deleteOnExit(); | ||
new File(file.getAbsolutePath() + ".bai").deleteOnExit(); | ||
new File(file.getAbsolutePath() + ".md5").deleteOnExit(); | ||
new File(file.getAbsolutePath().replaceAll(extension + "$", ".bai")).deleteOnExit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the FileExtensions constants for these wherever there is on available (i.e., FileExtensions.BAI_INDEX).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I know its like this in GATK, but I'd make the same comment as elsewhere about putting the .bai variants adjacent to each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
/** | ||
* | ||
* tsato: convert to string | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this comment ? Obsolete ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/** | ||
* This default implementation is here to support tools that have not been upgraded to support cloud reference files. | ||
* (The alternative is to make it abstract, which necessitates we update all the classes that implement this interface with | ||
* an implementation like what we have below). | ||
* | ||
* The cloud-reference-enabled tools should override this method e.g. return REFERENCE_SEQUENCE (see DownsampleSam::makeReferenceArgumentCollection) | ||
* Once all the relevant subclasses have been updated, we plan to make this abstract. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is somewhat out of place here and a bit confusing. Either way, I think the override cases are the exception, and should be (and I think are) documented where they occur. The javadoc here should really just describe the method itself. I would suggest the simplifying:
/** | |
* This default implementation is here to support tools that have not been upgraded to support cloud reference files. | |
* (The alternative is to make it abstract, which necessitates we update all the classes that implement this interface with | |
* an implementation like what we have below). | |
* | |
* The cloud-reference-enabled tools should override this method e.g. return REFERENCE_SEQUENCE (see DownsampleSam::makeReferenceArgumentCollection) | |
* Once all the relevant subclasses have been updated, we plan to make this abstract. | |
/** | |
* Returns a PicardHtsPath for the reference input. May be null. Implementers of this interface should override with an appropriate implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@cmnbroad I made the changes you requested and otherwise cleaned up comments and code. Back to you. Thanks again for reviewing. |
Before this PR is merged, the accompanying htsjdk PR must be merged first. samtools/htsjdk#1681 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting closer, but still a few more issues to resolve (also, I finally had time this time to look at all the code, so hopefully this finally covers all of the issues).
|
||
/** | ||
* @return The reference provided by the user, if any, or the default defined by {@code htsjdk.samtools.Defaults.REFERENCE_FASTA} | ||
* @return The reference provided by the user, if any, or the default, if any, as an nio Path. May be null. (tsato: remove this javadoc since it's already present in the base class?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd remove the javadoc from this class too, like you've done for the required one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* there is no guarantee that this will be used as the root of the tmp file name, a local prefix may be placed in the tmp folder for example | ||
* @param extension and extension for the temporary file path, the resulting path will end in this | ||
* @return a path to use as a temporary file, on remote file systems which don't support an atomic tmp file reservation a path is chosen with a long randomized name | ||
* @param directory the directory where the temporary fill will be placed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this method is mirroring what the GATK one does, but it has weird semantics in that this param is only used if it specifies a remote path. Given that, I would recommend including the language used in the GATK version of that that says that there is no guaranty that the file will actually be placed in this directory (or maybe, it seems like this is only used for remote files, so you could even be explicit and say that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the behavior of prefix when creating a local tmp file was incorrect here. Code updated. Thanks for catching.
@@ -83,7 +78,7 @@ private static String randomRemotePath(String stagingLocation, String prefix, St | |||
* on Spark because using the fat, shaded jar breaks the registration of the GCS FilesystemProvider. | |||
* To transform other types of string URLs into Paths, use IOUtils.getPath instead. | |||
*/ | |||
public static java.nio.file.Path getPathOnGcs(String gcsUrl) { | |||
public static CloudStoragePath getPathOnGcs(String gcsUrl) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this can't be private ? It's currently only used within this file, and it would be great to prevent it from proliferating further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -92,22 +87,24 @@ public static java.nio.file.Path getPathOnGcs(String gcsUrl) { | |||
} | |||
|
|||
/** | |||
* Note: to the extent possible, avoid converting an HtsPath object to a String then calling this method. | |||
* Instead, use the same method below that takes in a PicardHtsPath as input. | |||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I'm all for discouraging use of this method, and its private anyway (which is good), I would either remove this comment entirely or else clarify the reason for it with something that explains that we want to discourage use of this pattern of manipulating Strings. (Also, it no longer matches the signature of the actual method below). As it stands I don't think the comment will make sense to someone who doesn't already know the reason for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
I would like to document "best practices" for manipulating PicardHtsPath/IOPath objects --- e.g. avoid converting to a string then editing it…perhaps at the top of a some class as a comment. I will look for a good place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - would love to see the best practices documentation for this somewhere (maybe htsjdk, since that where the base implementation is). But I think thats a tall order, especially since we're still ironing those out via this code review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, sounds good. I will just keep the notes in a private file for now, and maybe we'll add it to the repo one day.
* @param append If set to true, append the new extension to the original basename. If false, replace the original extension | ||
* with the new extension. If append = false and the original name has no extension, an exception will be thrown. | ||
* @param newExtension A new file extension. Must include the leading dot e.g. ".txt", ".bam" | ||
* @return A new PicardHtsPath object with the new extension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should mention that it only replaces the LAST component of an extension, i.e., given "my.fasta.gz, it will only replace the ".gz" part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
{"testdata/picard/sam/test.bam", ".bai", true, new File("testdata/picard/sam/test.bam.bai").getAbsolutePath()}, | ||
{"/Users/jdoe/workspace/picard/testdata/picard/sam/test.bam", ".bai", false, "/Users/jdoe/workspace/picard/testdata/picard/sam/test.bai"}, | ||
{"/Users/jdoe/workspace/picard/testdata/picard/sam/test.bam", ".md5", true, "/Users/jdoe/workspace/picard/testdata/picard/sam/test.bam.md5"} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest some test cases with query params - see my comment on the PicardHtsPath changes.
} catch (IOException e2) { | ||
throw new PicardException("Unknown problem encountered, and failed to delete the incomplete output.", e); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall if we talked about this previously, but why is this second deletion attempt necessary ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary; I removed the first attempt, OUTPUT.toPath().toFile().delete();
, which as you pointed out probably doesn't work on a remote path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks.
} catch (IllegalArgumentException e) { | ||
// in case of an unexpected error delete the file so that there isn't a | ||
// truncated result which might be valid yet wrong. | ||
OUTPUT.delete(); | ||
OUTPUT.toPath().toFile().delete(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm - does OUTPUT.toPath().toFile()
return a legitimate, useable File for a remote path ?
* Takes an IOPath and returns a new PicardHtsPath object that keeps the same basename as the original but has | ||
* a new extension. If the input path is a relative local path, it is converted to an absolute path via | ||
* PicardHtsPath::toPath (we might change this behavior and avoid converting to an absolute path in the future, | ||
* by for instance working directly with a string). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior described here is a bit of an implementation detail, and as you mention here, could be fixed by changing the implementation start with the rawInputString (I'm not suggesting we do that now - since I think as a practical matter this only impacts your tests because you want to compare the old and new paths). Anyway I'd change this comment slightly if you want to retain it:
* Takes an IOPath and returns a new PicardHtsPath object that keeps the same basename as the original but has | |
* a new extension. If the input path is a relative local path, it is converted to an absolute path via | |
* PicardHtsPath::toPath (we might change this behavior and avoid converting to an absolute path in the future, | |
* by for instance working directly with a string). | |
* Takes an IOPath and returns a new PicardHtsPath object that keeps the same basename as the original but has | |
* a new extension. If the input path was created from a rawInputString that specifies a relative local path, the new path will | |
* always have a rawInputString that specifies an absolute path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the javadoc is meant for the users of the function and not where the developers describe the implementation details. At the same time I think this is unnatural behavior, and the reason why it's implemented this way should be documented. I think here in the javadoc seems the most fitting place to put this information (the second best place might be the right before we return the output, because that's where this local to absolute path conversion happens under the hood).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that it this behavior is nonintuitive, and I'm fine with retaining this here, but I still suggest the change I proposed. In your text, the term "input path" is ambiguous (I interpret it to mean to the path used to create the original object). So I tried to fix that in my proposed text to eliminate the ambiguity by being a bit more explicit. At any rate, do whatever you think is best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I incorporated your text.
|
||
public class PicardBucketUtilsTest { | ||
|
||
// Check that the extension scheme (e.g. "txt" vs ".txt" as the second argument) is consistent for cloud and local files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is obsolete since there is no longer a "txt" vs ".txt" issue (and also the reference to the "second argument" is a bit ambiguous given that there are two overloads being used here). Maybe:
// Check that the extension scheme (e.g. "txt" vs ".txt" as the second argument) is consistent for cloud and local files | |
// Check that the extension scheme is consistent for cloud and local files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@cmnbroad thanks again for your edits. I responded to all your comments. Back to you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're almost there- I left a couple of responses to Takuto's questions. This also has conflicts and needs a rebase.
@@ -92,22 +87,24 @@ public static java.nio.file.Path getPathOnGcs(String gcsUrl) { | |||
} | |||
|
|||
/** | |||
* Note: to the extent possible, avoid converting an HtsPath object to a String then calling this method. | |||
* Instead, use the same method below that takes in a PicardHtsPath as input. | |||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - would love to see the best practices documentation for this somewhere (maybe htsjdk, since that where the base implementation is). But I think thats a tall order, especially since we're still ironing those out via this code review.
if (append){ | ||
return new PicardHtsPath(path.getURIString() + newExtension); | ||
} else { | ||
final Optional<String> oldExtension = path.getExtension(); | ||
|
||
if (oldExtension.isEmpty()){ | ||
throw new PicardException("The original path must have an extension when append = false: " + path.getURIString()); | ||
} | ||
|
||
final String oldFileName = path.toPath().getFileName().toString(); | ||
final String newFileName = oldFileName.replaceAll(oldExtension.get() + "$", newExtension); | ||
return PicardHtsPath.fromPath(path.toPath().resolveSibling(newFileName)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True that, but a file extension, and the file name part of a URI, are strings. So you're right that in those cases its unavoidable, but I think those are fine. My objection/obsession is with turning the entire URI into a string and then trying to parse/extend it.
As for question 1, I don't think its a matter of telling users to clean up the URIs; GCS, Azure, refget, DRS, etc., all use query parameters to support various features. If we don't properly support query parameters, I think we will have entirely missed the boat. And I don't actually think its that hard to do (though I know it probably feels a bit painful in this first PR!). My fear is if we don't get it right in this PR, it will really hard to retrofit.
The second question is a good one. I can certainly imagine cases where the user doesn't want to preserve the query parameters when changing the extension, but we can add methods to support that in a future PR without breaking backward compatibility. For now I would implement what you proposed, that is, preserve them (http://some/file.txt.log? http://some/file.txt.log?someParam=someValue).
* @param targetDir Directory in which to create the temp file. If null, the default temp directory is used. | ||
* @return A file in the temporary directory starting with name, ending with extension, which will be deleted after the program exits. | ||
*/ | ||
public static File createTempFileInDirectory(final String name, String extension, final File targetDir) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this can wait.
} catch (IOException e2) { | ||
throw new PicardException("Unknown problem encountered, and failed to delete the incomplete output.", e); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks.
* Takes an IOPath and returns a new PicardHtsPath object that keeps the same basename as the original but has | ||
* a new extension. If the input path is a relative local path, it is converted to an absolute path via | ||
* PicardHtsPath::toPath (we might change this behavior and avoid converting to an absolute path in the future, | ||
* by for instance working directly with a string). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that it this behavior is nonintuitive, and I'm fine with retaining this here, but I still suggest the change I proposed. In your text, the term "input path" is ambiguous (I interpret it to mean to the path used to create the original object). So I tried to fix that in my proposed text to eliminate the ambiguity by being a bit more explicit. At any rate, do whatever you think is best.
|
||
// Test cram (input/output) | ||
// Note: this test is very slow---takes about 5 min on Broad internal network | ||
final boolean testCram = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow - thats surprising. I was going to try to debug it, but I don't have you htsjdk changes. I would say we should retain the test (these run in separate cloud test job in CI anyway, right ?), and cleanup the code to remove the references to your local file system, remove the unused variables, etc.
@tsato Not sure where this stands (we might be waiting for each other ?). It seems like most things have either been addressed or dismissed. Unless you have more changes to make, the conflicts should probably be resolved and then I can take one final look (though I'll be out the next two weeks). |
Hi @cmnbroad, sorry about the delay. It took me a while to test all the different cases for the cram performance issue that I was looking at with @lbergelson. Should be able to push my changes soon (hopefully tonight) so they should be ready when you return in two weeks. |
@cmnbroad it's ready for your final review. One disappointment is that for the reasons I listed above in comment, I could not write tests for query params (http:// paths are currently not supported in Picard, and gcloud actively rejects URIs with a query parameter). But other than that I incorporated all the changes you suggested. |
(Merge conflict is pretty minor so I will resolve it when it's ready to merge) |
…mentCollection implementations to use nio paths. Change CleanSam to use getReferencePath. preliminary changes test passing for now CreateSequenceDictionary looks good working on downsamplesam temporary as I fix the issue remove 2 just a commit near ready another checkpoint checkpoint all good except maybe the cram ready chris comments no 1 review edits 1 done proofreading chris edit 2 PicardHtsPath -> IOPath where appropriate another edit another round of edits louis update htsjdk chris suggestion replaceExtension Co-authored-by: Chris Norman <[email protected]> chris edit reorganize proofread manually resolve conflicts khalid conflict that was not caught some fix fixed more fails
78260d1
to
b1f0946
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good now - down to a couple of minor requests (plus we still need to move the code Khalid added, and decide if we want to replace the CRAM tests with something that doesn't use a full reference).
{"testdata/picard/sam/test.bam", ".bai", false, new File("testdata/picard/sam/test.bai").getAbsolutePath()}, // a relative input will be converted to absolute input | ||
{"testdata/picard/sam/test.bam", ".bai", true, new File("testdata/picard/sam/test.bam.bai").getAbsolutePath()}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're going to have the REPLACE/APPEND aliases for true/false, you might as well use them everywhere. Otherwise it looks like something strange is going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a good idea to add some test cases that have the replacement extension embedded in the pathname somewhere in addition to being at the end, to make sure the replacement only happens at the end, i.e.:
{"gs://hellbender/test/resources.fasta/hg19mini.fasta", ".fai", REPLACE, "gs://hellbender/test/resources.fasta/hg19mini.fai"}, {"gs://hellbender/test/resources.fasta/hg19mini.fasta", ".fai", APPEND, "gs://hellbender/test/resources.fasta/hg19mini.fasta.fai"},
as well as some cases where the input has no extension:
{"gs://hellbender/test/resources/hg19mini", ".fai", APPEND, "gs://hellbender/test/resources/hg19mini.fai"},
and also a new test for negative case(s?) that throw:
{"gs://hellbender/test/resources/hg19mini", ".fai", REPLACE},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, added these cases (and made the failure cases a separate test; let me know if there is a better way)
try { | ||
Files.delete(OUTPUT.toPath()); | ||
} catch (IOException e2) { | ||
throw new PicardException("Unknown problem encountered, and failed to delete the incomplete output.", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand why we need the second attempt at deletion here, but I thought you had fixed this to include the text of e2.getMessage().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I'm not sure why this change got lost, but I put it back in; removed the first delete attempt and added e2.message() to the string given by PicardException
@@ -71,7 +71,7 @@ public class CleanSam extends CommandLineProgram { | |||
protected int doWork() { | |||
IOUtil.assertFileIsReadable(INPUT); | |||
IOUtil.assertFileIsWritable(OUTPUT); | |||
final SamReaderFactory factory = SamReaderFactory.makeDefault().referenceSequence(REFERENCE_SEQUENCE); | |||
final SamReaderFactory factory = SamReaderFactory.makeDefault().referenceSequence(referenceSequence.getReferencePath()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to change this for the reader, we should also do it for the writer (line 79). Right now it looks like two different references. I'd either revert this change, or make it consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to revert this change and cloudify the tool in a separate PR.
@@ -705,7 +705,7 @@ public void testUnmapBacterialContamination() throws IOException { | |||
|
|||
//yuck! | |||
final RequiredReferenceArgumentCollection requiredReferenceArgumentCollection = new RequiredReferenceArgumentCollection(); | |||
requiredReferenceArgumentCollection.REFERENCE_SEQUENCE = reference; | |||
requiredReferenceArgumentCollection.REFERENCE_SEQUENCE = new PicardHtsPath(reference.getAbsolutePath()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be bit more straightforward to just use reference.getHtsPath()
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you meant requiredReferenceArgumentCollection.REFERENCE_SEQUENCE = new PicardHtsPath(reference);
?
Thanks again for your review @cmnbroad. When using the full hg19 reference vs the chr20 + 21 reference for the cram test, the runtimes are actually roughly the same because in both cases we download the entire chromosome 20. So it doesn't seem to matter which one we use (I picked the 20 + 21 one). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now! Approved (in spirit anyway - github won't let me merge it because its not marked as approved, and it won't let me mark it as approved, I assume because it was originally my branch and I authored some of the commits).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGT@cmnbroad
...by updating the ReferenceArgumentCollection implementations to use nio paths.
(Edited by tsato on 8/14/23)
This PR updates
ReferenceArgumentCollection
and classes that implement it to support reference files stored in the cloud. NowDownsampleSam
andCreateSequenceDictionary
support cloud reference files. Updating of the remaining tools will be done in a separate PR.Legacy tools may continue to access the reference via
File REFERENCE_SEQUENCE
inCommandLineProgram
or bygetReferenceFile()
inReferenceArgumentCollection
.Going forward the tools should access the reference via these methods in
ReferenceArgumentCollection
:Path getReferencePath()
: access the reference asPath
while avoiding NPE in case it's null.PicardHtsPath getHtsPath()
: the main way to access the reference