[AstroPy] Astropy and large VOTable files

Mon May 18 16:29:57 EDT 2015

I understood Jennifer's question to be specific to the VOTable XML file 
format and the problems are really specific to parsing XML.  In 
iterator/streaming interface could probably be built on top of it, 
however.  Arbitrary (random access) slicing isn't really possible with 
XML, though.

Mike

On 05/18/2015 02:09 PM, Andrew Hearin wrote:
> Being able to read large data in chunks, make cuts on the chunks, and 
> return a table of rows that pass the cuts is a pretty common data 
> mining task that I think would be good to include in Astropy. I’m 
> happy to (re-)raise a GitHub issue for this purpose, and contribute 
> some code, but first: Jennifer, this is the functionality you are 
> describing, right? If so: Mike, do you see any fundamental obstacles 
> with this?
>
>
>
> On May 18, 2015, at 2:00 PM, Michael Droettboom <mdroe at stsci.edu 
> <mailto:mdroe at stsci.edu>> wrote:
>
>> Thanks for the question.
>>
>> Unfortunately, it will read the entire file into memory each time.  
>> It does read it in as a Numpy array, so the memory used should 
>> generally be less than the space on disk, however, depending on the 
>> content.
>>
>> XML doesn't really support the kind of slicing that FITS (or another 
>> binary format) can, because you can't know how big something is (or 
>> even what it is!) without parsing the whole file.  That said, given 
>> the constraint of the file format, minimal memory usage is one of the 
>> main design features of astropy.io.votable, so I'd recommend trying 
>> it on large files and seeing how it goes.  It shouldn't ever take 
>> significantly more memory than a binary array of data, i.e. the same 
>> as the equivalent FITS file loaded entirely into memory.
>>
>> Cheers,
>> Mike
>>
>> On 05/17/2015 10:11 AM, Jennifer Baldwin wrote:
>>> Hi all,
>>>
>>> I was trying to find an answer to this but could not. I am wondering 
>>> if parse_single_table will attempt to read an entire VOTable file? 
>>> Or if it will operate the same way as for FITS files so that when 
>>> you slice the returned data array, it only loads the part it needs 
>>> into memory? I'm concerned with how it will perform with extremely 
>>> large xml files, but could not find a direct answer anywhere in the 
>>> documentation.
>>>
>>> Thanks!
>>>
>>>
>>> _______________________________________________
>>> AstroPy mailing list
>>> AstroPy at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/astropy
>>
>> _______________________________________________
>> AstroPy mailing list
>> AstroPy at scipy.org <mailto:AstroPy at scipy.org>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__mail.scipy.org_mailman_listinfo_astropy&d=AwICAg&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=AHkQ8HPUDwzl0x62ybAnwN_OEebPRGDtcjUPBcnLYw4&m=fqrZPrNFrzwqmHSxKJ-shiCsIXJN8_SWmuwg5yOr9sA&s=m6R7fy7bDIllNOJ0BaVKj5GdN1j87_QtxcNSxOty56I&e= 
>>
>
>
>
> _______________________________________________
> AstroPy mailing list
> AstroPy at scipy.org
> http://mail.scipy.org/mailman/listinfo/astropy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/astropy/attachments/20150518/36c43754/attachment.html>