Questions on Pandas

Fri Jun 26 00:34:09 EDT 2015

Hi there, I have a number of questions related to the Pandas exercises found from the book, Python for Data Analysis by Wes McKinney. Particularly, these exercises are from Chapter 6 of the book. It'd be much appreciated if you could answer the following questions!

1.
[code]
Input: pd.read_csv('ch06/ex2.csv', header=None)
Output:
 X.1 X.2 X.3 X.4 X.5
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

[/code]

Does the header appear as "X.#" by default when it is set to be None?

2.
[code]
Input: chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)
Input: chunker
Output: <pandas.io.parsers.TextParser at 0x8398150>

[/code]

Please explain the idea of chunksize and the output meaning.

3.
[code]
The TextParser object returned by read_csv allows you to iterate over the parts of the
file according to the chunksize. For example, we can iterate over ex6.csv, aggregating
the value counts in the 'key' column like so:
chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)
tot = Series([])
for piece in chunker:
 tot = tot.add(piece['key'].value_counts(), fill_value=0)
tot = tot.order(ascending=False)
We have then:
In [877]: tot[:10]
Out[877]:
E 368
X 364
L 346
O 343
Q 340
M 338
J 337
F 335
K 334
H 330

[/code]

I couldn't run the Series function successfully... is there something missing in this code?

4.
[code]
Data can also be exported to delimited format. Let's consider one of the CSV files read
above:
In [878]: data = pd.read_csv('ch06/ex5.csv')
In [879]: data
Out[879]:
 something a b c d message
0 one 1 2 3 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11 12 foo

Missing values appear as empty strings in the output. You might want to denote them
by some other sentinel value:
In [883]: data.to_csv(sys.stdout, na_rep='NULL')
,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo

[/code]

Error occured as I tried to run this code with sys.stdout.

5.
[code]
class of csv.Dialect:
class my_dialect(csv.Dialect):
 lineterminator = '\n'
 delimiter = ';'
 quotechar = '"'
reader = csv.reader(f, dialect=my_dialect)

[/code]

An error occurred when I tried to run this code: "quotechar must be an 1-character integer... please explain.

6.
[code]
with open('mydata.csv', 'w') as f:
 writer = csv.writer(f, dialect=my_dialect)
 writer.writerow(('one', 'two', 'three'))
 writer.writerow(('1', '2', '3'))
 writer.writerow(('4', '5', '6'))
 writer.writerow(('7', '8', '9'))
[/code]

An error occurred when I ran this code. Please explain the cause of the error.

7.
[code]
But these are objects representing HTML elements; to get the URL and link text you
have to use each element's get method (for the URL) and text_content method (for
the display text):
In [908]: lnk = links[28]
In [909]: lnk
Out[909]: <Element a at 0x6c48dd0>
In [910]: lnk.get('href')
Out[910]: 'http://biz.yahoo.com/special.html'
In [911]: lnk.text_content()
Out[911]: 'Special Editions'
Thus, getting a list of all URLs in the document is a matter of writing this list comprehension:
In [912]: urls = [lnk.get('href') for lnk in doc.findall('.//a')]
In [913]: urls[-10:]
Out[913]:
['http://info.yahoo.com/privacy/us/yahoo/finance/details.html',
 'http://info.yahoo.com/relevantads/',
 'http://docs.yahoo.com/info/terms/',
 'http://docs.yahoo.com/info/copyright/copyright.html',
 'http://help.yahoo.com/l/us/yahoo/finance/forms_index.html',
 'http://help.yahoo.com/l/us/yahoo/finance/quotes/fitadelay.html',
 'http://help.yahoo.com/l/us/yahoo/finance/quotes/fitadelay.html',

[/code]

An error related to line 912 occurred as I tried to run the code. Please explain.

8.
[code]
Using lxml.objectify, we parse the file and get a reference to the root node of the XML
file with getroot:
from lxml import objectify
path = 'Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()

[/code]

An error occured when I tried to run the code to access the XML file. Please explain.

9.
[code]
ML data can get much more complicated than this example. Each tag can have metadata,
too. Consider an HTML link tag which is also valid XML:
from StringIO import StringIO
tag = '<a href="http://www.google.com">Google</a>'
root = objectify.parse(StringIO(tag)).getroot()
You can now access any of the fields (like href) in the tag or the link text:
In [930]: root
Out[930]: <Element a at 0x88bd4b0>
In [931]: root.get('href')
Out[931]: 'http://www.google.com'
In [932]: root.text
Out[932]: 'Google'
[/code]

The outputs for line 930 and 931 are the same as line 932 (i.e., Google). Please explain

10.

[code]
One of the easiest ways to store data efficiently in binary format is using Python's builtin
pickle serialization. Conveniently, pandas objects all have a save method which
writes the data to disk as a pickle:
In [933]: frame = pd.read_csv('ch06/ex1.csv')
In [934]: frame
Out[934]:
 a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
In [935]: frame.save('ch06/frame_pickle')
You read the data back into Python with pandas.load, another pickle convenience
function:
In [936]: pd.load('ch06/frame_pickle')
Out[936]:
 a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

[/code]

I couldn't run this code successfully. Please explain.

11.
[code]
HDF5 to provide multiple flexible data containers, table indexing, querying capability,
and some support for out-of-core computations.
pandas has a minimal dict-like HDFStore class, which uses PyTables to store pandas
objects:
In [937]: store = pd.HDFStore('mydata.h5')
In [938]: store['obj1'] = frame
In [939]: store['obj1_col'] = frame['a']
In [940]: store
Out[940]:
<class 'pandas.io.pytables.HDFStore'>
File path: mydata.h5
obj1 DataFrame
obj1_col Series
Objects contained in the HDF5 file can be retrieved in a dict-like fashion:
In [941]: store['obj1']
Out[941]:
 a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

[/code]

Do I need to import Pytables in order to successfully run this code?

12.
[code]
We can then make a list of the tweet fields of interest then pass the results list to DataFrame:
In [951]: tweet_fields = ['created_at', 'from_user', 'id', 'text']
In [952]: tweets = DataFrame(data['results'], columns=tweet_fields)
In [953]: tweets
Out[953]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 0 to 14
Data columns:
created_at 15 non-null values
from_user 15 non-null values
id 15 non-null values
text 15 non-null values
dtypes: int64(1), object(3)
Each row in the DataFrame now has the extracted data from each tweet:
In [121]: tweets.ix[7]
Out[121]:
created_at Thu, 23 Jul 2012 09:54:00 +0000
from_user deblike
id 227419585803059201
text pandas: powerful Python data analysis toolkit
Name: 7
[/code]

An error message occured: 
KeyError: 'Results'

Please explain.

13.
[code]
Storing and Loading Data in MongoDB
NoSQL databases take many different forms. Some are simple dict-like key-value stores
like BerkeleyDB or Tokyo Cabinet, while others are document-based, with a dict-like
object being the basic unit of storage. I've chosen MongoDB (http://mongodb.org) for
my example. I started a MongoDB instance locally on my machine, and connect to it
on the default port using pymongo, the official driver for MongoDB:
import pymongo
con = pymongo.Connection('localhost', port=27017)
[/code]

I have trouble importing pymongo as it is a database package. How do I import MongoDB properly into python 2.7?

Again, your help will be greatly appreciated!