PostgreSQL¶
Connecting to PostgreSQL databases from the command line¶
The psql client is the standard tool for connecting to PostgreSQL databases from the command line. It is pre-installed on all Faculty servers.
To connect to a PostgreSQL database, you need to know:
its hostname: this is usually a string like
customers.mydomain.com
;its port: the default port for PostgreSQL databases is 5432;
the name of the database that you want to connect to. If this is a new server, the only database on it is likely to be named
postgres
;your username and password for the database server (note that this is different to your Faculty username). If you are unsure of these, you should ask your database administrator.
You can then connect to the database with:
$ psql --host HOSTNAME --port PORT --user USERNAME --dbname DATABASE_NAME
You should then be prompted for a password. If the port is the default 5432,
you may omit the --port PORT
argument.
For instance, I can connect to database github
running on host
github-scraper.cpllpj7texvd.eu-west-1.rds.amazonaws.com
with username
awsuser
:
If you get stuck, run psql --help | less
in a terminal to view a list of
all available options.
Connecting to PostgreSQL databases from Python¶
To connect to PostgreSQL databases from Python, you can use the Python module psycopg2, which comes pre-installed on Faculty Jupyter and JupyterLab servers:
import psycopg2
connection = psycopg2.connect(
host='customers.mydomain.com', # host on which the database is running
database='database_name', # name of the database to connect to
user='username', # username to connect with
password='password' # password associated with your username
)
cursor = connection.cursor()
cursor.execute('SELECT * FROM customers')
customers = list(cursor.fetchall())
print('We have {} customers'.format(len(customers))) # This is data science!
connection.close()
Note
We close the connection to allow the database server to reclaim resources. This can be critical in a Jupyter notebook, since the kernel remains alive for a long time.
The psycopg2 object can also be used in a with
statement to ensure it gets
closed automatically:
import psycopg2
with psycopg2.connect(
host='customers.mydomain.com', database='database_name',
user='username', password='password'
) as connection:
cursor = connection.cursor()
cursor.execute('SELECT * FROM customers')
customers = list(cursor.fetchall())
print('We have {} customers'.format(len(customers)))
You can also use the read_sql()
function in pandas to read the result of a
query directly into a DataFrame:
import psycopg2
import pandas
with psycopg2.connect(
host='customers.mydomain.com', database='database_name',
user='username', password='password'
) as connection:
df = pandas.read_sql('SELECT * FROM customers', connection)
print(df)
Note
We recommend avoiding pasting database passwords and other connection details in many notebooks in a project. Have a look at Factoring connection details into a package for a recommended strategy for managing database connection details.
Connecting to PostgreSQL databases from R¶
To connect to PostgreSQL databases from an RStudio server, firstly install the dependencies for the database driver by running the following command:
$ sudo apt-get install unixodbc-dev odbc-postgresql
Alternately, specify unixodbc-dev
and odbc-postgresql
in the System
section of a custom environment, and apply this environment to your RStudio
server (see Custom server environments).
Once the ODBC drivers are installed, install the odbc
library for R by
running install.packages("odbc")
in an R session. Run library(odbc)
to
import the library, and check which drivers are installed on the system by
running odbc::odbcListDrivers()
.
To connect to a PostgreSQL database from R create a connection object:
connection <- DBI::dbConnect(
odbc::odbc(),
Driver = "psqlodbca.so",
Server = "customer.mydomain.com", # Host on which the database is running
Database = "database_name", # Name of the database
UID = "username", # Database username
PWD = "password", # Database password
Port = 5432 # Database port (5432 is the PosgreSQL default)
)
Once you have a connection object, you can list the tables in the database with:
dbListTables(connection)
You can then query a table with dbSendQuery
and dbFetch
:
result <- dbSendQuery(connection, "SELECT * FROM customers")
dbFetch(result)
where dbSendQuery
sends the query and dbFetch
retrieves the results. For
other kind of queries, just use dbSendQuery
.